Comparison of Sentiment Labeling Using Textblob, Vader, and Flair with IndoBERT

Abstract

This project presents an in-depth comparative analysis of various sentiment labeling techniques, aiming to evaluate their effectiveness in understanding public opinion. We specifically compare Textblob, VADER, and Flair, utilizing the IndoBERT model for results validation and enhancement, to provide insights into the strengths and limitations of each method, particularly within the context of the Indonesian language. Our findings indicate that Flair achieved the highest accuracy among the compared methods, while IndoBERT played a crucial role in the overall performance validation.

Introduction

Sentiment analysis has become an indispensable tool for understanding public opinions and emotions from vast amounts of text data. With the increasing volume of data generated on social media and other online platforms, the ability to accurately gauge sentiment is crucial for businesses, researchers, and policymakers alike. This project is motivated by the need to compare the performance of established sentiment analysis tools while considering the role of large language models like IndoBERT, specifically for the Indonesian language, which often possesses unique linguistic characteristics requiring tailored approaches. This study aims to guide practitioners in selecting the most suitable sentiment labeling method.

Methodology

In this study, we implemented and evaluated three primary approaches to sentiment analysis:

Textblob: A popular Python library for text processing, providing a simple API for rule-based and lexicon-based sentiment analysis.
VADER (Valence Aware Dictionary and sEntiment Reasoner): A lexicon and rule-based approach specifically optimized for sentiment expressed in social media contexts.
Flair: A lightweight and powerful NLP framework, offering pre-trained sentiment models based on deep learning and contextual embeddings.

Subsequently, we utilized IndoBERT, a BERT-based language model specifically trained on an Indonesian text corpus. In this project, IndoBERT was employed to enhance and validate the sentiment labeling results generated by Textblob, VADER, and Flair. IndoBERT's role was to provide a more robust reference and benchmark for evaluating the performance of each method within the Indonesian language context.

We used data collected from X (Twitter), YouTube comments, and Instagram comments relevant to the 2024 presidential inauguration theme. Data was collected using crawling techniques (for X API) and scraping techniques (for YouTube and Instagram with Google Data Scrapper extension). The final dataset consisted of 5,283 data points. The project focuses on evaluating the performance of each model (Textblob, VADER, Flair) using standard metrics such as Accuracy, Precision, Recall, and F1-score], with validation referencing IndoBERT.

Experiments

We conducted experiments on a dataset consisting of 5,283 data points collected from X (Twitter), YouTube, and Instagram social media platforms, focusing on public opinion after the 2024 presidential inauguration.

The experimental process involved the following steps:

Data Preprocessing: Raw data underwent preprocessing stages including case folding, data cleaning (removing mentions, numbers, links, punctuation, symbols), tokenization, and negation handling. This preprocessing was performed using Pandas, Regular Expression, String, and NLTK libraries.
Data Translation: The preprocessed data was then translated from Indonesian to English, as TextBlob, VADER, and Flair are known to be effective for English text.
Sentiment Labeling: Each method (Textblob, VADER, Flair) was applied to label sentiment on the translated dataset. TextBlob and VADER classify sentiment into positive, negative, and neutral categories based on polarity values or compound scores. Flair uses a pre-trained 'en-sentiment' model that only classifies positive and negative sentiments.
Data Splitting: The dataset was split into subsets: 80% for training data (4225 data points), 10% for validation data (529 data points), and 10% for test data (529 data points) using a stratified sampling method to maintain sentiment balance.
Validation and Evaluation with IndoBERT: The IndoBERT model (specifically the indobert-base-p1 variation) was used to train and test the labeling results from Textblob, VADER, and Flair. This process involved tokenization, padding and truncation, embeddings, and transformer layers. Performance evaluation was conducted using a confusion matrix to derive Accuracy, Precision, Recall, and F1-score.

Results

The results of our experiments, based on validation using the IndoBERT model, show the following performance comparison:

Flair: Demonstrated the highest accuracy among Textblob, VADER, and Flair, achieving an accuracy of 81.29%. Flair achieved Precision 0.8090, Recall 0.8412, and F1-score 0.8248 for the negative category. For positive, Precision 0.8174, Recall 0.7817, and F1-score 0.7992. Flair did not detect neutral sentiment.
Textblob: Recorded an accuracy of 73.35%. For the negative category, Precision 0.6552, Recall 0.6064, F1-score 0.6298. For positive, Precision 0.7225, Recall 0.7366, F1-score 0.7295. For neutral, Precision 0.7725, Recall 0.7826, F1-score 0.7775. TextBlob predominantly classified neutral sentiment (43.51% of total data).
VADER: Recorded an accuracy of 74.86%. For the negative category, Precision 0.7079, Recall 0.7826, F1-score 0.7434. For positive, Precision 0.8008, Recall 0.7714, F1-score 0.7859. For neutral, Precision 0.7043, Recall 0.6585, F1-score 0.6807. VADER predominantly classified positive sentiment (46.28% of total data).

Further details regarding performance metrics (precision, recall, F1-score) and sentiment class-wise analysis are available in the project's GitHub repository.

Discussion

Our findings underscore that Flair, with its deep learning foundation and contextual embedding capabilities, consistently outperformed Textblob and VADER in the sentiment labeling task on the Indonesian dataset. Flair's superior accuracy can be attributed to its ability to learn more complex sentiment patterns and nuances from the data. Although Flair did not detect neutral sentiment, its performance on positive and negative sentiments was excellent.

On the other hand, Textblob and VADER, as lexicon and rule-based approaches, showed limitations in capturing finer or context-specific sentiment intricacies in Indonesian that deep learning-based models can handle. Textblob tended to classify more neutral sentiments, making it suitable for fundamental sentiment analysis. VADER, designed for social media text, handled informal language better and detected more explicit sentiments, especially positive ones.

IndoBERT's role was pivotal in this study. While not directly compared as a primary labeling model, its utilization for validation provided a robust and reliable benchmark for the obtained results. This affirms the potential of large language models trained on specific languages to serve as crucial evaluation and quality enhancement tools in NLP analysis. The variation in labeling results across libraries (as shown by Table 3 in the original article) also highlights the importance of choosing the appropriate labeling method for sentiment analysis.

Conclusion

This project affirms the complexity of sentiment analysis. It highlights the importance of selecting appropriate tools based on data and language characteristics and the significance of validating with more robust models. This comparison provides valuable insights for practitioners and researchers aiming to apply sentiment analysis effectively.

From this research, 5,283 data points from X, YouTube, and Instagram were collected, preprocessed, and translated. Sentiment labeling was performed using TextBlob, VADER, and Flair, then trained and validated with the IndoBERT implementation. TextBlob predominantly detected neutral sentiments, VADER positive sentiments, and Flair negative sentiments, but could not detect neutral sentiments. Flair yielded the best results with an accuracy of 81.29% due to its use of deep learning and contextual embedding techniques, which are more accurate than lexicon-based methods. This proves that the sentiment labeling process affects the results and the model's performance in sentiment analysis.

References

Anam, K., & Kusnawi. (2025). Comparison of Sentiment Labeling Using Textblob, Vader, and Flair in Public Opinion Analysis Post-2024 Presidential Inauguration with IndoBERT. Jurnal Teknik Informatika (JUTIF), 6(2), 803-818. [cite: 1, 2] https://jutif.if.unsoed.ac.id/index.php/jurnal/article/view/4015/812
Zulham. (2023). Communication of Political Identity & Indonesian Presidential Candidacy in the 2024 Election. Int. J. Humanit. Soc. Stud., 11(1), 60-63. [cite: 303] https://doi.org/10.24940/theijhss/2023/v11/i1/hs2301-014
Konovalova, E., Le Mens, G., & Schöll, N. (2023). Social media feedback and extreme opinion expression. PLoS One, 18(11 November), 1-12. [cite: 304] https://doi.org/10.1371/journal.pone.0293805
Elhan, A., Hardhienata, M. K. D., Yeni, H., Hartono, S. W., & Adisantoso, J. (2022). Analisis Sentimen Pengguna Twitter terhadap Vaksinasi COVID-19 di Indonesia menggunakan Algoritme Random Forest dan BERT. J. Ilmu Komput. Agri-informatika, 9(2), 199-211. [cite: 305] https://doi.org/10.29244/jika.9.2.199-211
Kokab, S. T., Asghar, S., & Naz, S. (2022). Transformer-based deep learning models for the sentiment analysis of social media data. Array, 14(October 2021), 100157. [cite: 307] https://doi.org/10.1016/j.array.2022.100157
Rangarjan, P. K., et al. (2024). The social media sentiment analysis framework: deep learning for sentiment analysis on social media. Int. J. Electr. Comput. Eng., 14(3), 3394-3405. [cite: 308] https://doi.org/10.11591/ijece.v14i3.pp3394-3405
Pathak, U., & Rai, E. P. (2023). Sentiment Analysis: Methods, Applications, and. Int. J. Res. Appl. Sci. Eng. Technol., 11(February). [cite: 310] https://doi.org/10.22214/ijraset.2023.49165
Mikula, M., Gao, X., & Mach, M. (2020). Lexicon-based Sentiment Analysis Using the Particle Swarm Optimization. J. Electron., 1-22. [cite: 311] https://doi.org/10.3390/electronics9081317
Kalaiarasu, M., & Kumar, C. R. (2022). Sentiment Analysis using Improved Novel Convolutional Neural Network (SNCNN). Int. J. Comput. Commun. Control, 17(2), 1-15. [cite: 312] https://doi.org/10.15837/ijccc.2022.2.4351
Hussein, D. J., Rashad, M. N., Mirza, K. I., & Hussein, D. L. (2022). Machine Learning Approach to Sentiment Analysis in Data Mining. Passer J. Basic Appl. Sci., 4(1), 71-77. [cite: 314] https://doi.org/10.24271/psr.2022.312664.1101
Ramadhan, N. G., Wibowo, M., Mohd Rosely, N. F. L., & Quix, C. (2022). Opinion mining indonesian presidential election on twitter data based on decision tree method. J. Infotel, 14(4), 243-248. [cite: 316] https://doi.org/10.20895/infotel.v14i4.832
Alenzi, B. M., Khan, M. B., Hasanat, M. H. A., Saudagar, A. K. J., Alkhathami, M., & Altameem, A. (2022). Automatic Annotation Performance of TextBlob and VADER on Covid Vaccination Dataset. Intell. Autom. Soft Comput., 34(2), 1311-1331. [cite: 317] https://doi.org/10.32604/iasc.2022.025861
Sivalakshmi, P., Kumar, P. U., Vasanth, M., Srinath, R., & Yokesh, M. (2021). COVID-19 Vaccine Public Sentiment Analysis Using Python's Textblob Approach. Int. J. Curr. Res. Rev., 13(11), 166-172. [cite: 319] https://doi.org/10.31782/ijcrr.2021.sp218
Prof, A., & Gujjar, P. (2021). Sentiment Analysis: Textblob For Decision Making Department of Business Analytics. Int. J. Sci. Res. Eng. Trends, 7(2), 1097-1099. [cite: 321]
Rosenberg, E., et al. (2023). Results in Engineering Sentiment analysis on Twitter data towards climate action. Results Eng., 19(June), 101287. [cite: 322] https://doi.org/10.1016/j.rineng.2023.101287
Darji, D. A., & Goswami, S. A. (2024). The Comparative study of Python Libraries for Natural Language Processing (NLP).
Shah, P., Patel, H., & Swaminarayan, P. (2024). EAI Endorsed Transactions Multitask Sentiment Analysis and Topic Classification. EAI Endorsed Trans. Scalable Inf. Syst., 1-12. [cite: 323] https://doi.org/10.4108/eetsis.5287
Alammary, A. S. (2022). applied sciences BERT Models for Arabic Text Classification: A Systematic Review. J. Appl. Sci., 1-20. [cite: 326] https://doi.org/10.3390/app12115720
Fitriyana, V., Hakim, L., Candra, D., Novitasari, R., & Hanif, A. (2023). Analisis Sentimen Ulasan Aplikasi Jamsostek Mobile Menggunakan Metode Support Vector Machine. J. Buana Inform., 14(April), 40-49. [cite: 327] https://doi.org/10.24002/jbi.v14i01.6909
Rifaldi, D., Fadlil, A., & Herman. (2023). Teknik Preprocessing Pada Text Mining Menggunakan Data Tweet 'Mental Health'. Decod. J. Pendidik. Teknol. Inf., 3(2), 161-171. [cite: 328] https://doi.org/10.51454/decode.v3i2.131
Kusnawi, K., & Wijaya, A. H. (2021). Sentiment Analysis of Pancasila Values in Social Media Life Using the Naive Bayes Algorithm. In 2021 International Seminar on Application for Technology of Information and Communication (iSemantic) (pp. 96-101). [cite: 330] https://doi.org/10.1109/iSemantic52711.2021.9573194
Chouhan, K. U., Jha, R. S., Pradeep, N., Jha, K., & Kamaluddin, S. I. (2023). Legal Document Analysis. Int. J. Res. Appl. Sci. Eng. Technol., 11(IV). [cite: 331] https://doi.org/10.22214/ijraset.2023.50123
Muzaki, A., & Witanti, A. (2021). Sentiment Analysis Of The Community In The Twitter To The 2020 Election In Pandemic Covid-19 By Method Naive Bayes Classifier. J. Tek. Inform., 2(2), 101-107. [cite: 333] https://doi.org/10.20884/1.jutif.2021.2.2.51
Khotimah, A. C., et al. (2022). Comparison Naïve Bayes Classifier, K-Nearest Neighbor And Support Vector Machine In The Classification Of Individual On Perbandingan Algoritma Naïve Bayes Classifier, K-Nearest Neighbor Dan Support Vector Machine Dalam Klasifikasi. J. Tek. Inform., 3(3), 673-680. [cite: 335] https://doi.org/10.20884/1.jutif.2022.3.3.254
Puspitasari, R., Findawati, Y., & Rosid, M. A. (2023). Sentiment Analysis Of Post-Covid-19 Inflation Based On Twitter Using The K-Nearest Neighbor And Support Vector Machine. J. Tek. Inform., 4(4), 669-679. [cite: 337] https://doi.org/10.52436/1.jutif.2023.4.4.801
Sistem, R., Putra, I. M. S., Jhonarendra, P., Kadek, N., & Rusjayanthi, D. (2021). Deteksi Kesamaan Teks Jawaban pada Sistem Test Essay Online dengan Pendekatan Neural Network. J. RESTI (Rekayasa Sist. Dan Teknol. Inf., 5(158), 3-12. [cite: 338] https://doi.org/10.29207/resti.v5i6.3544
Makkar, K., Kumar, P., Poriye, M., & Aggarwal, S. (2024). Improving Sentiment Analysis using Negation Scope Detection and Negation Handling. Int. J. Comput. Digit. Syst., 1(1), 239-247. https://doi.org/10.12785/ijcds/160119
Hazarika, D., Konwar, G., & Deb, S. (2020). Sentiment Analysis on Twitter by Using TextBlob for Natural Language Processing. Proc. Int. Conf. Res. Manag. Technovation, 24, 63-67. [cite: 339] https://doi.org/10.15439/2020KM20
Dewi, S., & Arianto, D. B. (2022). Twitter Sentiment Analysis Towards Qatar As Host Of The 2022 World Cup Using Textblob. J. Soc. Res., 2(2), 443-454. [cite: 340] https://doi.org/10.55324/josr.v2i2.615
Arief, M., & Samsudin, N. A. (2023). Hybrid Approach with VADER and Multinomial Logistic Regression for Multiclass Sentiment Analysis in Online Customer Review. Int. J. Adv. Comput. Sci. Appl., 14(12), 311-320. [cite: 340] https://doi.org/10.3390/s23010506
Pano, T., & Kashef, R. (2020). A Complete VADER-Based Sentiment Analysis of Bitcoin (BTC) Tweets during the Era of COVID-19. [cite: 342] https://doi.org/10.3390/bdcc4040033
Herwanto, G. B., Ningtyas, A. M., Mujiyatna, I. G., & Nyoman, I. (2021). Hate Speech Detection in Indonesian Twitter using Contextual Embedding Approach. IJCCS (Indonesian J. Comput. Cybern. Syst., 15(2). [cite: 342] https://doi.org/10.22146/ijccs.64916
Ali, M. F., Irfan, R., & Lashari, T. A. (2023). Comprehensive sentimental analysis of tweets towards COVID-19 in Pakistan: a study on governmental preventive measures. PeerJ Comput. Sci. [cite: 342] https://doi.org/10.7717/peerj-cs.1220
Fadlil, A., Riadi, I., & Andrianto, F. (2024). Improving Sentiment Analysis in Digital Marketplaces through SVM Kernel Fine-Tuning. Int. J. Comput. Digit. Syst., 1(1). [cite: 342] https://doi.org/10.12785/ijcds/160113
Yulianti, E., & Nissa, N. K. (2024). ABSA of Indonesian customer reviews using IndoBERT: single-sentence and sentence-pair classification approaches. 13(5), 3579-3589. [cite: 345] https://doi.org/10.11591/eei.v13i5.8032
Cahyadi, M. F., & Rochadiani, T. H. (2025). Implementasi Ensemble Deep Learning Untuk Analisis Sentimen Terhadap Genre Game Mobile. 8, 1512-1523. [cite: 346] https://doi.org/10.30865/mib.v8i3.7832
Kusnawi, M., Rahardi, M., & Pandiangan, V. D. (2023). Sentiment Analysis of Neobank Digital Banking Using Support Vector Machine Algorithm in Indonesia. Int. J. INFORMATICS Vis., 7(June), 377-383. [cite: 347] https://doi.org/10.30630/joiv.7.2.1652
Özel, M., & Çetinkaya Bozkurt, Ö. (2024). Sentiment Analysis on GPT-4 with Comparative Models Using Twitter Data. Acta Infologica, 0(0), 0-0. [cite: 348, 349] https://doi.org/10.26650/acin.1418834
Miranda, E., Gabriella, V., Wahyudi, S. A., & Chai, J. (2023). Text Classification untuk Menganalisis Sentimen Pendapat Masyarakat Indonesia terhadap Vaksinasi Covid - 19. J. Sist. Inf., 12, 438-451. [cite: 350] http://sistemasi.ftik.unisi.ac.id
Bellar, O., Baina, A., & Bellafkih, M. (2023). Sentiment Analysis of Tweets on Social Issues Using Machine Learning Approach. Proc. 2023 Int. Conf. Digit. Age Technol. Adv. Sustain. Dev. ICDATA 2023, 9(4), 126-131. [cite: 351] https://doi.org/10.1109/ICDATA58816.2023.00031
Mushtaq, M. F., Fareed, M. M. S., Almutairi, M., Ullah, S., Ahmed, G., & Munir, K. (2022). Analyses of Public Attention and Sentiments towards Different COVID-19 Vaccines Using Data Mining Techniques. Vaccines, 10(5). [cite: 353] https://doi.org/10.3390/vaccines10050661
Marrapu, S., Senn, W., & Prybutok, V. (2024). Sentiment Analysis of Twitter Discourse on Omicron Vaccination in the USA Using VADER and BERT. J. Data Sci. Intell. Syst., 00(January), 1-11. [cite: 354] https://doi.org/10.47852/bonviewjdsis42022441
Illia, F., et al. (2021). Sentiment Analysis on PeduliLindungi Application Using TextBlob and VADER Library. Proc. Int. Conf. Data Sci. Off. Stat., 64, 278-288. [cite: 355] https://doi.org/10.34123/icdsos.v2021i1.236
Rajkhowa, P., et al. (2023). Factors Influencing Monkeypox Vaccination A Cue to Policy Implementation. J. Epidemiol. Glob. Health, 13(2), 226-238. [cite: 357] https://doi.org/10.1007/s44197-023-00100-9
Asri, Y., Suliyanti, W. N., Kuswardani, D., & Fajri, M. (2022). Pelabelan Otomatis Lexicon Vader dan Klasifikasi Naive Bayes dalam menganalisis sentimen data ulasan PLN Mobile. Petir, 15(2), 264-275. [cite: 359]
Arifiyanti, A. A., Kartika, D. S. Y., & Prawiro, C. J. (2022). Using Pre-Trained Models for Sentiment Analysis in Indonesian Tweets. Proc. Int. Conf. Informatics Comput. Sci., 2022-Septe(February), 78-83. [cite: 360] https://doi.org/10.1109/ICICOS56336.2022.9930599
Maqbool, J., Aggarwal, P., Kaur, R., Mittal, A., & Ali, I. (2023). ScienceDirect ScienceDirect Stock Prediction by Integrating Sentiment Scores of Financial News and MLP-Regressor: A Machine Learning Approach. Elsevier, 218, 1067-1078. [cite: 362]

Contact

Email: khoerulanam231@gmail.com
LinkedIn: https://www.linkedin.com/in/khoerul-anam-a7b627221/