Sentiment Analysis and Topic Modeling of Pet Market Reviews Using BERTimbau and Active Learning

Abstract

This project combines BERTimbau and Active Learning to classify the sentiment of pet service reviews in Brazilian Portuguese. Starting with 500 labeled examples, entropy-based selection identifies the most uncertain reviews for manual annotation, improving model performance with less effort. Additionally, topic modeling using BERT embeddings and KeyBERT extracts key themes from both positive and negative reviews, offering valuable insights into customer satisfaction and dissatisfaction.

Introduction

This project applies Natural Language Processing (NLP) and Active Learning to classify sentiment in customer reviews written in Brazilian Portuguese, focusing on pet-related businesses in Santo André, Brazil. Using the BERTimbau model, a transformer pre-trained for Portuguese, the system learns to predict review sentiment based on text rather than relying solely on user-provided star ratings.

To reduce the cost of manual labeling, an entropy-based Active Learning loop was implemented, allowing the model to focus on the most uncertain and informative examples. In parallel, topic modeling was performed using UMAP, HDBSCAN, and KeyBERT — providing interpretable topics that highlight common patterns in both positive and negative reviews.

The result is a practical and efficient NLP pipeline that delivers accurate sentiment predictions while also helping businesses understand key drivers of customer satisfaction and complaints.

Objectives

Classify customer reviews on a 1–5 star scale using fine-tuned BERTimbau.
Apply entropy-based Active Learning to select the most uncertain unlabeled examples for annotation.
Extract and rank key topics from both positive and negative reviews using BERT embeddings, UMAP, HDBSCAN, and KeyBERT.
Understand mismatches between star ratings and the sentiment expressed in review text.

Dataset and Context

The dataset was collected using the Google Places API and consists of customer reviews written in Brazilian Portuguese. The businesses include pet shops, veterinary clinics, and grooming centers.

Each review includes:

A star rating (1 to 5)
A free-text comment describing the experience
Metadata from the business location

Text length and quality were validated to ensure informative content. Very short or incomplete reviews were excluded during preprocessing.

Sentiment Classification with BERTimbau

The model used for classification is neuralmind/bert-base-portuguese-cased, a pre-trained BERT model for Brazilian Portuguese.

The text was tokenized and fed into BERTimbau with a classification head.
Training began with an initial set of 500 manually labeled reviews.
The model was evaluated against the original ratings but also qualitatively assessed, revealing that the model's sentiment predictions often aligned more closely with the actual content of the reviews than the user-provided star ratings.

This highlighted a key motivation for the project: many star ratings are inconsistent or misleading when compared to the text. BERTimbau helped overcome this subjectivity by focusing on the linguistic content.

Active Learning Strategy

To avoid labeling thousands of reviews manually, an Active Learning loop was implemented using entropy-based uncertainty sampling:

The model was trained on the initial 500 labeled reviews.
Predictions were made on the remaining unlabeled pool.
Entropy was calculated to measure the model’s uncertainty.
The 100 most uncertain examples were selected and manually labeled.
These new examples were added to the training set and the model was retrained.

This cycle was repeated, allowing the model to focus its learning on ambiguous or borderline cases, which led to faster performance gains with fewer labeled samples.

Figure 1. Active Learning loop using entropy-based uncertainty sampling.
The model is initially trained on a small labeled set. At each iteration, it selects the most uncertain examples (based on entropy), which are then manually annotated and added to the training set for retraining.

Topic Modeling for Interpretability

To go beyond classification and extract interpretable themes from the data, a topic modeling pipeline was applied:

Reviews were embedded using BERTimbau to capture nuanced semantic information in Portuguese.
UMAP was used for dimensionality reduction of the BERT embeddings.
The reduced embeddings were clustered using HDBSCAN, a density-based algorithm effective at discovering clusters of varying shapes and sizes.
Within each resulting cluster, KeyBERT was used to extract keywords that summarized the dominant topic.
Topic modeling was performed separately for positive reviews (4–5 stars) and negative reviews (1–2 stars) to better isolate patterns.

This revealed key findings such as:

Positive reviews consistently highlighted excellent veterinarians, caring and friendly service, clean and organized facilities, and affectionate treatment of pets. Emotional expressions like “amoo”, “ficou lindaaaaaa”, and “maravilhosos” were frequently used to convey strong satisfaction.
Negative reviews often pointed to poor service, rude staff, maltreatment of animals, and distrust in diagnoses, sometimes expressed with sarcasm or disbelief.
These insights help businesses understand customer perceptions and guide improvements in service and care.

Figure 2. Topic Modeling Pipeline Using BERT Embeddings and KeyBERT.
Customer reviews are preprocessed and transformed into semantic embeddings. UMAP reduces dimensionality, HDBSCAN identifies topic clusters, and KeyBERT extracts representative keywords for interpretation.

Results

BERTimbau successfully learned to classify review sentiment with increasing accuracy as new examples were added via Active Learning.
The Active Learning strategy minimized labeling effort while maximizing learning from informative samples.
Topic modeling uncovered concrete drivers of both satisfaction and dissatisfaction, providing a second layer of value beyond simple classification.
The project demonstrated that star ratings alone are not reliable indicators of sentiment, and that a model trained on review text can offer a more nuanced understanding of customer feedback.

Figure 3. Comparison Between Actual and Predicted Star Ratings.
The chart shows how the model’s predicted sentiment ratings align with the original user-provided star ratings, highlighting discrepancies and subjectivity in user annotations.

Figure 4. UMAP Projection of Clusters from Positive Reviews.
Semantic embeddings of 4- and 5-star reviews were reduced using UMAP and clustered with HDBSCAN. The visualization reveals distinct thematic groupings within positive customer feedback.

Why This Project Matters

This project highlights the value of applying domain-specific NLP to customer reviews in Brazilian Portuguese. By combining deep learning, interpretability, and Active Learning, it delivers a cost-effective and accurate sentiment analysis pipeline. Beyond immediate insights, it also demonstrates how a flexible architecture can be reused or adapted for other types of business reviews — whether in different industries or local domains — making it a scalable and practical solution for analyzing customer sentiment in low-resource languages.

GitHub Repository

Full source code, notebooks, and reproducible pipeline are available at:
https://github.com/ricardo-yos/bertimbau-sentiment-analysis

Abstract

Introduction

The result is a practical and efficient NLP pipeline that delivers accurate sentiment predictions while also helping businesses understand key drivers of customer satisfaction and complaints.

Objectives

Classify customer reviews on a 1–5 star scale using fine-tuned BERTimbau.
Apply entropy-based Active Learning to select the most uncertain unlabeled examples for annotation.
Extract and rank key topics from both positive and negative reviews using BERT embeddings, UMAP, HDBSCAN, and KeyBERT.
Understand mismatches between star ratings and the sentiment expressed in review text.

Dataset and Context

The dataset was collected using the Google Places API and consists of customer reviews written in Brazilian Portuguese. The businesses include pet shops, veterinary clinics, and grooming centers.

Each review includes:

A star rating (1 to 5)
A free-text comment describing the experience
Metadata from the business location

Text length and quality were validated to ensure informative content. Very short or incomplete reviews were excluded during preprocessing.

Sentiment Classification with BERTimbau

The model used for classification is neuralmind/bert-base-portuguese-cased, a pre-trained BERT model for Brazilian Portuguese.

The text was tokenized and fed into BERTimbau with a classification head.
Training began with an initial set of 500 manually labeled reviews.
The model was evaluated against the original ratings but also qualitatively assessed, revealing that the model's sentiment predictions often aligned more closely with the actual content of the reviews than the user-provided star ratings.

Active Learning Strategy

To avoid labeling thousands of reviews manually, an Active Learning loop was implemented using entropy-based uncertainty sampling:

The model was trained on the initial 500 labeled reviews.
Predictions were made on the remaining unlabeled pool.
Entropy was calculated to measure the model’s uncertainty.
The 100 most uncertain examples were selected and manually labeled.
These new examples were added to the training set and the model was retrained.

This cycle was repeated, allowing the model to focus its learning on ambiguous or borderline cases, which led to faster performance gains with fewer labeled samples.

Topic Modeling for Interpretability

To go beyond classification and extract interpretable themes from the data, a topic modeling pipeline was applied:

Reviews were embedded using BERTimbau to capture nuanced semantic information in Portuguese.
UMAP was used for dimensionality reduction of the BERT embeddings.
The reduced embeddings were clustered using HDBSCAN, a density-based algorithm effective at discovering clusters of varying shapes and sizes.
Within each resulting cluster, KeyBERT was used to extract keywords that summarized the dominant topic.
Topic modeling was performed separately for positive reviews (4–5 stars) and negative reviews (1–2 stars) to better isolate patterns.

This revealed key findings such as:

Positive reviews consistently highlighted excellent veterinarians, caring and friendly service, clean and organized facilities, and affectionate treatment of pets. Emotional expressions like “amoo”, “ficou lindaaaaaa”, and “maravilhosos” were frequently used to convey strong satisfaction.
Negative reviews often pointed to poor service, rude staff, maltreatment of animals, and distrust in diagnoses, sometimes expressed with sarcasm or disbelief.
These insights help businesses understand customer perceptions and guide improvements in service and care.

Results

BERTimbau successfully learned to classify review sentiment with increasing accuracy as new examples were added via Active Learning.
The Active Learning strategy minimized labeling effort while maximizing learning from informative samples.
Topic modeling uncovered concrete drivers of both satisfaction and dissatisfaction, providing a second layer of value beyond simple classification.
The project demonstrated that star ratings alone are not reliable indicators of sentiment, and that a model trained on review text can offer a more nuanced understanding of customer feedback.

Why This Project Matters

GitHub Repository

Full source code, notebooks, and reproducible pipeline are available at:
https://github.com/ricardo-yos/bertimbau-sentiment-analysis

Sentiment Analysis and Topic Modeling of Pet Market Reviews Using BERTimbau and Active Learning

Table of contents

Abstract

Introduction

Objectives

Dataset and Context

Sentiment Classification with BERTimbau

Active Learning Strategy

Topic Modeling for Interpretability

Results

Why This Project Matters

GitHub Repository

Table of contents

Abstract

Introduction

Objectives

Dataset and Context

Sentiment Classification with BERTimbau

Active Learning Strategy

Topic Modeling for Interpretability

Results

Why This Project Matters

GitHub Repository