This project combines BERTimbau and Active Learning to classify the sentiment of pet service reviews in Brazilian Portuguese. Starting with 500 labeled examples, entropy-based selection identifies the most uncertain reviews for manual annotation, improving model performance with less effort. Additionally, topic modeling using BERT embeddings and KeyBERT extracts key themes from both positive and negative reviews, offering valuable insights into customer satisfaction and dissatisfaction.
This project applies Natural Language Processing (NLP) and Active Learning to classify sentiment in customer reviews written in Brazilian Portuguese, focusing on pet-related businesses in Santo André, Brazil. Using the BERTimbau model, a transformer pre-trained for Portuguese, the system learns to predict review sentiment based on text rather than relying solely on user-provided star ratings.
To reduce the cost of manual labeling, an entropy-based Active Learning loop was implemented, allowing the model to focus on the most uncertain and informative examples. In parallel, topic modeling was performed using UMAP, HDBSCAN, and KeyBERT — providing interpretable topics that highlight common patterns in both positive and negative reviews.
The result is a practical and efficient NLP pipeline that delivers accurate sentiment predictions while also helping businesses understand key drivers of customer satisfaction and complaints.
The dataset was collected using the Google Places API and consists of customer reviews written in Brazilian Portuguese. The businesses include pet shops, veterinary clinics, and grooming centers.
Each review includes:
Text length and quality were validated to ensure informative content. Very short or incomplete reviews were excluded during preprocessing.
The model used for classification is neuralmind/bert-base-portuguese-cased
, a pre-trained BERT model for Brazilian Portuguese.
This highlighted a key motivation for the project: many star ratings are inconsistent or misleading when compared to the text. BERTimbau helped overcome this subjectivity by focusing on the linguistic content.
To avoid labeling thousands of reviews manually, an Active Learning loop was implemented using entropy-based uncertainty sampling:
The model was trained on the initial 500 labeled reviews.
Predictions were made on the remaining unlabeled pool.
Entropy was calculated to measure the model’s uncertainty.
The 100 most uncertain examples were selected and manually labeled.
These new examples were added to the training set and the model was retrained.
This cycle was repeated, allowing the model to focus its learning on ambiguous or borderline cases, which led to faster performance gains with fewer labeled samples.
Figure 1. Active Learning loop using entropy-based uncertainty sampling.
The model is initially trained on a small labeled set. At each iteration, it selects the most uncertain examples (based on entropy), which are then manually annotated and added to the training set for retraining.
To go beyond classification and extract interpretable themes from the data, a topic modeling pipeline was applied:
This revealed key findings such as:
Figure 2. Topic Modeling Pipeline Using BERT Embeddings and KeyBERT.
Customer reviews are preprocessed and transformed into semantic embeddings. UMAP reduces dimensionality, HDBSCAN identifies topic clusters, and KeyBERT extracts representative keywords for interpretation.
Figure 3. Comparison Between Actual and Predicted Star Ratings.
The chart shows how the model’s predicted sentiment ratings align with the original user-provided star ratings, highlighting discrepancies and subjectivity in user annotations.
Figure 4. UMAP Projection of Clusters from Positive Reviews.
Semantic embeddings of 4- and 5-star reviews were reduced using UMAP and clustered with HDBSCAN. The visualization reveals distinct thematic groupings within positive customer feedback.
This project highlights the value of applying domain-specific NLP to customer reviews in Brazilian Portuguese. By combining deep learning, interpretability, and Active Learning, it delivers a cost-effective and accurate sentiment analysis pipeline. Beyond immediate insights, it also demonstrates how a flexible architecture can be reused or adapted for other types of business reviews — whether in different industries or local domains — making it a scalable and practical solution for analyzing customer sentiment in low-resource languages.
Full source code, notebooks, and reproducible pipeline are available at:
https://github.com/ricardo-yos/bertimbau-sentiment-analysis