Assuming you're a small business owner launching a new app, and overnight, hundreds of reviews pour in on app stores and social media. Some rave about its sleek design, others complain about bugs, and a few sit in neutral territory. Manually sifting through this feedback? It's overwhelming, time-consuming, and prone to bias. What if AI could instantly classify these opinions as positive, negative, or neutral, helping you spot trends, fix issues, and celebrate wins? This real-world need inspired my Sentiment Analysis project. Born from the frustration of unstructured text data, it leverages natural language processing (NLP) to transform raw reviews into quantifiable sentiments. Trained on movie reviews but adaptable to any domain, this tool empowers businesses, researchers, and developers to make data-driven decisions faster than ever. Let's dive into how it all comes together.
This GitHub-hosted project (palaemezie/Sentiment-Analysis) performs sentiment analysis using the DistilBERT model from Hugging Face. It classifies text reviews as positive or negative (with potential for neutral in extensions) and includes a Streamlit web interface for interactive use, a FastAPI deployment for scalable API access, and comprehensive training via a Jupyter notebook. The model achieves ~80.87% accuracy on the test set, making it efficient for real-world applications like customer feedback analysis.
Key features:
The project uses the IMDB Dataset, a collection of 50,000 movie reviews labelled as positive or negative. This binary setup provides a strong baseline for sentiment classification, simulating real customer feedback scenarios.
To load and preview the data, the notebook starts with:
# Load the local IMDB dataset df = pd.read_csv('IMDB Dataset.csv') df.head(5)
This displays sample reviews like:
review | sentiment |
---|---|
One of the other reviewers has mentioned that ... | positive |
A wonderful little production. ... | positive |
The dataset is split 80-20 for training and testing:
# Create train and test splits (80-20 split) train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
You can explore the full dataset in the repo: IMDB Dataset.csv
Text data needs tokenization to convert reviews into numerical inputs suitable for transformers. We use Hugging Face's AutoTokenizer
from the DistilBERT model. A custom PyTorch Dataset class handles this:
class IMDBDatasetForTrainer(Dataset): def __init__(self, df, tokenizer, max_length=128): self.reviews = df['review'].values self.sentiments = df['sentiment'].values self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.reviews) def __getitem__(self, idx): review = str(self.reviews[idx]) label = 1 if self.sentiments[idx] == 'positive' else 0 # Tokenize the review encoding = self.tokenizer( review, truncation=True, max_length=self.max_length, padding='max_length', return_tensors='pt' ) return { 'input_ids': encoding['input_ids'][0], 'attention_mask': encoding['attention_mask'][0], 'labels': torch.tensor(label, dtype=torch.long) }
This ensures truncation to 128 tokens, padding, and label mapping (1 for positive, 0 for negative). Datasets are then created:
# Initialize tokenizer and create datasets tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") train_dataset = IMDBDatasetForTrainer(train_df, tokenizer) test_dataset = IMDBDatasetForTrainer(test_df, tokenizer)
DistilBERT is a distilled version of BERT – smaller, faster, and nearly as accurate. We load it for sequence classification with 2 labels:
# Initialize model model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased", num_labels=2 )
To optimize for efficiency, we freeze the base model's parameters, training only the classifier head. This reduces memory use, speeds up training, and prevents overfitting on smaller datasets. Benefits include:
Training uses Hugging Face's Trainer
API for simplicity. We define accuracy as the metric:
# Define metrics computation function def compute_metrics(eval_pred): predictions, labels = eval_pred predictions = np.argmax(predictions, axis=1) return {"accuracy": (predictions == labels).mean()}
The training setup (detailed in the notebook) includes parameters like batch size, epochs (~3 minutes per epoch), and evaluation strategy. The model trains on 40,000 samples, with frozen layers enabling quick convergence.
For the full training code and hyperparameters, check the notebook: sentiment-analysis.ipynb
On the 10,000-sample test set, the model delivers:
These metrics highlight its balance of accuracy and efficiency, outperforming baselines while remaining lightweight.
Launch a Streamlit app for easy testing:
streamlit run sentiment_interface.py
Enter a review and receive an instant sentiment prediction with confidence scores.
For production, deploy via FastAPI:
uvicorn app:app --reload
Endpoints include:
/health
: Check API status./predict
: Single text prediction./predict-batch
: Batch processing.Test with curl:
curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"text": "This movie was amazing!"}'
Programmatically, use:
from sentiment_analysis import predict_sentiment result = predict_sentiment("This is great!") print(result) # {'sentiment': 'positive', 'confidence': 0.95}
Clone the repo and install dependencies (uv recommended for speed):
git clone https://github.com/palaemezie/Sentiment-Analysis.git cd Sentiment-Analysis uv venv uv pip install -r requirements.txt
Verify:
python -c "import torch, transformers; print('✅ Installation successful!')"
Fork the repo, create a branch, and submit a pull request. For issues, visit GitHub Issues.
This project acknowledges Hugging Face, PyTorch, and the IMDB dataset creators. Dive in, experiment, and let's make sentiment analysis even smarter!