Sentiment Analysis

Sentiment Analysis: Unlocking Customer Insights from Text

The Spark: Why Build a Sentiment Analyser?

Assuming you're a small business owner launching a new app, and overnight, hundreds of reviews pour in on app stores and social media. Some rave about its sleek design, others complain about bugs, and a few sit in neutral territory. Manually sifting through this feedback? It's overwhelming, time-consuming, and prone to bias. What if AI could instantly classify these opinions as positive, negative, or neutral, helping you spot trends, fix issues, and celebrate wins? This real-world need inspired my Sentiment Analysis project. Born from the frustration of unstructured text data, it leverages natural language processing (NLP) to transform raw reviews into quantifiable sentiments. Trained on movie reviews but adaptable to any domain, this tool empowers businesses, researchers, and developers to make data-driven decisions faster than ever. Let's dive into how it all comes together.

Project Overview

This GitHub-hosted project (palaemezie/Sentiment-Analysis) performs sentiment analysis using the DistilBERT model from Hugging Face. It classifies text reviews as positive or negative (with potential for neutral in extensions) and includes a Streamlit web interface for interactive use, a FastAPI deployment for scalable API access, and comprehensive training via a Jupyter notebook. The model achieves ~80.87% accuracy on the test set, making it efficient for real-world applications like customer feedback analysis.

Key features:

Dataset: IMDB movie reviews (50,000 labelled examples).
Model: DistilBERT-base-uncased for sequence classification.
Training: Fine-tuned with frozen base parameters for speed and efficiency.
Deployment: Web UI and API endpoints.

Dataset: The Foundation of Insights

The project uses the IMDB Dataset, a collection of 50,000 movie reviews labelled as positive or negative. This binary setup provides a strong baseline for sentiment classification, simulating real customer feedback scenarios.

To load and preview the data, the notebook starts with:

# Load the local IMDB dataset
df = pd.read_csv('IMDB Dataset.csv')
df.head(5)

This displays sample reviews like:

review	sentiment
One of the other reviewers has mentioned that ...	positive
A wonderful little production. ...	positive

The dataset is split 80-20 for training and testing:

# Create train and test splits (80-20 split)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

You can explore the full dataset in the repo: IMDB Dataset.csv

Preprocessing: Preparing Text for the Model

Text data needs tokenization to convert reviews into numerical inputs suitable for transformers. We use Hugging Face's AutoTokenizer from the DistilBERT model. A custom PyTorch Dataset class handles this:

class IMDBDatasetForTrainer(Dataset):
    def __init__(self, df, tokenizer, max_length=128):
        self.reviews = df['review'].values
        self.sentiments = df['sentiment'].values
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, idx):
        review = str(self.reviews[idx])
        label = 1 if self.sentiments[idx] == 'positive' else 0

        # Tokenize the review
        encoding = self.tokenizer(
            review,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'][0],
            'attention_mask': encoding['attention_mask'][0],
            'labels': torch.tensor(label, dtype=torch.long)
        }

This ensures truncation to 128 tokens, padding, and label mapping (1 for positive, 0 for negative). Datasets are then created:

# Initialize tokenizer and create datasets
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
train_dataset = IMDBDatasetForTrainer(train_df, tokenizer)
test_dataset = IMDBDatasetForTrainer(test_df, tokenizer)

Model Architecture: Leveraging DistilBERT

DistilBERT is a distilled version of BERT – smaller, faster, and nearly as accurate. We load it for sequence classification with 2 labels:

# Initialize model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", 
    num_labels=2
)

To optimize for efficiency, we freeze the base model's parameters, training only the classifier head. This reduces memory use, speeds up training, and prevents overfitting on smaller datasets. Benefits include:

Faster Training: Only classifier layers update.
Less Memory Usage: Ideal for limited hardware.
Prevents Overfitting: Preserves pre-trained features.
Stable Features: Keeps BERT's learned representations intact.
Good for Small Datasets: Avoids destroying pre-trained knowledge.

Training: Fine-Tuning with Hugging Face Trainer

Training uses Hugging Face's Trainer API for simplicity. We define accuracy as the metric:

# Define metrics computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

The training setup (detailed in the notebook) includes parameters like batch size, epochs (~3 minutes per epoch), and evaluation strategy. The model trains on 40,000 samples, with frozen layers enabling quick convergence.

For the full training code and hyperparameters, check the notebook: sentiment-analysis.ipynb

Evaluation: Measuring Performance

On the 10,000-sample test set, the model delivers:

Accuracy: 80.87%
Loss: 0.4140
Inference Speed: ~58.7 samples/second

These metrics highlight its balance of accuracy and efficiency, outperforming baselines while remaining lightweight.

Usage: From Notebook to Production

Interactive Web Interface

Launch a Streamlit app for easy testing:

streamlit run sentiment_interface.py

Enter a review and receive an instant sentiment prediction with confidence scores.

Streamlit UI Screenshot

API Deployment

For production, deploy via FastAPI:

uvicorn app:app --reload

Endpoints include:

/health: Check API status.
/predict: Single text prediction.
/predict-batch: Batch processing.

Test with curl:

curl -X POST "http://127.0.0.1:8000/predict" -H "Content-Type: application/json" -d '{"text": "This movie was amazing!"}'

API Screenshot

Programmatically, use:

from sentiment_analysis import predict_sentiment
result = predict_sentiment("This is great!")
print(result)  # {'sentiment': 'positive', 'confidence': 0.95}

Installation and Getting Started

Clone the repo and install dependencies (uv recommended for speed):

git clone https://github.com/palaemezie/Sentiment-Analysis.git
cd Sentiment-Analysis
uv venv
uv pip install -r requirements.txt

Verify:

python -c "import torch, transformers; print('✅ Installation successful!')"

Contributing and Support

Fork the repo, create a branch, and submit a pull request. For issues, visit GitHub Issues.

This project acknowledges Hugging Face, PyTorch, and the IMDB dataset creators. Dive in, experiment, and let's make sentiment analysis even smarter!