Sentiment Analysis of IMDB Dataset
Table of contents
Abstract
This project implements a comprehensive sentiment analysis pipeline using the IMDB Movie Reviews Dataset. It encompasses data acquisition and storage in an SQLite database, preprocessing and exploratory analysis, training a machine learning model to classify reviews as positive or negative, and deploying the trained model via a Flask API for real-time predictions. The methodology involves sourcing the dataset, cleaning and preparing the data, experimenting with various machine learning algorithms, and evaluating their performance. The results demonstrate the model's effectiveness in accurately classifying sentiments, highlighting the pipeline's applicability in real-world scenarios. The project concludes by emphasizing its contributions to natural language processing tasks and potential areas for future enhancement.
Introduction
This project implements an end-to-end sentiment analysis pipeline using the IMDB Movie Reviews Dataset. It includes:
Data acquisition and storage in an SQLite database.
Data preprocessing and exploratory analysis.
Training a machine learning model to classify reviews as positive or negative.
Deploying the trained model via a Flask API for real-time predictions.
Methodology
Data Acquisition & Storage
The IMDB dataset is sourced from Hugging Face and stored in an SQLite database (imdb_reviews.db) using data_setup.py. This script also establishes the database schema defined in imdb_schema.sql.
Data Preprocessing & Exploratory Analysis
Preprocessing steps include text cleaning, tokenization, and vectorization. Exploratory analysis is conducted to understand data distribution and sentiment trends.
Model Training
A machine learning model is trained to classify reviews as positive or negative. The training process is handled by train_model.py, which utilizes scikit-learn's algorithms for model development.
Model Deployment
The trained model is deployed using a Flask API (app.py), enabling real-time sentiment predictions for new movie reviews.
Experiments
Various machine learning algorithms are evaluated to determine the most effective model for sentiment classification. Performance metrics such as accuracy, precision, recall, and F1-score are used to assess model efficacy.
Results
The optimal model achieves high accuracy in classifying movie reviews, demonstrating the effectiveness of the preprocessing and training pipeline. The Flask API allows for efficient real-time predictions, validating the model's practical applicability.
Conclusion
This project successfully implements a comprehensive sentiment analysis system, from data ingestion to model deployment. The integration of data storage, preprocessing, machine learning, and web deployment showcases a robust approach to natural language processing tasks.
Table of contents
Code
Datasets
Start a deeper conversation
Go beyond the comments — open a conversation to ask a question, share ideas, or explore this publication further with the community.