This project focuses on developing a movie recommender system using natural language processing (NLP) techniques and machine learning. The dataset used for this system includes various features like genres, keywords, cast, and crew information. The methodology involves data preprocessing, feature extraction, and recommendation generation based on cosine similarity.
A movie recommender system suggests films to users based on their preferences. Recommender systems play a crucial role in enhancing user experience by filtering large sets of items and providing personalized suggestions. This project leverages NLP techniques like text vectorization and machine learning algorithms to recommend movies.
Data Preprocessing
Data Loading: We started by importing the required libraries and loading the dataset, which included tmdb_5000_movies.csv and tmdb_5000_credits.csv.
Data Merging: The two datasets were merged on the title column to create a unified dataset for analysis.
Data Cleaning: Unnecessary columns like homepage and tagline were removed, and missing values were handled appropriately.
Text Normalization: Genres, keywords, cast, and crew information were extracted, cleaned, and normalized to ensure consistency in the dataset.
Feature Extraction
Bag of Words (BoW): We used the Bag of Words model for text vectorization. This involved tokenizing text data and converting it into numerical vectors.
Stemming: The words were reduced to their base form using the Porter Stemmer to handle different variations of the same word.
Similarity Measure
Cosine Similarity: We computed the cosine similarity between movie vectors to identify the closest matches. This helped in generating recommendations based on the similarity of movies.
Exploratory Data Analysis (EDA)
Language Distribution: Visualized the distribution of movies by their original language, revealing a dominance of English films.
Budget and Revenue Analysis: Analyzed the highest budget and revenue-generating movies, visualizing the top five movies in each category using bar charts.
Data Transformation
Feature Engineering: Combined multiple features into a unified tags column, representing the essential information for each movie.
Vectorization: Converted the combined tags into numerical vectors using CountVectorizer, limiting the vocabulary to the top 5000 words.
Recommendation Quality: The movie recommender system successfully suggests relevant movies based on the user's input by finding the highest cosine similarity with other movies in the dataset.
Model Performance: The implementation of the Bag of Words and cosine similarity method showed a high correlation in recommending movies that share similar genres, keywords, and cast. To see the end product please refer:
The movie recommender system developed in this project effectively recommends films using NLP and machine learning techniques. The approach of combining genres, keywords, cast, and crew information into a single feature vector significantly improved the relevance of the recommendations. Future work could include enhancing the model's accuracy by exploring deep learning techniques or incorporating user ratings for more personalized recommendations.