In an increasingly digital world, understanding public sentiment has become crucial for businesses, governments, and influencers alike. My project, Social Media Sentiment Analysis, aims to harness the power of natural language processing and machine learning to analyze and interpret emotions and opinions expressed across social platforms. By examining real-time data from diverse sources, my project provides actionable insights into audience moods, helping organizations enhance their communication strategies and decision-making processes.
In this project, I focused on analyzing social media sentiment to understand public opinion and trends using Tweets Data Set. Here are some key highlights:
Handling missing data is crucial in ensuring the accuracy of any ML model. I employed robust data imputation techniques to fill in the gaps and maintain the integrity of our dataset. This step is vital for producing reliable and consistent results.
Machine Learning Algorithm - LogisticRegressionCV:
For the sentiment classification task, I chose LogisticRegressionCV. This algorithm not only provides the benefits of logistic regression but also incorporates cross-validation to find the best hyperparameters, enhancing model performance and preventing overfitting.
Leveraging the power of natural language processing, I utilized state-of-the-art NLP tools to preprocess the text data. This included tokenization, stopword removal, stemming, and more, ensuring that our model could accurately understand and classify the sentiments expressed in social media posts.
Achieved high accuracy and precision in sentiment classification.
Improved model robustness through effective data imputation.
Enhanced feature extraction and text analysis with comprehensive NLP techniques.
This project showcases the integration of data science and machine learning to derive actionable insights from social media data. I'm thrilled with the results and the potential applications in market research, customer feedback analysis, and beyond.
import pandas as pd import numpy as np import re import seaborn as sns from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegressionCV from sklearn.metrics import accuracy_score from sklearn.feature_extraction.text import TfidfVectorizer
df=pd.read_csv('Tweets.csv') df.shape df.head(600)
df.info()
df.isnull().sum()
sns.heatmap(df.isnull())
sns.displot(df['airline_sentiment'],color='skyblue')
mode=df[df['negativereason'].notna()]['negativereason'].mode()[0] df['negativereason']=df['negativereason'].fillna(mode)
df['negativereason_confidence']=df['negativereason_confidence'].fillna(df['negativereason_confidence'].median())
mode1=df[df['airline_sentiment_gold'].notna()]['airline_sentiment_gold'].mode()[0] df['airline_sentiment_gold']=df['airline_sentiment_gold'].fillna(mode1)
mode2=df[df['negativereason_gold'].notna()]['negativereason_gold'].mode()[0] df['negativereason_gold']=df['negativereason_gold'].fillna(mode2)
mode3=df[df['tweet_coord'].notna()]['tweet_coord'].mode()[0] df['tweet_coord']=df['tweet_coord'].fillna(mode3)
mode4=df[df['tweet_location'].notna()]['tweet_location'].mode()[0] df['tweet_location']=df['tweet_location'].fillna(mode4)
mode5=df[df['user_timezone'].notna()]['user_timezone'].mode()[0] df['user_timezone']=df['user_timezone'].fillna(mode5)
df.head(100)
df.isnull().sum()
sns.heatmap(df.isnull())
df.replace({'airline_sentiment':{'positive':1}},inplace=True) df.replace({'airline_sentiment':{'negative':0}},inplace=True) df.replace({'airline_sentiment':{'neutral':2}},inplace=True)
df['airline_sentiment'].value_counts()
sns.displot(df['airline_sentiment'],color='orange')
#import nltk #nltk.download('stopwords')
print(stopwords.words('english'))
port_stem=PorterStemmer()
def stemming(content): stemmed_content = re.sub('[^a-zA-Z]',' ',content) stemmed_content = stemmed_content.lower() stemmed_content = stemmed_content.split() stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] stemmed_content = ' '.join(stemmed_content) return stemmed_content
df['stemmed_content'] = df['text'].apply(stemming)
print(df['stemmed_content'])
x=df['stemmed_content'].values y=df['airline_sentiment'].values
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=2)
print(x.shape, x_train.shape, x_test.shape)
print(x_train)
print(x_test)
vectorizer=TfidfVectorizer() x_train=vectorizer.fit_transform(x_train) x_test=vectorizer.transform(x_test)
print(x_train)
print(x_test)
model = LogisticRegressionCV(max_iter=1000) model.fit(x_train, y_train)
x_train_prediction = model.predict(x_train) training_data_accuracy = accuracy_score(y_train, x_train_prediction)
print('Accuracy Score of Train Data:',training_data_accuracy)
x_test_prediction = model.predict(x_test) testing_data_accuracy = accuracy_score(y_test, x_test_prediction)
print('Accuracy Score of Test Data:',testing_data_accuracy)
import pickle filename='trained_model.sav' pickle.dump(model,open(filename, 'wb'))
loaded_model = pickle.load(open('trained_model.sav','rb'))
df.head()
x_new = x_test[569] print(y_test[569]) prediction=model.predict(x_new) print(prediction) if (prediction[0]==1): print('positive tweet') elif(prediction[0]==0): print('negative tweet') else: print('neutral tweet')
x_new = x_test[3] print(y_test[3]) prediction=model.predict(x_new) print(prediction) if (prediction[0]==1): print('positive tweet') elif(prediction[0]==0): print('negative tweet') else: print('neutral tweet')