Oct 16, 2024●11 reads●No License

SOCIALMEDIA_SENTIMENT_ANALYSIS

Hello , I am Tanmoy Bera a Tech enthusiast from India . As a data enthusiast with a passion for harnessing technology to solve real-world problems, I’ve developed " Social Media Sentiment Analysis" to tap into the vast landscape of social media and decode public sentiment. My project leverages natural language processing and machine learning techniques to analyze posts, comments, and trends from social platforms, providing a deeper understanding of user emotions and opinions. This tool can help brands, organizations, and influencers make more informed decisions, tailor their communication strategies, and stay connected to their audience’s evolving sentiments in real time.

Project: Social Media Sentiment Analysis

In an increasingly digital world, understanding public sentiment has become crucial for businesses, governments, and influencers alike. My project, Social Media Sentiment Analysis, aims to harness the power of natural language processing and machine learning to analyze and interpret emotions and opinions expressed across social platforms. By examining real-time data from diverse sources, my project provides actionable insights into audience moods, helping organizations enhance their communication strategies and decision-making processes.

In this project, I focused on analyzing social media sentiment to understand public opinion and trends using Tweets Data Set. Here are some key highlights:

Data Imputation Technique:

Handling missing data is crucial in ensuring the accuracy of any ML model. I employed robust data imputation techniques to fill in the gaps and maintain the integrity of our dataset. This step is vital for producing reliable and consistent results.
Machine Learning Algorithm - LogisticRegressionCV:
For the sentiment classification task, I chose LogisticRegressionCV. This algorithm not only provides the benefits of logistic regression but also incorporates cross-validation to find the best hyperparameters, enhancing model performance and preventing overfitting.

NLP Toolkit:

Leveraging the power of natural language processing, I utilized state-of-the-art NLP tools to preprocess the text data. This included tokenization, stopword removal, stemming, and more, ensuring that our model could accurately understand and classify the sentiments expressed in social media posts.

Key Achievements:

Achieved high accuracy and precision in sentiment classification.
Improved model robustness through effective data imputation.
Enhanced feature extraction and text analysis with comprehensive NLP techniques.
This project showcases the integration of data science and machine learning to derive actionable insights from social media data. I'm thrilled with the results and the potential applications in market research, customer feedback analysis, and beyond.

Importing Importent Packages and libraries

import pandas as pd
import numpy as np
import re
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

Read the Data set

df=pd.read_csv('Tweets.csv')
df.shape
df.head(600)

Chacking total missing values in the data set

df.info()

df.isnull().sum()

Heatmap of null values

sns.heatmap(df.isnull())

Sentiment graph for the reactions

sns.displot(df['airline_sentiment'],color='skyblue')

Handling The missing Values of clean the data set

1-mode imputation on negativereason

mode=df[df['negativereason'].notna()]['negativereason'].mode()[0]
df['negativereason']=df['negativereason'].fillna(mode)

2-Median imputation on negativereason_confidence

df['negativereason_confidence']=df['negativereason_confidence'].fillna(df['negativereason_confidence'].median())

3-Mode imputation on airline_sentiment_gold

mode1=df[df['airline_sentiment_gold'].notna()]['airline_sentiment_gold'].mode()[0]
df['airline_sentiment_gold']=df['airline_sentiment_gold'].fillna(mode1)

4-Mode imputation on negativereason_gold

mode2=df[df['negativereason_gold'].notna()]['negativereason_gold'].mode()[0]
df['negativereason_gold']=df['negativereason_gold'].fillna(mode2)

5-Mode imputation on tweet_coord

mode3=df[df['tweet_coord'].notna()]['tweet_coord'].mode()[0]
df['tweet_coord']=df['tweet_coord'].fillna(mode3)

6-Mode imputation on tweet_location

mode4=df[df['tweet_location'].notna()]['tweet_location'].mode()[0]
df['tweet_location']=df['tweet_location'].fillna(mode4)

7-Mode imputation on user_timezone

mode5=df[df['user_timezone'].notna()]['user_timezone'].mode()[0]
df['user_timezone']=df['user_timezone'].fillna(mode5)

df.head(100)

After handling the missing values here is the clean data

df.isnull().sum()

Again checking the heat map after cleaning

sns.heatmap(df.isnull())

Replace the reactions with numerical values for better under standing

df.replace({'airline_sentiment':{'positive':1}},inplace=True)
df.replace({'airline_sentiment':{'negative':0}},inplace=True)
df.replace({'airline_sentiment':{'neutral':2}},inplace=True)

Counting the total airline_sentiment values

df['airline_sentiment'].value_counts()

Sentiment matrix after data cleaning

sns.displot(df['airline_sentiment'],color='orange')

Download the only stopwords module

#import nltk
#nltk.download('stopwords')

Daily uses commands

print(stopwords.words('english'))

Creating a function for clean the unnecessary comands from the Text content

port_stem=PorterStemmer()

Stemming the Text content

def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)

    return stemmed_content

df['stemmed_content'] = df['text'].apply(stemming)

print(df['stemmed_content'])

x=df['stemmed_content'].values
y=df['airline_sentiment'].values

Split Train Test Data

x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=2)

print(x.shape, x_train.shape, x_test.shape)

print(x_train)

print(x_test)

Coverting the text content data to numerical values for the ML understanding purpose

vectorizer=TfidfVectorizer()
x_train=vectorizer.fit_transform(x_train)
x_test=vectorizer.transform(x_test)

print(x_train)

print(x_test)

Model train using Logistic Regression CV

model = LogisticRegressionCV(max_iter=1000)
model.fit(x_train, y_train)

x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(y_train, x_train_prediction)

Score the Train and Test Data Accuracy

print('Accuracy Score of Train Data:',training_data_accuracy)

x_test_prediction = model.predict(x_test)
testing_data_accuracy = accuracy_score(y_test, x_test_prediction)

print('Accuracy Score of Test Data:',testing_data_accuracy)

Creating a sav file for read and write

import pickle
filename='trained_model.sav'
pickle.dump(model,open(filename, 'wb'))

loaded_model = pickle.load(open('trained_model.sav','rb'))

df.head()

Checking the prediction is right or wrong

x_new = x_test[569]
print(y_test[569])
prediction=model.predict(x_new)
print(prediction)

if (prediction[0]==1):
    print('positive tweet')
elif(prediction[0]==0):
    print('negative tweet')
else:
    print('neutral tweet')

x_new = x_test[3]
print(y_test[3])
prediction=model.predict(x_new)
print(prediction)

if (prediction[0]==1):
    print('positive tweet')
elif(prediction[0]==0):
    print('negative tweet')
else:
    print('neutral tweet')

Models

Socialmedia Sentiment Analysis

Files