The question in its essence is, are Wine Reviews written over time sufficient enough to 'unconsciously' know who was it that wrote the Review?
As a start, I wanted to restrict my temptation to consider other factors such as Country, Region, points, etc; as they would give us more insight on how each Reviewer belonging to a specific region affects their availability to a Wine in order to write a Review or how he/she evaluates a Wine.
I bilieve so, since it's the journey (Explaination) that tells a lot more than the destination (Quantified Evaluation).
A little bit of background about myself, I am definietly not a Wine-taster, all I have wondered about Alcohol is, does it taste good and gets me the buzz? What I am trying to say is, I do not really Know what is it that a Wine-Reviwer looks for in a Wine.
These observations as I will talk about as we go through the code, led me to believe that Reviews contained sufficient information to determine who the Reviewer was.
import numpy as np import pandas as pd import nltk import re from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from wordcloud import WordCloud import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.svm import LinearSVC from sklearn.metrics import classification_report from imblearn.over_sampling import SMOTE nltk.download('punkt') nltk.download('stopwords')
PATH = "C:\\Users\\Rohit\\Deep Learning\\" df_raw= pd.read_csv(f'{PATH}winemag-data-130k-v2.csv', low_memory=False)
# generating word frequencies def gen_freq(text): # list of words word_list = [] # loop over all the text docs and extract words into word_list for sentence in text: for words in sentence.split(): word_list.append(words) # create word freuencies using word_list word_freq = pd.Series(word_list).value_counts() return word_freq
stop_words = set(stopwords.words('english')) 'With' in stop_words
for i in df_raw['taster_name'].dropna().unique(): newb=df_raw[['description','taster_name']][df_raw['taster_name']==i] word_freq_view = gen_freq(newb['description'].tolist()) wc_normal = WordCloud(width = 400, height = 330, max_words = 50, background_color = 'white', stopwords = stop_words).generate_from_frequencies(word_freq_view) print(newb['taster_name'].unique()) plt.figure(figsize = (14, 10)) plt.imshow(wc_normal, interpolation = 'bilinear') plt.axis('off') plt.show()
What I saw in those Word Clouds (with Stop_words) is that some Reviewers look for multiple flavours as well as aromas and clumination of them two.
for i in df_raw['taster_name'].dropna().unique(): newb=df_raw[['description','taster_name']][df_raw['taster_name']==i] newb['description'] = newb['description'].apply(lambda x: ' '.join([w for w in nltk.word_tokenize(x) if w.isalpha() == True and len(w)>3 and not w in stop_words])) word_freq_view = gen_freq(newb['description'].tolist()) wc_normal = WordCloud(width = 400, height = 330, max_words = 50, background_color = 'white', stopwords = stop_words).generate_from_frequencies(word_freq_view) print(newb['taster_name'].unique()) plt.figure(figsize = (14, 10)) plt.imshow(wc_normal, interpolation = 'bilinear') plt.axis('off') plt.show()
After removing the Stop-Words, the Word Clouds helped me understand what a Reviewer looks for first and as a whole while tasting the wine.
SO yes, hence, The Reviews do contain the information that relays what a Reviewer looks for first and how one consumes the Wine and what does the Wine make them feel. I do think We can do the Prediction of A Reviewer based on Reviews.
ds = pd.Series(df_raw['taster_name']) ax=sns.countplot(df_raw['taster_name'],data=ds) ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right") plt.tight_layout() plt.show()
df_raw['taster_name'].value_counts(normalize=True).plot(kind='bar')
df_raw_2 = df_raw.groupby(['taster_name']).size().reset_index(name='count')
df_raw_2
Reviewers = df_raw['taster_name'].fillna('Unknown_Reviewer')
Reviews = df_raw['description'].dropna()
Reviewers.shape
X_train, X_test, Y_train, Y_test = train_test_split(Reviews,Reviewers,stratify=Reviewers,test_size=0.2,random_state=100)
tv = TfidfVectorizer(min_df = 0., max_df = 1., norm = 'l2', use_idf = True, smooth_idf = True) train_tfidf = tv.fit_transform(X_train) test_tfidf = tv.transform(X_test) print(train_tfidf.shape) print(test_tfidf.shape)
My primary intuition to start with this was because we could consider the high frequent words to be the dimensions of the Manifold the documents create and the Reviewers to be the data points in that Manifold.
And 'curve fitting' would do a tremendous job. As seen, it definietly did.
Altough it didn't have high F1-Score for Reviewers with lesser Reviewers than 100.
log_model = LogisticRegression(penalty = 'l2', solver = 'lbfgs', multi_class = 'ovr', max_iter = 1000, C = 1, random_state = 100) log_model.fit(train_tfidf, Y_train)
log_predictions = log_model.predict(test_tfidf)
print(classification_report(Y_test,log_predictions))
SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.
I chose 4 k nearest neighbhors since Fiona Adams, when sampled during the split only had 5 representations of her Reviews.
sm = SMOTE(k_neighbors=4) X_sm, Y_sm = sm.fit_resample( train_tfidf, Y_train)
log_model = LogisticRegression(penalty = 'l2', solver = 'lbfgs', multi_class = 'ovr', max_iter = 1000, C = 1, random_state = 100) log_model.fit(X_sm, Y_sm)
log_sm_predictions = log_model.predict(test_tfidf) print(classification_report(Y_test,log_sm_predictions))
Reviews_No_Stop_Words=df_raw['description'].dropna().apply(lambda x: ' '.join([w for w in nltk.word_tokenize(x) if w.isalpha() == True and len(w)>3 and not w in stop_words]))
X_train, X_test, Y_train, Y_test = train_test_split(Reviews_No_Stop_Words,Reviewers,stratify=Reviewers,test_size=0.2,random_state=100)
tv = TfidfVectorizer(min_df = 0., max_df = 1., norm = 'l2', use_idf = True, smooth_idf = True) train_tfidf = tv.fit_transform(X_train) test_tfidf = tv.transform(X_test) print(train_tfidf.shape) print(test_tfidf.shape)
Chose to stick with SMOTE analysis in order to Include Fiona Adams.
sm = SMOTE(k_neighbors=4) X_sm, Y_sm = sm.fit_resample( train_tfidf, Y_train)
log_model = LogisticRegression(penalty = 'l2', solver = 'lbfgs', multi_class = 'ovr', max_iter = 1000, C = 1, random_state = 100) log_model.fit(X_sm, Y_sm)
log_sm_predictions = log_model.predict(test_tfidf) print(classification_report(Y_test,log_sm_predictions))
It uses an estimation to the inverse Hessian Matrix, which is approximated, what is means is that it saves a lot of memory, although it might have the risk of not converging to anything.
I believe although the drawback exists, which the dataset being small it surely would converge. Which it did.
log_model_2 = LogisticRegression(penalty = 'l2', solver = 'liblinear', multi_class = 'ovr', max_iter = 1000, C = 1, random_state = 100) log_model_2.fit(X_sm, Y_sm)
log_sm_predictions = log_model_2.predict(test_tfidf) print(classification_report(Y_test,log_sm_predictions))
This SAG-Stochstic Average Gradient, incorporates a memory of previous gradient values, leading to a much faster convergence rate.
log_model_3 = LogisticRegression(penalty = 'l2', solver = 'sag', multi_class = 'ovr', max_iter = 1000, C = 1, random_state = 100) log_model_3.fit(X_sm, Y_sm)
log_sm_predictions = log_model_3.predict(test_tfidf) print(classification_report(Y_test,log_sm_predictions))
newb=df_raw[['description','taster_name']][df_raw['taster_name']=='nan']
test_data=df_raw[['description']][df_raw['taster_name'].isnull()]
test_one = [test_data.at[33,'description']]
test_data_see=test_data['description'].apply(lambda x: ' '.join([w for w in nltk.word_tokenize(x) if w.isalpha() == True and len(w)>3 and not w in stop_words]))
test_data_see.at[33]
test_one_review = tv.transform([test_data_see.at[33]])
log_model_2.predict(test_one_review)
Works good, since Unknown_Reviewer the replacement of NaNs is as labeled, Unknown and mysterious
I suspect this is so because they Reviews were taken from a Reviewer with an account at some place with no details.
And, Unknown_Reviewer having quite a High Recall and Precision does say it's the same Reviewer, quite weird to be honest.
Because limitation of my Laptop and my Time constraints I didn't use deeper trees. Although, yes, I did test this on Colab, but still it didn't generate a better performance than Logistic Regression
But, yes the Minor Classes's Precision and Recall were a tad bit better than the Logistic.
I used Gini, I suspect had I used Information Grain although the time taken would be high ( only CPU was available during this), the precisson and Recall would be better.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators = 100,criterion = 'gini', max_depth=15) model.fit(X_sm, Y_sm)
Y_predicted = model.predict(test_tfidf) print(classification_report(Y_test,Y_predicted))
I used the Cosine Similarity suspecting that the similarity based on angles between data points (Reviewer) over the manifold of dimensions of Words would be a better measure.
from sklearn.neighbors import KNeighborsClassifier model_2 = KNeighborsClassifier(n_neighbors=30,metric='cosine') model_2.fit(X_sm,Y_sm)
Y_predicted = model_2.predict(test_tfidf) print(classification_report(Y_test,Y_predicted))
I always call SVMs the magicians because the very core of construction of this algorithm is such that the bottom line of a Loss function is a Dual Optimization problem of Paraboloid which has a high probability of going to the Global Optimum, since we only consider the dot product of Two points over the manifold for the construction of the hyperplane.
fo
model_3 = LinearSVC(multi_class='ovr',dual=True) model_3.fit(X_sm,Y_sm)
Y_predicted = model_3.predict(test_tfidf) print(classification_report(Y_test,Y_predicted))
I would say I would go with SVMs.
Try to get more Reviews written by Christina, Fiona and Carrie. Didn't want to Impute the data more with the worry that I would just be ending up generating my data and won't be the true distribution of data_set.
I wouldn't recommend going to Deep Learning, since ML algorithms were pretty good models.
I would further might also try to increase the weightage of the minor classes (Reviewers with less Reviews)