The following notebook will create a model to predict ( or classify ) whether a given swiss product review is helpful or not.
This notebook follows the same steps as the predicting_helpfulness_small notebook, the main diference being that this one uses a larger dataset of helpful reviews.
The links towards the papers that inspired this work should be available here: "Predicting Amazon review helpfulness" and "Predicting Helpfulness Ratings of Amazon Product Reviews"
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import yaml
Open the file containing the filtered helpful swiss reviews. This notebook uses the bigger such dataset, in which the swiss reviews had at least 5 helpful flags (votes) and totalling to 23755 reviews.
In [3]:
with open("data/swiss-reviews-helpful-correct-bigger.txt", 'r') as fp:
s_swiss_rev_help = fp.readlines()
In [4]:
len(s_swiss_rev_help)
Out[4]:
In [5]:
s_swiss_rev_help[0]
Out[5]:
The ARI score is a measure of the readability of a piece of text and it represents one of the features in the later model.
In [6]:
import re
import nltk.data
from nltk import wordpunct_tokenize
# split into words by punctuations
# remove punctuations and all '-' words
RE = re.compile('[0-9a-z-]', re.I)
s_sent = nltk.data.load('tokenizers/punkt/english.pickle')
In [7]:
def get_ARI_score(text):
words = list(filter(lambda w: RE.search(w) and w.replace('-', ''), wordpunct_tokenize(text)))
wordc = len(words)
charc = sum(len(w) for w in words)
sents = s_sent.tokenize(text)
sentc = len(sents)
return 4.71 * charc / (wordc + 0.1) + 0.5 * (wordc + 0.1) / (sentc + 0.1) - 21.43
In the next step we convert the file containing the reviews into a pandas dataframe as it is easier to work with it in that format.
In [8]:
def convert_to_dict(line):
l = line.rstrip('\n')
l = eval(l)
return l
In [9]:
s_dict_list = []
counter = 1
for item in s_swiss_rev_help:
if(counter%500 == 0):
print(counter)
counter += 1
s_dict_list.append(convert_to_dict(item))
In [10]:
s_features = pd.DataFrame(s_dict_list)
We remove the 'reviewTime', 'reviewerName', 'unixReviewTime' columns from the dataframe as they will not be used as features nor used to construct other features. As a note, in the papers which inspired the current work these columns were used to obtain features such as the age of a review and the time delay of the review.
In [11]:
s_features.drop(['reviewTime', 'reviewerName', 'unixReviewTime'],inplace=True,axis=1)
In [12]:
s_features.head(3)
Out[12]:
Feature 1: Review length measured by the number of characters of the text. It will be noted here that after the creation of each feature it will also be standardized.
In [13]:
# Review length
s_features["rev_length"] = s_features['reviewText'].apply(lambda x: len(x))
s_features["rev_length"] = (s_features["rev_length"] - \
s_features["rev_length"].mean()) / s_features["rev_length"].std()
Feature 2: The ratio of the number of caps to the total number of characters. +1 was added in the denominator to prevent division by 0 (apparently there are zero-length reviews).
In [14]:
# Number of all caps
def count_all_caps(text):
count = 0
for i in text:
if(i.isupper()):
count += 1
return count/(len(text) + 1)
In [15]:
s_features["all_caps"] = s_features['reviewText'].apply(count_all_caps)
s_features["all_caps"] = (s_features["all_caps"] - \
s_features["all_caps"].mean()) / s_features["all_caps"].std()
Feature 3: Total number of sentences
In [16]:
# Sentence count
s_features["sent_count"] = s_features['reviewText'].apply(lambda x: \
len(nltk.tokenize.sent_tokenize(x)))
In [17]:
s_features["sent_count"] = (s_features["sent_count"] - \
s_features["sent_count"].mean()) / s_features["sent_count"].std()
Feature 4: Question and exclamation marks ratio to the length of the text
In [18]:
# Question and exclamation marks ratio to characters
def count_ques_exc_ratio(text):
count = 0
for i in text:
if(i == '!' or i == '?'):
count += 1
return count*1.0/(len(text)+1)
In [19]:
s_features["que_exc_ratio"] = s_features['reviewText'].apply(count_ques_exc_ratio)
s_features["que_exc_ratio"] = (s_features["que_exc_ratio"] - \
s_features["que_exc_ratio"].mean()) / s_features["que_exc_ratio"].std()
Feature 5: ARI readability score
In [20]:
# ARI readability index
s_features["ari_score"] = s_features['reviewText'].apply(get_ARI_score)
In [21]:
s_features["ari_score"] = (s_features["ari_score"] - \
s_features["ari_score"].mean()) / s_features["ari_score"].std()
Feature 6: The deviation of the review's rating from the mean rating score of the respective product
In [22]:
# load mean rating scores for considered asins
with open("data/asin_ratings_bigger.json", "r") as fp:
s_asin_rev_ratings = fp.readlines()
In [23]:
s_dict_list_ar = []
for item in s_asin_rev_ratings:
s_dict_list_ar.append(convert_to_dict(item))
In [24]:
s_dict_ar = dict([(key,d[key]) for d in s_dict_list_ar for key in d])
In [25]:
def get_dev_mean_score(row):
if(row['asin'] in s_dict_ar.keys()):
return row['overall'] - s_dict_ar[row['asin']]
print("Should not get here")
return 0
In [26]:
s_features['dev_mean_score'] = s_features.apply(get_dev_mean_score, axis=1)
In [27]:
# should only be standardized after dev_mean_score has been computed
s_features["overall"] = (s_features["overall"] - \
s_features["overall"].mean()) / s_features["overall"].std()
In [28]:
s_features["dev_mean_score"] = (s_features["dev_mean_score"] - \
s_features["dev_mean_score"].mean()) / s_features["dev_mean_score"].std()
In [29]:
len(s_features)
Out[29]:
In the next step we remove duplicate ('asin', 'reviewText') pair values. There other approaches as well as removing duplicates only by the 'reviewText' field, but this would reduce the dataset considerably. Here we can make an interesting observation, regarding the fact that a lot of the reviews are copied in different products, without changing the content at the very least.
In [30]:
s_features.drop_duplicates(subset=['asin','reviewText'], keep='first', inplace=True)
In [31]:
len(s_features)
Out[31]:
In the next part we create the labels by considering a review 'helpful' only if it has at least 0.6 helpful ratio (helpful votes / total votes). This approach was also used in the mentioned papers, the motivation being that the normal/intuitive threshold of 0.5 would be a little ambiguous and as such a more strict threshold was used.
In [32]:
def create_labels(pair):
if(pair[0]*1.0/pair[1] >= 0.6):
return 1
else:
return 0
In [33]:
s_features['Y'] = s_features['helpful'].apply(create_labels)
To complete the set of features considered for the learning model we use tf-idf on the corpus of helpful reviews. In a little more detail we first compute the dictionary of the corpus and then attach to each word its tf-idf score. A given review will contain as features the vocabulary of words together with the tf-idf scores for its respective words contained in the review.
In [34]:
from sklearn.feature_extraction.text import CountVectorizer
s_count_vect = CountVectorizer(stop_words='english')
s_X_train_counts = s_count_vect.fit_transform(list(s_features['reviewText']))
In [35]:
s_X_train_counts.shape
Out[35]:
In [36]:
from sklearn.feature_extraction.text import TfidfTransformer
s_tf_transformer = TfidfTransformer().fit(s_X_train_counts)
s_X_train_tf = s_tf_transformer.transform(s_X_train_counts)
In [37]:
s_X_train_tf.shape
Out[37]:
In order to preserve the sparsity we scale by the maximum absolute value.
In [38]:
from sklearn.preprocessing import scale, maxabs_scale
In [39]:
maxabs_scale(s_X_train_tf, copy=False)
Out[39]:
In [40]:
s_Y = s_features.as_matrix(['Y'])
In [41]:
s_Y.shape
Out[41]:
In [42]:
s_X_train_else = s_features.as_matrix(['overall','rev_length','all_caps','sent_count','que_exc_ratio',
'ari_score', 'dev_mean_score'])
In [43]:
type(s_X_train_else)
Out[43]:
We combine the features constructed until now into the final feature matrix s_X
In [44]:
from scipy.sparse import hstack
s_X = hstack([s_X_train_tf, s_X_train_else])
In [45]:
type(s_X)
Out[45]:
In [46]:
s_X.shape
Out[46]:
In [47]:
s_Y.shape
Out[47]:
In [48]:
s_Y = s_Y.reshape((s_Y.shape[0],))
In [49]:
s_Y.shape
Out[49]:
Load the classifier models to be tested: Random forest, Logistic regression, AdaBoost, SVC (from SVM)
In [50]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, KFold, StratifiedKFold
from sklearn.metrics import confusion_matrix, roc_curve, auc
Split the dataset into training and testing and see how the classifiers roughly behave
In [51]:
x_tr, x_te, y_tr, y_te = train_test_split(s_X, s_Y, train_size=0.8)
In [52]:
s_cllr = LogisticRegression(class_weight='balanced', n_jobs=-1, C=1e2)
In [53]:
s_cllr.fit(x_tr, y_tr)
print("Test set accuracy:" , s_cllr.score(x_te, y_te))
print("Train set accuracy:", s_cllr.score(x_tr, y_tr))
In [54]:
s_clrf = RandomForestClassifier(n_estimators = 10, class_weight='balanced', n_jobs=-1)
In [55]:
s_clrf.fit(x_tr, y_tr)
print("Test set accuracy:" , s_clrf.score(x_te, y_te))
print("Train set accuracy:", s_clrf.score(x_tr, y_tr))
In [56]:
s_clab = AdaBoostClassifier()
In [57]:
s_clab.fit(x_tr, y_tr)
print("Test set accuracy:" , s_clab.score(x_te, y_te))
print("Train set accuracy:", s_clab.score(x_tr, y_tr))
Note: Unfortunately the SVC (linear SVM) classifier takes a long time to compute and it was removed from the analysis.
In [58]:
sum(y_tr)/len(y_tr)
Out[58]:
The accuracy does not bring any meaningful observations as the 2 classes are highly imbalanced. The ratio of the positive (helpful) samples to the negative (unhelpful) samples is approximately 3:1.
One alternative is to compute the F1 score for each classfier using cross-validation
In [59]:
shuffle = KFold(n_splits=10, shuffle=True)
In [60]:
f1s_rf = cross_val_score(s_clrf, s_X, s_Y, cv=shuffle, scoring='f1', n_jobs=-1)
In [61]:
print("F1 score mean -- Random forest:" , f1s_rf.mean())
print("F1 score std -- Random forest:", f1s_rf.std())
In [62]:
f1s_lr = cross_val_score(s_cllr, s_X, s_Y, cv=shuffle, scoring='f1', n_jobs=-1)
In [63]:
print("F1 score mean -- Logistic regression:" , f1s_lr.mean())
print("F1 score std -- Logistic regression:", f1s_lr.std())
In [64]:
f1s_ab = cross_val_score(s_clab, s_X, s_Y, cv=shuffle, scoring='f1', n_jobs=-1)
In [65]:
print("F1 score mean -- AdaBoost:" , f1s_ab.mean())
print("F1 score std -- AdaBoost:", f1s_ab.std())
Based on the F1 score, the Random Forest classifier seems to be doing a little bit better than the Logistic Regression and AdaBoost classifiers.
Next we can also take a look at the confusion matrix for each model.
In [66]:
s_y_pred_rf = cross_val_predict(s_clrf, s_X, s_Y, cv=shuffle, n_jobs=-1)
In [67]:
s_y_pred_lr = cross_val_predict(s_cllr, s_X, s_Y, cv=shuffle, n_jobs=-1)
In [68]:
s_y_pred_ab = cross_val_predict(s_clab, s_X, s_Y, cv=shuffle, n_jobs=-1)
In [69]:
s_cm_rf = confusion_matrix(s_Y, s_y_pred_rf)
In [70]:
s_cm_lr = confusion_matrix(s_Y, s_y_pred_lr)
In [71]:
s_cm_ab = confusion_matrix(s_Y, s_y_pred_ab)
In [72]:
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
In [73]:
# Plot non-normalized confusion matrix -- Random forest
plt.figure()
plot_confusion_matrix(s_cm_rf, classes=['Unhelpful', 'Helpful'],
title='Confusion matrix, without normalization')
plt.show()
In [74]:
# Plot non-normalized confusion matrix -- Logistic regression
plt.figure()
plot_confusion_matrix(s_cm_lr, classes=['Unhelpful', 'Helpful'],
title='Confusion matrix, without normalization')
plt.show()
In [75]:
# Plot non-normalized confusion matrix -- AdaBoost
plt.figure()
plot_confusion_matrix(s_cm_ab, classes=['Unhelpful', 'Helpful'],
title='Confusion matrix, without normalization')
plt.show()
After preforming cross-validation to find the predictions of each model, we plotted the confusion matrix with respect to each model's predictions.
There are different measurements based on the confusion matrix, the F1 score (harmonic mean between precision and recall) computed before being one of them. One measurement which could be more useful is the false positive rate (FP / N; FP = false positives, N = total negatives). The intuition behind this is the following: classifying true helpful reviews as unhelpful is less significant than classifying true unhelpful reviews as helpful. We see that all the models perform well in identifying helpful reviews, yet some are better at handling the unhelpful reviews. The false positive rate focuses on finding which model is better or more sensitive for handling negative samples (unhelpful reviews).
In [76]:
tn_rf, fp_rf, fn_rf, tp_rf = s_cm_rf.ravel()
In [77]:
false_positive_rate_rf = fp_rf / (fp_rf + tn_rf)
false_positive_rate_rf
Out[77]:
In [78]:
tn_lr, fp_lr, fn_lr, tp_lr = s_cm_lr.ravel()
In [79]:
false_positive_rate_lr = fp_lr / (fp_lr + tn_lr)
false_positive_rate_lr
Out[79]:
In [80]:
tn_ab, fp_ab, fn_ab, tp_ab = s_cm_ab.ravel()
In [81]:
false_positive_rate_ab = fp_ab / (fp_ab + tn_ab)
false_positive_rate_ab
Out[81]:
Logistic regression (again) provides the smallest false positive rate, suggesting that it is the better model for handling the negative cases
Next measurement step will be to analyze the roc curves for each model.
In [82]:
# with StratifiedKFold the folds are made by preserving the percentage of samples for each class
shuffle_5 = StratifiedKFold(n_splits=5, shuffle=True)
In [83]:
from itertools import cycle
colors = cycle(['cyan', 'indigo', 'seagreen', 'yellow', 'blue'])
In [84]:
from scipy import interp
def plot_roc_cv(classifier):
i = 0
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
for (train, test), color in zip(shuffle_5.split(s_X, s_Y), colors):
classifier.fit(s_X.tocsr()[train], s_Y[train])
probas_ = classifier.predict_proba(s_X.tocsr()[test])
# Compute ROC curve and area under the curve
fpr, tpr, thresholds = roc_curve(s_Y[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, color=color, label='ROC fold %d (area = %0.4f)' % (i, roc_auc))
i += 1
mean_tpr /= shuffle_5.get_n_splits()
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='r', linestyle='--', label='Mean ROC (area = %0.4f)' % mean_auc, lw=2)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
In [85]:
# ROC curve(s) for Random Forest
plt.figure()
plot_roc_cv(s_clrf)
plt.show()
In [86]:
# ROC curve(s) for Logistic Regression
plt.figure()
plot_roc_cv(s_cllr)
plt.show()
In [87]:
# ROC curve(s) for AdaBoost
plt.figure()
plot_roc_cv(s_clab)
plt.show()
Based on the results above, random forest provides the best mean roc area under curve, followed by the logistic regression.
After observing the results of the F1 score, FPR (false positive rate) and the area under the ROC curve (AUC) both the random forest and the logistic regression models seem perform better than AdaBoost, however between the two of them it is hard to say which is better as their results are quite similar.
In the last part of the analysis we try to get an intuition of the most relevant features when training the random forest. For the other classifiers identifying the most important features it is not as straight forward as for a random forest.
s_feature_names stores the name of the features (the vocabulary together with the constructed features)
In [88]:
s_words_list = s_count_vect.get_feature_names()
s_other_feats = ['overall','rev_length','all_caps','sent_count','que_exc_ratio',
'ari_score', 'dev_mean_score']
s_feature_names = s_words_list + s_other_feats
In [89]:
len(s_feature_names) == s_X.shape[1]
Out[89]:
In [90]:
# the number of the most relevant features to be plotted
number_of_features = 15
importances = s_clrf.feature_importances_
# sort the indices based on the importance values
indices = np.argsort(importances)[::-1]
# the names of the features based on the indices and how many to select
temp_names = []
for i in range(number_of_features):
temp_names.append(s_feature_names[indices[i]])
In [91]:
# Plot the feature importances of the random forest
plt.figure(figsize=(20,10))
plt.title("Feature importances", fontsize=20)
plt.bar(range(number_of_features), importances[indices[:number_of_features]],
color="r", align="center")#, yerr=std[indices])
plt.xticks(range(number_of_features), temp_names, fontsize=20, rotation=45)
plt.xlim([-1, number_of_features])
plt.show()
# Print the feature ranking
print("Feature ranking:")
for f in range(number_of_features):
print("%d. feature %s (%f)" % (f + 1, s_feature_names[indices[f]], importances[indices[f]]))