Amazon food reviews Prediction

This Dataset contains 568,455 reviews by customers on products. The columns in the table are:

  • Id
  • ProductId - unique identifier for the product
  • UserId - unqiue identifier for the user
  • ProfileName
  • HelpfulnessNumerator - number of users who found the review helpful
  • HelpfulnessDenominator - number of users who indicated whether they found the review helpful
  • Score - rating between 1 and 5
  • Time - timestamp for the review
  • Summary - brief summary of the review
  • Text - text of the review

Load the Dataset


In [1]:
import sqlite3
import pandas as pd
con = sqlite3.connect('database.sqlite')

In [2]:
dat = pd.read_sql_query("""SELECT * FROM Reviews""", con)

In [3]:
dat.head(5)


Out[3]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe...
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i...
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price. There was a wid...

In [18]:
import matplotlib.pyplot as plt
%matplotlib inline

In [53]:
print 'There are %f percentage unique products'% (len(set(dat['ProductId']))/(len(dat['ProductId'])*1.))


There are 0.130632 percentage unique products

In [54]:
print 'There are %f percentage unique customers'% (len(set(dat['UserId']))/(len(dat['UserId'])*1.))


There are 0.450448 percentage unique customers

In [55]:
dat['Score'].plot.hist()


Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x36003fb0>

The plot shows that there are more positive reviews. Lets Split 4,5 as positive and 1,2,3 as negative reviews for our prediction


In [4]:
def labelData(score):
    if score >= 4:
        return 'positive'
    return 'negative'

dat['PScore'] = dat['Score'].map(labelData)

In [5]:
dat[['Score','PScore']].head(5)


Out[5]:
Score PScore
0 5 positive
1 1 negative
2 4 positive
3 2 negative
4 5 positive

we can also combine summary and text.


In [6]:
def combineText(row):
    return row['Summary'] +" "+ row['Text']

dat['reviews'] = dat.apply(combineText,axis=1)

In [7]:
dat[['reviews','PScore']].head(5)


Out[7]:
reviews PScore
0 Good Quality Dog Food I have bought several of... positive
1 Not as Advertised Product arrived labeled as J... negative
2 "Delight" says it all This is a confection tha... positive
3 Cough Medicine If you are looking for the secr... negative
4 Great taffy Great taffy at a great price. The... positive

save dataframe to disk


In [8]:
dat[['reviews','PScore']].to_pickle('amzreviews')

Read Data back for further processing


In [20]:
dat = pd.read_pickle('amzreviews')

In [21]:
dat.head(5)


Out[21]:
reviews PScore
0 Good Quality Dog Food I have bought several of... positive
1 Not as Advertised Product arrived labeled as J... negative
2 "Delight" says it all This is a confection tha... positive
3 Cough Medicine If you are looking for the secr... negative
4 Great taffy Great taffy at a great price. The... positive

Use nltk to clean the data (we select 10000 rows to avoid memory error)


In [22]:
reviews = dat['reviews'].tolist()[:10000]

In [23]:
reviews[:2]


Out[23]:
[u'Good Quality Dog Food I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
 u'Not as Advertised Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".']

In [24]:
len(reviews)


Out[24]:
10000

Tokenize words


In [27]:
from nltk.tokenize import word_tokenize

In [28]:
tokenized_reviews = []

for doc in reviews:
    tokenized_reviews.append(word_tokenize(doc))
    if len(tokenized_reviews) % 10000 == 0:
        print len(tokenized_reviews)


10000

In [29]:
tokenized_reviews[:3]


Out[29]:
[[u'Good',
  u'Quality',
  u'Dog',
  u'Food',
  u'I',
  u'have',
  u'bought',
  u'several',
  u'of',
  u'the',
  u'Vitality',
  u'canned',
  u'dog',
  u'food',
  u'products',
  u'and',
  u'have',
  u'found',
  u'them',
  u'all',
  u'to',
  u'be',
  u'of',
  u'good',
  u'quality',
  u'.',
  u'The',
  u'product',
  u'looks',
  u'more',
  u'like',
  u'a',
  u'stew',
  u'than',
  u'a',
  u'processed',
  u'meat',
  u'and',
  u'it',
  u'smells',
  u'better',
  u'.',
  u'My',
  u'Labrador',
  u'is',
  u'finicky',
  u'and',
  u'she',
  u'appreciates',
  u'this',
  u'product',
  u'better',
  u'than',
  u'most',
  u'.'],
 [u'Not',
  u'as',
  u'Advertised',
  u'Product',
  u'arrived',
  u'labeled',
  u'as',
  u'Jumbo',
  u'Salted',
  u'Peanuts',
  u'...',
  u'the',
  u'peanuts',
  u'were',
  u'actually',
  u'small',
  u'sized',
  u'unsalted',
  u'.',
  u'Not',
  u'sure',
  u'if',
  u'this',
  u'was',
  u'an',
  u'error',
  u'or',
  u'if',
  u'the',
  u'vendor',
  u'intended',
  u'to',
  u'represent',
  u'the',
  u'product',
  u'as',
  u'``',
  u'Jumbo',
  u"''",
  u'.'],
 [u'``',
  u'Delight',
  u"''",
  u'says',
  u'it',
  u'all',
  u'This',
  u'is',
  u'a',
  u'confection',
  u'that',
  u'has',
  u'been',
  u'around',
  u'a',
  u'few',
  u'centuries',
  u'.',
  u'It',
  u'is',
  u'a',
  u'light',
  u',',
  u'pillowy',
  u'citrus',
  u'gelatin',
  u'with',
  u'nuts',
  u'-',
  u'in',
  u'this',
  u'case',
  u'Filberts',
  u'.',
  u'And',
  u'it',
  u'is',
  u'cut',
  u'into',
  u'tiny',
  u'squares',
  u'and',
  u'then',
  u'liberally',
  u'coated',
  u'with',
  u'powdered',
  u'sugar',
  u'.',
  u'And',
  u'it',
  u'is',
  u'a',
  u'tiny',
  u'mouthful',
  u'of',
  u'heaven',
  u'.',
  u'Not',
  u'too',
  u'chewy',
  u',',
  u'and',
  u'very',
  u'flavorful',
  u'.',
  u'I',
  u'highly',
  u'recommend',
  u'this',
  u'yummy',
  u'treat',
  u'.',
  u'If',
  u'you',
  u'are',
  u'familiar',
  u'with',
  u'the',
  u'story',
  u'of',
  u'C.S',
  u'.',
  u'Lewis',
  u"'",
  u'``',
  u'The',
  u'Lion',
  u',',
  u'The',
  u'Witch',
  u',',
  u'and',
  u'The',
  u'Wardrobe',
  u"''",
  u'-',
  u'this',
  u'is',
  u'the',
  u'treat',
  u'that',
  u'seduces',
  u'Edmund',
  u'into',
  u'selling',
  u'out',
  u'his',
  u'Brother',
  u'and',
  u'Sisters',
  u'to',
  u'the',
  u'Witch',
  u'.']]

Removing punctuation


In [30]:
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation))

In [31]:
tokenized_reviews_no_punctuation = []

for review in tokenized_reviews:
    
    new_review = []
    for token in review: 
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_reviews_no_punctuation.append(new_review)
    
print tokenized_reviews_no_punctuation[:3]


[[u'Good', u'Quality', u'Dog', u'Food', u'I', u'have', u'bought', u'several', u'of', u'the', u'Vitality', u'canned', u'dog', u'food', u'products', u'and', u'have', u'found', u'them', u'all', u'to', u'be', u'of', u'good', u'quality', u'The', u'product', u'looks', u'more', u'like', u'a', u'stew', u'than', u'a', u'processed', u'meat', u'and', u'it', u'smells', u'better', u'My', u'Labrador', u'is', u'finicky', u'and', u'she', u'appreciates', u'this', u'product', u'better', u'than', u'most'], [u'Not', u'as', u'Advertised', u'Product', u'arrived', u'labeled', u'as', u'Jumbo', u'Salted', u'Peanuts', u'the', u'peanuts', u'were', u'actually', u'small', u'sized', u'unsalted', u'Not', u'sure', u'if', u'this', u'was', u'an', u'error', u'or', u'if', u'the', u'vendor', u'intended', u'to', u'represent', u'the', u'product', u'as', u'Jumbo'], [u'Delight', u'says', u'it', u'all', u'This', u'is', u'a', u'confection', u'that', u'has', u'been', u'around', u'a', u'few', u'centuries', u'It', u'is', u'a', u'light', u'pillowy', u'citrus', u'gelatin', u'with', u'nuts', u'in', u'this', u'case', u'Filberts', u'And', u'it', u'is', u'cut', u'into', u'tiny', u'squares', u'and', u'then', u'liberally', u'coated', u'with', u'powdered', u'sugar', u'And', u'it', u'is', u'a', u'tiny', u'mouthful', u'of', u'heaven', u'Not', u'too', u'chewy', u'and', u'very', u'flavorful', u'I', u'highly', u'recommend', u'this', u'yummy', u'treat', u'If', u'you', u'are', u'familiar', u'with', u'the', u'story', u'of', u'CS', u'Lewis', u'The', u'Lion', u'The', u'Witch', u'and', u'The', u'Wardrobe', u'this', u'is', u'the', u'treat', u'that', u'seduces', u'Edmund', u'into', u'selling', u'out', u'his', u'Brother', u'and', u'Sisters', u'to', u'the', u'Witch']]

In [33]:
print len(tokenized_reviews_no_punctuation)


10000

Cleaning text of stopwords


In [19]:
from nltk.corpus import stopwords

tokenized_reviews_no_stopwords = []
for doc in tokenized_reviews_no_punctuation:
    new_term_vector = []
    for word in doc:
        if not word in stopwords.words('english'):
            new_term_vector.append(word)
    tokenized_reviews_no_stopwords.append(new_term_vector)
            
print tokenized_reviews_no_stopwords[:4]


[[u'Good', u'Quality', u'Dog', u'Food', u'I', u'bought', u'several', u'Vitality', u'canned', u'dog', u'food', u'products', u'found', u'good', u'quality', u'The', u'product', u'looks', u'like', u'stew', u'processed', u'meat', u'smells', u'better', u'My', u'Labrador', u'finicky', u'appreciates', u'product', u'better'], [u'Not', u'Advertised', u'Product', u'arrived', u'labeled', u'Jumbo', u'Salted', u'Peanuts', u'peanuts', u'actually', u'small', u'sized', u'unsalted', u'Not', u'sure', u'error', u'vendor', u'intended', u'represent', u'product', u'Jumbo'], [u'Delight', u'says', u'This', u'confection', u'around', u'centuries', u'It', u'light', u'pillowy', u'citrus', u'gelatin', u'nuts', u'case', u'Filberts', u'And', u'cut', u'tiny', u'squares', u'liberally', u'coated', u'powdered', u'sugar', u'And', u'tiny', u'mouthful', u'heaven', u'Not', u'chewy', u'flavorful', u'I', u'highly', u'recommend', u'yummy', u'treat', u'If', u'familiar', u'story', u'CS', u'Lewis', u'The', u'Lion', u'The', u'Witch', u'The', u'Wardrobe', u'treat', u'seduces', u'Edmund', u'selling', u'Brother', u'Sisters', u'Witch'], [u'Cough', u'Medicine', u'If', u'looking', u'secret', u'ingredient', u'Robitussin', u'I', u'believe', u'I', u'found', u'I', u'got', u'addition', u'Root', u'Beer', u'Extract', u'I', u'ordered', u'good', u'made', u'cherry', u'soda', u'The', u'flavor', u'medicinal']]

In [20]:
len(tokenized_reviews_no_stopwords)


Out[20]:
10000

Stemming and Lemmatizing


In [34]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
preprocessed_docs = []
for doc in tokenized_reviews_no_punctuation:
#for doc in tokenized_reviews_no_stopwords:
    final_doc = []
    for word in doc:
        final_doc.append(porter.stem(word))
    preprocessed_docs.append(final_doc)

print preprocessed_docs[:4]


[[u'Good', u'Qualiti', u'Dog', u'Food', u'I', u'have', u'bought', u'sever', u'of', u'the', u'Vital', u'can', u'dog', u'food', u'product', u'and', u'have', u'found', u'them', u'all', u'to', u'be', u'of', u'good', u'qualiti', u'The', u'product', u'look', u'more', u'like', u'a', u'stew', u'than', u'a', u'process', u'meat', u'and', u'it', u'smell', u'better', u'My', u'Labrador', u'is', u'finicki', u'and', u'she', u'appreci', u'thi', u'product', u'better', u'than', u'most'], [u'Not', u'as', u'Advertis', u'Product', u'arriv', u'label', u'as', u'Jumbo', u'Salt', u'Peanut', u'the', u'peanut', u'were', u'actual', u'small', u'size', u'unsalt', u'Not', u'sure', u'if', u'thi', u'wa', u'an', u'error', u'or', u'if', u'the', u'vendor', u'intend', u'to', u'repres', u'the', u'product', u'as', u'Jumbo'], [u'Delight', u'say', u'it', u'all', u'Thi', u'is', u'a', u'confect', u'that', u'ha', u'been', u'around', u'a', u'few', u'centuri', u'It', u'is', u'a', u'light', u'pillowi', u'citru', u'gelatin', u'with', u'nut', u'in', u'thi', u'case', u'Filbert', u'And', u'it', u'is', u'cut', u'into', u'tini', u'squar', u'and', u'then', u'liber', u'coat', u'with', u'powder', u'sugar', u'And', u'it', u'is', u'a', u'tini', u'mouth', u'of', u'heaven', u'Not', u'too', u'chewi', u'and', u'veri', u'flavor', u'I', u'highli', u'recommend', u'thi', u'yummi', u'treat', u'If', u'you', u'are', u'familiar', u'with', u'the', u'stori', u'of', u'CS', u'Lewi', u'The', u'Lion', u'The', u'Witch', u'and', u'The', u'Wardrob', u'thi', u'is', u'the', u'treat', u'that', u'seduc', u'Edmund', u'into', u'sell', u'out', u'hi', u'Brother', u'and', u'Sister', u'to', u'the', u'Witch'], [u'Cough', u'Medicin', u'If', u'you', u'are', u'look', u'for', u'the', u'secret', u'ingredi', u'in', u'Robitussin', u'I', u'believ', u'I', u'have', u'found', u'it', u'I', u'got', u'thi', u'in', u'addit', u'to', u'the', u'Root', u'Beer', u'Extract', u'I', u'order', u'which', u'wa', u'good', u'and', u'made', u'some', u'cherri', u'soda', u'The', u'flavor', u'is', u'veri', u'medicin']]

In [35]:
# to reduce memory
del tokenized_reviews
del tokenized_reviews_no_punctuation
#del tokenized_reviews_no_stopwords

In [36]:
len(preprocessed_docs)


Out[36]:
10000

In [37]:
final_reviews = []
for doc in preprocessed_docs:
    final_reviews.append((" ".join(doc)).lower())
print final_reviews[:2]


[u'good qualiti dog food i have bought sever of the vital can dog food product and have found them all to be of good qualiti the product look more like a stew than a process meat and it smell better my labrador is finicki and she appreci thi product better than most', u'not as advertis product arriv label as jumbo salt peanut the peanut were actual small size unsalt not sure if thi wa an error or if the vendor intend to repres the product as jumbo']

In [38]:
print len(final_reviews)


10000

In [39]:
scores = dat['PScore'].tolist()[:10000]

In [40]:
len(scores)


Out[40]:
10000

In [41]:
from collections import Counter
Counter(scores)


Out[41]:
Counter({'negative': 2384, 'positive': 7616})

convert in to dataframe and save it


In [42]:
dat = pd.DataFrame({'reviews':final_reviews,'scores':scores})

In [43]:
dat.head(5)


Out[43]:
reviews scores
0 good qualiti dog food i have bought sever of t... positive
1 not as advertis product arriv label as jumbo s... negative
2 delight say it all thi is a confect that ha be... positive
3 cough medicin if you are look for the secret i... negative
4 great taffi great taffi at a great price there... positive

In [44]:
dat.to_pickle('reviewsAndscores')

In [45]:
dat = pd.read_pickle('reviewsAndscores')

In [46]:
dat.head(5)


Out[46]:
reviews scores
0 good qualiti dog food i have bought sever of t... positive
1 not as advertis product arriv label as jumbo s... negative
2 delight say it all thi is a confect that ha be... positive
3 cough medicin if you are look for the secret i... negative
4 great taffi great taffi at a great price there... positive

Lets do some prediction

For our prediction we do logistic regression and naive bayes classifier


In [47]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [48]:
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(dat['reviews'])           
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)

split in to Train and Test set


In [49]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(train_tfidf, dat['scores'], test_size=0.2, random_state=21)

Fit Logistic regression


In [50]:
from sklearn import linear_model
prediction = dict()
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X_train, Y_train)
prediction['Logistic'] = logreg.predict(X_test)

Fit Multinomial Naive Bayes


In [51]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, Y_train)
prediction['Multinomial'] = model.predict(X_test)

Bernoulli Naive Bayes


In [52]:
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train, Y_train)
prediction['Bernoulli'] = model.predict(X_test)

Metrics


In [53]:
%matplotlib inline
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

def format(x):
    if x == 'negative':
        return 0
    return 1
vfunc = np.vectorize(format)

cmp = 0
colors = ['b', 'g', 'y', 'm', 'k']
for model, predicted in prediction.items():
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test.map(format), vfunc(predicted))
    roc_auc = auc(false_positive_rate, true_positive_rate)
    plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC %0.2f'% (model,roc_auc))
    cmp += 1

plt.title('Classifiers comparaison with ROC')
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()



In [54]:
print(metrics.classification_report(Y_test, prediction['Logistic'], target_names = ["positive", "negative"]))


             precision    recall  f1-score   support

   positive       0.72      0.71      0.72       451
   negative       0.92      0.92      0.92      1549

avg / total       0.87      0.87      0.87      2000


In [55]:
from sklearn.metrics import confusion_matrix

In [56]:
confusion_matrix(Y_test, prediction['Logistic'])


Out[56]:
array([[ 322,  129],
       [ 125, 1424]])

In [58]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(set(scores)))
    plt.xticks(tick_marks, set(scores), rotation=45)
    plt.yticks(tick_marks, set(scores))
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
# Compute confusion matrix
cm = confusion_matrix(Y_test, prediction['Logistic'])
np.set_printoptions(precision=2)
plt.figure()
plot_confusion_matrix(cm)    

cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

plt.show()


References