TEXT MINING AMAZON REVIEWS

By Andrew Botero

PART 3 - Exploratory Data Analysis


In [1]:
import numpy as np
import pandas as pd
import gzip
import json
import gzip
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

pd.set_option('display.max_colwidth', -1)

In [2]:
#Some functions to handle files

def parse(path):
    with open(path) as data_file:    
        for d in data_file:
            yield eval(d)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

Let's load the data in a Pandas dataframe, the data can be downloaded from http://jmcauley.ucsd.edu/data/amazon/, I have uncompressed it in my data directory.


In [3]:
%%time
#Load the dataset
df = getDF('data/Electronics_5_sample_400k.json')


CPU times: user 2min 32s, sys: 16.5 s, total: 2min 48s
Wall time: 3min 6s

Let's take a quick look at the data:


In [4]:
df.head()


Out[4]:
reviewerID asin reviewerName helpful unixReviewTime reviewText overall reviewTime summary
0 A267CNXZTWED3N B007BJHETS C. Harmon "Florida Maveric" [0, 0] 1394236800 Sandisk memory cards are among the best made. Price from Amamzon was very good too. Easy camera insertion of card is a plus. 5.0 03 8, 2014 SanDisk Memory cards
1 A3LPMZJXKFPZ6J B007BJHETS chen [0, 0] 1335312000 For SD or microSD card, sandisk is the only reliable choice.Perfect for the camera user.Good for the price which I paid 17. 5.0 04 25, 2012 Fast
2 A3OQYGKVU7LLU0 B007BJHETS Ches [0, 0] 1362096000 This card was exactly what I was looking for at a much better price than the same ones in my local camera store. I would recommend it. 5.0 03 1, 2013 Just what I was looking for.
3 A3G8HQP6FR4NWS B007BJHETS chris and sherry [0, 0] 1401753600 It shows up as 14 something gigabytes when you put it in something, they all show up as less. This card will fit a ton of stuff.I recorded a couple of movies, and still had room for a few hundred pictures. It will work with both my canon and nikon cameras, both my ho and toshiba laptops. Very much worth the little bit of money. Great buy! 5.0 06 3, 2014 Excellent SD Card
4 AHOTLR1X2YG0Q B007BJHETS Chris B. "Chris" [0, 0] 1398643200 It works well, it is a memory card. I have never had an issue with it, and I do not expect to have an issue with it. I have bought many SanDisk product before and then pleased with all of them. 5.0 04 28, 2014 What is there to say?

In [5]:
df.shape


Out[5]:
(400000, 9)

Let's convert the unixReviewTime to date type.


In [6]:
#transform the Unix timestamp column to date
from datetime import datetime
df['date'] = pd.to_datetime(df['unixReviewTime'],unit='s')

Now let's see if there are any null values in the data.


In [7]:
# Identifiy missing values
null_data = df[df.isnull().any(axis=1)]
print 'There are {} rows with missing values'.format(len(null_data.index))


There are 9603 rows with missing values

In [8]:
[col for col in null_data.columns if df[col].isnull().any()]


Out[8]:
['reviewerName']

Since the null column is reviewerName and we are interested in the review text, we will keep these rows.

Let's look at how long are the reviews.


In [9]:
# Review length distribution
df['reviewLength'] = df['reviewText'].apply(len)
df['reviewWordCount'] = df['reviewText'].map(lambda x: len(str.split(x)))

#Reviews over 1000 character are quite rare so don't want to include them in the plot
ax = df[df['reviewLength'] < 1000]['reviewLength'].plot(kind='hist')
ax.set_xlabel("Review Length")
ax.set_ylabel("Frequency")
plt.title('Review Length Distribution')
plt.show()



In [10]:
ax = df[df['reviewWordCount'] < 600]['reviewWordCount'].plot(kind='hist')
ax.set_xlabel("Word Count")
ax.set_ylabel("Frequency")
plt.title('Word Count Distribution')
plt.show()


Let's see summarize the overall scores


In [11]:
df['overall'].describe()


Out[11]:
count    400000.000000
mean     4.259350     
std      1.141574     
min      1.000000     
25%      4.000000     
50%      5.000000     
75%      5.000000     
max      5.000000     
Name: overall, dtype: float64

In [12]:
ax = df['overall'].value_counts().plot.bar()
ax.set_xlabel("Rating")
ax.set_ylabel("Frequency")
plt.title('Rating Distribution')
plt.show()


Let's see if there is a relation between the number of reviews and the score.


In [13]:
g1 = df.groupby(["asin", "overall"]).size().reset_index(name='count')

g1 = g1.sort_values(by=['count'], ascending=[False])

#Top 10 products with most reviews
g1.head(10)


Out[13]:
asin overall count
5968 B007WTAJTO 5.0 3922
43009 B00DR0PDNE 5.0 1903
24593 B009SYZ8OC 5.0 1817
34138 B00BGGDVOO 5.0 1473
44014 B00E3W15P0 5.0 1103
15134 B008OHNZI0 5.0 876
4177 B007R5YDYA 5.0 846
43008 B00DR0PDNE 4.0 820
1501 B007I5JT4S 5.0 813
32305 B00B588HY2 5.0 649

In [14]:
g1.tail(10)


Out[14]:
asin overall count
40914 B00D440SYW 3.0 1
11753 B008CXTX5U 1.0 1
40918 B00D45JWXY 4.0 1
27519 B00A81S4YU 3.0 1
11763 B008CXYPE4 2.0 1
40922 B00D467DRK 3.0 1
40924 B00D48OET8 1.0 1
5722 B007VSRAZW 4.0 1
40926 B00D49VO3G 1.0 1
26649 B00A2RB82A 1.0 1

In [15]:
ax = g1[g1['count'] < 30].boxplot(column='count', by='overall')
ax.set_xlabel("Overall rating")
ax.set_ylabel("Counts")
plt.title('')
plt.show()


It seems to be a relation between the high score and the number of reviews.

Let's use the vader lexicon to make some analysis


In [16]:
#import nltk
#nltk.download('vader_lexicon')

In [18]:
%%time
from nltk.sentiment.vader import SentimentIntensityAnalyzer

SIA = SentimentIntensityAnalyzer()

#Compound score positive: score >= 0.5, negative score <= -0.5, neutral (compound score > -0.5) and (compound score < 0.5)
df = df.merge(df.reviewText.apply(lambda x: pd.Series({
    'vader_compound': SIA.polarity_scores(x)['compound'], 
    'vader_pos': SIA.polarity_scores(x)['pos'],
    'vader_neg': SIA.polarity_scores(x)['neg']
})),left_index=True, right_index=True)


CPU times: user 55min 42s, sys: 25.3 s, total: 56min 8s
Wall time: 56min 8s

In [19]:
ax = df.boxplot(column=['vader_compound'], by='overall')
ax.set_xlabel("Overall rating")
ax.set_ylabel("Compound Score")
plt.title('')
plt.show()



In [20]:
ax = df.boxplot(column=['vader_pos', 'vader_neg'], by='overall')
plt.show()


Let's label our dataset


In [21]:
def get_sentiment(row):
    if row['vader_compound'] >= 0.5 or row['overall'] > 3:  
        return 'Positive'
    elif row['vader_compound'] <= -0.5 or row['overall'] < 3:
        return 'Negative'
    else:
        return 'Neutral'

df['sentiment'] = df.apply(lambda row: get_sentiment(row), axis=1)

In [28]:
df[['overall', 'vader_compound', 'reviewText', 'sentiment']].head(30)


Out[28]:
overall vader_compound reviewText sentiment
0 5.0 0.8899 Sandisk memory cards are among the best made. Price from Amamzon was very good too. Easy camera insertion of card is a plus. Positive
1 5.0 0.0000 For SD or microSD card, sandisk is the only reliable choice.Perfect for the camera user.Good for the price which I paid 17. Positive
2 5.0 0.6597 This card was exactly what I was looking for at a much better price than the same ones in my local camera store. I would recommend it. Positive
3 5.0 0.8430 It shows up as 14 something gigabytes when you put it in something, they all show up as less. This card will fit a ton of stuff.I recorded a couple of movies, and still had room for a few hundred pictures. It will work with both my canon and nikon cameras, both my ho and toshiba laptops. Very much worth the little bit of money. Great buy! Positive
4 5.0 0.6124 It works well, it is a memory card. I have never had an issue with it, and I do not expect to have an issue with it. I have bought many SanDisk product before and then pleased with all of them. Positive
5 5.0 0.6249 Bought it along with a new 3DS XL and it works great. Don't know why others are having issues maybe it is their cards. Plenty of room for a 3DS. Positive
6 5.0 0.6059 Have many Sandisk Ultra cards for many different cameras and devices and they are always well made and never fail. Positive
7 5.0 -0.6908 Sandisk makes great memory products, this one is no exception. It is not the fastest, only 30MB/s, but it keeps up with video recording no problem on my Rebel 4Ti. Positive
8 5.0 0.0000 Needed a back up and bought this. Very handy. Positive
9 5.0 0.3384 I used it in my Nikon D3200 and it has greatly increased the speed with my picture taking. I have yet to try it taking videos. Positive
10 5.0 -0.2144 Got this for recording HD video on GoPro and Canon cameras. Never had a problem. It exceeds the 'required' speed by quite a bit, but I'd rather have too much speed rather than not enough. Positive
11 5.0 0.7034 Was just what I needed to order, plus I received a GREAT PRICE. I take lots of pictures, so I need lots of storage. Positive
12 5.0 0.7003 Great for High resolution and HD-capable cameras.8gb is enough for stills, but recommend at least 16gb if shooting HD video. Positive
13 5.0 -0.0183 The higher data transfer rates of this chip and its even more &#34;quick and costly&#34; brethren are worth jumping up a bit in price to get. With modern cameras having large images (16MB a norm and some holding 24MB), even your JPEG compressed images both fill an SD chip faster than before (many many images on 16MGB though) AND most importantly, there's no &#34;wait&#34; as the camera digests all that data you just shoved at it by hitting the shutter. If you want to have rapid fire sequences this chip or an even faster one is a necessity with the large sized images. Positive
14 4.0 0.5966 Very good and reliable product, excellent for photography, I have in the past used other brands and had problem with either the memory or case not fitting well and getting stuck in my camera.I would recommend this brand to anyone is in photography. Positive
15 5.0 -0.4019 works fine in my panasonic lumix camera. No issues with data loss or anything. Sandisk is always reliable I have found. Positive
16 5.0 -0.0516 does what it says no problems here I would purchase again . nothing much else to say . im happy Positive
17 5.0 0.7475 SD card is fast and reliable. It's worth the money if you care for the images you take and store and cannot afford to lose any. Positive
18 5.0 -0.1316 Was going on a of shore fishing trip and need more capacity. bought two as I didn't want to run out of space. Didn't need it as I took more than 3000 pictures and still had room for more! Positive
19 1.0 -0.7469 I bought two of these and both were defective. When using for large downloads whether PC or Mac, the transfer produces an error. Tried to convert to Ex-Fat, but still, no luck. These are just bad cards... Negative
20 5.0 0.7073 The capacity of the card and the transfer speed attainable make this a great alternative to USB memory sticks. I use it for data back up and thus am not inserting and removing it often. It barely projects from the slot in my laptop and thus it is less in the way than the USB devices. Positive
21 4.0 0.8550 I purchased two of these for my Sony NEX-5N. Shooting 16.1 MP RAW photos, I can store over 1000 photos, and more than 4000 if I'm shooting in JPEG format. Either way, its more than enough storage space for any shooting session, or even a week-long vacation. The card speed seems to work perfectly with the camera. I'm never waiting on photos to be saved or pulled up for review. Highly recommended! Positive
22 5.0 0.0910 writes and reads no error just what it should do, received quickly would definitely buy again can't go wrong with this item Positive
23 5.0 0.9273 This memory card worked perfectly and was a good price on a deal of the day. We used in a Canon digital camera and it was perfect for saving the hundreds of photos we shot on our vacation. Highly recommend. Positive
24 5.0 0.6597 Very fast for multimedia data and devices. No compatibility issues were encountered. Great purchase. Would recommend for music and movies. Positive
25 5.0 0.3149 I've bought several of this exact card over time, and have had no issues whatsoever! Very good performance and reliable media in my experience. Positive
26 4.0 -0.3400 I took this on numerous shots in testing environments and it has given me no problems. I would recommend it. Positive
27 5.0 0.8914 SanDisk ULTRA cards are GREAT - have used them for years - they never seem to wear out. I am a non-professional photographer - but take hundreds of pictures. I recently learned the the CLASS 10 is the one to use for video photography - the CLASS 6 being fine for individual shots - which is what I take. My picture quality is 1600 x 1200 - which is fine for 4 x 6 prints. This 8GB card will hold hundreds of pictures this size. The reason I got this CLASS 10 card is tohave it in case I want to take video. Price is good - less than at Walmart or Kmart. Positive
28 5.0 0.4199 Exactly what I needed! Works as designed, not much more to say. Install and ready to go. Can get much more time on one of these and do the slow motion pictures. Positive
29 5.0 0.7475 i ordered this for my camera and it worked great, had plenty of room for pictures and a couple videos, didnt have any problems with it Positive

Let's visualize the most common words, if you use anaconde install the wordcloud package using conda install -c conda-forge wordcloud=1.2.1.


In [22]:
from wordcloud import WordCloud,STOPWORDS 
stopwords = set(STOPWORDS)
# Transform to single string
positive_reviews_str = df[df['sentiment'] == 'Positive'].reviewText.str.cat()

# Create wordclouds
wordcloud_positive = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(positive_reviews_str)

fig = plt.figure(figsize=(30,10))

ax1 = fig.add_subplot(211)
ax1.imshow(wordcloud_positive,interpolation='bilinear')
ax1.axis("off")
ax1.set_title('Reviews with Positive Scores', fontsize=20)
plt.show()



In [23]:
negative_reviews_str = df[df['sentiment'] == 'Negative'].reviewText.str.cat()


wordcloud_negative = WordCloud(
        background_color='black',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(negative_reviews_str)

fig = plt.figure(figsize=(30,10))

ax1 = fig.add_subplot(211)
ax1.imshow(wordcloud_negative,interpolation='bilinear')
ax1.axis("off")
ax1.set_title('Reviews with Negative Scores', fontsize=20)
plt.show()


Overall Sentiment


In [51]:
df.groupby(['sentiment']).size()


Out[51]:
sentiment
Negative    29581 
Positive    359367
dtype: int64

PART 4 - Modelling Performance


In [24]:
from sklearn.model_selection import train_test_split
from sklearn import feature_extraction, ensemble, cross_validation, metrics
from sklearn.metrics import confusion_matrix


/home/andrew/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Let's focus only in the positive and negative reviews.


In [25]:
df = df[df['sentiment'] != 'Neutral']
df= df.reset_index(drop=True)

For our model the sentiment column needs to be transformed into a binary column


In [26]:
df['label'] = df['sentiment'].map(lambda x: 1 if x == "Positive" else 0)

In [27]:
train, test = train_test_split(df, test_size=0.2)

In [36]:
%%time
vectorizer = feature_extraction.text.CountVectorizer(analyzer = "word", stop_words = 'english', max_features = 1000, preprocessor = None,)
vectorizer.fit(train.reviewText)
x_train = vectorizer.transform(train.reviewText)
(train.reviewText)
x_test = vectorizer.transform(test.reviewText)
y_train = train.label
y_test = test.label
prediction = dict()


CPU times: user 3min 54s, sys: 15.3 s, total: 4min 9s
Wall time: 12min 27s

In [32]:
def print_confusion(y, y_hat):
    confusion = pd.crosstab(y, y_hat, rownames=['Predicted'], colnames=['     True'], margins=True)
    print(confusion)

In [33]:
def print_score(model, x, y):
    print ('Model Score: {:2.4}%' . format(model.score(x, y)*100))

Multinomial Naïve Bayes learning method


In [37]:
%%time
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(x_train, y_train)
prediction['Multinomial'] = model.predict(x_test)

print_confusion(y_test, prediction['Multinomial'])
print_score(model, x_train, y_train)


     True     0      1    All
Predicted                    
0          2791  3065   5856 
1          4722  67212  71934
All        7513  70277  77790
Model Score: 90.11%
CPU times: user 712 ms, sys: 132 ms, total: 844 ms
Wall time: 7.38 s

Bernoulli Naïve Bayes learning method


In [38]:
%%time
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(x_train, y_train)
prediction['Bernoulli'] = model.predict(x_test)

print_confusion(y_test, prediction['Bernoulli'])
print_score(model, x_train, y_train)


     True      0      1    All
Predicted                     
0          3968   1888   5856 
1          12309  59625  71934
All        16277  61513  77790
Model Score: 81.8%
CPU times: user 916 ms, sys: 100 ms, total: 1.02 s
Wall time: 1.57 s

Logistic regression


In [39]:
%%time
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg_result = logreg.fit(x_train, y_train)
prediction['Logistic'] = logreg.predict(x_test)

print_confusion(y_test, prediction['Logistic'])
print_score(logreg, x_train, y_train)


     True     0      1    All
Predicted                    
0          1703  4153   5856 
1          845   71089  71934
All        2548  75242  77790
Model Score: 93.63%
CPU times: user 10.5 s, sys: 168 ms, total: 10.7 s
Wall time: 11.8 s

Support Vector Machine


In [40]:
%%time
from sklearn import svm
svc_model = svm.LinearSVC(penalty = 'l1', dual=False, C=1.0, random_state=2016)
svc_model.fit(x_train, y_train)
prediction['SVM'] = svc_model.predict(x_test)

print_confusion(y_test, prediction['SVM'])
print_score(svc_model, x_train, y_train)


     True     0      1    All
Predicted                    
0          1170  4686   5856 
1          572   71362  71934
All        1742  76048  77790
Model Score: 93.29%
CPU times: user 9.71 s, sys: 84 ms, total: 9.8 s
Wall time: 10 s

Random Forest


In [41]:
from sklearn.ensemble import RandomForestClassifier

In [42]:
%%time
rfc_model = RandomForestClassifier(n_estimators = 100, class_weight='balanced', random_state = 2016)
rfc_model.fit(x_train, y_train)
prediction['Random Forest'] = rfc_model.predict(x_test)


CPU times: user 1h 21min 49s, sys: 144 ms, total: 1h 21min 49s
Wall time: 1h 21min 53s

In [43]:
%%time
from sklearn.model_selection import cross_val_score

scores = cross_val_score(rfc_model, x_train, y_train, scoring = "roc_auc")
print("CV AUC {}, Average AUC {}".format(scores, scores.mean()))

print_confusion(y_test, prediction['Random Forest'])
print_score(rfc_model, x_train, y_train)


CV AUC [ 0.89135839  0.89398969  0.89295773], Average AUC 0.892768601089
     True     0      1    All
Predicted                    
0          1217  4639   5856 
1          612   71322  71934
All        1829  75961  77790
Model Score: 99.91%
CPU times: user 1h 58min, sys: 1.33 s, total: 1h 58min 2s
Wall time: 1h 58min 8s

Plot ROC AUC


In [44]:
from sklearn.metrics import roc_curve, auc

def plot_roc_auc (y, prediction, title_text):
    
    cmp = 0
    colors = ['g', 'o', 'y', 'k', 'm' ]

    for model, predicted in prediction.items():
        false_positive_rate, true_positive_rate, thresholds = roc_curve(y, predicted)
        roc_auc = auc(false_positive_rate, true_positive_rate)
        plt.plot(false_positive_rate, true_positive_rate, colors[cmp], label='%s: AUC %0.2f'% (model,roc_auc))
        cmp += 1
    

    plt.title(title_text)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [45]:
plot_roc_auc (y_test, prediction, 'Classifiers comparison with ROC')



In [ ]: