In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('ggplot')
%matplotlib inline

In [2]:
products = pd.read_csv('../data/amazon_baby.csv')

Let's explore the data


In [3]:
products.head()


Out[3]:
name review rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5

In [4]:
len(products)


Out[4]:
183531

In [5]:
# Sklearn does not work well with empty fields, so we're dropping all rows that have empty fields
products = products.dropna()
len(products)


Out[5]:
182384

Build the word count vector for each review

Here Sklearn works different from the Graphlab. Word counts are recorded in a sparse matrix, where every column is a unique word and every row is a review. For demonstration purposes and to stay in line with the lecture, the word_counts column is added here, but this is not actually used in the model later on. Instead, the word count vector cv will be used.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [7]:
cv.fit(products['review']) # Create the word count vector
products['word_counts'] = cv.transform(products['review'])

In [8]:
products.head()


Out[8]:
name review rating word_counts
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\...
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\...
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\...
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\...
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\...

In [9]:
products['name'].describe()


Out[9]:
count                               182384
unique                               32315
top       Vulli Sophie the Giraffe Teether
freq                                   779
Name: name, dtype: object

The total number of reviews is lower than in the lecture video. Likely due to dropping the reviews with NA's.

Explore Vulli Sophie


In [10]:
giraffe_reviews = products[products['name'] == 'Vulli Sophie the Giraffe Teether']

In [11]:
len(giraffe_reviews)


Out[11]:
779

In [12]:
giraffe_reviews['rating'].hist()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1169cd908>

In [13]:
giraffe_reviews['rating'].value_counts()


Out[13]:
5    531
4     93
3     62
1     56
2     37
Name: rating, dtype: int64

Build a sentiment classifier

Define what's a positive and negative review


In [14]:
# Ignore all 3* review
products = products[products['rating'] != 3]

In [15]:
products['sentiment'] = products['rating'] >= 4

In [16]:
products.head()


Out[16]:
name review rating word_counts sentiment
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... True
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... True
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... True
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... True
5 Stop Pacifier Sucking without tears with Thumb... When the Binky Fairy came to our house, we did... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... True

Let's train the sentiment classifier


In [17]:
from sklearn.cross_validation import train_test_split

# Due to the random divide between the train and test data, the model will be 
# slightly different from the lectures from here on out.
train_data, test_data = train_test_split(products, test_size=0.2, random_state=42)

In [18]:
from sklearn.linear_model import LogisticRegression

cv.fit(train_data['review']) # Use the count vector, but fit only the train data

sentiment_model = LogisticRegression().fit(cv.transform(train_data['review']), train_data['sentiment'])

In [19]:
# Predict sentiment for the test data, based on the sentiment model
# The cv.transform is necessary to get the test_data review data in the right format for the model
predicted = sentiment_model.predict(cv.transform(test_data['review']))

Evaluate the sentiment model


In [20]:
from sklearn import metrics

# These metrics will be slightly different then in the lecture, due to the different
# train/test data split and differences in how the model is fitted

print ("Accuracy:", metrics.accuracy_score(test_data['sentiment'], predicted))
print ("ROC AUC Score:", metrics.roc_auc_score(test_data['sentiment'], predicted))
print ("Confusion matrix:")
print (metrics.confusion_matrix(test_data['sentiment'], predicted))
print (metrics.classification_report(test_data['sentiment'], predicted))


Accuracy: 0.936232496379
ROC AUC Score: 0.860505266666
Confusion matrix:
[[ 3997  1341]
 [  772 27026]]
             precision    recall  f1-score   support

      False       0.84      0.75      0.79      5338
       True       0.95      0.97      0.96     27798

avg / total       0.93      0.94      0.93     33136


In [21]:
# for the ROC curve, we need the prediction probabilities rather than the True/False values
# which are obtained by using the .predict_proba function instead of .predict
predicted_probs = sentiment_model.predict_proba(cv.transform(test_data['review']))

In [22]:
false_positive_rate, true_positive_rate, _ = metrics.roc_curve(test_data['sentiment'], predicted_probs[:,1])

In [23]:
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Sentiment Analysis')
plt.show()


Applying the learned model to understand sentiment for Giraffe


In [24]:
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict_proba(cv.transform(giraffe_reviews['review']))[:,1]

In [25]:
giraffe_reviews.head()


Out[25]:
name review rating word_counts predicted_sentiment
34313 Vulli Sophie the Giraffe Teether He likes chewing on all the parts especially t... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 0.998540
34314 Vulli Sophie the Giraffe Teether My son loves this toy and fits great in the di... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 0.999326
34315 Vulli Sophie the Giraffe Teether There really should be a large warning on the ... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 0.340178
34316 Vulli Sophie the Giraffe Teether All the moms in my moms\' group got Sophie for... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 0.973703
34317 Vulli Sophie the Giraffe Teether I was a little skeptical on whether Sophie was... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 0.335945

Sort the reviews based on the predicted sentiment and explore


In [26]:
giraffe_reviews.sort_values(by='predicted_sentiment', inplace=True, ascending=False)

In [27]:
# Despite the slightly different model, the same review is ranked highest in predicted sentiment
giraffe_reviews.head(10)


Out[27]:
name review rating word_counts predicted_sentiment
34892 Vulli Sophie the Giraffe Teether Sophie, oh Sophie, your time has come. My gran... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34434 Vulli Sophie the Giraffe Teether My Mom-in-Law bought Sophie for my son when he... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34515 Vulli Sophie the Giraffe Teether As every mom knows, you always want to give yo... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34442 Vulli Sophie the Giraffe Teether Yes, it\'s imported. Yes, it\'s expensive. And... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34341 Vulli Sophie the Giraffe Teether I\'ll be honest...I bought this toy because al... 4 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34429 Vulli Sophie the Giraffe Teether Let me just start off by addressing the chokin... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34746 Vulli Sophie the Giraffe Teether Sophie the Giraffe is the perfect teething toy... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34975 Vulli Sophie the Giraffe Teether My 8 week old LOVES Sophie. The rubber feels s... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
34410 Vulli Sophie the Giraffe Teether Our son really likes Sopie...the problem is th... 5 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0
35035 Vulli Sophie the Giraffe Teether My 4 month old son is teething, and I\'ve trie... 4 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.0

In [28]:
giraffe_reviews.iloc[0]['review']


Out[28]:
"Sophie, oh Sophie, your time has come. My granddaughter, Violet is 5 months old and starting to teeth. What joy little Sophie brings to Violet. Sophie is made of a very pliable rubber that is sturdy but not tough. It is quite easy for Violet to twist Sophie into unheard of positions to get Sophie into her mouth. The little nose and hooves fit perfectly into small mouths, and the drooling has purpose. The paint on Sophie is food quality.Sophie was born in 1961 in France. The maker had wondered why there was nothing available for babies and made Sophie from the finest rubber, phthalate-free on St Sophie\\'s Day, thus the name was born. Since that time millions of Sophie\\'s populate the world. She is soft and for babies little hands easy to grasp. Violet especially loves the bumpy head and horns of Sophie. Sophie has a long neck that easy to grasp and twist. She has lovely, sizable spots that attract Violet\\'s attention. Sophie has happy little squeaks that bring squeals of delight from Violet. She is able to make Sophie squeak and that brings much joy. Sophie\\'s smooth skin is soothing to Violet\\'s little gums. Sophie is 7 inches tall and is the exact correct size for babies to hold and love.As you well know the first thing babies grasp, goes into their mouths- how wonderful to have a toy that stimulates all of the senses and helps with the issue of teething. Sophie is small enough to fit into any size pocket or bag. Sophie is the perfect find for babies from a few months to a year old. How wonderful to hear the giggles and laughs that emanate from babies who find Sophie irresistible. Viva La Sophie!Highly Recommended.  prisrob 12-11-09"

Let's look at the negative reviews


In [29]:
giraffe_reviews.tail(10)
## We can see the lowest scoring review in the lecture is ranked 10th lowest in this analysis


Out[29]:
name review rating word_counts predicted_sentiment
35018 Vulli Sophie the Giraffe Teether My son (now 2.5) LOVED his Sophie, and I bough... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 4.980937e-04
34706 Vulli Sophie the Giraffe Teether Totally overpriced for what it is. Go to Pets... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 3.623238e-04
34556 Vulli Sophie the Giraffe Teether Alright, first off didn\'t realize this was a ... 2 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 3.557328e-04
34709 Vulli Sophie the Giraffe Teether i looked at this teether forever before i fina... 4 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 2.725130e-04
34860 Vulli Sophie the Giraffe Teether This children\'s toy is nostalgic and very cut... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.553006e-04
34824 Vulli Sophie the Giraffe Teether I got one of these as a showe gift that my bab... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.473995e-04
34994 Vulli Sophie the Giraffe Teether When I received this the paint was peeling off... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.093208e-04
34411 Vulli Sophie the Giraffe Teether I was so looking forward to getting this for m... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.074097e-04
34732 Vulli Sophie the Giraffe Teether Received the product and smells like cheap rub... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 5.586384e-05
34687 Vulli Sophie the Giraffe Teether I wanted to love this product and was excited ... 1 (0, 303)\t1\n (0, 3288)\t1\n (0, 4889)\t1\... 1.802705e-09

In [30]:
giraffe_reviews.iloc[-1]['review']


Out[30]:
'I wanted to love this product and was excited to buy it when I became pregnant but am now hesitant to let my baby use it after reading about the recall in Europe. Apparently, as I understand it, their toxin standards of measurement are lower than ours so they have not been recalled here (apparently we are OK with low levels of nitrates in the toys our children put in their mouths, but Europeans are not...hmmm)...Be that as it may, toxins registering even CLOSE to a dangerous level made me nervous about using. After digging around online I did discover that the company claims to have changed the product after a certain date and lists manufacturing codes so you can check yours (those listed were made after a certain date and are said to be safer). Sadly mine was not made after the &#34;improved&#34; date but I could not return it because there was no formal recall in our country. I considered returning it and hunting for one with an approved manufacturing date but man that was just too much work. Bummed but not ready to take a risk with my baby.'