In [6]:
# import
import graphlab as gl

gl.canvas.set_target('ipynb')

In [3]:
# reading the data
data = gl.SFrame("data/amazon_baby.gl/")
data.head(5)


This non-commercial license of GraphLab Create for academic use is assigned to atul9806@yahoo.in and will expire on February 02, 2018.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Atul\AppData\Local\Temp\graphlab_server_1502390410.log.0
Out[3]:
name review rating
Planetwise Flannel Wipes These flannel wipes are
OK, but in my opinion ...
3.0
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0
[5 rows x 3 columns]


In [4]:
# Build a word count vector
data['word_count'] = gl.text_analytics.count_words(data['review'])

In [5]:
data.head(4)


Out[5]:
name review rating word_count
Planetwise Flannel Wipes These flannel wipes are
OK, but in my opinion ...
3.0 {'and': 5L, 'stink': 1L,
'because': 1L, 'order ...
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
[4 rows x 4 columns]

Defining Positive and Negative Sentense
Ignore 0 and 3 star ratings
1 and 2 are treated as Negative
4 and 4 are treated as Positive


In [8]:
# ignoring the 3 star rating
data2 = data[data['rating'] != 3 ]

In [9]:
data2['sentiment'] = data2['rating'] > 3

In [11]:
data2.head(5)


Out[11]:
name review rating word_count sentiment
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
1
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
1
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
1
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
1
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
1
[5 rows x 5 columns]


In [12]:
# training the classifier model
# first, spliting the data into train and test datasets

train_data, test_data = data2.random_split(0.8, seed=0)

In [14]:
sentiment_model = gl.logistic_classifier.create(train_data, 
                                    target='sentiment', 
                                    features=['word_count'],
                                   validation_set=test_data)


WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.
Logistic regression:
--------------------------------------------------------
Number of examples          : 133448
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 219217
Number of coefficients    : 219218
Starting L-BFGS
--------------------------------------------------------
+-----------+----------+-----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+-----------+--------------+-------------------+---------------------+
| 1         | 5        | 0.000002  | 1.825216     | 0.841481          | 0.839989            |
| 2         | 9        | 3.000000  | 3.739485     | 0.947425          | 0.894877            |
| 3         | 10       | 3.000000  | 4.463968     | 0.923768          | 0.866232            |
| 4         | 11       | 3.000000  | 5.223472     | 0.971779          | 0.912743            |
| 5         | 12       | 3.000000  | 5.923937     | 0.975511          | 0.908900            |
| 6         | 13       | 3.000000  | 6.629406     | 0.899991          | 0.825967            |
| 7         | 15       | 1.000000  | 7.639079     | 0.984548          | 0.921451            |
| 8         | 16       | 1.000000  | 8.329540     | 0.985118          | 0.921871            |
| 9         | 17       | 1.000000  | 8.991981     | 0.987066          | 0.919709            |
| 10        | 18       | 1.000000  | 9.697445     | 0.988715          | 0.916256            |
+-----------+----------+-----------+--------------+-------------------+---------------------+
TERMINATED: Iteration limit reached.
This model may not be optimal. To improve it, consider increasing `max_iterations`.

In [15]:
# Evaluate the sentiment model
sentiment_model.evaluate(test_data, metric='roc_curve')


Out[15]:
{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0       |      1.0       | 27976 | 5328 |
 |   1e-05   | 0.909346846847 | 0.998856162425 | 27976 | 5328 |
 |   2e-05   | 0.896021021021 | 0.998748927652 | 27976 | 5328 |
 |   3e-05   | 0.886448948949 | 0.998462968259 | 27976 | 5328 |
 |   4e-05   | 0.879692192192 | 0.998284243637 | 27976 | 5328 |
 |   5e-05   | 0.875187687688 | 0.998212753789 | 27976 | 5328 |
 |   6e-05   | 0.872184684685 | 0.998177008865 | 27976 | 5328 |
 |   7e-05   | 0.868618618619 | 0.998034029168 | 27976 | 5328 |
 |   8e-05   | 0.864677177177 | 0.997998284244 | 27976 | 5328 |
 |   9e-05   | 0.860735735736 | 0.997962539319 | 27976 | 5328 |
 +-----------+----------------+----------------+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [16]:
sentiment_model.show(view='Evaluation')



In [17]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

In [18]:
def word_count(line):
    wc={}
    for word in line.split():
        wc[word] = wc.get(word, 0) +1
    return wc

In [60]:
#data3['word_count_dic'] = data2['review'].apply(word_count)

In [20]:
data2.head(5)


Out[20]:
name review rating word_count sentiment
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
1
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
1
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
1
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
1
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
1
[5 rows x 5 columns]

def selected_word_count(line, selected_words=['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']):
    wc={}
    for word in line.split():
        if word in selected_words:
            wc[word] = wc.get(word, 0) +1
    return wc

In [37]:
def selected_word_count(line, selected_words=['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']):
    wc={}
    for key in selected_words:
         if key in line.keys():
            wc[key] = line[key]
    return wc

In [68]:
#data2['selected_word_count_dic'] = data2['word_count'].apply(selected_word_count)
data2['selected_word_count_dic'] = data2['word_count'].dict_trim_by_keys(selected_words, exclude=False)

In [69]:
def get_count(data, word):
    return data.get(word,0)

In [70]:
for word in selected_words:
    data2[word+'_count'] = data2['selected_word_count_dic'].apply(lambda line: get_count(line, word))

In [71]:
data2.head(4)


Out[71]:
name review rating word_count sentiment
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
1
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
1
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
1
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
1
selected_word_count_dic awesome_count great_count fantastic_count amazing_count love_count horrible_count
{'love': 1L} 0 0 0 0 1 0
{} 0 0 0 0 0 0
{'love': 2L} 0 0 0 0 2 0
{'great': 1L} 0 1 0 0 0 0
bad_count terrible_count awful_count wow_count hate_count
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
[4 rows x 17 columns]


In [44]:
train_data,test_data = data2.random_split(.8, seed=0)

In [45]:
selected_words_model = gl.logistic_classifier.create(train_data,
                                 target='sentiment', 
                                 features=['selected_word_count_dic'],
                                 validation_set=test_data)


Logistic regression:
--------------------------------------------------------
Number of examples          : 133448
Number of classes           : 2
Number of feature columns   : 1
Number of unpacked features : 11
Number of coefficients    : 12
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+--------------+-------------------+---------------------+
| 1         | 2        | 0.177118     | 0.844299          | 0.842842            |
| 2         | 3        | 0.304202     | 0.844186          | 0.842842            |
| 3         | 4        | 0.463308     | 0.844276          | 0.843142            |
| 4         | 5        | 0.609405     | 0.844269          | 0.843142            |
| 5         | 6        | 0.746498     | 0.844269          | 0.843142            |
| 6         | 7        | 0.873581     | 0.844269          | 0.843142            |
+-----------+----------+--------------+-------------------+---------------------+
SUCCESS: Optimal solution found.


In [67]:
#gl.SFrame.print_rows(num_rows=12, num_columns=5)
coef=selected_words_model['coefficients'].sort('value')
coef.print_rows(num_rows=12, num_columns=5)


+-------------------------+-----------+-------+------------------+
|           name          |   index   | class |      value       |
+-------------------------+-----------+-------+------------------+
| selected_word_count_dic |  terrible |   1   |  -2.09049998487  |
| selected_word_count_dic |  horrible |   1   |  -1.99651800559  |
| selected_word_count_dic |   awful   |   1   |  -1.76469955631  |
| selected_word_count_dic |    hate   |   1   |  -1.40916406276  |
| selected_word_count_dic |    bad    |   1   | -0.985827369929  |
| selected_word_count_dic |    wow    |   1   | -0.0541450123333 |
| selected_word_count_dic |   great   |   1   |  0.883937894898  |
| selected_word_count_dic | fantastic |   1   |  0.891303090304  |
| selected_word_count_dic |  amazing  |   1   |  0.892802422508  |
| selected_word_count_dic |  awesome  |   1   |  1.05800888878   |
|       (intercept)       |    None   |   1   |  1.36728315229   |
| selected_word_count_dic |    love   |   1   |  1.39989834302   |
+-------------------------+-----------+-------+------------------+
+------------------+
|      stderr      |
+------------------+
| 0.0967241912229  |
| 0.0973584169028  |
|  0.134679803365  |
| 0.0771983993506  |
| 0.0433603009142  |
|  0.275616449416  |
| 0.0217379527921  |
|  0.154532343591  |
|  0.127989503231  |
|  0.110865296265  |
| 0.00861805467824 |
| 0.0287147460124  |
+------------------+
[12 rows x 5 columns]


In [47]:
# accuracy
selected_words_model.evaluate(test_data)


Out[47]:
{'accuracy': 0.8431419649291376,
 'auc': 0.6648096413721418,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        0        |  234  |
 |      1       |        0        |  130  |
 |      0       |        1        |  5094 |
 |      1       |        1        | 27846 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.914242563530107,
 'log_loss': 0.40547471103656485,
 'precision': 0.8453551912568306,
 'recall': 0.9953531598513011,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+------+
 | threshold | fpr | tpr |   p   |  n   |
 +-----------+-----+-----+-------+------+
 |    0.0    | 1.0 | 1.0 | 27976 | 5328 |
 |   1e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   2e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   3e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   4e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   5e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   6e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   7e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   8e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   9e-05   | 1.0 | 1.0 | 27976 | 5328 |
 +-----------+-----+-----+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [48]:
sentiment_model.evaluate(test_data)


Out[48]:
{'accuracy': 0.916256305548883,
 'auc': 0.9446492867438502,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1461 |
 |      0       |        1        |  1328 |
 |      0       |        0        |  4000 |
 |      1       |        1        | 26515 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9500349343413533,
 'log_loss': 0.2610669843242187,
 'precision': 0.9523039902309378,
 'recall': 0.9477766657134686,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0       |      1.0       | 27976 | 5328 |
 |   1e-05   | 0.909346846847 | 0.998856162425 | 27976 | 5328 |
 |   2e-05   | 0.896021021021 | 0.998748927652 | 27976 | 5328 |
 |   3e-05   | 0.886448948949 | 0.998462968259 | 27976 | 5328 |
 |   4e-05   | 0.879692192192 | 0.998284243637 | 27976 | 5328 |
 |   5e-05   | 0.875187687688 | 0.998212753789 | 27976 | 5328 |
 |   6e-05   | 0.872184684685 | 0.998177008865 | 27976 | 5328 |
 |   7e-05   | 0.868618618619 | 0.998034029168 | 27976 | 5328 |
 |   8e-05   | 0.864677177177 | 0.997998284244 | 27976 | 5328 |
 |   9e-05   | 0.860735735736 | 0.997962539319 | 27976 | 5328 |
 +-----------+----------------+----------------+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [49]:
# Analysis why clf is works better than selected_Word_model

diaper_champ_reviews = data2[data2['name']=='Baby Trend Diaper Champ']
diaper_champ_reviews.head(2)


Out[49]:
name review rating word_count sentiment selected_word_count_dic
Baby Trend Diaper Champ Ok - newsflash. Diapers
are just smelly. We've ...
4.0 {'just': 2L, 'less': 1L,
'-': 3L, 'smell- ...
1 {}
Baby Trend Diaper Champ My husband and I selected
the Diaper "Champ" ma ...
1.0 {'just': 1L, 'less': 1L,
'when': 3L, 'over': 1L, ...
0 {}
awesome_count great_count fantastic_count amazing_count love_count horrible_count bad_count terrible_count
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
awful_count wow_count hate_count
0 0 0
0 0 0
[2 rows x 17 columns]


In [50]:
diaper_champ_reviews['pred_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews_sorted = diaper_champ_reviews.sort('pred_sentiment', ascending=False)
diaper_champ_reviews_sorted.head(2)


Out[50]:
name review rating word_count sentiment
Baby Trend Diaper Champ Baby Luke can turn a
clean diaper to a dirty ...
5.0 {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
1
Baby Trend Diaper Champ I LOOOVE this diaper
pail! Its the easies ...
5.0 {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
1
selected_word_count_dic awesome_count great_count fantastic_count amazing_count love_count horrible_count
{} 0 0 0 0 0 0
{'love': 1L} 0 0 0 0 1 0
bad_count terrible_count awful_count wow_count hate_count pred_sentiment
0 0 0 0 0 0.999999937267
0 0 0 0 0 0.999999917406
[2 rows x 18 columns]


In [75]:
diaper_champ_reviews['sel_pred_sentiment'] = selected_words_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews_sel_sorted = diaper_champ_reviews.sort('pred_sentiment', ascending=False)
diaper_champ_reviews_sel_sorted.head(2)


Out[75]:
name review rating word_count sentiment
Baby Trend Diaper Champ Baby Luke can turn a
clean diaper to a dirty ...
5.0 {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
1
Baby Trend Diaper Champ I LOOOVE this diaper
pail! Its the easies ...
5.0 {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
1
selected_word_count_dic awesome_count great_count fantastic_count amazing_count love_count horrible_count
{} 0 0 0 0 0 0
{'love': 1L} 0 0 0 0 1 0
bad_count terrible_count awful_count wow_count hate_count pred_sentiment sel_pred_sentiment
0 0 0 0 0 0.999999937267 0.796940851291
0 0 0 0 0 0.999999917406 0.940876393428
[2 rows x 19 columns]


In [77]:
diaper_champ_reviews_sel_sorted[0:1]['review']


Out[77]:
dtype: str
Rows: 1
['Baby Luke can turn a clean diaper to a dirty diaper in 3 seconds flat. The diaper champ turns the smelly diaper into "what diaper smell" in less time than that. I hesitated and wondered what I REALLY needed for the nursery. This is one of the best purchases we made. The champ, the baby bjorn, fluerville diaper bag, and graco pack and play bassinet all vie for the best baby purchase.Great product, easy to use, economical, effective, absolutly fabulous.UpdateI knew that I loved the champ, and useing the diaper genie at a friend's house REALLY reinforced that!! There is no comparison, the chanp is easy and smell free, the genie was difficult to use one handed (which is absolutly vital if you have a little one on a changing pad) and there was a deffinite odor eminating from the genieplus we found that the quick tie garbage bags where the ties are integrated into the bag work really well because there isn't any added bulk around the sealing edge of the champ.']

In [74]:
# Out of the 11 words in selected_words, which one is most used in the reviews in the dataset?
print('hate_count',data2['hate_count'].sum())
print('wow_count',data2['wow_count'].sum())
print('awful_count',data2['awful_count'].sum())
print('terrible_count',data2['terrible_count'].sum())
print('bad_count',data2['bad_count'].sum())
print('horrible_count',data2['horrible_count'].sum())
print('love_count',data2['love_count'].sum())
print('amazing_count',data2['amazing_count'].sum())
print('fantastic_count',data2['fantastic_count'].sum())
print('great_count',data2['great_count'].sum())
print('awesome_count',data2['awesome_count'].sum())


('hate_count', 1057L)
('wow_count', 131L)
('awful_count', 345L)
('terrible_count', 673L)
('bad_count', 3197L)
('horrible_count', 659L)
('love_count', 40277L)
('amazing_count', 1305L)
('fantastic_count', 873L)
('great_count', 42420L)
('awesome_count', 2002L)

In [59]:
# It is quite common to use the **majority class classifier** as the a baseline (or reference) model for 
# comparison with your classifier model. The majority classifier model predicts the majority class for all data points. 
# At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print (num_positive)
print (num_negative)
print (num_positive*1.0/len(train_data))


112283
0
0.841398896949

In [58]:
test_num_positive  = (test_data['sentiment'] == +1).sum()
test_num_negative = (test_data['sentiment'] == -1).sum()
print (test_num_positive)
print (test_num_negative)
print(len(test_data))
print(test_num_positive*1.0/len(test_data))


27976
0
33304
0.840019216911

In [ ]: