Predicting sentiment from product reviews

Fire up GraphLab Create



In [1]:

    
import graphlab



In [157]:

    
products = graphlab.SFrame('amazon_baby.gl/')



In [158]:

    
products = products[products['rating'] != 3]



In [159]:

    
products['sentiment'] = products['rating'] >=4



In [4]:

    
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']



In [123]:

    
selected_words_counts = [x + '_count' for x in selected_words]



In [160]:

    
products['word_count'] = graphlab.text_analytics.count_words(products['review'])



In [161]:

    
for word in selected_words:
    products[word + '_count'] = products['review'].apply(lambda r: r.count(word))



In [149]:

    
for word in selected_words:
    print word, products[word + '_count'].sum()









    



awesome 3380
great 51091
fantastic 1611
amazing 2533
love 65236
horrible 1081
bad 5099
terrible 1126
awful 694
wow 135
hate 3795



In [162]:

    
train_data, test_data = products.random_split(.8, seed=0)



In [163]:

    
selected_words_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=selected_words_counts,
                                                     validation_set=test_data)









    



PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 11
PROGRESS: Number of unpacked features : 11
PROGRESS: Number of coefficients    : 12
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.210012     | 0.845820          | 0.844193            |
PROGRESS: | 2         | 3        | 0.360021     | 0.845775          | 0.844253            |
PROGRESS: | 3         | 4        | 0.503029     | 0.845873          | 0.844313            |
PROGRESS: | 4         | 5        | 0.650038     | 0.845865          | 0.844313            |
PROGRESS: | 5         | 6        | 0.798046     | 0.845873          | 0.844313            |
PROGRESS: | 6         | 7        | 0.943054     | 0.845873          | 0.844313            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+



In [164]:

    
selected_words_model









    Out[164]:





Class                         : LogisticClassifier

Schema
------
Number of coefficients        : 12
Number of examples            : 133448
Number of classes             : 2
Number of feature columns     : 11
Number of unpacked features   : 11

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 6
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 0.9831

Settings
--------
Log-likelihood                : 53146.5739

Highest Positive Coefficients
-----------------------------
(intercept)                   : 1.3043
love_count                    : 1.2008
amazing_count                 : 1.0732
awesome_count                 : 1.0221
fantastic_count               : 0.7931

Lowest Negative Coefficients
----------------------------
horrible_count                : -2.1574
terrible_count                : -2.0737
awful_count                   : -1.9094
bad_count                     : -1.0069
hate_count                    : -0.7167



In [133]:

    
selected_words_model.evaluate(test_data)









    Out[133]:





{'accuracy': 0.8443129954359837, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  120  |
 |      0       |        0        |  263  |
 |      0       |        1        |  5065 |
 |      1       |        1        | 27856 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}



In [117]:

    
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)









    



PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 219217
PROGRESS: Number of coefficients    : 219218
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000002  | 1.502086     | 0.841481          | 0.839989            |
PROGRESS: | 2         | 9        | 3.000000  | 2.888165     | 0.947425          | 0.894877            |
PROGRESS: | 3         | 10       | 3.000000  | 3.439197     | 0.923768          | 0.866232            |
PROGRESS: | 4         | 11       | 3.000000  | 4.017230     | 0.971779          | 0.912743            |
PROGRESS: | 5         | 12       | 3.000000  | 4.600263     | 0.975511          | 0.908900            |
PROGRESS: | 6         | 13       | 3.000000  | 5.174296     | 0.899991          | 0.825967            |
PROGRESS: | 10        | 18       | 1.000000  | 7.686440     | 0.988715          | 0.916256            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+



In [118]:

    
sentiment_model.evaluate(test_data)









    Out[118]:





{'accuracy': 0.916256305548883, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1461 |
 |      0       |        1        |  1328 |
 |      0       |        0        |  4000 |
 |      1       |        1        | 26515 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}



In [165]:

    
diaper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']



In [166]:

    
diaper_champ_reviews['predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')



In [167]:

    
diaper_champ_reviews = diaper_champ_reviews.sort('predicted_sentiment', ascending=False)



In [168]:

    
diaper_champ_reviews









    Out[168]:





    
        name
        review
        rating
        sentiment
        word_count
        awesome_count
    
    
        Baby Trend Diaper Champ
        Baby Luke can turn a
clean diaper to a dirty ...
        5.0
        1
        {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
        0
    
    
        Baby Trend Diaper Champ
        I LOOOVE this diaper
pail!  Its the easies ...
        5.0
        1
        {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
        0
    
    
        Baby Trend Diaper Champ
        We researched all of the
different types of di ...
        4.0
        1
        {'all': 2L, 'just': 4L,
"don't": 2L, 'one,': 1L, ...
        0
    
    
        Baby Trend Diaper Champ
        My baby is now 8 months
and the can has been ...
        5.0
        1
        {"don't": 1L, 'when': 1L,
'over': 1L, 'soon': 1L, ...
        0
    
    
        Baby Trend Diaper Champ
        This is absolutely, by
far, the best diaper  ...
        5.0
        1
        {'just': 3L, 'money': 1L,
'not': 2L, 'mechanism': ...
        0
    
    
        Baby Trend Diaper Champ
        Diaper Champ or Diaper
Genie? That was my ...
        5.0
        1
        {'all': 1L, 'bags.': 1L,
'son,': 1L, '(i': 1L, ...
        0
    
    
        Baby Trend Diaper Champ
        Wow!  This is fabulous.
It was a toss-up between ...
        5.0
        1
        {'and': 4L, '"genie".':
1L, 'since': 1L, ...
        0
    
    
        Baby Trend Diaper Champ
        I originally put this
item on my baby registry ...
        5.0
        1
        {'lysol': 1L, 'all': 2L,
'bags.': 1L, 'feedback': ...
        0
    
    
        Baby Trend Diaper Champ
        Two girlfriends and two
family members put me ...
        5.0
        1
        {'just': 1L, 'when': 1L,
'both': 1L, 'results': ...
        0
    
    
        Baby Trend Diaper Champ
        I am one of those super-
critical shoppers who ...
        5.0
        1
        {'taller': 1L, 'bags.':
1L, 'just': 1L, "don't": ...
        0
    


    
        great_count
        fantastic_count
        amazing_count
        love_count
        horrible_count
        bad_count
        terrible_count
        awful_count
    
    
        0
        0
        0
        1
        0
        0
        0
        0
    
    
        0
        0
        0
        1
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
        0
        0
        0
    
    
        2
        0
        0
        0
        0
        1
        0
        0
    
    
        0
        0
        0
        2
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
        0
        0
        0
    
    
        0
        0
        1
        0
        1
        0
        0
        0
    
    
        0
        0
        0
        1
        0
        0
        0
        0
    


    
        wow_count
        hate_count
        predicted_sentiment
    
    
        0
        0
        0.999999937267
    
    
        0
        0
        0.999999917406
    
    
        0
        0
        0.999999899509
    
    
        0
        0
        0.999999836182
    
    
        0
        0
        0.999999824745
    
    
        0
        0
        0.999999759315
    
    
        0
        0
        0.999999692111
    
    
        0
        0
        0.999999642488
    
    
        0
        0
        0.999999604504
    
    
        0
        0
        0.999999486804
    

[298 rows x 17 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [173]:

    
selected_words_model.predict(diaper_champ_reviews[0:3], output_type='probability'),









    Out[173]:





(dtype: float
 Rows: 3
 [0.9244960743879546, 0.9244960743879546, 0.786557230500331],)



In [171]:

    
for word in selected_words:
    print word, diaper_champ_reviews[2]['review'].count(word)









    



awesome 0
great 0
fantastic 0
amazing 0
love 0
horrible 0
bad 0
terrible 0
awful 0
wow 0
hate 0

name	review	rating	sentiment	word_count
Baby Trend Diaper Champ	Baby Luke can turn a clean diaper to a dirty ...	5.0	1	{'all': 1L, 'less': 1L, "friend's": 1L, '(whi ...
Baby Trend Diaper Champ	I LOOOVE this diaper pail! Its the easies ...	5.0	1	{'just': 1L, 'over': 1L, 'rweek': 1L, 'sooo': 1L, ...
Baby Trend Diaper Champ	We researched all of the different types of di ...	4.0	1	{'all': 2L, 'just': 4L, "don't": 2L, 'one,': 1L, ...
Baby Trend Diaper Champ	My baby is now 8 months and the can has been ...	5.0	1	{"don't": 1L, 'when': 1L, 'over': 1L, 'soon': 1L, ...
Baby Trend Diaper Champ	This is absolutely, by far, the best diaper ...	5.0	1	{'just': 3L, 'money': 1L, 'not': 2L, 'mechanism': ...
Baby Trend Diaper Champ	Diaper Champ or Diaper Genie? That was my ...	5.0	1	{'all': 1L, 'bags.': 1L, 'son,': 1L, '(i': 1L, ...
Baby Trend Diaper Champ	Wow! This is fabulous. It was a toss-up between ...	5.0	1	{'and': 4L, '"genie".': 1L, 'since': 1L, ...
Baby Trend Diaper Champ	I originally put this item on my baby registry ...	5.0	1	{'lysol': 1L, 'all': 2L, 'bags.': 1L, 'feedback': ...
Baby Trend Diaper Champ	Two girlfriends and two family members put me ...	5.0	1	{'just': 1L, 'when': 1L, 'both': 1L, 'results': ...
Baby Trend Diaper Champ	I am one of those super- critical shoppers who ...	5.0	1	{'taller': 1L, 'bags.': 1L, 'just': 1L, "don't": ...

great_count	amazing_count	love_count	horrible_count	bad_count
0	0	1	0	0
0	0	1	0	0
0	0	0	0	0
2	0	0	0	1
0	0	2	0	0
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
0	1	0	1	0
0	0	1	0	0

wow_count	hate_count	predicted_sentiment
0	0	0.999999937267
0	0	0.999999917406
0	0	0.999999899509
0	0	0.999999836182
0	0	0.999999824745
0	0	0.999999759315
0	0	0.999999692111
0	0	0.999999642488
0	0	0.999999604504
0	0	0.999999486804

great_count	amazing_count	love_count	horrible_count	bad_count
0	0	1	0	0
0	0	1	0	0
0	0	0	0	0
2	0	0	0	1
0	0	2	0	0
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
0	1	0	1	0
0	0	1	0	0

great_count	amazing_count	love_count	horrible_count	bad_count
0	0	1	0	0
0	0	1	0	0
0	0	0	0	0
2	0	0	0	1
0	0	2	0	0
0	0	0	0	0
0	0	0	0	0
0	0	0	0	0
0	1	0	1	0
0	0	1	0	0