Predicting sentiment from product reviews

Fire up GraphLab Create


In [1]:
import graphlab

In [157]:
products = graphlab.SFrame('amazon_baby.gl/')

In [158]:
products = products[products['rating'] != 3]

In [159]:
products['sentiment'] = products['rating'] >=4

In [4]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

In [123]:
selected_words_counts = [x + '_count' for x in selected_words]

In [160]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

In [161]:
for word in selected_words:
    products[word + '_count'] = products['review'].apply(lambda r: r.count(word))

In [149]:
for word in selected_words:
    print word, products[word + '_count'].sum()


awesome 3380
great 51091
fantastic 1611
amazing 2533
love 65236
horrible 1081
bad 5099
terrible 1126
awful 694
wow 135
hate 3795

In [162]:
train_data, test_data = products.random_split(.8, seed=0)

In [163]:
selected_words_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=selected_words_counts,
                                                     validation_set=test_data)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 11
PROGRESS: Number of unpacked features : 11
PROGRESS: Number of coefficients    : 12
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 0.210012     | 0.845820          | 0.844193            |
PROGRESS: | 2         | 3        | 0.360021     | 0.845775          | 0.844253            |
PROGRESS: | 3         | 4        | 0.503029     | 0.845873          | 0.844313            |
PROGRESS: | 4         | 5        | 0.650038     | 0.845865          | 0.844313            |
PROGRESS: | 5         | 6        | 0.798046     | 0.845873          | 0.844313            |
PROGRESS: | 6         | 7        | 0.943054     | 0.845873          | 0.844313            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+

In [164]:
selected_words_model


Out[164]:
Class                         : LogisticClassifier

Schema
------
Number of coefficients        : 12
Number of examples            : 133448
Number of classes             : 2
Number of feature columns     : 11
Number of unpacked features   : 11

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 6
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 0.9831

Settings
--------
Log-likelihood                : 53146.5739

Highest Positive Coefficients
-----------------------------
(intercept)                   : 1.3043
love_count                    : 1.2008
amazing_count                 : 1.0732
awesome_count                 : 1.0221
fantastic_count               : 0.7931

Lowest Negative Coefficients
----------------------------
horrible_count                : -2.1574
terrible_count                : -2.0737
awful_count                   : -1.9094
bad_count                     : -1.0069
hate_count                    : -0.7167

In [133]:
selected_words_model.evaluate(test_data)


Out[133]:
{'accuracy': 0.8443129954359837, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  120  |
 |      0       |        0        |  263  |
 |      0       |        1        |  5065 |
 |      1       |        1        | 27856 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

In [117]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 219217
PROGRESS: Number of coefficients    : 219218
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000002  | 1.502086     | 0.841481          | 0.839989            |
PROGRESS: | 2         | 9        | 3.000000  | 2.888165     | 0.947425          | 0.894877            |
PROGRESS: | 3         | 10       | 3.000000  | 3.439197     | 0.923768          | 0.866232            |
PROGRESS: | 4         | 11       | 3.000000  | 4.017230     | 0.971779          | 0.912743            |
PROGRESS: | 5         | 12       | 3.000000  | 4.600263     | 0.975511          | 0.908900            |
PROGRESS: | 6         | 13       | 3.000000  | 5.174296     | 0.899991          | 0.825967            |
PROGRESS: | 10        | 18       | 1.000000  | 7.686440     | 0.988715          | 0.916256            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

In [118]:
sentiment_model.evaluate(test_data)


Out[118]:
{'accuracy': 0.916256305548883, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1461 |
 |      0       |        1        |  1328 |
 |      0       |        0        |  4000 |
 |      1       |        1        | 26515 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

In [165]:
diaper_champ_reviews = products[products['name'] == 'Baby Trend Diaper Champ']

In [166]:
diaper_champ_reviews['predicted_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')

In [167]:
diaper_champ_reviews = diaper_champ_reviews.sort('predicted_sentiment', ascending=False)

In [168]:
diaper_champ_reviews


Out[168]:
name review rating sentiment word_count awesome_count
Baby Trend Diaper Champ Baby Luke can turn a
clean diaper to a dirty ...
5.0 1 {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
0
Baby Trend Diaper Champ I LOOOVE this diaper
pail! Its the easies ...
5.0 1 {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
0
Baby Trend Diaper Champ We researched all of the
different types of di ...
4.0 1 {'all': 2L, 'just': 4L,
"don't": 2L, 'one,': 1L, ...
0
Baby Trend Diaper Champ My baby is now 8 months
and the can has been ...
5.0 1 {"don't": 1L, 'when': 1L,
'over': 1L, 'soon': 1L, ...
0
Baby Trend Diaper Champ This is absolutely, by
far, the best diaper ...
5.0 1 {'just': 3L, 'money': 1L,
'not': 2L, 'mechanism': ...
0
Baby Trend Diaper Champ Diaper Champ or Diaper
Genie? That was my ...
5.0 1 {'all': 1L, 'bags.': 1L,
'son,': 1L, '(i': 1L, ...
0
Baby Trend Diaper Champ Wow! This is fabulous.
It was a toss-up between ...
5.0 1 {'and': 4L, '"genie".':
1L, 'since': 1L, ...
0
Baby Trend Diaper Champ I originally put this
item on my baby registry ...
5.0 1 {'lysol': 1L, 'all': 2L,
'bags.': 1L, 'feedback': ...
0
Baby Trend Diaper Champ Two girlfriends and two
family members put me ...
5.0 1 {'just': 1L, 'when': 1L,
'both': 1L, 'results': ...
0
Baby Trend Diaper Champ I am one of those super-
critical shoppers who ...
5.0 1 {'taller': 1L, 'bags.':
1L, 'just': 1L, "don't": ...
0
great_count fantastic_count amazing_count love_count horrible_count bad_count terrible_count awful_count
0 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0
2 0 0 0 0 1 0 0
0 0 0 2 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 1 0 1 0 0 0
0 0 0 1 0 0 0 0
wow_count hate_count predicted_sentiment
0 0 0.999999937267
0 0 0.999999917406
0 0 0.999999899509
0 0 0.999999836182
0 0 0.999999824745
0 0 0.999999759315
0 0 0.999999692111
0 0 0.999999642488
0 0 0.999999604504
0 0 0.999999486804
[298 rows x 17 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [173]:
selected_words_model.predict(diaper_champ_reviews[0:3], output_type='probability'),


Out[173]:
(dtype: float
 Rows: 3
 [0.9244960743879546, 0.9244960743879546, 0.786557230500331],)

In [171]:
for word in selected_words:
    print word, diaper_champ_reviews[2]['review'].count(word)


awesome 0
great 0
fantastic 0
amazing 0
love 0
horrible 0
bad 0
terrible 0
awful 0
wow 0
hate 0