notebook.community

Edit and run



In [6]:

    
# import
import graphlab as gl

gl.canvas.set_target('ipynb')



In [3]:

    
# reading the data
data = gl.SFrame("data/amazon_baby.gl/")
data.head(5)









    



This non-commercial license of GraphLab Create for academic use is assigned to atul9806@yahoo.in and will expire on February 02, 2018.






    



[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Atul\AppData\Local\Temp\graphlab_server_1502390410.log.0






    Out[3]:





    
        name
        review
        rating
    
    
        Planetwise Flannel Wipes
        These flannel wipes are
OK, but in my opinion ...
        3.0
    
    
        Planetwise Wipe Pouch
        it came early and was not
disappointed. i love ...
        5.0
    
    
        Annas Dream Full Quilt
with 2 Shams ...
        Very soft and comfortable
and warmer than it ...
        5.0
    
    
        Stop Pacifier Sucking
without tears with ...
        This is a product well
worth the purchase.  I ...
        5.0
    
    
        Stop Pacifier Sucking
without tears with ...
        All of my kids have cried
non-stop when I tried to ...
        5.0
    

[5 rows x 3 columns]



In [4]:

    
# Build a word count vector
data['word_count'] = gl.text_analytics.count_words(data['review'])



In [5]:

    
data.head(4)









    Out[5]:





    
        name
        review
        rating
        word_count
    
    
        Planetwise Flannel Wipes
        These flannel wipes are
OK, but in my opinion ...
        3.0
        {'and': 5L, 'stink': 1L,
'because': 1L, 'order ...
    
    
        Planetwise Wipe Pouch
        it came early and was not
disappointed. i love ...
        5.0
        {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
    
    
        Annas Dream Full Quilt
with 2 Shams ...
        Very soft and comfortable
and warmer than it ...
        5.0
        {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
    
    
        Stop Pacifier Sucking
without tears with ...
        This is a product well
worth the purchase.  I ...
        5.0
        {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
    

[4 rows x 4 columns]

Defining Positive and Negative Sentense
Ignore 0 and 3 star ratings
1 and 2 are treated as Negative
4 and 4 are treated as Positive



In [8]:

    
# ignoring the 3 star rating
data2 = data[data['rating'] != 3 ]



In [9]:

    
data2['sentiment'] = data2['rating'] > 3



In [11]:

    
data2.head(5)









    Out[11]:





    
        name
        review
        rating
        word_count
        sentiment
    
    
        Planetwise Wipe Pouch
        it came early and was not
disappointed. i love ...
        5.0
        {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
        1
    
    
        Annas Dream Full Quilt
with 2 Shams ...
        Very soft and comfortable
and warmer than it ...
        5.0
        {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        This is a product well
worth the purchase.  I ...
        5.0
        {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        All of my kids have cried
non-stop when I tried to ...
        5.0
        {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        When the Binky Fairy came
to our house, we didn't ...
        5.0
        {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
        1
    

[5 rows x 5 columns]



In [12]:

    
# training the classifier model
# first, spliting the data into train and test datasets

train_data, test_data = data2.random_split(0.8, seed=0)



In [14]:

    
sentiment_model = gl.logistic_classifier.create(train_data, 
                                    target='sentiment', 
                                    features=['word_count'],
                                   validation_set=test_data)









    




WARNING: The number of feature dimensions in this problem is very large in comparison with the number of examples. Unless an appropriate regularization value is set, this model may not provide accurate predictions for a validation/test set.






    




Logistic regression:






    




--------------------------------------------------------






    




Number of examples          : 133448






    




Number of classes           : 2






    




Number of feature columns   : 1






    




Number of unpacked features : 219217






    




Number of coefficients    : 219218






    




Starting L-BFGS






    




--------------------------------------------------------






    




+-----------+----------+-----------+--------------+-------------------+---------------------+






    




| Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




+-----------+----------+-----------+--------------+-------------------+---------------------+






    




| 1         | 5        | 0.000002  | 1.825216     | 0.841481          | 0.839989            |






    




| 2         | 9        | 3.000000  | 3.739485     | 0.947425          | 0.894877            |






    




| 3         | 10       | 3.000000  | 4.463968     | 0.923768          | 0.866232            |






    




| 4         | 11       | 3.000000  | 5.223472     | 0.971779          | 0.912743            |






    




| 5         | 12       | 3.000000  | 5.923937     | 0.975511          | 0.908900            |






    




| 6         | 13       | 3.000000  | 6.629406     | 0.899991          | 0.825967            |






    




| 7         | 15       | 1.000000  | 7.639079     | 0.984548          | 0.921451            |






    




| 8         | 16       | 1.000000  | 8.329540     | 0.985118          | 0.921871            |






    




| 9         | 17       | 1.000000  | 8.991981     | 0.987066          | 0.919709            |






    




| 10        | 18       | 1.000000  | 9.697445     | 0.988715          | 0.916256            |






    




+-----------+----------+-----------+--------------+-------------------+---------------------+






    




TERMINATED: Iteration limit reached.






    




This model may not be optimal. To improve it, consider increasing `max_iterations`.



In [15]:

    
# Evaluate the sentiment model
sentiment_model.evaluate(test_data, metric='roc_curve')









    Out[15]:





{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0       |      1.0       | 27976 | 5328 |
 |   1e-05   | 0.909346846847 | 0.998856162425 | 27976 | 5328 |
 |   2e-05   | 0.896021021021 | 0.998748927652 | 27976 | 5328 |
 |   3e-05   | 0.886448948949 | 0.998462968259 | 27976 | 5328 |
 |   4e-05   | 0.879692192192 | 0.998284243637 | 27976 | 5328 |
 |   5e-05   | 0.875187687688 | 0.998212753789 | 27976 | 5328 |
 |   6e-05   | 0.872184684685 | 0.998177008865 | 27976 | 5328 |
 |   7e-05   | 0.868618618619 | 0.998034029168 | 27976 | 5328 |
 |   8e-05   | 0.864677177177 | 0.997998284244 | 27976 | 5328 |
 |   9e-05   | 0.860735735736 | 0.997962539319 | 27976 | 5328 |
 +-----------+----------------+----------------+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}



In [16]:

    
sentiment_model.show(view='Evaluation')



In [17]:

    
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']



In [18]:

    
def word_count(line):
    wc={}
    for word in line.split():
        wc[word] = wc.get(word, 0) +1
    return wc



In [60]:

    
#data3['word_count_dic'] = data2['review'].apply(word_count)



In [20]:

    
data2.head(5)









    Out[20]:





    
        name
        review
        rating
        word_count
        sentiment
    
    
        Planetwise Wipe Pouch
        it came early and was not
disappointed. i love ...
        5.0
        {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
        1
    
    
        Annas Dream Full Quilt
with 2 Shams ...
        Very soft and comfortable
and warmer than it ...
        5.0
        {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        This is a product well
worth the purchase.  I ...
        5.0
        {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        All of my kids have cried
non-stop when I tried to ...
        5.0
        {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        When the Binky Fairy came
to our house, we didn't ...
        5.0
        {'and': 2L, 'cute': 1L,
'help': 2L, 'doll': 1L, ...
        1
    

[5 rows x 5 columns]

def selected_word_count(line, selected_words=['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']):
    wc={}
    for word in line.split():
        if word in selected_words:
            wc[word] = wc.get(word, 0) +1
    return wc



In [37]:

    
def selected_word_count(line, selected_words=['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']):
    wc={}
    for key in selected_words:
         if key in line.keys():
            wc[key] = line[key]
    return wc



In [68]:

    
#data2['selected_word_count_dic'] = data2['word_count'].apply(selected_word_count)
data2['selected_word_count_dic'] = data2['word_count'].dict_trim_by_keys(selected_words, exclude=False)



In [69]:

    
def get_count(data, word):
    return data.get(word,0)



In [70]:

    
for word in selected_words:
    data2[word+'_count'] = data2['selected_word_count_dic'].apply(lambda line: get_count(line, word))



In [71]:

    
data2.head(4)









    Out[71]:





    
        name
        review
        rating
        word_count
        sentiment
    
    
        Planetwise Wipe Pouch
        it came early and was not
disappointed. i love ...
        5.0
        {'and': 3L, 'love': 1L,
'it': 2L, 'highly': 1L, ...
        1
    
    
        Annas Dream Full Quilt
with 2 Shams ...
        Very soft and comfortable
and warmer than it ...
        5.0
        {'and': 2L, 'quilt': 1L,
'it': 1L, 'comfortable': ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        This is a product well
worth the purchase.  I ...
        5.0
        {'ingenious': 1L, 'and':
3L, 'love': 2L, ...
        1
    
    
        Stop Pacifier Sucking
without tears with ...
        All of my kids have cried
non-stop when I tried to ...
        5.0
        {'and': 2L, 'parents!!':
1L, 'all': 2L, 'puppe ...
        1
    


    
        selected_word_count_dic
        awesome_count
        great_count
        fantastic_count
        amazing_count
        love_count
        horrible_count
    
    
        {'love': 1L}
        0
        0
        0
        0
        1
        0
    
    
        {}
        0
        0
        0
        0
        0
        0
    
    
        {'love': 2L}
        0
        0
        0
        0
        2
        0
    
    
        {'great': 1L}
        0
        1
        0
        0
        0
        0
    


    
        bad_count
        terrible_count
        awful_count
        wow_count
        hate_count
    
    
        0
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
    

[4 rows x 17 columns]



In [44]:

    
train_data,test_data = data2.random_split(.8, seed=0)



In [45]:

    
selected_words_model = gl.logistic_classifier.create(train_data,
                                 target='sentiment', 
                                 features=['selected_word_count_dic'],
                                 validation_set=test_data)









    




Logistic regression:






    




--------------------------------------------------------






    




Number of examples          : 133448






    




Number of classes           : 2






    




Number of feature columns   : 1






    




Number of unpacked features : 11






    




Number of coefficients    : 12






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+-------------------+---------------------+






    




| Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |






    




+-----------+----------+--------------+-------------------+---------------------+






    




| 1         | 2        | 0.177118     | 0.844299          | 0.842842            |






    




| 2         | 3        | 0.304202     | 0.844186          | 0.842842            |






    




| 3         | 4        | 0.463308     | 0.844276          | 0.843142            |






    




| 4         | 5        | 0.609405     | 0.844269          | 0.843142            |






    




| 5         | 6        | 0.746498     | 0.844269          | 0.843142            |






    




| 6         | 7        | 0.873581     | 0.844269          | 0.843142            |






    




+-----------+----------+--------------+-------------------+---------------------+






    




SUCCESS: Optimal solution found.



In [67]:

    
#gl.SFrame.print_rows(num_rows=12, num_columns=5)
coef=selected_words_model['coefficients'].sort('value')
coef.print_rows(num_rows=12, num_columns=5)









    



+-------------------------+-----------+-------+------------------+
|           name          |   index   | class |      value       |
+-------------------------+-----------+-------+------------------+
| selected_word_count_dic |  terrible |   1   |  -2.09049998487  |
| selected_word_count_dic |  horrible |   1   |  -1.99651800559  |
| selected_word_count_dic |   awful   |   1   |  -1.76469955631  |
| selected_word_count_dic |    hate   |   1   |  -1.40916406276  |
| selected_word_count_dic |    bad    |   1   | -0.985827369929  |
| selected_word_count_dic |    wow    |   1   | -0.0541450123333 |
| selected_word_count_dic |   great   |   1   |  0.883937894898  |
| selected_word_count_dic | fantastic |   1   |  0.891303090304  |
| selected_word_count_dic |  amazing  |   1   |  0.892802422508  |
| selected_word_count_dic |  awesome  |   1   |  1.05800888878   |
|       (intercept)       |    None   |   1   |  1.36728315229   |
| selected_word_count_dic |    love   |   1   |  1.39989834302   |
+-------------------------+-----------+-------+------------------+
+------------------+
|      stderr      |
+------------------+
| 0.0967241912229  |
| 0.0973584169028  |
|  0.134679803365  |
| 0.0771983993506  |
| 0.0433603009142  |
|  0.275616449416  |
| 0.0217379527921  |
|  0.154532343591  |
|  0.127989503231  |
|  0.110865296265  |
| 0.00861805467824 |
| 0.0287147460124  |
+------------------+
[12 rows x 5 columns]



In [47]:

    
# accuracy
selected_words_model.evaluate(test_data)









    Out[47]:





{'accuracy': 0.8431419649291376,
 'auc': 0.6648096413721418,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        0        |  234  |
 |      1       |        0        |  130  |
 |      0       |        1        |  5094 |
 |      1       |        1        | 27846 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.914242563530107,
 'log_loss': 0.40547471103656485,
 'precision': 0.8453551912568306,
 'recall': 0.9953531598513011,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-------+------+
 | threshold | fpr | tpr |   p   |  n   |
 +-----------+-----+-----+-------+------+
 |    0.0    | 1.0 | 1.0 | 27976 | 5328 |
 |   1e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   2e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   3e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   4e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   5e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   6e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   7e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   8e-05   | 1.0 | 1.0 | 27976 | 5328 |
 |   9e-05   | 1.0 | 1.0 | 27976 | 5328 |
 +-----------+-----+-----+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}



In [48]:

    
sentiment_model.evaluate(test_data)









    Out[48]:





{'accuracy': 0.916256305548883,
 'auc': 0.9446492867438502,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1461 |
 |      0       |        1        |  1328 |
 |      0       |        0        |  4000 |
 |      1       |        1        | 26515 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9500349343413533,
 'log_loss': 0.2610669843242187,
 'precision': 0.9523039902309378,
 'recall': 0.9477766657134686,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+-------+------+
 | threshold |      fpr       |      tpr       |   p   |  n   |
 +-----------+----------------+----------------+-------+------+
 |    0.0    |      1.0       |      1.0       | 27976 | 5328 |
 |   1e-05   | 0.909346846847 | 0.998856162425 | 27976 | 5328 |
 |   2e-05   | 0.896021021021 | 0.998748927652 | 27976 | 5328 |
 |   3e-05   | 0.886448948949 | 0.998462968259 | 27976 | 5328 |
 |   4e-05   | 0.879692192192 | 0.998284243637 | 27976 | 5328 |
 |   5e-05   | 0.875187687688 | 0.998212753789 | 27976 | 5328 |
 |   6e-05   | 0.872184684685 | 0.998177008865 | 27976 | 5328 |
 |   7e-05   | 0.868618618619 | 0.998034029168 | 27976 | 5328 |
 |   8e-05   | 0.864677177177 | 0.997998284244 | 27976 | 5328 |
 |   9e-05   | 0.860735735736 | 0.997962539319 | 27976 | 5328 |
 +-----------+----------------+----------------+-------+------+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}



In [49]:

    
# Analysis why clf is works better than selected_Word_model

diaper_champ_reviews = data2[data2['name']=='Baby Trend Diaper Champ']
diaper_champ_reviews.head(2)









    Out[49]:





    
        name
        review
        rating
        word_count
        sentiment
        selected_word_count_dic
    
    
        Baby Trend Diaper Champ
        Ok - newsflash.  Diapers
are just smelly.  We've ...
        4.0
        {'just': 2L, 'less': 1L,
'-': 3L, 'smell- ...
        1
        {}
    
    
        Baby Trend Diaper Champ
        My husband and I selected
the Diaper "Champ" ma ...
        1.0
        {'just': 1L, 'less': 1L,
'when': 3L, 'over': 1L, ...
        0
        {}
    


    
        awesome_count
        great_count
        fantastic_count
        amazing_count
        love_count
        horrible_count
        bad_count
        terrible_count
    
    
        0
        0
        0
        0
        0
        0
        0
        0
    
    
        0
        0
        0
        0
        0
        0
        0
        0
    


    
        awful_count
        wow_count
        hate_count
    
    
        0
        0
        0
    
    
        0
        0
        0
    

[2 rows x 17 columns]



In [50]:

    
diaper_champ_reviews['pred_sentiment'] = sentiment_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews_sorted = diaper_champ_reviews.sort('pred_sentiment', ascending=False)
diaper_champ_reviews_sorted.head(2)









    Out[50]:





    
        name
        review
        rating
        word_count
        sentiment
    
    
        Baby Trend Diaper Champ
        Baby Luke can turn a
clean diaper to a dirty ...
        5.0
        {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
        1
    
    
        Baby Trend Diaper Champ
        I LOOOVE this diaper
pail!  Its the easies ...
        5.0
        {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
        1
    


    
        selected_word_count_dic
        awesome_count
        great_count
        fantastic_count
        amazing_count
        love_count
        horrible_count
    
    
        {}
        0
        0
        0
        0
        0
        0
    
    
        {'love': 1L}
        0
        0
        0
        0
        1
        0
    


    
        bad_count
        terrible_count
        awful_count
        wow_count
        hate_count
        pred_sentiment
    
    
        0
        0
        0
        0
        0
        0.999999937267
    
    
        0
        0
        0
        0
        0
        0.999999917406
    

[2 rows x 18 columns]



In [75]:

    
diaper_champ_reviews['sel_pred_sentiment'] = selected_words_model.predict(diaper_champ_reviews, output_type='probability')
diaper_champ_reviews_sel_sorted = diaper_champ_reviews.sort('pred_sentiment', ascending=False)
diaper_champ_reviews_sel_sorted.head(2)









    Out[75]:





    
        name
        review
        rating
        word_count
        sentiment
    
    
        Baby Trend Diaper Champ
        Baby Luke can turn a
clean diaper to a dirty ...
        5.0
        {'all': 1L, 'less': 1L,
"friend's": 1L, '(whi ...
        1
    
    
        Baby Trend Diaper Champ
        I LOOOVE this diaper
pail!  Its the easies ...
        5.0
        {'just': 1L, 'over': 1L,
'rweek': 1L, 'sooo': 1L, ...
        1
    


    
        selected_word_count_dic
        awesome_count
        great_count
        fantastic_count
        amazing_count
        love_count
        horrible_count
    
    
        {}
        0
        0
        0
        0
        0
        0
    
    
        {'love': 1L}
        0
        0
        0
        0
        1
        0
    


    
        bad_count
        terrible_count
        awful_count
        wow_count
        hate_count
        pred_sentiment
        sel_pred_sentiment
    
    
        0
        0
        0
        0
        0
        0.999999937267
        0.796940851291
    
    
        0
        0
        0
        0
        0
        0.999999917406
        0.940876393428
    

[2 rows x 19 columns]



In [77]:

    
diaper_champ_reviews_sel_sorted[0:1]['review']









    Out[77]:





dtype: str
Rows: 1
['Baby Luke can turn a clean diaper to a dirty diaper in 3 seconds flat. The diaper champ turns the smelly diaper into "what diaper smell" in less time than that. I hesitated and wondered what I REALLY needed for the nursery. This is one of the best purchases we made. The champ, the baby bjorn, fluerville diaper bag, and graco pack and play bassinet all vie for the best baby purchase.Great product, easy to use, economical, effective, absolutly fabulous.UpdateI knew that I loved the champ, and useing the diaper genie at a friend's house REALLY reinforced that!! There is no comparison, the chanp is easy and smell free, the genie was difficult to use one handed (which is absolutly vital if you have a little one on a changing pad) and there was a deffinite odor eminating from the genieplus we found that the quick tie garbage bags where the ties are integrated into the bag work really well because there isn't any added bulk around the sealing edge of the champ.']



In [74]:

    
# Out of the 11 words in selected_words, which one is most used in the reviews in the dataset?
print('hate_count',data2['hate_count'].sum())
print('wow_count',data2['wow_count'].sum())
print('awful_count',data2['awful_count'].sum())
print('terrible_count',data2['terrible_count'].sum())
print('bad_count',data2['bad_count'].sum())
print('horrible_count',data2['horrible_count'].sum())
print('love_count',data2['love_count'].sum())
print('amazing_count',data2['amazing_count'].sum())
print('fantastic_count',data2['fantastic_count'].sum())
print('great_count',data2['great_count'].sum())
print('awesome_count',data2['awesome_count'].sum())









    



('hate_count', 1057L)
('wow_count', 131L)
('awful_count', 345L)
('terrible_count', 673L)
('bad_count', 3197L)
('horrible_count', 659L)
('love_count', 40277L)
('amazing_count', 1305L)
('fantastic_count', 873L)
('great_count', 42420L)
('awesome_count', 2002L)



In [59]:

    
# It is quite common to use the **majority class classifier** as the a baseline (or reference) model for 
# comparison with your classifier model. The majority classifier model predicts the majority class for all data points. 
# At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print (num_positive)
print (num_negative)
print (num_positive*1.0/len(train_data))









    



112283
0
0.841398896949



In [58]:

    
test_num_positive  = (test_data['sentiment'] == +1).sum()
test_num_negative = (test_data['sentiment'] == -1).sum()
print (test_num_positive)
print (test_num_negative)
print(len(test_data))
print(test_num_positive*1.0/len(test_data))









    



27976
0
33304
0.840019216911



In [ ]:

name	review	rating
Planetwise Flannel Wipes	These flannel wipes are OK, but in my opinion ...	3.0
Planetwise Wipe Pouch	it came early and was not disappointed. i love ...	5.0
Annas Dream Full Quilt with 2 Shams ...	Very soft and comfortable and warmer than it ...	5.0
Stop Pacifier Sucking without tears with ...	This is a product well worth the purchase. I ...	5.0
Stop Pacifier Sucking without tears with ...	All of my kids have cried non-stop when I tried to ...	5.0

name	review	rating	word_count
Planetwise Flannel Wipes	These flannel wipes are OK, but in my opinion ...	3.0	{'and': 5L, 'stink': 1L, 'because': 1L, 'order ...
Planetwise Wipe Pouch	it came early and was not disappointed. i love ...	5.0	{'and': 3L, 'love': 1L, 'it': 2L, 'highly': 1L, ...
Annas Dream Full Quilt with 2 Shams ...	Very soft and comfortable and warmer than it ...	5.0	{'and': 2L, 'quilt': 1L, 'it': 1L, 'comfortable': ...
Stop Pacifier Sucking without tears with ...	This is a product well worth the purchase. I ...	5.0	{'ingenious': 1L, 'and': 3L, 'love': 2L, ...

selected_word_count_dic	great_count	love_count
{'love': 1L}	0	1
{}	0	0
{'love': 2L}	0	2
{'great': 1L}	1	0

name	review	rating	word_count	sentiment	selected_word_count_dic
Baby Trend Diaper Champ	Ok - newsflash. Diapers are just smelly. We've ...	4.0	{'just': 2L, 'less': 1L, '-': 3L, 'smell- ...	1	{}
Baby Trend Diaper Champ	My husband and I selected the Diaper "Champ" ma ...	1.0	{'just': 1L, 'less': 1L, 'when': 3L, 'over': 1L, ...	0	{}

name	review	rating	word_count	sentiment
Baby Trend Diaper Champ	Baby Luke can turn a clean diaper to a dirty ...	5.0	{'all': 1L, 'less': 1L, "friend's": 1L, '(whi ...	1
Baby Trend Diaper Champ	I LOOOVE this diaper pail! Its the easies ...	5.0	{'just': 1L, 'over': 1L, 'rweek': 1L, 'sooo': 1L, ...	1

bad_count	terrible_count	awful_count	wow_count	hate_count	pred_sentiment	sel_pred_sentiment
0	0	0	0	0	0.999999937267	0.796940851291
0	0	0	0	0	0.999999917406	0.940876393428