Yelp Reviews Classification

Introduction

This exercise uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.

Description of the data:

  • yelp.csv contains the dataset. It is stored in the repository (in the data directory), so there is no need to download anything from the Kaggle website.
  • Each observation (row) in this dataset is a review of a particular business by a particular user.
  • The stars column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
  • The text column is the text of the review.

Goal: Predict the star rating of a review using only the review text.

Tip: After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.


In [3]:
## Task 1
import pandas as pd

data = pd.read_csv("D:\Machine Learning\pycon-2016-tutorial-master\data\yelp.csv")


# print(data.iloc[0:3,[4,3]])

df1 = data.loc[:,['text','stars']]



# Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [4]:
## Task 2

# Create a new DataFrame that only contains the **5-star** and **1-star** reviews.


df2 = df1[(df1['stars'] == 5) | (df1['stars'] == 1)]


df2[1:4]


#  **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.


Out[4]:
text stars
1 I have no idea why some people give bad review... 5
3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... 5
4 General Manager Scott Petello is a good egg!!!... 5

In [6]:
## Task 3

## Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

X = df2.text
y=  df2.stars

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)



#- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.


(3064,)
(1022,)
(3064,)
(1022,)

In [7]:
## Task 4

## Use CountVectorizer to create **document-term matrices** from X_train and X_test.

# Instantiate the Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english',max_df=0.5,min_df=2)


X_train_dtm = vect.fit_transform(X_train)


# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

In [11]:
## Task 5 Building and evaluating a model

## Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

## **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

# https://www.youtube.com/watch?v=os-NaA0ldGs&list=PLBv09BD7ez_6CxkuiFTbL3jsn2Qd1IU7B&index=1 
# - Link to understand different Gaussian distribution

from sklearn.naive_bayes import MultinomialNB
nb =  MultinomialNB()

%time nb.fit(X_train_dtm,y_train)
y_pred_class = nb.predict(X_test_dtm)

from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))

# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)


Wall time: 16 ms
0.922700587084
Out[11]:
array([[148,  36],
       [ 43, 795]])

In [12]:
## Task 6 (Challenge)


## Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

print(y_test.value_counts())


print(1- (184/(838+184)))

##- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!


5    838
1    184
Name: stars, dtype: int64
0.8199608610567515

In [16]:
## Task 7 (Challenge)

X_test[y_test < y_pred_class].head(10)

# 1-star reviews incorrectly classified as 5-star reviews
X_test[1781] # Model is reacting to positive words


Out[16]:
"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating."

In [22]:
## Task 8 (Challenge)
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)
nb.feature_count_.shape
# Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

# - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.


Out[22]:
(2, 8495)

In [23]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]

five_star_token_count = nb.feature_count_[1, :]

In [ ]:


In [24]:
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')

In [28]:
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1
tokens.head(5)


Out[28]:
five_star one_star
token
00 42.0 29.0
000 8.0 7.0
00am 5.0 6.0
00pm 7.0 4.0
01 5.0 4.0

In [33]:
# first number is one-star reviews, second number is five-star reviews
print(nb.class_count_)
sum(tokens['five_star'])


[  565.  2499.]
Out[33]:
150133.0

In [34]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]

In [35]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star

In [36]:
tokens.sort_values('five_star_ratio', ascending=False).head(10)


Out[36]:
five_star one_star five_star_ratio
token
perfect 0.098840 0.008850 11.168868
fantastic 0.078031 0.007080 11.021909
favorite 0.138856 0.015929 8.717042
amazing 0.186074 0.024779 7.509432
awesome 0.115246 0.021239 5.426170
flavors 0.044818 0.008850 5.064426
yum 0.025610 0.005310 4.823263
great 0.601841 0.127434 4.722778
excellent 0.108443 0.023009 4.713116
love 0.340136 0.072566 4.687241

Task 9 (Challenge)

Up to this point, we have framed this as a binary classification problem by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a 5-class classification problem.

Here are the steps:

  • Define X and y using the original DataFrame. (y should contain 5 different classes.)
  • Split X and y into training and testing sets.
  • Create document-term matrices using CountVectorizer.
  • Calculate the testing accuracy of a Multinomial Naive Bayes model.
  • Compare the testing accuracy with the null accuracy, and comment on the results.
  • Print the confusion matrix, and comment on the results. (This Stack Overflow answer explains how to read a multi-class confusion matrix.)
  • Print the classification report, and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [40]:
import pandas as pd

data = pd.read_csv("D:\Machine Learning\pycon-2016-tutorial-master\data\yelp.csv")

In [42]:
# Define X and y using the original DataFrame. (y should contain 5 different classes.)

df1 = data.loc[:,['text','stars']]
df1.shape


Out[42]:
(10000, 2)

In [43]:
# Split X and y into training and testing sets.

X = df1.text
y=  df1.stars

# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(7500,)
(2500,)
(7500,)
(2500,)

In [44]:
# Create document-term matrices using CountVectorizer.
# Instantiate the Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english',max_df=0.5,min_df=2)


X_train_dtm = vect.fit_transform(X_train)


# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

In [45]:
# Calculate the testing accuracy of a Multinomial Naive Bayes model.
from sklearn.naive_bayes import MultinomialNB
nb =  MultinomialNB()

%time nb.fit(X_train_dtm,y_train)
y_pred_class = nb.predict(X_test_dtm)


Wall time: 22 ms

In [46]:
# Compare the testing accuracy with the null accuracy, and comment on the results.

from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))

# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)


0.496
Out[46]:
array([[ 84,  34,  32,  24,  11],
       [ 36,  48,  56,  71,  23],
       [ 14,  18,  93, 196,  44],
       [ 17,  13,  46, 571, 237],
       [ 17,   7,  18, 346, 444]])

In [47]:
# calculate the null accuracy
y_test.value_counts().head(1) / y_test.shape


Out[47]:
4    0.3536
Name: stars, dtype: float64

In [ ]:
# **Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix.
# manually calculate the precision for class 1
precision = 55 / float(55 + 28 + 5 + 7 + 6)
print(precision)

In [48]:
# **Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix.
# manually calculate the recall for class 1
recall = 55 / float(55 + 14 + 24 + 65 + 27)
print(recall)


0.2972972972972973

In [ ]:
# **F1 score** is a weighted average of precision and recall.
# manually calculate the F1 score for class 1
f1 = 2 * (precision * recall) / (precision + recall)
print(f1)

In [ ]:
**Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1,
for example, you sum the first row of the confusion matrix.

In [ ]:
# manually calculate the support for class 1
support = 55 + 14 + 24 + 65 + 27
print(support)

Classification report comments:

  • Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.
  • Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from.

In [ ]: