This exercise uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.
Description of the data:
yelp.csv
contains the dataset. It is stored in the repository (in the data
directory), so there is no need to download anything from the Kaggle website.Goal: Predict the star rating of a review using only the review text.
Tip: After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
In [3]:
## Task 1
import pandas as pd
data = pd.read_csv("D:\Machine Learning\pycon-2016-tutorial-master\data\yelp.csv")
# print(data.iloc[0:3,[4,3]])
df1 = data.loc[:,['text','stars']]
# Read **`yelp.csv`** into a pandas DataFrame and examine it.
In [4]:
## Task 2
# Create a new DataFrame that only contains the **5-star** and **1-star** reviews.
df2 = df1[(df1['stars'] == 5) | (df1['stars'] == 1)]
df2[1:4]
# **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.
Out[4]:
In [6]:
## Task 3
## Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.
X = df2.text
y= df2.stars
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
#- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
In [7]:
## Task 4
## Use CountVectorizer to create **document-term matrices** from X_train and X_test.
# Instantiate the Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english',max_df=0.5,min_df=2)
X_train_dtm = vect.fit_transform(X_train)
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
In [11]:
## Task 5 Building and evaluating a model
## Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.
## **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.
# https://www.youtube.com/watch?v=os-NaA0ldGs&list=PLBv09BD7ez_6CxkuiFTbL3jsn2Qd1IU7B&index=1
# - Link to understand different Gaussian distribution
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
%time nb.fit(X_train_dtm,y_train)
y_pred_class = nb.predict(X_test_dtm)
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
Out[11]:
In [12]:
## Task 6 (Challenge)
## Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.
print(y_test.value_counts())
print(1- (184/(838+184)))
##- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
In [16]:
## Task 7 (Challenge)
X_test[y_test < y_pred_class].head(10)
# 1-star reviews incorrectly classified as 5-star reviews
X_test[1781] # Model is reacting to positive words
Out[16]:
In [22]:
## Task 8 (Challenge)
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)
nb.feature_count_.shape
# Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.
# - **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
Out[22]:
In [23]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]
In [ ]:
In [24]:
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')
In [28]:
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1
tokens.head(5)
Out[28]:
In [33]:
# first number is one-star reviews, second number is five-star reviews
print(nb.class_count_)
sum(tokens['five_star'])
Out[33]:
In [34]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]
In [35]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star
In [36]:
tokens.sort_values('five_star_ratio', ascending=False).head(10)
Out[36]:
Up to this point, we have framed this as a binary classification problem by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a 5-class classification problem.
Here are the steps:
In [40]:
import pandas as pd
data = pd.read_csv("D:\Machine Learning\pycon-2016-tutorial-master\data\yelp.csv")
In [42]:
# Define X and y using the original DataFrame. (y should contain 5 different classes.)
df1 = data.loc[:,['text','stars']]
df1.shape
Out[42]:
In [43]:
# Split X and y into training and testing sets.
X = df1.text
y= df1.stars
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
In [44]:
# Create document-term matrices using CountVectorizer.
# Instantiate the Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english',max_df=0.5,min_df=2)
X_train_dtm = vect.fit_transform(X_train)
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
In [45]:
# Calculate the testing accuracy of a Multinomial Naive Bayes model.
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
%time nb.fit(X_train_dtm,y_train)
y_pred_class = nb.predict(X_test_dtm)
In [46]:
# Compare the testing accuracy with the null accuracy, and comment on the results.
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
Out[46]:
In [47]:
# calculate the null accuracy
y_test.value_counts().head(1) / y_test.shape
Out[47]:
In [ ]:
# **Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix.
# manually calculate the precision for class 1
precision = 55 / float(55 + 28 + 5 + 7 + 6)
print(precision)
In [48]:
# **Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix.
# manually calculate the recall for class 1
recall = 55 / float(55 + 14 + 24 + 65 + 27)
print(recall)
In [ ]:
# **F1 score** is a weighted average of precision and recall.
# manually calculate the F1 score for class 1
f1 = 2 * (precision * recall) / (precision + recall)
print(f1)
In [ ]:
**Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1,
for example, you sum the first row of the confusion matrix.
In [ ]:
# manually calculate the support for class 1
support = 55 + 14 + 24 + 65 + 27
print(support)
Classification report comments:
In [ ]: