10 Text Representation

This exercise uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition. It is stored in the local yelp.csv file.

Description of the data:

  • yelp.csv contains the dataset. It is stored in the repository (in the data directory), so there is no need to download anything from the Kaggle website.
  • Each observation (row) in this dataset is a review of a particular business by a particular user.
  • The stars column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
  • The text column is the text of the review.

Goal: Predict the star rating of a review using only the review text.

First, we read yelp.csv into a pandas DataFrame and examine it.


In [1]:
import pandas as pd
path = 'material/yelp.csv'
yelp = pd.read_csv(path)

In [ ]:
# examine the shape
yelp.shape

In [ ]:
# examine the first row
yelp.head(1)

In [ ]:
yelp.tail(3)

In [ ]:
# only those with 5 stars
yelp[yelp['stars'] == 5]

In [ ]:
# All columns
yelp.columns

In [ ]:
# The first sample
yelp.iloc[0]

In [ ]:
# examine the class distribution
yelp.stars.value_counts().sort_index()

Task 1

Create a new DataFrame that only contains the 5-star and 1-star reviews.


In [ ]:

Task 2

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the review text as the only feature and the star rating as the response.


In [ ]:

Task 3

Use CountVectorizer to create document-term matrices. Think carefully about the usage of X_train and X_test! You must not use X_test during fitting!


In [ ]:

Task 4

Use multinomial Naive Bayes (from sklearn.naive_bayes import MultinomialNB) to predict the star rating for the reviews in the testing set, and then calculate the accuracy and print the confusion matrix. Note: you can use the simple confusion matrix from sklean.metrics.confusion_matrix.


In [ ]:

Task 5

Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.

  • Hint: Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the feature_count_ and class_count_ attributes of the Naive Bayes model object.

In [ ]: