Linear regression homework with Yelp votes

Introduction

This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.

Description of the data:

  • yelp.json is the original format of the file. yelp.csv contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.
  • Each observation in this dataset is a review of a particular business by a particular user.
  • The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
  • The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
  • The "useful" and "funny" columns are similar to the "cool" column.

Task 1

Read yelp.csv into a DataFrame.


In [1]:
# access yelp.csv using a relative path
import pandas as pd
yelp = pd.read_csv('../data/yelp.csv')
yelp.head(1)


Out[1]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0

Task 1 (Bonus)

Ignore the yelp.csv file, and construct this DataFrame yourself from yelp.json. This involves reading the data into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.


In [2]:
# read the data from yelp.json into a list of rows
# each row is decoded into a dictionary named "data" using using json.loads()
import json
with open('../data/yelp.json', 'rU') as f:
    data = [json.loads(row) for row in f]

In [3]:
# show the first review
data[0]


Out[3]:
{u'business_id': u'9yKzy9PApeiPPOUJEtnvkg',
 u'date': u'2011-01-26',
 u'review_id': u'fWKvX83p0-ka4JS3dc6E5A',
 u'stars': 5,
 u'text': u'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!',
 u'type': u'review',
 u'user_id': u'rLtl8ZkDX5vH5nAx9C3q5Q',
 u'votes': {u'cool': 2, u'funny': 0, u'useful': 5}}

In [4]:
# convert the list of dictionaries to a DataFrame
ydata = pd.DataFrame(data)
ydata.head(2)


Out[4]:
business_id date review_id stars text type user_id votes
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q {u'funny': 0, u'useful': 5, u'cool': 2}
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ {u'funny': 0, u'useful': 0, u'cool': 0}

In [5]:
# add DataFrame columns for cool, useful, and funny
x = pd.DataFrame.from_records(ydata.votes)
ydata= pd.concat([ydata, x], axis=1)
ydata.head(2)


Out[5]:
business_id date review_id stars text type user_id votes cool funny useful
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q {u'funny': 0, u'useful': 5, u'cool': 2} 2 0 5
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ {u'funny': 0, u'useful': 0, u'cool': 0} 0 0 0

In [6]:
# drop the votes column and then display the head
ydata.drop("votes", axis=1, inplace=True)
ydata.head(2)


Out[6]:
business_id date review_id stars text type user_id cool funny useful
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 0 5
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0

Task 2

Explore the relationship between each of the vote types (cool/useful/funny) and the number of stars.


In [7]:
# treat stars as a categorical variable and look for differences between groups by comparing the means of the groups
ydata.groupby(['stars'])['cool','funny','useful'].mean().T


Out[7]:
stars 1 2 3 4 5
cool 0.576769 0.719525 0.788501 0.954623 0.944261
funny 1.056075 0.875944 0.694730 0.670448 0.608631
useful 1.604806 1.563107 1.306639 1.395916 1.381780

In [8]:
# display acorrelation matrix of the vote types (cool/useful/funny) and stars
%matplotlib inline
import seaborn as sns
sns.heatmap(yelp.corr())


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ac5d8d0>

In [9]:
# display multiple scatter plots (cool, useful, funny) with linear regression line
feat_cols = ['cool', 'useful', 'funny']
sns.pairplot(ydata, x_vars=feat_cols, y_vars='stars', kind='reg', size=5)


Out[9]:
<seaborn.axisgrid.PairGrid at 0x11db6c150>

Task 3

Define cool/useful/funny as the feature matrix X, and stars as the response vector y.


In [10]:
X = ydata[['cool', 'useful', 'funny']]
y = ydata['stars']

Task 4

Fit a linear regression model and interpret the coefficients. Do the coefficients make intuitive sense to you? Explore the Yelp website to see if you detect similar trends.


In [11]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

# print the coefficients
print lr.intercept_
print lr.coef_
zip(X, lr.coef_)


3.83989479278
[ 0.27435947 -0.14745239 -0.13567449]
Out[11]:
[(u'cool', 0.27435946858853977),
 (u'useful', -0.14745239099401466),
 (u'funny', -0.13567449053706701)]

Task 5

Evaluate the model by splitting it into training and testing sets and computing the RMSE. Does the RMSE make intuitive sense to you?


In [12]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np

In [13]:
# define a function that accepts a list of features and returns testing RMSE
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(feat_cols):
    X = ydata[feat_cols]
    y = ydata.stars
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    y_pred = linreg.predict(X_test)
    return np.sqrt(metrics.mean_squared_error(y_test, y_pred))

train_test_split(X, y, random_state=123)


Out[13]:
[      cool  useful  funny
 9174     1       2      1
 4379     1       2      1
 541      0       0      0
 7251     0       0      0
 1135     7       8     10
 9978     0       1      0
 3536     2       1      1
 5891     0       0      0
 3906     3       3      2
 4348     0       1      0
 889      0       0      0
 3927     0       0      0
 2017     1       1      0
 2497     0       0      0
 3779     2       1      0
 7652     2       3      1
 7513     1       2      1
 8765     0       0      0
 5234     0       0      0
 2995     0       1      0
 2350     1       1      1
 466      0       0      0
 5758     2       3      1
 1299     0       0      0
 9717     0       0      0
 9256     0       0      0
 4109     0       1      0
 2092     0       0      0
 3628     1       3      3
 1057     0       0      0
 ...    ...     ...    ...
 7344     0       0      0
 6077     0       0      0
 6648     0       0      0
 6740     0       0      0
 9998     0       0      0
 39       1       1      0
 6484     1       1      1
 1847     0       1      0
 7985     0       0      0
 1092     0       0      0
 3325     2       1      1
 2894     6       5      4
 1363     0       1      0
 3481     0       0      0
 111      0       2      0
 6368     0       1      0
 942      0       1      0
 5664     0       1      1
 4169     0       0      0
 4143     0       1      0
 6782     0       1      0
 6257     0       0      0
 96       1       1      0
 5857     0       0      0
 7382     0       0      0
 9785     0       0      0
 7763     0       0      0
 5218     0       0      0
 1346     0       0      0
 3582     1       1      1
 
 [7500 rows x 3 columns],       cool  useful  funny
 2656     0       0      0
 445      2       2      0
 9505     0       0      0
 332      0       1      0
 4168     0       0      0
 2364     1       1      0
 6097     1       2      0
 7        0       1      0
 7752     1       2      1
 4453     0       0      0
 4743     0       2      1
 6243     0       0      0
 3126     0       1      0
 4056     0       1      0
 5852     0       0      0
 628      0       0      0
 8545     0       2      0
 2481     0       1      0
 9743     2       2      0
 1335     0       0      0
 2683     1       2      0
 4566     0       1      0
 2514     1       1      0
 9353     0       0      0
 1907     0       0      0
 2802     1       1      1
 6890     0       0      0
 784      0       0      0
 9489     4       5      8
 6893     0       1      0
 ...    ...     ...    ...
 4649     1       5      1
 3645     1       0      0
 9751     0       1      0
 4129     1       1      0
 9561     0       1      0
 7183     0       1      1
 1479     0       0      0
 2109     0       1      0
 6523     0       1      0
 9912     1       2      0
 6474     1       2      1
 2069     1       1      0
 4026     1       2      1
 8740     0       1      1
 7827     1       2      1
 4613     0       1      0
 3516     9       8      6
 8400     1       5      1
 2322     0       0      1
 582      6       8      2
 8951     0       1      0
 5153     0       0      0
 1667     2       3      5
 1516     0       0      0
 4990     0       2      0
 397      1       1      0
 5552     2       2      1
 1453     4       5      2
 4469     0       0      0
 2413     0       0      0
 
 [2500 rows x 3 columns], 9174    4
 4379    4
 541     4
 7251    5
 1135    4
 9978    5
 3536    3
 5891    3
 3906    4
 4348    4
 889     3
 3927    4
 2017    4
 2497    4
 3779    4
 7652    4
 7513    1
 8765    5
 5234    4
 2995    1
 2350    4
 466     4
 5758    4
 1299    1
 9717    3
 9256    5
 4109    2
 2092    4
 3628    1
 1057    5
        ..
 7344    5
 6077    4
 6648    1
 6740    5
 9998    2
 39      4
 6484    4
 1847    4
 7985    4
 1092    4
 3325    4
 2894    4
 1363    5
 3481    3
 111     5
 6368    4
 942     1
 5664    4
 4169    2
 4143    5
 6782    5
 6257    5
 96      4
 5857    3
 7382    1
 9785    5
 7763    4
 5218    2
 1346    3
 3582    4
 Name: stars, dtype: int64, 2656    3
 445     5
 9505    4
 332     2
 4168    4
 2364    4
 6097    5
 7       4
 7752    4
 4453    4
 4743    1
 6243    5
 3126    5
 4056    3
 5852    2
 628     4
 8545    1
 2481    3
 9743    3
 1335    4
 2683    5
 4566    5
 2514    2
 9353    5
 1907    4
 2802    4
 6890    3
 784     5
 9489    5
 6893    4
        ..
 4649    4
 3645    5
 9751    2
 4129    5
 9561    5
 7183    2
 1479    5
 2109    4
 6523    2
 9912    4
 6474    4
 2069    3
 4026    4
 8740    4
 7827    1
 4613    5
 3516    5
 8400    3
 2322    5
 582     4
 8951    4
 5153    2
 1667    1
 1516    4
 4990    3
 397     3
 5552    4
 1453    4
 4469    4
 2413    5
 Name: stars, dtype: int64]

In [14]:
# calculate RMSE with all three features
print train_test_rmse(['cool', 'funny', 'useful'])


1.17336862742

Task 6

Try removing some of the features and see if the RMSE improves.


In [15]:
print train_test_rmse(['cool', 'funny', 'useful'])
print train_test_rmse(['cool', 'funny'])
print train_test_rmse(['cool'])

### RMSE is best with all 3 features


1.17336862742
1.1851949299
1.20049049928

Task 7 (Bonus)

Think of some new features you could create from the existing data that might be predictive of the response. Figure out how to create those features in Pandas, add them to your model, and see if the RMSE improves.


In [19]:
# new feature: Number of reviews per business_id. More reviews = more favored by reviewer? 
# Adding # of occurs for business_id
ydata['review_freq']= ydata.groupby(['business_id'])['stars'].transform('count')

In [20]:
# new features: 
# add 0 if occurs < 4 or 1 if >= 4
ydata["favored"] = [1 if x > 3 else 0 for x in ydata.review_freq]

In [23]:
# add new features to the model and calculate RMSE
print train_test_rmse(['cool', 'funny', 'useful','review_freq'])


1.16823194233

Task 8 (Bonus)

Compare your best RMSE on the testing set with the RMSE for the "null model", which is the model that ignores all features and simply predicts the mean response value in the testing set.


In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

# create a NumPy array with the same shape as y_test
y_null = np.zeros_like(y_test, dtype=float)

# fill the array with the mean value of y_test
y_null.fill(y_test.mean())
y_null


Out[24]:
array([ 3.7808,  3.7808,  3.7808, ...,  3.7808,  3.7808,  3.7808])

In [25]:
np.sqrt(metrics.mean_squared_error(y_test, y_null))


Out[25]:
1.2019781029619465

NUll model worse than slight;y improved model with added features from task 7