This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.
Description of the data:
yelp.json
is the original format of the file. yelp.csv
contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.
In [1]:
# access yelp.csv using a relative path
import pandas as pd
yelp = pd.read_csv('../data/yelp.csv')
yelp.head(1)
Out[1]:
In [2]:
# read the data from yelp.json into a list of rows
# each row is decoded into a dictionary named "data" using using json.loads()
import json
with open('../data/yelp.json', 'rU') as f:
data = [json.loads(row) for row in f]
In [3]:
# show the first review
data[0]
Out[3]:
In [4]:
# convert the list of dictionaries to a DataFrame
ydata = pd.DataFrame(data)
ydata.head(2)
Out[4]:
In [5]:
# add DataFrame columns for cool, useful, and funny
x = pd.DataFrame.from_records(ydata.votes)
ydata= pd.concat([ydata, x], axis=1)
ydata.head(2)
Out[5]:
In [6]:
# drop the votes column and then display the head
ydata.drop("votes", axis=1, inplace=True)
ydata.head(2)
Out[6]:
In [7]:
# treat stars as a categorical variable and look for differences between groups by comparing the means of the groups
ydata.groupby(['stars'])['cool','funny','useful'].mean().T
Out[7]:
In [8]:
# display acorrelation matrix of the vote types (cool/useful/funny) and stars
%matplotlib inline
import seaborn as sns
sns.heatmap(yelp.corr())
Out[8]:
In [9]:
# display multiple scatter plots (cool, useful, funny) with linear regression line
feat_cols = ['cool', 'useful', 'funny']
sns.pairplot(ydata, x_vars=feat_cols, y_vars='stars', kind='reg', size=5)
Out[9]:
In [10]:
X = ydata[['cool', 'useful', 'funny']]
y = ydata['stars']
In [11]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)
# print the coefficients
print lr.intercept_
print lr.coef_
zip(X, lr.coef_)
Out[11]:
In [12]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np
In [13]:
# define a function that accepts a list of features and returns testing RMSE
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(feat_cols):
X = ydata[feat_cols]
y = ydata.stars
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
return np.sqrt(metrics.mean_squared_error(y_test, y_pred))
train_test_split(X, y, random_state=123)
Out[13]:
In [14]:
# calculate RMSE with all three features
print train_test_rmse(['cool', 'funny', 'useful'])
In [15]:
print train_test_rmse(['cool', 'funny', 'useful'])
print train_test_rmse(['cool', 'funny'])
print train_test_rmse(['cool'])
### RMSE is best with all 3 features
In [19]:
# new feature: Number of reviews per business_id. More reviews = more favored by reviewer?
# Adding # of occurs for business_id
ydata['review_freq']= ydata.groupby(['business_id'])['stars'].transform('count')
In [20]:
# new features:
# add 0 if occurs < 4 or 1 if >= 4
ydata["favored"] = [1 if x > 3 else 0 for x in ydata.review_freq]
In [23]:
# add new features to the model and calculate RMSE
print train_test_rmse(['cool', 'funny', 'useful','review_freq'])
In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
# create a NumPy array with the same shape as y_test
y_null = np.zeros_like(y_test, dtype=float)
# fill the array with the mean value of y_test
y_null.fill(y_test.mean())
y_null
Out[24]:
In [25]:
np.sqrt(metrics.mean_squared_error(y_test, y_null))
Out[25]:
NUll model worse than slight;y improved model with added features from task 7