We are going to analyze the businesses contained in the Yelp Phoneix Dataset Analysis to see if we can find relationships between any of the variables and the rating the business has.
The first thing we are going to do is preprocess the data and leave it ready for the analysis. This preprocesssing is going to be divided in two steps: Extraction and Transformation.
Since we have all the data in the same JSON file, extraction is rather a trivial step and it just consists on loading the data file. We can see how it is done in the next three lines of code:
In [ ]:
import json
business_file_path = 'yelp_academic_dataset_business.json'
records = [json.loads(line) for line in open(business_file_path)]
In [2]:
def drop_fields(fields, dictionary_list):
"""
Removes the specified fields from every dictionary in the list records
:rtype : void
:param fields: a list of strings, which contains the fields that are
going to be removed from every dictionary in the list records
:param dictionary_list: a list of dictionaries
"""
for record in dictionary_list:
for field in fields:
del (record[field])
def add_transpose_list_column(field, dictionary_list):
"""
Takes a list of dictionaries and adds to every dictionary a new field
for each value contained in the specified field among all the
dictionaries in the field, leaving 1 for the values that are present in
the dictionary and 0 for the values that are not. It can be seen as
transposing the dictionary matrix.
:param field: the field which is going to be transposed
:param dictionary_list: a list of dictionaries
:return: the modified list of dictionaries
"""
values_set = set()
for dictionary in dictionary_list:
values_set |= set(dictionary[field])
for dictionary in dictionary_list:
for value in values_set:
if value in dictionary[field]:
dictionary[value] = 1
else:
dictionary[value] = 0
return dictionary_list
def add_transpose_single_column(field, dictionary_list):
"""
Takes a list of dictionaries and adds to every dictionary a new field
for each value contained in the specified field among all the
dictionaries in the field, leaving 1 for the values that are present in
the dictionary and 0 for the values that are not. It can be seen as
transposing the dictionary matrix.
:param field: the field which is going to be transposed
:param dictionary_list: a list of dictionaries
:return: the modified list of dictionaries
"""
values_set = set()
for dictionary in dictionary_list:
values_set.add(dictionary[field])
for dictionary in dictionary_list:
for value in values_set:
if value in dictionary[field]:
dictionary[value] = 1
else:
dictionary[value] = 0
return dictionary_list
def drop_unwanted_fields(dictionary_list):
"""
Drops fields that are not useful for data analysis in the business
data set
:rtype : void
:param dictionary_list: the list of dictionaries containing the data
"""
unwanted_fields = [
'attributes',
'business_id',
'categories',
'city',
'full_address',
'hours',
'name',
'neighborhoods',
'open',
'state',
'type'
]
drop_fields(unwanted_fields, dictionary_list)
Finally, with this auxiliary functions we create another one that is in charge of loadin the data and transform it. Since we are going to perform a linear regression to analyze the data, all of the values must be numeric. But there are cases in which we cannot traduce a qualitative value into a quantitative one, such is the case of the city of the business.
For this case we simply transpose the matrix, and add each possible city as a column. Then, if the business belongs to that city, we put a 1 in that cell, if it doesn't, we put a 0.
As you can see, with the help of our auxiliary functions, extracting and transforming the data is very straightforward.
In [3]:
def load_file(file_path):
"""
Loads the Yelp Phoenix Academic Data Set file for business data, and
transforms it so it can be analyzed
:type file_path: list of dictionaries
:param file_path: the path for the file that contains the businesses
data
:return: a list of dictionaries with the preprocessed data
"""
records = [json.loads(line) for line in open(file_path)]
records = add_transpose_list_column('categories', records)
records = add_transpose_single_column('city', records)
drop_unwanted_fields(records)
return records
Our records now have the shape of a numeric matrix in which most of the values are binary, due to the inclusion of columns for each category and each restaurant. It's a very wide matrix with a total of 663 columns. The first record looks like this:
In [4]:
records = load_file(business_file_path)
print records[0]
len(records[0])
Out[4]:
Now that we have our data in the way we want it, it's time to analyze it. For this, we are going to start with a simple linear regression using the number of reviews ('review_count') as the independent variable and the rating ('stars') as the dependent variable.
Before starting with the linear regression, we are going to plot the data to see if there seems to be a correlation between the two variables at simple sight.
In [5]:
import pylab as plt
records = load_file(business_file_path)
x = [record['review_count'] for record in records]
y = [record['stars'] for record in records]
plt.scatter(x, y)
Out[5]:
In [ ]:
import numpy as np
from sklearn import linear_model
records = load_file(business_file_path)
ratings = np.array([record['stars'] for record in records])
data = np.array([[record['review_count']] for record in records])
num_testing_records = int(len(ratings) * 0.8)
training_data = data[:num_testing_records]
testing_data = data[num_testing_records:]
training_ratings = ratings[:num_testing_records]
testing_ratings = ratings[num_testing_records:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(training_data, training_ratings)
# The coefficients
slope = regr.coef_[0]
intercept = regr.intercept_
print('Slope: \n', slope)
print('Intercept: \n', intercept)
# The root mean square error
print("RMSE: %.2f"
% (np.mean(
(regr.predict(testing_data) - testing_ratings) ** 2)) ** 0.5)
plt.scatter(testing_data, testing_ratings, color='black')
plt.plot(testing_data, regr.predict(testing_data), color='blue',
linewidth=3)
plt.xticks(())
plt.yticks(())
In [7]:
from sklearn.cross_validation import KFold
def multiple_lineal_regression():
load_file(business_file_path)
ratings = np.array([record['stars'] for record in records])
drop_fields(['stars'], records)
data = np.array([record.values() for record in records])
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(data, ratings)
model = linear_model.LinearRegression(fit_intercept=True)
model.fit(data, ratings)
p = np.array([model.predict(xi) for xi in data])
e = p - ratings
total_error = np.dot(e, e)
rmse_train = np.sqrt(total_error / len(p))
kf = KFold(len(data), n_folds=10)
err = 0
for train, test in kf:
model.fit(data[train], ratings[train])
p = np.array([model.predict(xi) for xi in data[test]])
e = p - ratings[test]
err += np.dot(e, e)
rmse_10cv = np.sqrt(err / len(data))
print('RMSE on training: {}'.format(rmse_train))
print('RMSE on 10-fold CV: {}'.format(rmse_10cv))
But before we use this function we are going to modify the load_file function to see how the RMSE changes when we add more variables.
In [8]:
def load_file(file_path):
"""
Loads the Yelp Phoenix Academic Data Set file for business data, and
transforms it so it can be analyzed
:type file_path: list of dictionaries
:param file_path: the path for the file that contains the businesses
data
:return: a list of dictionaries with the preprocessed data
"""
records = [json.loads(line) for line in open(file_path)]
records = add_transpose_list_column('categories', records)
records = add_transpose_single_column('city', records)
drop_unwanted_fields(records)
return records
As we can see above, we are also adding the categories and cities as matrix columns along with the initial columns of review_count, longitude and latitued fields. The stars field is our dependent variable.
Now we are ready to execute our multiple linear regression function to see how much is the accuracy improved when we include several variables.
In [10]:
multiple_lineal_regression()
We can see that the RMSE has reduced from 0.91 to 0.829488852978.