San Diego Burrito Analytics: Linear models

Scott Cole

21 May 2016

This notebook attempts to predict the overall rating of a burrito as a linear combination of its dimensions. Interpretation of these models is complicated by the significant correlations between dimensions (such as meat quality and non-meat filling quality).

Imports


In [1]:
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

import seaborn as sns
sns.set_style("white")

Load data


In [2]:
import util
df = util.load_burritos()
N = df.shape[0]

Linear model 1: Predict overall rating from the individual dimensions


In [26]:
# Define predictors of the model
m_lm = ['Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Wrap']

# Remove incomplete data 
dffull = df[np.hstack((m_lm,'overall'))].dropna()
X = sm.add_constant(dffull[m_lm])
y = dffull['overall']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
print(res.summary())
print(1 - np.var(res.resid_pearson) / np.var(y))


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                overall   No. Observations:                  263
Model:                            GLM   Df Residuals:                      254
Model Family:                Gaussian   Df Model:                            8
Link Function:               identity   Scale:                  0.144242751889
Method:                          IRLS   Log-Likelihood:                -113.98
Date:                Sat, 08 Apr 2017   Deviance:                       36.638
Time:                        18:16:06   Pearson chi2:                     36.6
No. Iterations:                     4                                         
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
const           -0.3267      0.178     -1.839      0.066        -0.675     0.021
Tortilla         0.0738      0.033      2.213      0.027         0.008     0.139
Temp             0.0516      0.025      2.085      0.037         0.003     0.100
Meat             0.2880      0.035      8.114      0.000         0.218     0.358
Fillings         0.3756      0.038      9.922      0.000         0.301     0.450
Meat:filling     0.1404      0.028      5.005      0.000         0.085     0.195
Uniformity       0.0696      0.025      2.807      0.005         0.021     0.118
Salsa            0.0611      0.028      2.199      0.028         0.007     0.116
Wrap             0.0391      0.023      1.735      0.083        -0.005     0.083
================================================================================
0.7182737583817811

In [62]:
# Visualize coefficients
sns.set_style("whitegrid")
from tools.plt import bar
newidx = np.argsort(-res.params.values)
temp = np.arange(len(newidx))
newidx = np.delete(newidx,temp[newidx==0])
plt.figure(figsize=(15,7))
plt.bar(np.arange(len(newidx)), res.params[newidx].values, color='.5',yerr=res.bse[newidx].values)
plt.xticks(np.arange(len(newidx)),res.bse[newidx].keys())
ax = plt.gca()
ax.set_ylim((-.5,.5))
ax.set_yticks(np.arange(-.5,.6,.1))
ax.set_xticks([])
figname = 'overall_metric_linearmodelcoef'
plt.savefig('/gh/fig/burrito/'+figname + '.png')


Linear model 2: predict overall rating from ingredients

This linear model is no better than generating random features, showing that simply a good choice of ingredients is not sufficient to making a high quality burrito.


In [ ]:
# Get all ingredient keys
startingredients = 29
ingredientkeys = df.keys()[startingredients:]

# Get all ingredient keys with at least 10 burritos
Nlim = 10
ingredientkeys = ingredientkeys[df.count()[startingredients:].values>=Nlim]

# Make a dataframe for all ingredients
dfing = df[ingredientkeys]

# Convert data to binary
for k in dfing.keys():
    dfing[k] = dfing[k].map({'x':1,'X':1,1:1})
    dfing[k] = dfing[k].fillna(0)

In [ ]:
# Run a general linear model to predict overall burrito rating from ingredients
X = sm.add_constant(dfing)
y = df.overall
lm = sm.GLM(y,X)
res = lm.fit()
print(res.summary())
origR2 = 1 - np.var(res.resid_pearson) / np.var(y)

In [ ]:
# Test if the variance explained in this linear model is significantly better than chance
np.random.seed(0)
Nsurr = 1000
randr2 = np.zeros(Nsurr)
for n in range(Nsurr):
    Xrand = np.random.rand(X.shape[0],X.shape[1])
    Xrand[:,0] = np.ones(X.shape[0])
    lm = sm.GLM(y,Xrand)
    res = lm.fit()
    randr2[n] = 1 - np.var(res.resid_pearson) / np.var(y)
print 'p = ' , np.mean(randr2>origR2)

Linear model 3. Predicting Yelp ratings

Can also do this for Google ratings Note, interestingly, that the Tortilla rating is most positively correlated with Yelp and Google ratings. This is significant in a linear model when accounting for the overall rating.


In [ ]:
# Average each metric over each Location
# Avoid case issues; in the future should avoid article issues
df.Location = df.Location.str.lower()
m_Location = ['Location','N','Yelp','Google','Hunger','Cost','Volume','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Synergy','Wrap','overall']

tacoshops = df.Location.unique()
TS = len(tacoshops)
dfmean = pd.DataFrame(np.nan, index=range(TS), columns=m_Location)
for ts in range(TS):
    dfmean.loc[ts] = df.loc[df.Location==tacoshops[ts]].mean()
    dfmean['N'][ts] = sum(df.Location == tacoshops[ts])
dfmean.Location = tacoshops

In [ ]:
# Note high correlations between features
m_Yelp = ['Google','Yelp','Hunger','Cost','Volume','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Synergy','Wrap','overall']
M = len(m_Yelp)
dfmeancorr = dfmean[m_Yelp].corr()

from matplotlib import cm
clim1 = (-1,1)
plt.figure(figsize=(10,10))
cax = plt.pcolor(range(M+1), range(M+1), dfmeancorr, cmap=cm.bwr)
cbar = plt.colorbar(cax, ticks=(-1,-.5,0,.5,1))
cbar.ax.set_ylabel('Pearson correlation (r)', size=30)
plt.clim(clim1)
cbar.ax.set_yticklabels((-1,-.5,0,.5,1),size=20)
ax = plt.gca()
ax.set_yticks(np.arange(M)+.5)
ax.set_yticklabels(m_Yelp,size=25)
ax.set_xticks(np.arange(M)+.5)
ax.set_xticklabels(m_Yelp,size=25)
plt.xticks(rotation='vertical')
plt.xlim((0,M))
plt.ylim((0,M))
plt.tight_layout()

In [ ]:
# GLM for Yelp: all dimensions
m_Yelp = ['Hunger','Cost','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Synergy','Wrap','overall']
dffull = dfmean[np.hstack((m_Yelp,'Yelp'))].dropna()
X = sm.add_constant(dffull[m_Yelp])
y = dffull['Yelp']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
print(res.summary())
print(res.pvalues)
print 1 - np.var(res.resid_pearson) / np.var(y)

In [ ]:
# GLM for Yelp: some dimensions
m_Yelp = ['Tortilla','overall']
dffull = dfmean[np.hstack((m_Yelp,'Yelp'))].dropna()
X = sm.add_constant(dffull[m_Yelp])
y = dffull['Yelp']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
print(res.summary())

In [ ]:
plt.figure(figsize=(4,4))
ax = plt.gca()
dfmean.plot(kind='scatter',x='Tortilla',y='Yelp',ax=ax,**{'s':40,'color':'k','alpha':.3})
plt.xlabel('Average Tortilla rating',size=20)
plt.ylabel('Yelp rating',size=20)
plt.xticks(np.arange(0,6),size=15)
plt.yticks(np.arange(0,6),size=15)
plt.ylim((2,5))
plt.tight_layout()
print sp.stats.spearmanr(dffull.Yelp,dffull.Tortilla)

figname = 'corr-Yelp-tortilla'
plt.savefig('C:/gh/fig/burrito/'+figname + '.png')

In [ ]: