# San Diego Burrito Analytics: Data characterization

Scott Cole

1 July 2016

This notebook applies nonlinear technqiues to analyze the contributions of burrito dimensions to the overall burrito rating.

1. Create the ‘vitalness’ metric. For each dimension, identify the burritos that scored below average (defined as 2 or lower), then calculate the linear model’s predicted overall score and compare it to the actual overall score. For what dimensions is this distribution not symmetric around 0? If this distribution trends greater than 0 (Overall_predict - Overall_actual), that means that the actual score is lower than the predicted score. This means that this metric is ‘vital’ and that it being bad will make the whole burrito bad If vitalness < 0, then the metric being really bad actually doesn’t affect the overall burrito as much as it should.
2. In opposite theme, make the ‘saving’ metric for all burritos in which the dimension was 4.5 or 5
3. For those that are significantly different from 0, quantify the effect size. (e.g. a burrito with a 2 or lower rating for this metric: its overall rating will be disproportionately impacted by XX points).
4. For the dimensions, how many are nonzero? If all of them are 0, then burritos are perfectly linear, which would be weird. If many of them are nonzero, then burritos are highly nonlinear.

NOTE: A Neural network is not recommended because we should have 30x as many examples as weights (and for 3-layer neural network with 4 nodes in the first 2 layers and 1 in the last layer, that would be (16+4 = 20), so would need 600 burritos. One option would be to artificially create data.

### Default imports

``````

In [1]:

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pandasql

import seaborn as sns
sns.set_style("white")

``````

``````

In [2]:

import util
N = df.shape[0]

``````

# Vitalness metric

``````

In [3]:

def vitalness(df, dim, rating_cutoff = 2,
metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meatfilling',
'Uniformity','Salsa','Wrap']):
# Fit GLM to get predicted values
dffull = df[np.hstack((metrics,'overall'))].dropna()
y = dffull['overall']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
dffull['overallpred'] = res.fittedvalues

# Make exception for Meat:filling in order to avoid pandasql error
if dim == 'Meat:filling':
dffull = dffull.rename(columns={'Meat:filling':'Meatfilling'})
dim = 'Meatfilling'

# Compare predicted and actual overall ratings for each metric below the rating cutoff
import pandasql
q = """
SELECT
overall, overallpred
FROM
dffull
WHERE
"""
q = q + dim + ' <= ' + np.str(rating_cutoff)
df2 = pandasql.sqldf(q.lower(), locals())
return sp.stats.ttest_rel(df2.overall,df2.overallpred)

``````
``````

In [4]:

vital_metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meat:filling',
'Uniformity','Salsa','Wrap']
for metric in vital_metrics:
print metric
if metric == 'Volume':
rating_cutoff = .7
else:
rating_cutoff = 1
print vitalness(df,metric,rating_cutoff=rating_cutoff, metrics=vital_metrics)

``````
``````

Hunger
Ttest_relResult(statistic=-3.365883957256508, pvalue=0.0780709343672088)
Tortilla
(nan, nan)
Temp
Ttest_relResult(statistic=-0.81348890082577441, pvalue=0.5652447600464281)
Meat

C:\Users\Scott\Anaconda2\lib\site-packages\numpy\core\_methods.py:82: RuntimeWarning: Degrees of freedom <= 0 for slice
warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)

Ttest_relResult(statistic=nan, pvalue=nan)
Fillings
(nan, nan)
Meat:filling
Ttest_relResult(statistic=-3.0834043861018237, pvalue=0.036809511325454659)
Uniformity
Ttest_relResult(statistic=-1.0261459655791099, pvalue=0.33896960697814343)
Salsa
Ttest_relResult(statistic=0.54602701695372891, pvalue=0.61407301073787002)
Wrap
Ttest_relResult(statistic=-0.81741849256402688, pvalue=0.43735446532724853)

``````

# Savior metric

``````

In [5]:

def savior(df, dim, rating_cutoff = 2,
metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meatfilling',
'Uniformity','Salsa','Wrap']):

# Fit GLM to get predicted values
dffull = df[np.hstack((metrics,'overall'))].dropna()
y = dffull['overall']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
dffull['overallpred'] = res.fittedvalues

# Make exception for Meat:filling in order to avoid pandasql error
if dim == 'Meat:filling':
dffull = dffull.rename(columns={'Meat:filling':'Meatfilling'})
dim = 'Meatfilling'

# Compare predicted and actual overall ratings for each metric below the rating cutoff
import pandasql
q = """
SELECT
overall, overallpred
FROM
dffull
WHERE
"""
q = q + dim + ' >= ' + np.str(rating_cutoff)
df2 = pandasql.sqldf(q.lower(), locals())
print len(df2)
return sp.stats.ttest_rel(df2.overall,df2.overallpred)

``````
``````

In [11]:

vital_metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meat:filling',
'Uniformity','Salsa','Wrap']
for metric in vital_metrics:
print metric
print savior(df,metric,rating_cutoff=5, metrics=vital_metrics)
print 'Volume'
vital_metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meat:filling',
'Uniformity','Salsa','Wrap','Volume']
print savior(df,'Volume',rating_cutoff=.9,metrics=vital_metrics)

``````
``````

Hunger
3
Ttest_relResult(statistic=-0.67484451038709892, pvalue=0.56933333515759021)
Tortilla
1
Ttest_relResult(statistic=nan, pvalue=nan)
Temp
21
Ttest_relResult(statistic=-0.88123919839347509, pvalue=0.38865627728412522)
Meat
6
Ttest_relResult(statistic=-2.3211799942541269, pvalue=0.067953603108261804)
Fillings
7
Ttest_relResult(statistic=-1.0948905704812826, pvalue=0.31555790997898076)
Meat:filling
9
Ttest_relResult(statistic=-1.7913421705001222, pvalue=0.11101175319719182)
Uniformity
14
Ttest_relResult(statistic=-0.26954678056889914, pvalue=0.79174000333648464)
Salsa
3
Ttest_relResult(statistic=-3.8348725849845149, pvalue=0.061765323303517049)
Wrap
38
Ttest_relResult(statistic=-1.950837610088058, pvalue=0.058680682095782977)
Volume
14
Ttest_relResult(statistic=-0.73069258289976113, pvalue=0.47793010155576998)

``````