San Diego Burrito Analytics: Data characterization

Scott Cole

1 July 2016

This notebook applies nonlinear technqiues to analyze the contributions of burrito dimensions to the overall burrito rating.

  1. Create the ‘vitalness’ metric. For each dimension, identify the burritos that scored below average (defined as 2 or lower), then calculate the linear model’s predicted overall score and compare it to the actual overall score. For what dimensions is this distribution not symmetric around 0? If this distribution trends greater than 0 (Overall_predict - Overall_actual), that means that the actual score is lower than the predicted score. This means that this metric is ‘vital’ and that it being bad will make the whole burrito bad If vitalness < 0, then the metric being really bad actually doesn’t affect the overall burrito as much as it should.
  2. In opposite theme, make the ‘saving’ metric for all burritos in which the dimension was 4.5 or 5
  3. For those that are significantly different from 0, quantify the effect size. (e.g. a burrito with a 2 or lower rating for this metric: its overall rating will be disproportionately impacted by XX points).
  4. For the dimensions, how many are nonzero? If all of them are 0, then burritos are perfectly linear, which would be weird. If many of them are nonzero, then burritos are highly nonlinear.

NOTE: A Neural network is not recommended because we should have 30x as many examples as weights (and for 3-layer neural network with 4 nodes in the first 2 layers and 1 in the last layer, that would be (16+4 = 20), so would need 600 burritos. One option would be to artificially create data.

Default imports


In [1]:
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
import pandasql

import seaborn as sns
sns.set_style("white")

Load data


In [2]:
import util
df = util.load_burritos()
N = df.shape[0]

Vitalness metric


In [3]:
def vitalness(df, dim, rating_cutoff = 2,
             metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meatfilling',
               'Uniformity','Salsa','Wrap']):
    # Fit GLM to get predicted values
    dffull = df[np.hstack((metrics,'overall'))].dropna()
    X = sm.add_constant(dffull[metrics])
    y = dffull['overall']
    my_glm = sm.GLM(y,X)
    res = my_glm.fit()
    dffull['overallpred'] = res.fittedvalues
    
    # Make exception for Meat:filling in order to avoid pandasql error
    if dim == 'Meat:filling':
        dffull = dffull.rename(columns={'Meat:filling':'Meatfilling'})
        dim = 'Meatfilling'

    # Compare predicted and actual overall ratings for each metric below the rating cutoff
    import pandasql
    q = """
    SELECT
    overall, overallpred
    FROM
    dffull
    WHERE
    """
    q = q + dim + ' <= ' + np.str(rating_cutoff)
    df2 = pandasql.sqldf(q.lower(), locals())
    return sp.stats.ttest_rel(df2.overall,df2.overallpred)

In [4]:
vital_metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Wrap']
for metric in vital_metrics:
    print metric
    if metric == 'Volume':
        rating_cutoff = .7
    else:
        rating_cutoff = 1
    print vitalness(df,metric,rating_cutoff=rating_cutoff, metrics=vital_metrics)


Hunger
Ttest_relResult(statistic=-3.365883957256508, pvalue=0.0780709343672088)
Tortilla
(nan, nan)
Temp
Ttest_relResult(statistic=-0.81348890082577441, pvalue=0.5652447600464281)
Meat
C:\Users\Scott\Anaconda2\lib\site-packages\numpy\core\_methods.py:82: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
Ttest_relResult(statistic=nan, pvalue=nan)
Fillings
(nan, nan)
Meat:filling
Ttest_relResult(statistic=-3.0834043861018237, pvalue=0.036809511325454659)
Uniformity
Ttest_relResult(statistic=-1.0261459655791099, pvalue=0.33896960697814343)
Salsa
Ttest_relResult(statistic=0.54602701695372891, pvalue=0.61407301073787002)
Wrap
Ttest_relResult(statistic=-0.81741849256402688, pvalue=0.43735446532724853)

Savior metric


In [5]:
def savior(df, dim, rating_cutoff = 2,
             metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meatfilling',
               'Uniformity','Salsa','Wrap']):
    
    # Fit GLM to get predicted values
    dffull = df[np.hstack((metrics,'overall'))].dropna()
    X = sm.add_constant(dffull[metrics])
    y = dffull['overall']
    my_glm = sm.GLM(y,X)
    res = my_glm.fit()
    dffull['overallpred'] = res.fittedvalues
    
    # Make exception for Meat:filling in order to avoid pandasql error
    if dim == 'Meat:filling':
        dffull = dffull.rename(columns={'Meat:filling':'Meatfilling'})
        dim = 'Meatfilling'

    # Compare predicted and actual overall ratings for each metric below the rating cutoff
    import pandasql
    q = """
    SELECT
    overall, overallpred
    FROM
    dffull
    WHERE
    """
    q = q + dim + ' >= ' + np.str(rating_cutoff)
    df2 = pandasql.sqldf(q.lower(), locals())
    print len(df2)
    return sp.stats.ttest_rel(df2.overall,df2.overallpred)

In [11]:
vital_metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Wrap']
for metric in vital_metrics:
    print metric
    print savior(df,metric,rating_cutoff=5, metrics=vital_metrics)
print 'Volume'
vital_metrics = ['Hunger','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Wrap','Volume']
print savior(df,'Volume',rating_cutoff=.9,metrics=vital_metrics)


Hunger
3
Ttest_relResult(statistic=-0.67484451038709892, pvalue=0.56933333515759021)
Tortilla
1
Ttest_relResult(statistic=nan, pvalue=nan)
Temp
21
Ttest_relResult(statistic=-0.88123919839347509, pvalue=0.38865627728412522)
Meat
6
Ttest_relResult(statistic=-2.3211799942541269, pvalue=0.067953603108261804)
Fillings
7
Ttest_relResult(statistic=-1.0948905704812826, pvalue=0.31555790997898076)
Meat:filling
9
Ttest_relResult(statistic=-1.7913421705001222, pvalue=0.11101175319719182)
Uniformity
14
Ttest_relResult(statistic=-0.26954678056889914, pvalue=0.79174000333648464)
Salsa
3
Ttest_relResult(statistic=-3.8348725849845149, pvalue=0.061765323303517049)
Wrap
38
Ttest_relResult(statistic=-1.950837610088058, pvalue=0.058680682095782977)
Volume
14
Ttest_relResult(statistic=-0.73069258289976113, pvalue=0.47793010155576998)