Happy Healthy Hungry -- San Francisco (H3)

source

In this analysis, I will walk you through my general data science process by analyzing the inspections of San Francisco restaurants using publicly available data from the Department of Public health. We will explore this data to map the cleanliness of the city, and get a better perspective on the relative meaning of these scores by looking at statistics of the data. Throughout this notebook, I show how to use a spectrum of powerful tools for data science (from UNIX shell to pandas and matplotlib) and provide some tips and data tricks. If you would just like to see what insights we garnered, read the associated blog post or simply jump to the bottom of this notebook. This notebook can be downloaded (with associated data) from its repo.

Our approach is documented here to encourage anyone (and everyone) to repeat our analyses and to provide complete transparency into our results. And because we love the scientific method

There's More!

This is one of many projects aimed at democratizing data and letting it roam free (free-range data?). Stay tuned to our feeds or blog if this is the kind of thing that excites you!

If you want to learn more or dive deeper into any of these subjects, we are always happy to discuss (and can talk for days). And if you just can't get enough of this stuff (and want a completely immersive environment), you can apply for our intensive data science bootcamp starting September 16th.

We would love to hear from all of you! Please reach out with any suggestions, feedback, or to just say hi: jonathan@zipfianacademy.com

If you would like to get involved with our school please contact us! We are always looking for guest speakers/instructors, novel datasets, and companies to partner with either for internships/externships or for our hiring day



In [2]:

    
# Import pylab to provide scientific Python libraries (NumPy, SciPy, Matplotlib)
%pylab --no-import-all
#import pylab as pl
# import the Image display module
from IPython.display import Image

# inline allows us to embed matplotlib figures directly into the IPython notebook
%pylab inline









    



Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib
Populating the interactive namespace from numpy and matplotlib



In [3]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/animate.gif', width=700)









    Out[3]:

Problem

The first step of the Process is to define the problem we want to address. To do so let us review what we have set out to accomplish and begin exploring questions we want answered.

How clean are SF restaurants?

It is often best to arrive at a simple yet illuminating question to give you direction. Of course there are a number of sub-questions we may have that relate to our over arching problem, we can address these when we determine our goals for the analysis.

Let us review the important points to keep in mind when defining our problem:

The question can be qualitative, but the approach, must be quantifiable
What am I looking for?
What do I want to learn?
Alright to be exploratory (and the best analysis often are)



In [4]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/goal.png', width=500)









    Out[4]:

Determine Goal

Now that we have a problem we hope to solve, let us begin to quantify our analysis. Since our Problem Statement is often qualitative and broad, we can ask further questions to better define what we hope to achieve.

How does an individual restaurants' score compare to the whole/aggregate of SF?

Are SF's inspections better or worse than other cities?

If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?

Some things to note about our goals and approach:

Determine what deﬁnes success, and to what degree.
Brainstorm metrics to visualize and/or calculate.
Ask questions that have (or can have) a definitive answer.
Be careful what you wish for, be aware of possible correlations, and take caution with how you measure it.



In [5]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/explore.png', width=500)









    Out[5]:

Explore Data

To recap where we are in our analysis:

We have determined what we want to learn -- How clean are SF restaurants?
We explored quantifiable metrics to collect -- Individual scores, summary statistics about distribution of scores, other cities' inspection data to compare

The Explore stage of the analysis is where we will most likely spend most of our time. Now comes the fun part (in my opinion)! At this stage we will use a variety of tools (the documentation of each linked to inline) to figure out where and how to obtain data, what it looks like once we have it, and how to use it to answer of questions to achieve our goals.

Acquire

Luckily, San Francisco has much of its public government data freely accessible online. There are also great initiatives by SF companies collaborating with non-profits and government to develop open data standards. Such standardization allows for much more transparency, leading ultimately to a more engaged citizenry.

The relevant data has been downloaded for your convenience and can be found in the repo.

Examine

If you are working with this IPython notebook, download the data files into the same directory which you ran the ipython notebook command

Now that we have found the relevant data we can begin to peer inside to understand what we are working with. I recommend starting with an iterative approach, using the quickest/easiest tools first and slowly build to more complicated analyes. UNIX provides us with many powerful tools and can carry us quite far by itself. In our case the dataset came with documentation of its contents, but it still is essential to look at the raw data and compare it to the docs.



In [6]:

    
# Let us display a few lines of data to understand its format and fields

# Any command prefixed with '!' instructs IPython to run a shell command
# http://ipython.org/ipython-doc/rel-0.13.1/interactive/reference.html#system-shell-acces

!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/businesses.csv

head is a UNIX command to print only the first few lines of a file (in this case 5). This is very useful for exploring very large files (which happens a lot in data science) very quickly and easily. You can read more about it here) or by consulting its manual pages on the command-line: man head



In [7]:

    
# [PROTIP]: IPython has built in tab completion for commands.  
# Partially type a command or file name and hit tab.

!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/inspections.csv



In [8]:

    
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/violations.csv



In [10]:

    
# It is a little hard to read since the headers are much
# shorter than the data.  Lets see if we can prettify it.

!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/violations.csv | column -t -s ','

column is used to format its input into multiple columns. It also is useful for formating columns already present in delimited data (CSV for example). With this command we used UNIX pipes), one of the most powerful and useful aspects of working in a terminal.



In [11]:

    
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/ScoreLegend.csv | column -t -s ','

There are two different data directories, each of which has similar files. Let's try to figure out the difference between the two, since the documentation on the data does not mention anything.



In [12]:

    
%%bash

# We can use IPython cell 'magics' to run a cell in a subprocess. In this case we run
# the entire cell in a bash process (notice no exclamation point before the shell command)

head -n 5 happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv









    



"business_id","name","address","city","state","postal_code","latitude","longitude","phone_no","TaxCode","business_certificate","application_date","owner_name","owner_address","owner_city","owner_state","owner_zip"
10,"TIRAMISU KITCHEN","033 BELDEN PL","San Francisco","CA","94104","37.791116","-122.403816",,"H24",,,"Tiramisu LLC","33 Belden St","San Francisco","CA","94104"
12,"KIKKA","250 EMBARCADERO  7/F","San Francisco","CA","94105","37.788613","-122.393894",,"H24",,7/12/2002 0:00:00,"KIKKA ITO, INC.","431 South Isis Ave.","Inglewood","CA","90301"
17,"GEORGE'S COFFEE SHOP","2200 OAKDALE AVE ","San Francisco","CA","94124","37.741086","-122.401737","(141) 555-5314","H24",,4/5/1975 0:00:00,"LIEUW, VICTOR & CHRISTINA C","648 MACARTHUR DRIVE","DALY CITY","CA","94015"
19,"NRGIZE LIFESTYLE CAFE","1200 VAN NESS AVE, 3RD FLOOR","San Francisco","CA","94109","37.786848","-122.421547",,"H24",,,"24 Hour Fitness Inc","1200 Van Ness Ave, 3rd Floor","San Francisco","CA","94109"

IPython cell magics and other tricks



In [13]:

    
%%bash
# Can we somehow compare the columns?

head -n 1 happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv | awk -F, '{ print NF }'
head -n 1 happy-healthy-hungry-master/data/SFBusinesses/businesses.csv | awk -F, '{ print NF }'

We see here that in SFFoodProgram_Complete_Data/ the files seem to be augmented with additional data. The file has almost twice as many fields present.

We used awk to figure this out by passing in the header row from the file to a simple awk script to count the number of fields (NF).

AWK is actually a programming language (in addition to a command line utility) and can by quite powerful if used correctly. This is going to be one of the standard tools at our disposal as a data scientist to explore and manipulate data.



In [14]:

    
# In addition to the extra fields, is anything else different?

!wc happy-healthy-hungry-master/data/SFBusinesses/businesses.csv happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv









    



    6384   44426  656875 happy-healthy-hungry-master/data/SFBusinesses/businesses.csv
    6353   85003 1215896 happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv
   12737  129429 1872771 total

wc is another standard UNIX utility that we will find ourself coming back to time and time again. Here we compare the line counts (number of records) and sizes of the files.

The first column is the number of newlines, or how many records are contained. The second column is word count and the last the number of bytes

Some interesting things to note from this very simple (yet illustrative) exploration of our data. As we might have guessed, the files in the SFFoodProgram_Complete_Data/ have more information added to the original SFBusinesses/ files. While the 'complete' data has more columns, there are actually fewer records (6353 compared to 6384) possibly due to the fact that it is harder to get the additional data for the businesses. But while a few businesses might be missing (~30), there is almost twice as much data (656KB compared to 1.2MB) in the 'complete' files if we look at byte counts.

We can endlessly explore and compare these files and contents, I encourage you to perform similar comparisons with the other extra files (SFFoodProgram_Complete_Data/) in the directories and dive deeper into each file itself. For our examination of the data, I am happy with what we have accomplished. Given these new insights, we have enough information to continue on with our analysis.

Prepare

This is typically what people refer to as data 'munging' (or 'wrangling') and often is the most tedious process when working with messy data. Due to increasing awareness of the importance of data quality, the city of SF has been making great strides in more open and accessible data. If you (the city of SF) know the format you will need going into the data collection process (inspecting restaurants) you can hopefully avoid a lot of pain later in the analysis process.

The preparation process of our analysis is not as long and cumbersome as it typically might be due to the high quality of the raw data. Because of this, I will spare you much of the tedium of this step so we can focus on the more interesting aspects of the analysis. If you want to see (and experience) the pain (all you masochists out there), we will get much deeper into data acquisition and scrubbing techniques in our data wrangling post of this series.

Transform

Now that we know the structure of our data, we can start to begin examining it statistically to get a macrosopic look at its distribution. This part of our tutorial will use much of the powerful built in functionality of NumPy, SciPy, matplotlib, and pandas. If you want to get more experience with these, there are great resources and tutorials covering these libraries in much more depth than I will here. I highly recommend taking a look at these if this analysis interests you even in the least bit.



In [15]:

    
'''
To perform some interesting statistical analyses, we first need to "join" our CSV files in order to associate businesses 
with their inspection scores. This data currently resides in SFBusinesses/businesses.csv and SFBusinesses/inspections.csv
'''

# import pandas library which provides an R like environment for python.
# if you do not have it installed: sudo easy_install pandas.
import pandas as pd
import scipy as sp
from scipy import stats

# store relevant file paths in variables since we may use them frequently
root_dir = 'happy-healthy-hungry-master/data/SFBusinesses/'
businesses = root_dir + 'businesses.csv'
inspections = root_dir + 'inspections.csv'


# load each file into a Pandas DataFrame, pandas automatically converts the first line into a header for the columns

df_business = pd.read_csv(businesses)
df_inspection = pd.read_csv(inspections)

# inspect the first 10 rows of the DataFrame
df_inspection.head(10)









    Out[15]:






  
    
      
      business_id
      Score
      date
      type
    
  
  
    
      0
      10
      98
      20121114
      routine
    
    
      1
      10
      98
      20120403
      routine
    
    
      2
      10
      100
      20110928
      routine
    
    
      3
      10
      96
      20110428
      routine
    
    
      4
      10
      100
      20101210
      routine
    
    
      5
      12
      100
      20121120
      routine
    
    
      6
      12
      98
      20120420
      routine
    
    
      7
      12
      100
      20111018
      routine
    
    
      8
      12
      100
      20110401
      routine
    
    
      9
      17
      100
      20120823
      routine



In [16]:

    
'''
we can 'join' DataFrames just as we would database tables
pandas uses a left-outer join by default, meaning that all 
the records from the businesses will be present even if there
is not a corresponding row in the inspections.
'''

# join the two DataFrames on the 'business_id' column
big_table = df_business.merge(df_inspection, on='business_id')

# the joined DataFrame columns: frame1 columns + frame2 columns
# in our case it is the concatenation of df_business and df_inspection columns
print 'Business:\t' + str(df_business.columns) + '\n'
print 'Inspection:\t' + str(df_inspection.columns) + '\n'
print 'Big Table:\t' + str(big_table.columns)

# allows for row and column indexing succinctly
big_table.iloc[:10, :4]









    



Business:	Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code',
       u'latitude', u'longitude', u'phone_number'],
      dtype='object')

Inspection:	Index([u'business_id', u'Score', u'date', u'type'], dtype='object')

Big Table:	Index([u'business_id', u'name', u'address', u'city', u'state', u'postal_code',
       u'latitude', u'longitude', u'phone_number', u'Score', u'date', u'type'],
      dtype='object')






    Out[16]:






  
    
      
      business_id
      name
      address
      city
    
  
  
    
      0
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
    
    
      1
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
    
    
      2
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
    
    
      3
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
    
    
      4
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
    
    
      5
      12
      KIKKA
      250 EMBARCADERO  7/F
      San Francisco
    
    
      6
      12
      KIKKA
      250 EMBARCADERO  7/F
      San Francisco
    
    
      7
      12
      KIKKA
      250 EMBARCADERO  7/F
      San Francisco
    
    
      8
      12
      KIKKA
      250 EMBARCADERO  7/F
      San Francisco
    
    
      9
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco

Now that we have our joined data, we can start exploring it



In [18]:

    
# let us first group our data by business so we can find its most recent score for the inspections

grouped_business = big_table.groupby('business_id')

# a funtion that takes a DataFrame and returns the row with the newest date
def most_recent(df, column='date'):
    return df.sort_values(by=column)[-1:]
    
# input to most_recent is the DataFrame of each group, in this case 
# all of the rows and columns for each business (grouped on business_id). 
most_recent_inspection_results = grouped_business.apply(most_recent)
 
# We applied the most_recent function to extract the row
# of the DataFrame with the most recent inspection.
most_recent_inspection_results.head()









    Out[18]:






  
    
      
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
      Score
      date
      type
    
    
      business_id
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      10
      0
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
      CA
      94104
      37.791116
      -122.403816
      NaN
      98
      20121114
      routine
    
    
      12
      5
      12
      KIKKA
      250 EMBARCADERO  7/F
      San Francisco
      CA
      94105
      37.788613
      -122.393894
      NaN
      100
      20121120
      routine
    
    
      17
      9
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
    
    
      19
      13
      19
      NRGIZE LIFESTYLE CAFE
      1200 VAN NESS AVE, 3RD FLOOR
      San Francisco
      CA
      94109
      37.786848
      -122.421547
      NaN
      100
      20121127
      routine
    
    
      24
      18
      24
      OMNI S.F. HOTEL - 2ND FLOOR PANTRY
      500 CALIFORNIA ST, 2ND  FLOOR
      San Francisco
      CA
      94104
      37.792888
      -122.403135
      NaN
      100
      20121018
      routine



In [19]:

    
# Filter out records without lat/long for mapping
r = most_recent_inspection_results

zero_filtered = r[(r['latitude'] != 0) & (r['latitude'] != 0)]

filtered = zero_filtered.dropna(subset=['latitude', 'longitude'])[['business_id','name', 'address', 'Score', 'date', 'latitude', 'longitude']]

filtered.to_csv('geolocated_rest.csv', index=False)



In [20]:

    
Image(url='http://inundata.org/R_talks/meetup/images/splitapply.png', width=500)









    Out[20]:

Split-Apply-Combine

A visual representation of how group-by, aggregate, and apply semantics work

We can bin the restaurants by scores to understand the distribution of inspections better. Here we create a histogram to understand the distribution of scores better



In [21]:

    
from scipy.stats import expon

# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])

# pandas built-in histogram function automatically distributes and counts bin values 
h = most_recent_inspection_results['Score'].hist(bins=100)

# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 100, 2))

# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")









    Out[21]:





<matplotlib.text.Text at 0x112ce9050>



In [22]:

    
# Now that we have a basic idea of the distribution, let us look at some more interesting statistics
scores =  most_recent_inspection_results['Score']
mean = scores.mean()
median = scores.median()

# compute descriptive summary statistics of the inspection scores
summary = scores.describe()

mode = sp.stats.mode(scores)
skew = scores.skew()

# compute quantiles
ninety = scores.quantile(0.9)
eighty = scores.quantile(0.8)
seventy = scores.quantile(0.7)
sixty = scores.quantile(0.6)

print "Skew: " + str(skew)
print "90%: " + str(ninety)
print "80%: " + str(eighty)
print "70%: " + str(seventy)
print "60%: " + str(sixty)
print summary









    



Skew: -1.38406008151
90%: 100.0
80%: 100.0
70%: 98.0
60%: 96.0
count    5824.000000
mean       91.590659
std         8.296706
min        44.000000
25%        88.000000
50%        94.000000
75%        98.000000
max       100.000000
Name: Score, dtype: float64



In [23]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/solutions.png', width=500)









    Out[23]:

Propose Solutions

Since we have explored our data and have a better idea of its nature, we can begin to devise a plan to answer our questions. This is usually the most iterative part of the entire process: as we learn more about our data we modify our approach, and as modify our solutions we must re-examine our data.

Goals:

How does an individual restaurants' score compare to the whole/aggregate of SF?

Are SF's inspections better or worse than other cities?

If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?

Solutions:

Collect summary statistics (mean, median, standard deviation, etc.) about distribution of scores.

Acquire data on inspection scores for other cities, compare distribution of cities.

Perform a linear regression on historic data on past inspections combined with scores from other 'similar' restaurants.

Some things to note about formulating solutions:

Ask, how can I address the problem?
In what ways can I use the data to achieve my goals?
The simplest solution is often best (and the one you should try first)
Quantize the metrics needed for your analysis



In [24]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/metrics.png', width=500)









    Out[24]:

Collect Metrics

This is the step where derivative values are often calculated, including summary statistics, transformations on the data, and correlations. There also is a bit of traditional data mining involved as most machine learning occurs in the solutions and metrics stages (in our formulation). We could even go so far as to say that the results of predictive models are simply additional metrics: the probability of defaulting on a loan, the cluster a new product belongs in, or the score of a restaurant that hasn't been inspected yet.

The purpose of this part of the process is to calculate the information you need to begin evaluating and testing you solutions and hypotheses.



In [25]:

    
# recall that in the Score Legend, each numeric score corresponds to a more qualitative description
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/ScoreLegend.csv | column -t -s ','



In [26]:

    
'''
let us compute a histogram of these descriptions as well
'''

# first we need to discretize the numerical values, this can be 
# thought of as converting a continuous variable into a categorical one.
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]

# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = most_recent_inspection_results['Score'].copy()
score_transform = most_recent_inspection_results.copy()

# built-in pandas funcion which assigns each data point in 
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)



In [27]:

    
# tranform the original DataFrame's "Score" column with the new descriptions
score_transform['Score'] = discretized_scores

score_transform[['name', 'date','Score']].head(15)









    Out[27]:






  
    
      
      
      name
      date
      Score
    
    
      business_id
      
      
      
      
    
  
  
    
      10
      0
      TIRAMISU KITCHEN
      20121114
      Good
    
    
      12
      5
      KIKKA
      20121120
      Good
    
    
      17
      9
      GEORGE'S COFFEE SHOP
      20120823
      Good
    
    
      19
      13
      NRGIZE LIFESTYLE CAFE
      20121127
      Good
    
    
      24
      18
      OMNI S.F. HOTEL - 2ND FLOOR PANTRY
      20121018
      Good
    
    
      29
      23
      CHICO'S PIZZA
      20121228
      Adequate
    
    
      31
      28
      NORMAN'S ICE CREAM AND FREEZES
      20121217
      Good
    
    
      37
      33
      CAFE BISTRO
      20130130
      Good
    
    
      40
      38
      MO'S COFFEE BAR
      20121213
      Good
    
    
      45
      42
      CHARLIE'S DELI CAF�
      20121126
      Good
    
    
      48
      48
      ART'S CAFE
      20130327
      Adequate
    
    
      50
      53
      SUSHI ZONE
      20130328
      Good
    
    
      54
      59
      RHODA GOLDMAN PLAZA
      20130326
      Good
    
    
      56
      64
      CAFE X + O
      20130327
      Good
    
    
      58
      69
      OASIS GRILL
      20130207
      Good

By quantizing the scores of the restaurant inspections, we can get a better qualitative insight into the ratings. Let us compare this new distribution of quantized scores to the raw numeric values.



In [25]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/evaluate.png', width=500)









    Out[25]:

Evaluate

With the metrics we need properly calculated, it is time to draw some conclusions from our analyses. We need to evaluate whether the result we have arrived at:

Answers our original question to an acceptable level of confidence.
Has allowed us to achieve our goals?



In [28]:

    
'''
plot a histogram of the discretized scores
'''

# create a figure with 2 subplots
fig = figure(figsize=[30,7])
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

# count each occurance of descriptions in the 'Score' column,
# and reverse this result so 'Poor' is left most and 'Good' right most
counts = score_transform['Score'].value_counts()[::-1]
plt = counts.plot(kind='bar', ax=ax2)

# decorate the plot and axis with text
ax2.set_title("Restaurant Inspections (%i total)" % sum(counts))
ax2.set_ylabel("Counts")
ax2.set_xlabel("Description")

# let us add some labels to each bar
for x, y in enumerate(counts):
    plt.text(x + 0.5, y + 200, '%.f' % y, ha='left', va= 'top')
    
# plot the original raw scores (same grapoh as earlier)
most_recent_inspection_results['Score'].hist(bins=100, ax= ax1)
# create x-axis ticks of even numbers 0-100
ax1.set_xticks(np.arange(40, 100, 2))

# add a title to the current figure, our histogram
ax1.set_title("Histogram of Inspection Scores")
ax1.set_ylabel("Counts")
ax1.set_xlabel("Score")

savefig('histograms.png', bbox_inches=0)

We can see that a majority of restaurants are adequate or good, according to the quantiles only 25% have scores less than 88. While the histogram of the numeric scores gives us a more granular look at the data, it can be quite difficult to derive value from it. Is an 86 a filthy/unhealthy restaurant or did it simply forget a few nuanced inspection rules? The Score Legend provides us a mapping from a raw score to a meaningful value, similar to the scaling of standardized test raw scores.

If we are not satisfied with our evaluation, we need to iterate on our approach:

Do I need more/better data?
Do I need to try a different proposed solution?
Do I need to calculate different metrics?



In [29]:

    
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])

# pandas built-in histogram function automatically distributes and counts bin values 
h = most_recent_inspection_results['Score'].hist(bins=100)

# summary statistics vertical lines
axvline(x=mean,color='red',ls='solid', lw="3", label="mean")
axvline(x=median,color='green',ls='solid', lw="3", label="median")
axvline(x=mode[0][0],color='orange',ls='solid', lw="3", label="mode")

# 25th quantile
axvline(x=summary['25%'],color='maroon',ls='dashed', lw="3", label="25th")
axvspan(40, summary['25%'], facecolor="maroon", alpha=0.3)

# 75th quantile
axvline(x=summary['75%'],color='black',ls='dashed', lw="3", label="75th")
axvspan(40, summary['75%'], facecolor="yellow", alpha=0.3 )

# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 104, 2))

# add legend to graph
legend(loc=2)

# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")

savefig('quantiles.png', bbox_inches=0, transparent=True)

print summary









    



count    5824.000000
mean       91.590659
std         8.296706
min        44.000000
25%        88.000000
50%        94.000000
75%        98.000000
max       100.000000
Name: Score, dtype: float64

Iterate

Now that we have a general idea of the distribution of these scores, let us see if we can find any correlation between score ranges and health violations.



In [31]:

    
import re as re
import collections as c
import pprint as pp

# first let us form a 'big table' by joining the violations to the most recent inspection scores
file="happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/violations_plus.csv"

df_violations = pd.read_csv(file)

violation_table = most_recent_inspection_results.merge(df_violations, on=['business_id','date'])
violation_table.head()









    Out[31]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
      Score
      date
      type
      violation_type_id
      ViolationSeverity
      description
    
  
  
    
      0
      10
      TIRAMISU KITCHEN
      033 BELDEN PL
      San Francisco
      CA
      94104
      37.791116
      -122.403816
      NaN
      98
      20121114
      routine
      54
      Minor
      Unclean or degraded floors walls or ceilings
    
    
      1
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
      42
      Minor
      Unclean nonfood contact surfaces
    
    
      2
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
      24
      Minor
      Inadequately cleaned or sanitized food contact...
    
    
      3
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
      3
      Minor
      High risk food holding temperature
    
    
      4
      29
      CHICO'S PIZZA
      131 06TH ST
      San Francisco
      CA
      94103
      37.774722
      -122.406761
      1.415525e+10
      87
      20121228
      routine
      42
      Minor
      Unclean nonfood contact surfaces



In [32]:

    
# Let us see how the violations are distributed
figure(figsize=[18,7])

violation_hist = violation_table['description'].value_counts().plot(kind="bar")



In [33]:

    
# Let us see what violations a restaurant can have and still get a perfect score

figure(figsize=[18,7])

perfect_scores = violation_table[violation_table['Score'] == 100]

violation_hist = perfect_scores['description'].value_counts().plot(kind="bar")

perfect_scores









    Out[33]:






  
    
      
      business_id
      name
      address
      city
      state
      postal_code
      latitude
      longitude
      phone_number
      Score
      date
      type
      violation_type_id
      ViolationSeverity
      description
    
  
  
    
      1
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
      42
      Minor
      Unclean nonfood contact surfaces
    
    
      2
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
      24
      Minor
      Inadequately cleaned or sanitized food contact...
    
    
      3
      17
      GEORGE'S COFFEE SHOP
      2200 OAKDALE AVE
      San Francisco
      CA
      94124
      37.741086
      -122.401737
      1.415553e+10
      100
      20120823
      routine
      3
      Minor
      High risk food holding temperature
    
    
      5580
      5312
      THE WATERFRONT
      PIER 9
      San Francisco
      CA
      94111
      NaN
      NaN
      NaN
      100
      20120912
      routine
      54
      Minor
      Unclean or degraded floors walls or ceilings
    
    
      7747
      28016
      GRIDIRON BAR
      MONSTER PARK  MEZZ LEVEL
      San Francisco
      CA
      NaN
      0.000000
      0.000000
      NaN
      100
      20130112
      routine
      33
      Minor
      Foods not protected from contamination
    
    
      8857
      36744
      EL CASTILLITO TAQUERIA STAR
      250 GOLDEN GATE AVE
      San Francisco
      CA
      94102
      37.781788
      -122.414704
      NaN
      100
      20130426
      routine
      19
      Minor
      Inadequate and inaccessible handwashing facili...
    
    
      8858
      36744
      EL CASTILLITO TAQUERIA STAR
      250 GOLDEN GATE AVE
      San Francisco
      CA
      94102
      37.781788
      -122.414704
      NaN
      100
      20130426
      routine
      42
      Minor
      Unclean nonfood contact surfaces
    
    
      8859
      36744
      EL CASTILLITO TAQUERIA STAR
      250 GOLDEN GATE AVE
      San Francisco
      CA
      94102
      37.781788
      -122.414704
      NaN
      100
      20130426
      routine
      45
      Minor
      Improper storage of equipment utensils or linens
    
    
      8860
      36744
      EL CASTILLITO TAQUERIA STAR
      250 GOLDEN GATE AVE
      San Francisco
      CA
      94102
      37.781788
      -122.414704
      NaN
      100
      20130426
      routine
      56
      Minor
      Permit license or inspection report not posted
    
    
      8861
      36744
      EL CASTILLITO TAQUERIA STAR
      250 GOLDEN GATE AVE
      San Francisco
      CA
      94102
      37.781788
      -122.414704
      NaN
      100
      20130426
      routine
      5
      Minor
      Improper cooling methods
    
    
      9216
      38231
      WINES OF CALIFORNIA WINE BAR
      PIER 39  SPACE N-111
      San Francisco
      CA
      94133
      37.808240
      -122.410189
      NaN
      100
      20120601
      routine
      33
      Minor
      Foods not protected from contamination
    
    
      9217
      38231
      WINES OF CALIFORNIA WINE BAR
      PIER 39  SPACE N-111
      San Francisco
      CA
      94133
      37.808240
      -122.410189
      NaN
      100
      20120601
      routine
      14
      Minor
      High risk vermin infestation
    
    
      12095
      66254
      GAMBINO'S
      1 MARKET ST
      San Francisco
      CA
      94105
      37.794347
      -122.394932
      NaN
      100
      20130301
      routine
      20
      Minor
      Moderate risk food holding temperature
    
    
      13883
      69590
      HOTEL ADAGIO
      550 GEARY ST
      San Francisco
      CA
      94102
      NaN
      NaN
      1.415578e+10
      100
      20121001
      routine
      19
      Minor
      Inadequate and inaccessible handwashing facili...
    
    
      13884
      69590
      HOTEL ADAGIO
      550 GEARY ST
      San Francisco
      CA
      94102
      NaN
      NaN
      1.415578e+10
      100
      20121001
      routine
      57
      Minor
      Food safety certificate or food handler card n...
    
    
      13885
      69590
      HOTEL ADAGIO
      550 GEARY ST
      San Francisco
      CA
      94102
      NaN
      NaN
      1.415578e+10
      100
      20121001
      routine
      43
      Minor
      Inadequate warewashing facilities or equipment



In [41]:

    
violation_table['description'].value_counts()[:10]









    Out[41]:





Unclean or degraded floors walls or ceilings                          1321
Unclean nonfood contact surfaces                                      1068
Inadequate and inaccessible handwashing facilities                     915
Unapproved or unmaintained equipment or utensils                       880
Wiping cloths not clean or properly stored or inadequate sanitizer     835
Moderate risk food holding temperature                                 798
Inadequately cleaned or sanitized food contact surfaces                765
Improper food storage                                                  695
Foods not protected from contamination                                 634
Moderate risk vermin infestation                                       515
Name: description, dtype: int64



In [42]:

    
# Hmmm, apparently high risk vermin infestations are minor violations
# If that is minor, what is a severe violation

df_violations['ViolationSeverity'].drop_duplicates()

# well aparently there are only minor violations









    Out[42]:





0    Minor
Name: ViolationSeverity, dtype: object



In [43]:

    
# Let us bin health violations using the cities quantizations

descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]

# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = violation_table['Score'].copy()
violation_transform = violation_table.copy()

# built-in pandas funcion which assigns each data point in 
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
violation_transform["Scores"] = discretized_scores



In [44]:

    
grouped = violation_transform.groupby('Scores')



In [45]:

    
# let us find the most common violations for each group

# a funtion that takes a DataFrame and returns the top violations
def common_offenses(df):
    return pd.DataFrame(df['description'].value_counts(normalize=True) * 100).head(10)

top_offenses = grouped.apply(common_offenses)
top_offenses









    Out[45]:






  
    
      
      
      description
    
    
      Scores
      
      
    
  
  
    
      Poor
      Unclean or degraded floors walls or ceilings
      6.959153
    
    
      Unclean or unsanitary food contact surfaces
      6.656581
    
    
      Unclean nonfood contact surfaces
      6.278366
    
    
      High risk vermin infestation
      6.051437
    
    
      Unapproved or unmaintained equipment or utensils
      5.824508
    
    
      Foods not protected from contamination
      5.748865
    
    
      Inadequate and inaccessible handwashing facilities
      5.673222
    
    
      High risk food holding temperature
      5.295008
    
    
      Inadequate food safety knowledge or lack of certified food safety manager
      5.068079
    
    
      Improper food storage
      4.084720
    
    
      Needs Improvement
      Unclean or degraded floors walls or ceilings
      8.003078
    
    
      Unclean nonfood contact surfaces
      6.887264
    
    
      Inadequate and inaccessible handwashing facilities
      5.867641
    
    
      Unapproved or unmaintained equipment or utensils
      5.848403
    
    
      Foods not protected from contamination
      5.367449
    
    
      Inadequately cleaned or sanitized food contact surfaces
      5.348211
    
    
      High risk food holding temperature
      4.924971
    
    
      Wiping cloths not clean or properly stored or inadequate sanitizer
      4.732589
    
    
      Moderate risk food holding temperature
      4.636399
    
    
      Improper food storage
      4.386302
    
    
      Adequate
      Unclean or degraded floors walls or ceilings
      8.735189
    
    
      Unclean nonfood contact surfaces
      7.495178
    
    
      Inadequate and inaccessible handwashing facilities
      7.192064
    
    
      Moderate risk food holding temperature
      6.999173
    
    
      Inadequately cleaned or sanitized food contact surfaces
      6.255167
    
    
      Unapproved or unmaintained equipment or utensils
      6.227611
    
    
      Wiping cloths not clean or properly stored or inadequate sanitizer
      6.062276
    
    
      Improper food storage
      4.877377
    
    
      Moderate risk vermin infestation
      4.408928
    
    
      Foods not protected from contamination
      4.243593
    
    
      Good
      Unclean or degraded floors walls or ceilings
      11.272727
    
    
      Unclean nonfood contact surfaces
      8.068182
    
    
      Wiping cloths not clean or properly stored or inadequate sanitizer
      7.227273
    
    
      Inadequate and inaccessible handwashing facilities
      6.227273
    
    
      Unapproved or unmaintained equipment or utensils
      6.204545
    
    
      Moderate risk food holding temperature
      5.954545
    
    
      Food safety certificate or food handler card not available
      5.386364
    
    
      Improper food storage
      5.363636
    
    
      Inadequately cleaned or sanitized food contact surfaces
      4.931818
    
    
      Permit license or inspection report not posted
      3.568182



In [46]:

    
f = figure(figsize=[18,7])
colors = ['r', 'b', 'y', 'g']

for name, group in grouped:
    group['description'].value_counts().plot(kind="bar", axes=f, alpha=0.5, color=colors.pop())









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-4b9a6674d33d> in <module>()
      3 
      4 for name, group in grouped:
----> 5     group['description'].value_counts().plot(kind="bar", axes=f, alpha=0.5, color=colors.pop())

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in __call__(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
   3598                            colormap=colormap, table=table, yerr=yerr,
   3599                            xerr=xerr, label=label, secondary_y=secondary_y,
-> 3600                            **kwds)
   3601     __call__.__doc__ = plot_series.__doc__
   3602 

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
   2672                  yerr=yerr, xerr=xerr,
   2673                  label=label, secondary_y=secondary_y,
-> 2674                  **kwds)
   2675 
   2676 

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(data, x, y, subplots, ax, kind, **kwds)
   2468         plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds)
   2469 
-> 2470     plot_obj.generate()
   2471     plot_obj.draw()
   2472     return plot_obj.result

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in generate(self)
   1041         self._compute_plot_data()
   1042         self._setup_subplots()
-> 1043         self._make_plot()
   1044         self._add_table()
   1045         self._make_legend()

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _make_plot(self)
   1997                 rect = self._plot(ax, self.ax_pos + (i + 0.5) * w, y, w,
   1998                                   start=start, label=label,
-> 1999                                   log=self.log, **kwds)
   2000             self._add_legend_handle(rect, label, index=i)
   2001 

/usr/local/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(cls, ax, x, y, w, start, log, **kwds)
   1944     @classmethod
   1945     def _plot(cls, ax, x, y, w, start=0, log=False, **kwds):
-> 1946         return ax.bar(x, y, w, bottom=start, log=log, **kwds)
   1947 
   1948     @property

/usr/local/lib/python2.7/site-packages/matplotlib/__init__.pyc in inner(ax, *args, **kwargs)
   1890                     warnings.warn(msg % (label_namer, func.__name__),
   1891                                   RuntimeWarning, stacklevel=2)
-> 1892             return func(ax, *args, **kwargs)
   1893         pre_doc = inner.__doc__
   1894         if pre_doc is None:

/usr/local/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in bar(self, left, height, width, bottom, **kwargs)
   2132             elif orientation == 'horizontal':
   2133                 r.sticky_edges.x.append(l)
-> 2134             self.add_patch(r)
   2135             patches.append(r)
   2136 

/usr/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in add_patch(self, p)
   1857         """
   1858 
-> 1859         self._set_artist_props(p)
   1860         if p.get_clip_path() is None:
   1861             p.set_clip_path(self.patch)

/usr/local/lib/python2.7/site-packages/matplotlib/axes/_base.pyc in _set_artist_props(self, a)
    920             a.set_transform(self.transData)
    921 
--> 922         a.axes = self
    923         if a.mouseover:
    924             self.mouseover_set.add(a)

/usr/local/lib/python2.7/site-packages/matplotlib/artist.pyc in axes(self, new_axes)
    247         if (new_axes is not None and
    248                 (self._axes is not None and new_axes != self._axes)):
--> 249             raise ValueError("Can not reset the axes.  You are "
    250                              "probably trying to re-use an artist "
    251                              "in more than one Axes which is not "

ValueError: Can not reset the axes.  You are probably trying to re-use an artist in more than one Axes which is not supported



In [37]:

    
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/communicate.png', width=500)









    Out[37]:

Communicate Results

An important part of visualization that is often overlooked is informational accessibility, or rather how interpretable are the results. Interactivity can go a long way towards making the content of your visualization much more digestable by allowing the consumer to 'discover' the data at their own rate.

Each restaurant is geographically binned using the D3.js hexbin plugin. The color gradient of each hexagon reflects the median inspection score of the bin, and the radius of the hexagon is proportional to the number of restaurants in the bin. Binning is first computed with a uniform hexagon radius over the map, and then the radius of each individual hexagon is adjusted for how many restaurants ended up in its bin.

Large blue hexagons represent many high scoring restaurants and small red mean a few very poorly scoring restaurants. The controls on the map allow users to adjust the radius (Bin:) of the hexagon for computing the binning as well as the range (Score:) of scores to show/use on the map. The color of the Bin: slider represents the median color of the Score: range and its size represents the radius of the hexagons. The colors of each of the Score: sliders represent the threshold color for that score, i.e. if the range is 40 - 100 the left slider's color corresponds to a score of 40 and the right slider to a score of 100. The colors for every score in-between are computed using a linear gradient.

Credits

Thank you to Yelp and Code for America for spearheading the initiative to make civic data more open. And this would not have been possible without the city of SF and all the health inspectors, thank you for keeping us safe.

	business_id	Score	date	type
0	10	98	20121114	routine
1	10	98	20120403	routine
2	10	100	20110928	routine
3	10	96	20110428	routine
4	10	100	20101210	routine
5	12	100	20121120	routine
6	12	98	20120420	routine
7	12	100	20111018	routine
8	12	100	20110401	routine
9	17	100	20120823	routine

	business_id	name	address	city
0	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco
1	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco
2	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco
3	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco
4	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco
5	12	KIKKA	250 EMBARCADERO 7/F	San Francisco
6	12	KIKKA	250 EMBARCADERO 7/F	San Francisco
7	12	KIKKA	250 EMBARCADERO 7/F	San Francisco
8	12	KIKKA	250 EMBARCADERO 7/F	San Francisco
9	17	GEORGE'S COFFEE SHOP	2200 OAKDALE AVE	San Francisco

		business_id	name	address	city	state	postal_code	latitude	longitude	phone_number	Score	date	type
business_id
10	0	10	TIRAMISU KITCHEN	033 BELDEN PL	San Francisco	CA	94104	37.791116	-122.403816	NaN	98	20121114	routine
12	5	12	KIKKA	250 EMBARCADERO 7/F	San Francisco	CA	94105	37.788613	-122.393894	NaN	100	20121120	routine
17	9	17	GEORGE'S COFFEE SHOP	2200 OAKDALE AVE	San Francisco	CA	94124	37.741086	-122.401737	1.415553e+10	100	20120823	routine
19	13	19	NRGIZE LIFESTYLE CAFE	1200 VAN NESS AVE, 3RD FLOOR	San Francisco	CA	94109	37.786848	-122.421547	NaN	100	20121127	routine
24	18	24	OMNI S.F. HOTEL - 2ND FLOOR PANTRY	500 CALIFORNIA ST, 2ND FLOOR	San Francisco	CA	94104	37.792888	-122.403135	NaN	100	20121018	routine

		description
Scores
Poor	Unclean or degraded floors walls or ceilings	6.959153
	Unclean or unsanitary food contact surfaces	6.656581
	Unclean nonfood contact surfaces	6.278366
	High risk vermin infestation	6.051437
	Unapproved or unmaintained equipment or utensils	5.824508
	Foods not protected from contamination	5.748865
	Inadequate and inaccessible handwashing facilities	5.673222
	High risk food holding temperature	5.295008
	Inadequate food safety knowledge or lack of certified food safety manager	5.068079
	Improper food storage	4.084720
Needs Improvement	Unclean or degraded floors walls or ceilings	8.003078
	Unclean nonfood contact surfaces	6.887264
	Inadequate and inaccessible handwashing facilities	5.867641
	Unapproved or unmaintained equipment or utensils	5.848403
	Foods not protected from contamination	5.367449
	Inadequately cleaned or sanitized food contact surfaces	5.348211
	High risk food holding temperature	4.924971
	Wiping cloths not clean or properly stored or inadequate sanitizer	4.732589
	Moderate risk food holding temperature	4.636399
	Improper food storage	4.386302
Adequate	Unclean or degraded floors walls or ceilings	8.735189
	Unclean nonfood contact surfaces	7.495178
	Inadequate and inaccessible handwashing facilities	7.192064
	Moderate risk food holding temperature	6.999173
	Inadequately cleaned or sanitized food contact surfaces	6.255167
	Unapproved or unmaintained equipment or utensils	6.227611
	Wiping cloths not clean or properly stored or inadequate sanitizer	6.062276
	Improper food storage	4.877377
	Moderate risk vermin infestation	4.408928
	Foods not protected from contamination	4.243593
Good	Unclean or degraded floors walls or ceilings	11.272727
	Unclean nonfood contact surfaces	8.068182
	Wiping cloths not clean or properly stored or inadequate sanitizer	7.227273
	Inadequate and inaccessible handwashing facilities	6.227273
	Unapproved or unmaintained equipment or utensils	6.204545
	Moderate risk food holding temperature	5.954545
	Food safety certificate or food handler card not available	5.386364
	Improper food storage	5.363636
	Inadequately cleaned or sanitized food contact surfaces	4.931818
	Permit license or inspection report not posted	3.568182