In this analysis, I will walk you through my general data science process by analyzing the inspections of San Francisco restaurants using publicly available data from the Department of Public health. We will explore this data to map the cleanliness of the city, and get a better perspective on the relative meaning of these scores by looking at statistics of the data. Throughout this notebook, I show how to use a spectrum of powerful tools for data science (from UNIX shell to pandas and matplotlib) and provide some tips and data tricks. If you would just like to see what insights we garnered, read the associated blog post or simply jump to the bottom of this notebook. This notebook can be downloaded (with associated data) from its repo.
Our approach is documented here to encourage anyone (and everyone) to repeat our analyses and to provide complete transparency into our results. And because we love the scientific method
This is one of many projects aimed at democratizing data and letting it roam free (free-range data?). Stay tuned to our feeds or blog if this is the kind of thing that excites you!
If you want to learn more or dive deeper into any of these subjects, we are always happy to discuss (and can talk for days). And if you just can't get enough of this stuff (and want a completely immersive environment), you can apply for our intensive data science bootcamp starting September 16th.
We would love to hear from all of you! Please reach out with any suggestions, feedback, or to just say hi: jonathan@zipfianacademy.com
If you would like to get involved with our school please contact us! We are always looking for guest speakers/instructors, novel datasets, and companies to partner with either for internships/externships or for our hiring day
In [2]:
# Import pylab to provide scientific Python libraries (NumPy, SciPy, Matplotlib)
%pylab --no-import-all
#import pylab as pl
# import the Image display module
from IPython.display import Image
# inline allows us to embed matplotlib figures directly into the IPython notebook
%pylab inline
In [3]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/animate.gif', width=700)
Out[3]:
The first step of the Process is to define the problem we want to address. To do so let us review what we have set out to accomplish and begin exploring questions we want answered.
How clean are SF restaurants?
It is often best to arrive at a simple yet illuminating question to give you direction. Of course there are a number of sub-questions we may have that relate to our over arching problem, we can address these when we determine our goals for the analysis.
In [4]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/goal.png', width=500)
Out[4]:
Now that we have a problem we hope to solve, let us begin to quantify our analysis. Since our Problem Statement is often qualitative and broad, we can ask further questions to better define what we hope to achieve.
How does an individual restaurants' score compare to the whole/aggregate of SF?
Are SF's inspections better or worse than other cities?
If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?
In [5]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/explore.png', width=500)
Out[5]:
The Explore stage of the analysis is where we will most likely spend most of our time. Now comes the fun part (in my opinion)! At this stage we will use a variety of tools (the documentation of each linked to inline) to figure out where and how to obtain data, what it looks like once we have it, and how to use it to answer of questions to achieve our goals.
Luckily, San Francisco has much of its public government data freely accessible online. There are also great initiatives by SF companies collaborating with non-profits and government to develop open data standards. Such standardization allows for much more transparency, leading ultimately to a more engaged citizenry.
The relevant data has been downloaded for your convenience and can be found in the repo.
If you are working with this IPython notebook, download the data files into the same directory which you ran the ipython notebook
command
Now that we have found the relevant data we can begin to peer inside to understand what we are working with. I recommend starting with an iterative approach, using the quickest/easiest tools first and slowly build to more complicated analyes. UNIX provides us with many powerful tools and can carry us quite far by itself. In our case the dataset came with documentation of its contents, but it still is essential to look at the raw data and compare it to the docs.
In [6]:
# Let us display a few lines of data to understand its format and fields
# Any command prefixed with '!' instructs IPython to run a shell command
# http://ipython.org/ipython-doc/rel-0.13.1/interactive/reference.html#system-shell-acces
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/businesses.csv
head
is a UNIX command to print only the first few lines of a file (in this case 5). This is very useful for exploring very large files (which happens a lot in data science) very quickly and easily. You can read more about it here) or by consulting its manual pages on the command-line: man head
In [7]:
# [PROTIP]: IPython has built in tab completion for commands.
# Partially type a command or file name and hit tab.
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/inspections.csv
In [8]:
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/violations.csv
In [10]:
# It is a little hard to read since the headers are much
# shorter than the data. Lets see if we can prettify it.
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/violations.csv | column -t -s ','
In [11]:
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/ScoreLegend.csv | column -t -s ','
There are two different data directories, each of which has similar files. Let's try to figure out the difference between the two, since the documentation on the data does not mention anything.
In [12]:
%%bash
# We can use IPython cell 'magics' to run a cell in a subprocess. In this case we run
# the entire cell in a bash process (notice no exclamation point before the shell command)
head -n 5 happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv
IPython cell magics and other tricks
In [13]:
%%bash
# Can we somehow compare the columns?
head -n 1 happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv | awk -F, '{ print NF }'
head -n 1 happy-healthy-hungry-master/data/SFBusinesses/businesses.csv | awk -F, '{ print NF }'
We see here that in SFFoodProgram_Complete_Data/ the files seem to be augmented with additional data. The file has almost twice as many fields present.
We used awk to figure this out by passing in the header row from the file to a simple awk
script to count the number of fields (NF).
AWK is actually a programming language (in addition to a command line utility) and can by quite powerful if used correctly. This is going to be one of the standard tools at our disposal as a data scientist to explore and manipulate data.
In [14]:
# In addition to the extra fields, is anything else different?
!wc happy-healthy-hungry-master/data/SFBusinesses/businesses.csv happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/businesses_plus.csv
wc is another standard UNIX utility that we will find ourself coming back to time and time again. Here we compare the line counts (number of records) and sizes of the files.
The first column is the number of newlines, or how many records are contained. The second column is word count and the last the number of bytes
Some interesting things to note from this very simple (yet illustrative) exploration of our data. As we might have guessed, the files in the SFFoodProgram_Complete_Data/
have more information added to the original SFBusinesses/
files. While the 'complete' data has more columns, there are actually fewer records (6353 compared to 6384) possibly due to the fact that it is harder to get the additional data for the businesses. But while a few businesses might be missing (~30), there is almost twice as much data (656KB compared to 1.2MB) in the 'complete' files if we look at byte counts.
We can endlessly explore and compare these files and contents, I encourage you to perform similar comparisons with the other extra files (SFFoodProgram_Complete_Data/
) in the directories and dive deeper into each file itself. For our examination of the data, I am happy with what we have accomplished. Given these new insights, we have enough information to continue on with our analysis.
This is typically what people refer to as data 'munging' (or 'wrangling') and often is the most tedious process when working with messy data. Due to increasing awareness of the importance of data quality, the city of SF has been making great strides in more open and accessible data. If you (the city of SF) know the format you will need going into the data collection process (inspecting restaurants) you can hopefully avoid a lot of pain later in the analysis process.
The preparation process of our analysis is not as long and cumbersome as it typically might be due to the high quality of the raw data. Because of this, I will spare you much of the tedium of this step so we can focus on the more interesting aspects of the analysis. If you want to see (and experience) the pain (all you masochists out there), we will get much deeper into data acquisition and scrubbing techniques in our data wrangling post of this series.
Now that we know the structure of our data, we can start to begin examining it statistically to get a macrosopic look at its distribution. This part of our tutorial will use much of the powerful built in functionality of NumPy, SciPy, matplotlib, and pandas. If you want to get more experience with these, there are great resources and tutorials covering these libraries in much more depth than I will here. I highly recommend taking a look at these if this analysis interests you even in the least bit.
In [15]:
'''
To perform some interesting statistical analyses, we first need to "join" our CSV files in order to associate businesses
with their inspection scores. This data currently resides in SFBusinesses/businesses.csv and SFBusinesses/inspections.csv
'''
# import pandas library which provides an R like environment for python.
# if you do not have it installed: sudo easy_install pandas.
import pandas as pd
import scipy as sp
from scipy import stats
# store relevant file paths in variables since we may use them frequently
root_dir = 'happy-healthy-hungry-master/data/SFBusinesses/'
businesses = root_dir + 'businesses.csv'
inspections = root_dir + 'inspections.csv'
# load each file into a Pandas DataFrame, pandas automatically converts the first line into a header for the columns
df_business = pd.read_csv(businesses)
df_inspection = pd.read_csv(inspections)
# inspect the first 10 rows of the DataFrame
df_inspection.head(10)
Out[15]:
In [16]:
'''
we can 'join' DataFrames just as we would database tables
pandas uses a left-outer join by default, meaning that all
the records from the businesses will be present even if there
is not a corresponding row in the inspections.
'''
# join the two DataFrames on the 'business_id' column
big_table = df_business.merge(df_inspection, on='business_id')
# the joined DataFrame columns: frame1 columns + frame2 columns
# in our case it is the concatenation of df_business and df_inspection columns
print 'Business:\t' + str(df_business.columns) + '\n'
print 'Inspection:\t' + str(df_inspection.columns) + '\n'
print 'Big Table:\t' + str(big_table.columns)
# allows for row and column indexing succinctly
big_table.iloc[:10, :4]
Out[16]:
Now that we have our joined data, we can start exploring it
In [18]:
# let us first group our data by business so we can find its most recent score for the inspections
grouped_business = big_table.groupby('business_id')
# a funtion that takes a DataFrame and returns the row with the newest date
def most_recent(df, column='date'):
return df.sort_values(by=column)[-1:]
# input to most_recent is the DataFrame of each group, in this case
# all of the rows and columns for each business (grouped on business_id).
most_recent_inspection_results = grouped_business.apply(most_recent)
# We applied the most_recent function to extract the row
# of the DataFrame with the most recent inspection.
most_recent_inspection_results.head()
Out[18]:
In [19]:
# Filter out records without lat/long for mapping
r = most_recent_inspection_results
zero_filtered = r[(r['latitude'] != 0) & (r['latitude'] != 0)]
filtered = zero_filtered.dropna(subset=['latitude', 'longitude'])[['business_id','name', 'address', 'Score', 'date', 'latitude', 'longitude']]
filtered.to_csv('geolocated_rest.csv', index=False)
In [20]:
Image(url='http://inundata.org/R_talks/meetup/images/splitapply.png', width=500)
Out[20]:
We can bin the restaurants by scores to understand the distribution of inspections better. Here we create a histogram to understand the distribution of scores better
In [21]:
from scipy.stats import expon
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])
# pandas built-in histogram function automatically distributes and counts bin values
h = most_recent_inspection_results['Score'].hist(bins=100)
# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 100, 2))
# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")
Out[21]:
In [22]:
# Now that we have a basic idea of the distribution, let us look at some more interesting statistics
scores = most_recent_inspection_results['Score']
mean = scores.mean()
median = scores.median()
# compute descriptive summary statistics of the inspection scores
summary = scores.describe()
mode = sp.stats.mode(scores)
skew = scores.skew()
# compute quantiles
ninety = scores.quantile(0.9)
eighty = scores.quantile(0.8)
seventy = scores.quantile(0.7)
sixty = scores.quantile(0.6)
print "Skew: " + str(skew)
print "90%: " + str(ninety)
print "80%: " + str(eighty)
print "70%: " + str(seventy)
print "60%: " + str(sixty)
print summary
In [23]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/solutions.png', width=500)
Out[23]:
Since we have explored our data and have a better idea of its nature, we can begin to devise a plan to answer our questions. This is usually the most iterative part of the entire process: as we learn more about our data we modify our approach, and as modify our solutions we must re-examine our data.
How does an individual restaurants' score compare to the whole/aggregate of SF?
Are SF's inspections better or worse than other cities?
If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?
Collect summary statistics (mean, median, standard deviation, etc.) about distribution of scores.
Acquire data on inspection scores for other cities, compare distribution of cities.
Perform a linear regression on historic data on past inspections combined with scores from other 'similar' restaurants.
In [24]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/metrics.png', width=500)
Out[24]:
This is the step where derivative values are often calculated, including summary statistics, transformations on the data, and correlations. There also is a bit of traditional data mining involved as most machine learning occurs in the solutions and metrics stages (in our formulation). We could even go so far as to say that the results of predictive models are simply additional metrics: the probability of defaulting on a loan, the cluster a new product belongs in, or the score of a restaurant that hasn't been inspected yet.
The purpose of this part of the process is to calculate the information you need to begin evaluating and testing you solutions and hypotheses.
In [25]:
# recall that in the Score Legend, each numeric score corresponds to a more qualitative description
!head -n 5 happy-healthy-hungry-master/data/SFBusinesses/ScoreLegend.csv | column -t -s ','
In [26]:
'''
let us compute a histogram of these descriptions as well
'''
# first we need to discretize the numerical values, this can be
# thought of as converting a continuous variable into a categorical one.
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]
# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = most_recent_inspection_results['Score'].copy()
score_transform = most_recent_inspection_results.copy()
# built-in pandas funcion which assigns each data point in
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
In [27]:
# tranform the original DataFrame's "Score" column with the new descriptions
score_transform['Score'] = discretized_scores
score_transform[['name', 'date','Score']].head(15)
Out[27]:
By quantizing the scores of the restaurant inspections, we can get a better qualitative insight into the ratings. Let us compare this new distribution of quantized scores to the raw numeric values.
In [25]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/evaluate.png', width=500)
Out[25]:
In [28]:
'''
plot a histogram of the discretized scores
'''
# create a figure with 2 subplots
fig = figure(figsize=[30,7])
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
# count each occurance of descriptions in the 'Score' column,
# and reverse this result so 'Poor' is left most and 'Good' right most
counts = score_transform['Score'].value_counts()[::-1]
plt = counts.plot(kind='bar', ax=ax2)
# decorate the plot and axis with text
ax2.set_title("Restaurant Inspections (%i total)" % sum(counts))
ax2.set_ylabel("Counts")
ax2.set_xlabel("Description")
# let us add some labels to each bar
for x, y in enumerate(counts):
plt.text(x + 0.5, y + 200, '%.f' % y, ha='left', va= 'top')
# plot the original raw scores (same grapoh as earlier)
most_recent_inspection_results['Score'].hist(bins=100, ax= ax1)
# create x-axis ticks of even numbers 0-100
ax1.set_xticks(np.arange(40, 100, 2))
# add a title to the current figure, our histogram
ax1.set_title("Histogram of Inspection Scores")
ax1.set_ylabel("Counts")
ax1.set_xlabel("Score")
savefig('histograms.png', bbox_inches=0)
We can see that a majority of restaurants are adequate or good, according to the quantiles only 25% have scores less than 88. While the histogram of the numeric scores gives us a more granular look at the data, it can be quite difficult to derive value from it. Is an 86 a filthy/unhealthy restaurant or did it simply forget a few nuanced inspection rules? The Score Legend provides us a mapping from a raw score to a meaningful value, similar to the scaling of standardized test raw scores.
If we are not satisfied with our evaluation, we need to iterate on our approach:
In [29]:
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])
# pandas built-in histogram function automatically distributes and counts bin values
h = most_recent_inspection_results['Score'].hist(bins=100)
# summary statistics vertical lines
axvline(x=mean,color='red',ls='solid', lw="3", label="mean")
axvline(x=median,color='green',ls='solid', lw="3", label="median")
axvline(x=mode[0][0],color='orange',ls='solid', lw="3", label="mode")
# 25th quantile
axvline(x=summary['25%'],color='maroon',ls='dashed', lw="3", label="25th")
axvspan(40, summary['25%'], facecolor="maroon", alpha=0.3)
# 75th quantile
axvline(x=summary['75%'],color='black',ls='dashed', lw="3", label="75th")
axvspan(40, summary['75%'], facecolor="yellow", alpha=0.3 )
# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 104, 2))
# add legend to graph
legend(loc=2)
# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")
savefig('quantiles.png', bbox_inches=0, transparent=True)
print summary
In [31]:
import re as re
import collections as c
import pprint as pp
# first let us form a 'big table' by joining the violations to the most recent inspection scores
file="happy-healthy-hungry-master/data/SFFoodProgram_Complete_Data/violations_plus.csv"
df_violations = pd.read_csv(file)
violation_table = most_recent_inspection_results.merge(df_violations, on=['business_id','date'])
violation_table.head()
Out[31]:
In [32]:
# Let us see how the violations are distributed
figure(figsize=[18,7])
violation_hist = violation_table['description'].value_counts().plot(kind="bar")
In [33]:
# Let us see what violations a restaurant can have and still get a perfect score
figure(figsize=[18,7])
perfect_scores = violation_table[violation_table['Score'] == 100]
violation_hist = perfect_scores['description'].value_counts().plot(kind="bar")
perfect_scores
Out[33]:
In [41]:
violation_table['description'].value_counts()[:10]
Out[41]:
In [42]:
# Hmmm, apparently high risk vermin infestations are minor violations
# If that is minor, what is a severe violation
df_violations['ViolationSeverity'].drop_duplicates()
# well aparently there are only minor violations
Out[42]:
In [43]:
# Let us bin health violations using the cities quantizations
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]
# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = violation_table['Score'].copy()
violation_transform = violation_table.copy()
# built-in pandas funcion which assigns each data point in
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
violation_transform["Scores"] = discretized_scores
In [44]:
grouped = violation_transform.groupby('Scores')
In [45]:
# let us find the most common violations for each group
# a funtion that takes a DataFrame and returns the top violations
def common_offenses(df):
return pd.DataFrame(df['description'].value_counts(normalize=True) * 100).head(10)
top_offenses = grouped.apply(common_offenses)
top_offenses
Out[45]:
In [46]:
f = figure(figsize=[18,7])
colors = ['r', 'b', 'y', 'g']
for name, group in grouped:
group['description'].value_counts().plot(kind="bar", axes=f, alpha=0.5, color=colors.pop())
In [37]:
Image(url='http://assets.zipfianacademy.com/data/data-science-workflow/communicate.png', width=500)
Out[37]:
An important part of visualization that is often overlooked is informational accessibility, or rather how interpretable are the results. Interactivity can go a long way towards making the content of your visualization much more digestable by allowing the consumer to 'discover' the data at their own rate.
Each restaurant is geographically binned using the D3.js hexbin plugin. The color gradient of each hexagon reflects the median inspection score of the bin, and the radius of the hexagon is proportional to the number of restaurants in the bin. Binning is first computed with a uniform hexagon radius over the map, and then the radius of each individual hexagon is adjusted for how many restaurants ended up in its bin.
Large blue hexagons represent many high scoring restaurants and small red mean a few very poorly scoring restaurants. The controls on the map allow users to adjust the radius (Bin:) of the hexagon for computing the binning as well as the range (Score:) of scores to show/use on the map. The color of the Bin: slider represents the median color of the Score: range and its size represents the radius of the hexagons. The colors of each of the Score: sliders represent the threshold color for that score, i.e. if the range is 40 - 100 the left slider's color corresponds to a score of 40 and the right slider to a score of 100. The colors for every score in-between are computed using a linear gradient.
Thank you to Yelp and Code for America for spearheading the initiative to make civic data more open. And this would not have been possible without the city of SF and all the health inspectors, thank you for keeping us safe.