Modified from an IPython Notebook created by Jonathan Dinu here
Now, let's use dataframes to explore a dataset containing inspection records for San Francisco restaurants. We'll be using this publicly available dataset from the Department of Public health. We will explore this data to map the cleanliness of the city, and get a better perspective on the relative meaning of these scores by looking at statistics of the data. This notebook can be downloaded (with associated data) from its repo.
The first step of the data analysis process is to define the problem we want to address. To do so let us review what we have set out to accomplish and begin exploring questions we want answered.
How clean are SF restaurants?
It is often best to arrive at a simple yet illuminating question to give you direction. Of course there are a number of sub-questions we may have that relate to our over arching problem, we can address these when we determine our goals for the analysis.
Now that we have a problem we hope to solve, let us begin to quantify our analysis. Since our Problem Statement is often qualitative and broad, we can ask further questions to better define what we hope to achieve.
How does an individual restaurants' score compare to the whole/aggregate of SF?
Are SF's inspections better or worse than other cities?
If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?
To get a better understanding of the data, we can start to begin examining it statistically to get a macrosopic look at its distribution. This part of our tutorial will use much of the powerful built in functionality of NumPy, SciPy, matplotlib, and pandas. If you want to get more experience with these, there are great resources and tutorials covering these libraries in much more depth than I will here. I highly recommend taking a look at these if this analysis interests you even in the least bit.
In [ ]:
%pylab inline
In [ ]:
# Import pylab to provide scientific Python libraries (NumPy, SciPy, Matplotlib)
%pylab --no-import-all
#import pylab as pl
# import the Image display module
from IPython.display import Image
First, load the data into pandas dataframes
In [ ]:
# import pandas library which provides an R like environment for python.
# if you do not have it installed: sudo easy_install pandas.
import pandas as pd
import scipy as sp
from scipy import stats
# store relevant file paths in variables since we may use them frequently
businesses = 'businesses.csv'
inspections = 'inspections.csv'
# load each file into a Pandas DataFrame, pandas automatically converts the first line into a header for the columns
df_business = pd.read_csv(businesses)
df_inspection = pd.read_csv(inspections)
Now, we can inspect the first 10 rows of the DataFrame to ensure that the data is loaded properly and get an idea of the schema
In [ ]:
df_inspection.head(10)
We can 'join' DataFrames just as we would database tables pandas uses a left-outer join by default, meaning that all the records from the businesses will be present even if there is not a corresponding row in the inspections.
In [ ]:
# join the two DataFrames on the 'business_id' column
big_table = df_business.merge(df_inspection, on='business_id')
# the joined DataFrame columns: frame1 columns + frame2 columns
# in our case it is the concatenation of df_business and df_inspection columns
print 'Business:\t' + str(df_business.columns) + '\n'
print 'Inspection:\t' + str(df_inspection.columns) + '\n'
print 'Big Table:\t' + str(big_table.columns)
# allows for row and column indexing succinctly
big_table.iloc[:10, :4]
Now that we have our joined data, we can start exploring it
We can start by grouping our data by business in order to find its most recent score for the inspections
In [ ]:
grouped_business = big_table.groupby('business_id')
Then we can define a new DataFrame by applying a function to the old DataFrame.
In [ ]:
# a funtion that takes a DataFrame and returns the row with the newest date
def most_recent(df, column='date'):
return df.sort_values(by=column)[-1:]
# input to most_recent is the DataFrame of each group, in this case
# all of the rows and columns for each business (grouped on business_id).
most_recent_inspection_results = grouped_business.apply(most_recent)
# We applied the most_recent function to extract the row
# of the DataFrame with the most recent inspection.
most_recent_inspection_results.head()
Another important aspect of the exploration process is removing bad data. Here, we filter out records without lat/long for mapping
In [ ]:
r = most_recent_inspection_results
zero_filtered = r[(r['latitude'] != 0) & (r['latitude'] != 0)]
filtered = zero_filtered.dropna(subset=['latitude', 'longitude'])[['business_id','name', 'address', 'Score', 'date', 'latitude', 'longitude']]
filtered.to_csv('geolocated_rest.csv', index=False)
We can bin the restaurants by scores to understand the distribution of inspections better. Here we create a histogram to understand the distribution of scores better
In [ ]:
from scipy.stats import expon
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])
# pandas built-in histogram function automatically distributes and counts bin values
h = most_recent_inspection_results['Score'].hist(bins=100)
# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 100, 2))
# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")
Now that we have a basic idea of the distribution, let us look at some more interesting statistics
In [ ]:
scores = most_recent_inspection_results['Score']
mean = scores.mean()
median = scores.median()
# compute descriptive summary statistics of the inspection scores
summary = scores.describe()
mode = sp.stats.mode(scores)
skew = scores.skew()
# compute quantiles
ninety = scores.quantile(0.9)
eighty = scores.quantile(0.8)
seventy = scores.quantile(0.7)
sixty = scores.quantile(0.6)
print "Skew: " + str(skew)
print "90%: " + str(ninety)
print "80%: " + str(eighty)
print "70%: " + str(seventy)
print "60%: " + str(sixty)
print summary
Since we have explored our data and have a better idea of its nature, we can begin to devise a plan to answer our questions. This is usually the most iterative part of the entire process: as we learn more about our data we modify our approach, and as modify our solutions we must re-examine our data.
How does an individual restaurants' score compare to the whole/aggregate of SF?
Are SF's inspections better or worse than other cities?
If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?
Collect summary statistics (mean, median, standard deviation, etc.) about distribution of scores.
Acquire data on inspection scores for other cities, compare distribution of cities.
Perform a linear regression on historic data on past inspections combined with scores from other 'similar' restaurants.
This is the step where derivative values are often calculated, including summary statistics, transformations on the data, and correlations. There also is a bit of traditional data mining involved as most machine learning occurs in the solutions and metrics stages (in our formulation). We could even go so far as to say that the results of predictive models are simply additional metrics: the probability of defaulting on a loan, the cluster a new product belongs in, or the score of a restaurant that hasn't been inspected yet.
The purpose of this part of the process is to calculate the information you need to begin evaluating and testing you solutions and hypotheses.
Given a mapping from score ranges to descriptions, we want to assign a discription to each business in our grouped DataFrame. We will use create description bins using the description ranges, then use a built-in Pandas function to assign each business to a bin.
In [ ]:
# first we need to discretize the numerical values, this can be
# thought of as converting a continuous variable into a categorical one.
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]
# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = most_recent_inspection_results['Score'].copy()
score_transform = most_recent_inspection_results.copy()
# built-in pandas function which assigns each data point in
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
Now let's apply these transformations to the old DataFrame.
In [ ]:
# tranform the original DataFrame's "Score" column with the new descriptions
score_transform['Score'] = discretized_scores
score_transform[['name', 'date','Score']].head(15)
By quantizing the scores of the restaurant inspections, we can get a better qualitative insight into the ratings. Let us compare this new distribution of quantized scores to the raw numeric values.
Let's plot a histogram of the discretized scores to visualize where the restaurant scores fall within our description bins.
In [ ]:
# create a figure with 2 subplots
fig = figure(figsize=[30,7])
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
# count each occurance of descriptions in the 'Score' column,
# and reverse this result so 'Poor' is left most and 'Good' right most
counts = score_transform['Score'].value_counts()[::-1]
plt = counts.plot(kind='bar', ax=ax2)
# decorate the plot and axis with text
ax2.set_title("Restaurant Inspections (%i total)" % sum(counts))
ax2.set_ylabel("Counts")
ax2.set_xlabel("Description")
# let us add some labels to each bar
for x, y in enumerate(counts):
plt.text(x + 0.5, y + 200, '%.f' % y, ha='left', va= 'top')
# plot the original raw scores (same graph as earlier)
most_recent_inspection_results['Score'].hist(bins=100, ax= ax1)
# create x-axis ticks of even numbers 0-100
ax1.set_xticks(np.arange(40, 100, 2))
# add a title to the current figure, our histogram
ax1.set_title("Histogram of Inspection Scores")
ax1.set_ylabel("Counts")
ax1.set_xlabel("Score")
savefig('histograms.png', bbox_inches=0)
We can see that a majority of restaurants are adequate or good, according to the quantiles only 25% have scores less than 88. While the histogram of the numeric scores gives us a more granular look at the data, it can be quite difficult to derive value from it. Is an 86 a filthy/unhealthy restaurant or did it simply forget a few nuanced inspection rules? The Score Legend provides us a mapping from a raw score to a meaningful value, similar to the scaling of standardized test raw scores.
If we are not satisfied with our evaluation, we need to iterate on our approach:
We'll plot the histogram once again, this time marking the quantiles to provide a more meaningful representation of the scores.
In [ ]:
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])
# pandas built-in histogram function automatically distributes and counts bin values
h = most_recent_inspection_results['Score'].hist(bins=100)
# summary statistics vertical lines
axvline(x=mean,color='red',ls='solid', lw="3", label="mean")
axvline(x=median,color='green',ls='solid', lw="3", label="median")
axvline(x=mode[0][0],color='orange',ls='solid', lw="3", label="mode")
# 25th quantile
axvline(x=summary['25%'],color='maroon',ls='dashed', lw="3", label="25th")
axvspan(40, summary['25%'], facecolor="maroon", alpha=0.3)
# 75th quantile
axvline(x=summary['75%'],color='black',ls='dashed', lw="3", label="75th")
axvspan(40, summary['75%'], facecolor="yellow", alpha=0.3 )
# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 104, 2))
# add legend to graph
legend(loc=2)
# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores with Quantiles")
savefig('quantiles.png', bbox_inches=0, transparent=True)
print summary
In [ ]:
import re as re
import collections as c
import pprint as pp
# first let us form a 'big table' by joining the violations to the most recent inspection scores
file="violations_plus.csv"
df_violations = pd.read_csv(file)
violation_table = most_recent_inspection_results.merge(df_violations, on=['business_id','date'])
violation_table.head()
Let's plot a histogram for the violations data to see how the violations are distributed
In [ ]:
figure(figsize=[18,7])
violation_hist = violation_table['description'].value_counts().plot(kind="bar")
We might also be curious to see what violations a restaurant can have and still get a perfect score, so we'll filter the violation table for perfect scores.
In [ ]:
figure(figsize=[18,7])
perfect_scores = violation_table[violation_table['Score'] == 100]
perfect_scores
Then we'll plot a new violation histogram based on the filtered dataset.
In [ ]:
violation_hist = perfect_scores['description'].value_counts().plot(kind="bar")
Finally, let's look at how health violations are distributed across the different score description bins. We'll use the same discretizing process used earlier to create the score descriptions.
In [ ]:
# Let us bin health violations using the cities quantizations
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]
# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = violation_table['Score'].copy()
violation_transform = violation_table.copy()
# built-in pandas funcion which assigns each data point in
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
violation_transform["Scores"] = discretized_scores
In [ ]:
grouped = violation_transform.groupby('Scores')
We can now find the most common violations for each group by applying a function that counts and normalizes the offense types across the description groups.
In [ ]:
# a funtion that takes a DataFrame and returns the top violations
def common_offenses(df):
return pd.DataFrame(df['description'].value_counts(normalize=True) * 100).head(10)
top_offenses = grouped.apply(common_offenses)
top_offenses