Modified from an IPython Notebook created by Jonathan Dinu here

Now, let's use dataframes to explore a dataset containing inspection records for San Francisco restaurants. We'll be using this publicly available dataset from the Department of Public health. We will explore this data to map the cleanliness of the city, and get a better perspective on the relative meaning of these scores by looking at statistics of the data. This notebook can be downloaded (with associated data) from its repo.

The first step of the data analysis process is to define the problem we want to address. To do so let us review what we have set out to accomplish and begin exploring questions we want answered.

## How clean are SF restaurants?

It is often best to arrive at a simple yet illuminating question to give you direction. Of course there are a number of sub-questions we may have that relate to our over arching problem, we can address these when we determine our goals for the analysis.

Now that we have a problem we hope to solve, let us begin to quantify our analysis. Since our *Problem Statement* is often qualitative and broad, we can ask further questions to better define what we hope to achieve.

How does an individual restaurants' score compare to the whole/aggregate of SF?

Are SF's inspections better or worse than other cities?

If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?

To get a better understanding of the data, we can start to begin examining it statistically to get a macrosopic look at its distribution. This part of our tutorial will use much of the powerful built in functionality of NumPy, SciPy, matplotlib, and pandas. If you want to get more experience with these, there are great resources and tutorials covering these libraries in much more depth than I will here. I highly recommend taking a look at these if this analysis interests you even in the least bit.

```
In [ ]:
```%pylab inline

```
In [ ]:
```# Import pylab to provide scientific Python libraries (NumPy, SciPy, Matplotlib)
%pylab --no-import-all
#import pylab as pl
# import the Image display module
from IPython.display import Image

First, load the data into pandas dataframes

```
In [ ]:
```# import pandas library which provides an R like environment for python.
# if you do not have it installed: sudo easy_install pandas.
import pandas as pd
import scipy as sp
from scipy import stats
# store relevant file paths in variables since we may use them frequently
businesses = 'businesses.csv'
inspections = 'inspections.csv'
# load each file into a Pandas DataFrame, pandas automatically converts the first line into a header for the columns
df_business = pd.read_csv(businesses)
df_inspection = pd.read_csv(inspections)

```
In [ ]:
```df_inspection.head(10)

```
In [ ]:
```# join the two DataFrames on the 'business_id' column
big_table = df_business.merge(df_inspection, on='business_id')
# the joined DataFrame columns: frame1 columns + frame2 columns
# in our case it is the concatenation of df_business and df_inspection columns
print 'Business:\t' + str(df_business.columns) + '\n'
print 'Inspection:\t' + str(df_inspection.columns) + '\n'
print 'Big Table:\t' + str(big_table.columns)
# allows for row and column indexing succinctly
big_table.iloc[:10, :4]

**Now that we have our joined data, we can start exploring it**

```
In [ ]:
```grouped_business = big_table.groupby('business_id')

Then we can define a new DataFrame by applying a function to the old DataFrame.

```
In [ ]:
```# a funtion that takes a DataFrame and returns the row with the newest date
def most_recent(df, column='date'):
return df.sort_values(by=column)[-1:]
# input to most_recent is the DataFrame of each group, in this case
# all of the rows and columns for each business (grouped on business_id).
most_recent_inspection_results = grouped_business.apply(most_recent)
# We applied the most_recent function to extract the row
# of the DataFrame with the most recent inspection.
most_recent_inspection_results.head()

```
In [ ]:
```r = most_recent_inspection_results
zero_filtered = r[(r['latitude'] != 0) & (r['latitude'] != 0)]
filtered = zero_filtered.dropna(subset=['latitude', 'longitude'])[['business_id','name', 'address', 'Score', 'date', 'latitude', 'longitude']]
filtered.to_csv('geolocated_rest.csv', index=False)

```
In [ ]:
```from scipy.stats import expon
# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])
# pandas built-in histogram function automatically distributes and counts bin values
h = most_recent_inspection_results['Score'].hist(bins=100)
# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 100, 2))
# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores")

Now that we have a basic idea of the distribution, let us look at some more interesting statistics

```
In [ ]:
```scores = most_recent_inspection_results['Score']
mean = scores.mean()
median = scores.median()
# compute descriptive summary statistics of the inspection scores
summary = scores.describe()
mode = sp.stats.mode(scores)
skew = scores.skew()
# compute quantiles
ninety = scores.quantile(0.9)
eighty = scores.quantile(0.8)
seventy = scores.quantile(0.7)
sixty = scores.quantile(0.6)
print "Skew: " + str(skew)
print "90%: " + str(ninety)
print "80%: " + str(eighty)
print "70%: " + str(seventy)
print "60%: " + str(sixty)
print summary

Since we have explored our data and have a better idea of its nature, we can begin to devise a plan to answer our questions. This is usually the most iterative part of the entire process: as we learn more about our data we modify our approach, and as modify our solutions we must re-examine our data.

How does an individual restaurants' score compare to the whole/aggregate of SF?

Are SF's inspections better or worse than other cities?

If a restaurant has not yet been inspected, can we approximate/predict what score it will receive?

Collect summary statistics (mean, median, standard deviation, etc.) about distribution of scores.

Acquire data on inspection scores for other cities, compare distribution of cities.

Perform a linear regression on historic data on past inspections combined with scores from other 'similar' restaurants.

This is the step where derivative values are often calculated, including **summary statistics**, **transformations** on the data, and **correlations**. There also is a bit of traditional **data mining** involved as most machine learning occurs in the solutions and metrics stages (in our formulation). We could even go so far as to say that the results of predictive models are simply additional metrics: the **probability** of defaulting on a loan, the **cluster** a new product belongs in, or the **score** of a restaurant that hasn't been inspected yet.

*The purpose of this part of the process is to calculate the information you need to begin evaluating and testing you solutions and hypotheses.*

```
In [ ]:
```# first we need to discretize the numerical values, this can be
# thought of as converting a continuous variable into a categorical one.
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]
# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = most_recent_inspection_results['Score'].copy()
score_transform = most_recent_inspection_results.copy()
# built-in pandas function which assigns each data point in
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)

Now let's apply these transformations to the old DataFrame.

```
In [ ]:
```# tranform the original DataFrame's "Score" column with the new descriptions
score_transform['Score'] = discretized_scores
score_transform[['name', 'date','Score']].head(15)

```
In [ ]:
```# create a figure with 2 subplots
fig = figure(figsize=[30,7])
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
# count each occurance of descriptions in the 'Score' column,
# and reverse this result so 'Poor' is left most and 'Good' right most
counts = score_transform['Score'].value_counts()[::-1]
plt = counts.plot(kind='bar', ax=ax2)
# decorate the plot and axis with text
ax2.set_title("Restaurant Inspections (%i total)" % sum(counts))
ax2.set_ylabel("Counts")
ax2.set_xlabel("Description")
# let us add some labels to each bar
for x, y in enumerate(counts):
plt.text(x + 0.5, y + 200, '%.f' % y, ha='left', va= 'top')
# plot the original raw scores (same graph as earlier)
most_recent_inspection_results['Score'].hist(bins=100, ax= ax1)
# create x-axis ticks of even numbers 0-100
ax1.set_xticks(np.arange(40, 100, 2))
# add a title to the current figure, our histogram
ax1.set_title("Histogram of Inspection Scores")
ax1.set_ylabel("Counts")
ax1.set_xlabel("Score")
savefig('histograms.png', bbox_inches=0)

**adequate** or **good**, according to the quantiles only **25%** have scores less than **88**. While the histogram of the numeric scores gives us a more granular look at the data, it can be quite difficult to derive value from it. Is an **86** a filthy/unhealthy restaurant or did it simply forget a few nuanced inspection rules? The Score Legend provides us a mapping from a raw score to a meaningful value, similar to the scaling of standardized test raw scores.

If we are not satisfied with our evaluation, we need to iterate on our approach:

- Do I need more/better data?
- Do I need to try a different proposed solution?
- Do I need to calculate different metrics?

```
In [ ]:
```# create a matplotlib figure with size [15,7]
figure(figsize=[15,7])
# pandas built-in histogram function automatically distributes and counts bin values
h = most_recent_inspection_results['Score'].hist(bins=100)
# summary statistics vertical lines
axvline(x=mean,color='red',ls='solid', lw="3", label="mean")
axvline(x=median,color='green',ls='solid', lw="3", label="median")
axvline(x=mode[0][0],color='orange',ls='solid', lw="3", label="mode")
# 25th quantile
axvline(x=summary['25%'],color='maroon',ls='dashed', lw="3", label="25th")
axvspan(40, summary['25%'], facecolor="maroon", alpha=0.3)
# 75th quantile
axvline(x=summary['75%'],color='black',ls='dashed', lw="3", label="75th")
axvspan(40, summary['75%'], facecolor="yellow", alpha=0.3 )
# create x-axis ticks of even numbers 0-100
xticks(np.arange(40, 104, 2))
# add legend to graph
legend(loc=2)
# add a title to the current figure, our histogram
h.set_title("Histogram of Inspection Scores with Quantiles")
savefig('quantiles.png', bbox_inches=0, transparent=True)
print summary

```
In [ ]:
```import re as re
import collections as c
import pprint as pp
# first let us form a 'big table' by joining the violations to the most recent inspection scores
file="violations_plus.csv"
df_violations = pd.read_csv(file)
violation_table = most_recent_inspection_results.merge(df_violations, on=['business_id','date'])
violation_table.head()

Let's plot a histogram for the violations data to see how the violations are distributed

```
In [ ]:
```figure(figsize=[18,7])
violation_hist = violation_table['description'].value_counts().plot(kind="bar")

```
In [ ]:
```figure(figsize=[18,7])
perfect_scores = violation_table[violation_table['Score'] == 100]
perfect_scores

Then we'll plot a new violation histogram based on the filtered dataset.

```
In [ ]:
```violation_hist = perfect_scores['description'].value_counts().plot(kind="bar")

```
In [ ]:
```# Let us bin health violations using the cities quantizations
descriptions = ['Poor', 'Needs Improvement', 'Adequate', 'Good']
bins = [-1, 70, 85, 90, 100]
# copy the scores from our grouped DataFrame, DataFrames manipulate
# in place if we do not explicitly copy them.
scores = violation_table['Score'].copy()
violation_transform = violation_table.copy()
# built-in pandas funcion which assigns each data point in
# 'scores' to an interval in 'bins' with labels of 'descriptions'
discretized_scores = pd.cut(scores, bins ,labels=descriptions)
violation_transform["Scores"] = discretized_scores

```
In [ ]:
```grouped = violation_transform.groupby('Scores')

```
In [ ]:
```# a funtion that takes a DataFrame and returns the top violations
def common_offenses(df):
return pd.DataFrame(df['description'].value_counts(normalize=True) * 100).head(10)
top_offenses = grouped.apply(common_offenses)
top_offenses