Load packages to support analysis.


In [95]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(style="white",rc={"figure.figsize": (8, 4)})
import urllib2

Load the data from the URL, then add some additional information we'll need later. We'll code the day of the week the violation occurred as violation_date_weekday and the number of days elapsed between the date of the violation and the closure of the violation as issue_lag.


In [73]:
data = pd.read_csv(urllib2.urlopen('http://forever.codeforamerica.org/fellowship-2015-tech-interview/Violations-2012.csv'),parse_dates=['violation_date','violation_date_closed'])

# Get the day of the week
data['violation_date_weekday'] = data['violation_date'].apply(lambda x:x.weekday())

# Get the number of days elapsed from date to closure, convert from numpy.timedelta64 to int
data['issue_lag'] = (data['violation_date_closed'] - data['violation_date']).apply(lambda x:x/np.timedelta64(1,'D'))

# Summarize
data.head()


Out[73]:
violation_id inspection_id violation_category violation_date violation_date_closed violation_type violation_date_weekday issue_lag
0 204851 261019 Garbage and Refuse 2012-01-03 2012-02-02 Refuse Accumulation 1 30
1 204852 261019 Unsanitary Conditions 2012-01-03 2012-02-02 Unsanitary conditions, not specified 1 30
2 204853 261023 Garbage and Refuse 2012-01-03 2012-01-17 Refuse Accumulation 1 14
3 204854 261023 Garbage and Refuse 2012-01-03 2012-01-17 Refuse Accumulation 1 14
4 204858 261029 Garbage and Refuse 2012-01-03 2012-03-12 Refuse Accumulation 1 69

Now let's plot the cumulative number of violations across the different violation_category types over time. The Animals and Pests category has the most total violations by the end of the year, followed by Garbage and Refuse.


In [96]:
data_gb_category = data.groupby('violation_category')
summary_df = pd.DataFrame(index=pd.date_range(start=data['violation_date'].min(),end=data['violation_date'].max()))
for group in data_gb_category.groups.keys():
    group_data = data_gb_category.get_group(group)
    summary_df[group] = group_data.groupby('violation_date').agg({'violation_id':len})
    
summary_df.fillna(value=0).cumsum().plot(colormap='jet')


Out[96]:
<matplotlib.axes.AxesSubplot at 0x27098e80>

We can examine if there are any differences in violation_category by day of the week. Each violation has a set of 5 colored bars, reflecting the 5 days of the work week (Dark Green (0) is Monday, Red (4) is Friday). The distribution of violations is not evenly distributed throughout the week: several of the violations have mid-week peaks. This suggests several possible mechanisms. First, people may be more sensitive to violations in the middle of the week compared to the beginning or end of the week. Second, enforcement of regulations may drop off in the middle of the week which leads to more reporting of violations. Third, we have no base rates of overall activity in the city so observed violations may be a constant fraction of activity, but activity peaks mid-week so violations do as well.


In [105]:
sns.factorplot('violation_category',hue='violation_date_weekday',data=data,palette='Spectral_r')
plt.xticks(plt.xticks()[0],rotation=90)
plt.xlabel('')
plt.ylabel('Violations')


Out[105]:
<matplotlib.text.Text at 0x2b6c7ef0>

Finally, we look at the distribution of the issue_lag variable, which captures the time elapsed between the reporting of a violation to its closure. Some violations may be easier to address and solved in less time, but there may also be systematic biases towards solving some violations quickly and others more slowly. Chemical Hazards take upwards of 80 days on average to close while Biohazards take less than 20 days on average to close.


In [107]:
sns.factorplot('violation_category','issue_lag',data=data,kind='bar')
plt.xticks(plt.xticks()[0],rotation=90)
plt.ylabel('Time lag (days)')
plt.xlabel('')


Out[107]:
<matplotlib.text.Text at 0x2bb02390>

In [ ]: