Load packages to support analysis.
In [95]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(style="white",rc={"figure.figsize": (8, 4)})
import urllib2
Load the data from the URL, then add some additional information we'll need later. We'll code the day of the week the violation occurred as violation_date_weekday
and the number of days elapsed between the date of the violation and the closure of the violation as issue_lag
.
In [73]:
data = pd.read_csv(urllib2.urlopen('http://forever.codeforamerica.org/fellowship-2015-tech-interview/Violations-2012.csv'),parse_dates=['violation_date','violation_date_closed'])
# Get the day of the week
data['violation_date_weekday'] = data['violation_date'].apply(lambda x:x.weekday())
# Get the number of days elapsed from date to closure, convert from numpy.timedelta64 to int
data['issue_lag'] = (data['violation_date_closed'] - data['violation_date']).apply(lambda x:x/np.timedelta64(1,'D'))
# Summarize
data.head()
Out[73]:
Now let's plot the cumulative number of violations across the different violation_category
types over time. The Animals and Pests
category has the most total violations by the end of the year, followed by Garbage and Refuse
.
In [96]:
data_gb_category = data.groupby('violation_category')
summary_df = pd.DataFrame(index=pd.date_range(start=data['violation_date'].min(),end=data['violation_date'].max()))
for group in data_gb_category.groups.keys():
group_data = data_gb_category.get_group(group)
summary_df[group] = group_data.groupby('violation_date').agg({'violation_id':len})
summary_df.fillna(value=0).cumsum().plot(colormap='jet')
Out[96]:
We can examine if there are any differences in violation_category
by day of the week. Each violation has a set of 5 colored bars, reflecting the 5 days of the work week (Dark Green (0) is Monday, Red (4) is Friday). The distribution of violations is not evenly distributed throughout the week: several of the violations have mid-week peaks. This suggests several possible mechanisms. First, people may be more sensitive to violations in the middle of the week compared to the beginning or end of the week. Second, enforcement of regulations may drop off in the middle of the week which leads to more reporting of violations. Third, we have no base rates of overall activity in the city so observed violations may be a constant fraction of activity, but activity peaks mid-week so violations do as well.
In [105]:
sns.factorplot('violation_category',hue='violation_date_weekday',data=data,palette='Spectral_r')
plt.xticks(plt.xticks()[0],rotation=90)
plt.xlabel('')
plt.ylabel('Violations')
Out[105]:
Finally, we look at the distribution of the issue_lag
variable, which captures the time elapsed between the reporting of a violation to its closure. Some violations may be easier to address and solved in less time, but there may also be systematic biases towards solving some violations quickly and others more slowly. Chemical Hazards
take upwards of 80 days on average to close while Biohazards
take less than 20 days on average to close.
In [107]:
sns.factorplot('violation_category','issue_lag',data=data,kind='bar')
plt.xticks(plt.xticks()[0],rotation=90)
plt.ylabel('Time lag (days)')
plt.xlabel('')
Out[107]:
In [ ]: