Load packages to support analysis.



In [95]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(style="white",rc={"figure.figsize": (8, 4)})
import urllib2

Load the data from the URL, then add some additional information we'll need later. We'll code the day of the week the violation occurred as violation_date_weekday and the number of days elapsed between the date of the violation and the closure of the violation as issue_lag.



In [73]:

    
data = pd.read_csv(urllib2.urlopen('http://forever.codeforamerica.org/fellowship-2015-tech-interview/Violations-2012.csv'),parse_dates=['violation_date','violation_date_closed'])

# Get the day of the week
data['violation_date_weekday'] = data['violation_date'].apply(lambda x:x.weekday())

# Get the number of days elapsed from date to closure, convert from numpy.timedelta64 to int
data['issue_lag'] = (data['violation_date_closed'] - data['violation_date']).apply(lambda x:x/np.timedelta64(1,'D'))

# Summarize
data.head()









    Out[73]:






  
    
      
      violation_id
      inspection_id
      violation_category
      violation_date
      violation_date_closed
      violation_type
      violation_date_weekday
      issue_lag
    
  
  
    
      0
       204851
       261019
          Garbage and Refuse
      2012-01-03
      2012-02-02
                        Refuse Accumulation
       1
       30
    
    
      1
       204852
       261019
       Unsanitary Conditions
      2012-01-03
      2012-02-02
       Unsanitary conditions, not specified
       1
       30
    
    
      2
       204853
       261023
          Garbage and Refuse
      2012-01-03
      2012-01-17
                        Refuse Accumulation
       1
       14
    
    
      3
       204854
       261023
          Garbage and Refuse
      2012-01-03
      2012-01-17
                        Refuse Accumulation
       1
       14
    
    
      4
       204858
       261029
          Garbage and Refuse
      2012-01-03
      2012-03-12
                        Refuse Accumulation
       1
       69

Now let's plot the cumulative number of violations across the different violation_category types over time. The Animals and Pests category has the most total violations by the end of the year, followed by Garbage and Refuse.



In [96]:

    
data_gb_category = data.groupby('violation_category')
summary_df = pd.DataFrame(index=pd.date_range(start=data['violation_date'].min(),end=data['violation_date'].max()))
for group in data_gb_category.groups.keys():
    group_data = data_gb_category.get_group(group)
    summary_df[group] = group_data.groupby('violation_date').agg({'violation_id':len})
    
summary_df.fillna(value=0).cumsum().plot(colormap='jet')









    Out[96]:





<matplotlib.axes.AxesSubplot at 0x27098e80>

We can examine if there are any differences in violation_category by day of the week. Each violation has a set of 5 colored bars, reflecting the 5 days of the work week (Dark Green (0) is Monday, Red (4) is Friday). The distribution of violations is not evenly distributed throughout the week: several of the violations have mid-week peaks. This suggests several possible mechanisms. First, people may be more sensitive to violations in the middle of the week compared to the beginning or end of the week. Second, enforcement of regulations may drop off in the middle of the week which leads to more reporting of violations. Third, we have no base rates of overall activity in the city so observed violations may be a constant fraction of activity, but activity peaks mid-week so violations do as well.



In [105]:

    
sns.factorplot('violation_category',hue='violation_date_weekday',data=data,palette='Spectral_r')
plt.xticks(plt.xticks()[0],rotation=90)
plt.xlabel('')
plt.ylabel('Violations')









    Out[105]:





<matplotlib.text.Text at 0x2b6c7ef0>

Finally, we look at the distribution of the issue_lag variable, which captures the time elapsed between the reporting of a violation to its closure. Some violations may be easier to address and solved in less time, but there may also be systematic biases towards solving some violations quickly and others more slowly. Chemical Hazards take upwards of 80 days on average to close while Biohazards take less than 20 days on average to close.



In [107]:

    
sns.factorplot('violation_category','issue_lag',data=data,kind='bar')
plt.xticks(plt.xticks()[0],rotation=90)
plt.ylabel('Time lag (days)')
plt.xlabel('')









    Out[107]:





<matplotlib.text.Text at 0x2bb02390>



In [ ]:

	violation_id	inspection_id	violation_category	violation_date	violation_date_closed	violation_type	violation_date_weekday	issue_lag
0	204851	261019	Garbage and Refuse	2012-01-03	2012-02-02	Refuse Accumulation	1	30
1	204852	261019	Unsanitary Conditions	2012-01-03	2012-02-02	Unsanitary conditions, not specified	1	30
2	204853	261023	Garbage and Refuse	2012-01-03	2012-01-17	Refuse Accumulation	1	14
3	204854	261023	Garbage and Refuse	2012-01-03	2012-01-17	Refuse Accumulation	1	14
4	204858	261029	Garbage and Refuse	2012-01-03	2012-03-12	Refuse Accumulation	1	69