Heat and Violence in Chicago

Can we attribute a fall in violent crime in Chicago to its new conceal and carry laws? This was the argument many conservative blogs and news outlets were making in early April after statistics were released showing a marked drop in the murder rate in Chicago. These articles attributed "Chicago's first-quarter murder total [hitting] its lowest number since 1958" to the deterrent effects of Chicago's concealed and carry permits being issued in late February (RedState). Other conservative outlets latched onto the news to argue the policy "is partly responsible for Chicago's across-the-board drop in the crime" (TheBlaze) or that the policy contributed to the "murder rate promptly [falling] to 1958 levels" (TownHall).

Several articles hedged about the causal direction of any relationship and pointed out that this change is hard to separate from falling general crime rates as well as the atrocious winter weather this season (PJMedia, Wonkette, HuffPo) The central claim here is whether the adoption of the conceal and carry policy in March 2014 contributed to significant changes in crime rates rather than other social, historical, or environmental factors.

However, an April 7 feature story by David Bernstein and Noah Isackson in Chicago magazine found substantial evidence of violent crimes like homicides, robberies, burglaries, and assaults being reclassified, downgraded to more minor crimes, and even closed as noncriminal incidents. They argue that after Police Superintendent Garry McCarthy arrived in May 2011, the drop in crime has improbably plummeted in spite of high unemployment and significant contraction in the Chicago Police Department's beat cops. An audit by Chicago's inspector general into these crime numbers suggests assaults and batteries may have been underreported by more than 24%. This raises a second question: can we attribute the fall in violent crime in Chicago to systematic underreporting of criminal statistics?

In this post, I do four things:

First, I demonstate the relationship crime has with environmental factors like temperature as well as temporal factors like the hour of the day and day of the week. I use a common technique in signal processing to identify that criminal activity not only follows an annual pattern, but also patterns by day of the week.



In [381]:

    
Image(filename='homicides_month_hour_heatmap.png')









    Out[381]:

Second, I estimate a simple statistical model based on the findings above. This model combines temperature, the day of the week, the week of the year, and longer-term historical trends and despite its simplicity (relative to more advanced types of time series models that could be estimated), does a very good job explaining the dynamics of crime in Chicago over the past 13 years.



In [382]:

    
Image(filename='personal_model_comparison.png')









    Out[382]:

Third, I use this statistical model to make predictions about crime rates for the rest of 2014. If there's a significant fall-off in violent crime following the introduction of the conceal and carry policy in March 2014, this could be evidence of its success as a deterrent (or that this is a bad model). But if the actual crime data matches the model's forecasted trends, it suggests the new conceal and carry policy has had no effect. There are no findings here as yet, but I expect as the data comes in there will be no significant changes after March 2014.



In [383]:

    
Image(filename='2014_homicide_predictions.png')









    Out[383]:

Fourth, I find evidence of substantial discrepancies in the reporting some crime data since 2013. This obviously imperils the findings of the analyses done above, but also replicates the findings reported by Bernstein and Isackson. The statistical model above expected that property crimes such as arson, burglary, theft, and robbery should follow a particular pattern, which the observed data significantly deviates from after 2013. I perform some additional analyses to uncover which crimes and reporting districts are driving this discrepancy as well as how severe this discrepancy is.



In [384]:

    
Image(filename='personal_crime_rates_down.png')









    Out[384]:

Start your kernels!



In [271]:

    
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import seaborn as sb
from collections import Counter
from IPython.core.display import Image
import scipy.stats as stats

Scrape data

This will scrape historical weather data from Weather Underground between 2001 and 2014 and save each year's data as a CSV files.



In [ ]:

    
for i in range(2001,2015):
    with open('{0}.csv'.format(str(i)),'wb') as f:
        data = urllib2.urlopen('http://www.wunderground.com/history/airport/ORD/{0}/1/1/CustomHistory.html?dayend=1&monthend=1&yearend={1}&format=1'.format(str(i),str(i+1))).read()
        f.write(data)

Next we want to combine each year's CSV file into a master CSV file for all years. We do this by loading each CSV and saving it into the weather list of DataFrames and then using pd.concat to combine them all together. Then we save the resulting data as weather.csv to load up later.



In [ ]:

    
weather = list()

for i in range(2001,2015):
    weather.append(pd.read_csv('{0}.csv'.format(str(i)),skiprows=1))
    
weather_df = pd.concat(weather)
weather_df.set_index('CST',inplace=True)
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df.to_csv('weather.csv')

Read data

Weather

Read in weather.csv and clean up the data so that it's indexed by a proper DateTimeIndex and junk in the rainfall data is removed (in case we wanted to use it later).



In [2]:

    
weather_df = pd.read_csv('weather.csv',low_memory=False,index_col=0)
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['PrecipitationIn'] = weather_df['PrecipitationIn'].replace('T',np.nan)
weather_df['PrecipitationIn'] = weather_df['PrecipitationIn'].dropna().astype(float)
weather_df.tail()









    Out[2]:






  
    
      
      Max TemperatureF
      Mean TemperatureF
      Min TemperatureF
      Max Dew PointF
      MeanDew PointF
      Min DewpointF
      Max Humidity
       Mean Humidity
       Min Humidity
       Max Sea Level PressureIn
       Mean Sea Level PressureIn
       Min Sea Level PressureIn
       Max VisibilityMiles
       Mean VisibilityMiles
       Min VisibilityMiles
       Max Wind SpeedMPH
       Mean Wind SpeedMPH
       Max Gust SpeedMPH
      PrecipitationIn
       CloudCover
      
    
  
  
    
      2014-04-02
       45
       40
       34
       29
       26
       24
       70
       60
       49
       30.17
       30.12
       30.05
       10
       10
       10
       17
       10
       29
        NaN
       7
      ...
    
    
      2014-04-03
       43
       39
       35
       36
       32
       28
       89
       75
       60
       30.04
       29.86
       29.67
       10
        6
        0
       29
       14
       35
       0.53
       8
      ...
    
    
      2014-04-04
       48
       41
       34
       41
       33
       21
       92
       73
       54
       30.00
       29.60
       29.45
       10
        6
        0
       30
       17
       40
        NaN
       8
      ...
    
    
      2014-04-05
       51
       41
       31
       25
       22
       17
       69
       49
       29
       30.20
       30.13
       30.02
       10
       10
       10
       13
        6
       16
       0.00
       3
      ...
    
    
      2014-04-06
       57
       44
       31
       29
       25
       19
       63
       49
       35
       30.17
       30.08
       29.96
       10
       10
       10
       14
        7
       20
       0.00
       4
      ...
    
  

5 rows × 22 columns

List out all the different variables in the weather data.



In [3]:

    
weather_df.columns









    Out[3]:





Index([u'Max TemperatureF', u'Mean TemperatureF', u'Min TemperatureF', u'Max Dew PointF', u'MeanDew PointF', u'Min DewpointF', u'Max Humidity', u' Mean Humidity', u' Min Humidity', u' Max Sea Level PressureIn', u' Mean Sea Level PressureIn', u' Min Sea Level PressureIn', u' Max VisibilityMiles', u' Mean VisibilityMiles', u' Min VisibilityMiles', u' Max Wind SpeedMPH', u' Mean Wind SpeedMPH', u' Max Gust SpeedMPH', u'PrecipitationIn', u' CloudCover', u' Events', u' WindDirDegrees<br />'], dtype='object')

All crime data



In [4]:

    
crime_df = pd.read_csv('all_crime.csv')
crime_df['Datetime'] = pd.to_datetime(crime_df['Date'],format="%m/%d/%Y %I:%M:%S %p")
crime_df['Date'] = crime_df['Datetime'].apply(lambda x:x.date())
crime_df['Weekday'] = crime_df['Datetime'].apply(lambda x:x.weekday())
crime_df['Hour'] = crime_df['Datetime'].apply(lambda x:x.hour)
crime_df['Day'] = crime_df['Datetime'].apply(lambda x:x.day)
crime_df['Week'] = crime_df['Datetime'].apply(lambda x:x.week)
crime_df['Month'] = crime_df['Datetime'].apply(lambda x:x.month)

crime_df.head()









    



C:\Anaconda\lib\site-packages\pandas\io\parsers.py:1070: DtypeWarning: Columns (11,13) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)






    Out[4]:






  
    
      
      ID
      Case Number
      Date
      Block
      IUCR
      Primary Type
      Description
      Location Description
      Arrest
      Domestic
      Beat
      District
      Ward
      Community Area
      FBI Code
      X Coordinate
      Y Coordinate
      Year
      Updated On
      Latitude
      
    
  
  
    
      0
       9558116
       HX209604
       2014-04-03
          026XX W LOGAN BLVD
       0610
            BURGLARY
                       FORCIBLE ENTRY
           OTHER
       False
       False
       1411
       14
       35
       22
        05
       1158555
       1917229
       2014
       04/06/2014 12:38:43 AM
       41.928617
      ...
    
    
      1
       9558320
       HX209887
       2014-04-03
            008XX N LONG AVE
       0610
            BURGLARY
                       FORCIBLE ENTRY
       APARTMENT
       False
       False
       1524
       15
       37
       25
        05
       1140117
       1905254
       2014
       04/06/2014 12:38:43 AM
       41.896114
      ...
    
    
      2
       9557107
       HX208940
       2014-04-03
          070XX S RACINE AVE
       0495
             BATTERY
       AGGRAVATED OF A SENIOR CITIZEN
       APARTMENT
        True
        True
        734
        7
       17
       67
       04B
       1169494
       1858212
       2014
       04/06/2014 12:38:43 AM
       41.766438
      ...
    
    
      3
       9557134
       HX208947
       2014-04-03
       025XX N MILWAUKEE AVE
       0810
               THEFT
                            OVER $500
          STREET
       False
       False
       1414
       14
       35
       22
        06
       1155439
       1916718
       2014
       04/06/2014 12:38:43 AM
       41.927278
      ...
    
    
      4
       9557154
       HX208991
       2014-04-03
         070XX S MERRILL AVE
       2820
       OTHER OFFENSE
                     TELEPHONE THREAT
       RESIDENCE
       False
        True
        331
        3
        5
       43
        26
       1191720
       1858787
       2014
       04/05/2014 12:39:46 AM
       41.767505
      ...
    
  

5 rows × 28 columns



In [5]:

    
dict(Counter(crime_df['Primary Type']))









    Out[5]:





{'ARSON': 9065,
 'ASSAULT': 330270,
 'BATTERY': 1006149,
 'BURGLARY': 325403,
 'CONCEALED CARRY LICENSE VIOLATION': 3,
 'CRIM SEXUAL ASSAULT': 19481,
 'CRIMINAL DAMAGE': 636302,
 'CRIMINAL TRESPASS': 162053,
 'DECEPTIVE PRACTICE': 176914,
 'DOMESTIC VIOLENCE': 1,
 'GAMBLING': 13142,
 'HOMICIDE': 6581,
 'INTERFERE WITH PUBLIC OFFICER': 6172,
 'INTERFERENCE WITH PUBLIC OFFICER': 3365,
 'INTIMIDATION': 3315,
 'KIDNAPPING': 5802,
 'LIQUOR LAW VIOLATION': 12786,
 'MOTOR VEHICLE THEFT': 264724,
 'NARCOTICS': 628009,
 'NON - CRIMINAL': 2,
 'NON-CRIMINAL': 14,
 'NON-CRIMINAL (SUBJECT SPECIFIED)': 3,
 'OBSCENITY': 276,
 'OFFENSE INVOLVING CHILDREN': 33480,
 'OFFENSES INVOLVING CHILDREN': 439,
 'OTHER NARCOTIC VIOLATION': 95,
 'OTHER OFFENSE': 338382,
 'OTHER OFFENSE ': 7,
 'PROSTITUTION': 63493,
 'PUBLIC INDECENCY': 110,
 'PUBLIC PEACE VIOLATION': 38695,
 'RITUALISM': 23,
 'ROBBERY': 205745,
 'SEX OFFENSE': 20011,
 'STALKING': 2566,
 'THEFT': 1123449,
 'WEAPONS VIOLATION': 51793}



In [10]:

    
personal_crimes = ['ASSAULT','BATTERY','CRIM SEXUAL ASSAULT','HOMICIDE']
property_crimes = ['ARSON','BURGLARY','MOTOR VEHICLE THEFT','ROBBERY','THEFT']

Group data by date and join crime and weather data together

Join the temperature and crime data together based on their sharing a common DateTimeIndex.



In [19]:

    
arson_gb = crime_df[crime_df['Primary Type'] == 'ARSON'].groupby('Date')['ID'].agg(len)
assault_gb = crime_df[crime_df['Primary Type'] == 'ASSAULT'].groupby('Date')['ID'].agg(len)
battery_gb = crime_df[crime_df['Primary Type'] == 'BATTERY'].groupby('Date')['ID'].agg(len)
burglary_gb = crime_df[crime_df['Primary Type'] == 'BURGLARY'].groupby('Date')['ID'].agg(len)
homicide_gb = crime_df[crime_df['Primary Type'] == 'HOMICIDE'].groupby('Date')['ID'].agg(len)
sexual_assault_gb = crime_df[crime_df['Primary Type'] == 'CRIM SEXUAL ASSAULT'].groupby('Date')['ID'].agg(len)
robbery_gb = crime_df[crime_df['Primary Type'] == 'ROBBERY'].groupby('Date')['ID'].agg(len)
theft_gb = crime_df[crime_df['Primary Type'] == 'THEFT'].groupby('Date')['ID'].agg(len)
vehicle_theft_gb = crime_df[crime_df['Primary Type'] == 'MOTOR VEHICLE THEFT'].groupby('Date')['ID'].agg(len)
personal_gb = crime_df[crime_df['Primary Type'].isin(personal_crimes)].groupby('Date')['ID'].agg(len)
property_gb = crime_df[crime_df['Primary Type'].isin(property_crimes)].groupby('Date')['ID'].agg(len)

arson_gb.index = pd.to_datetime(arson_gb.index,unit='D')
assault_gb.index = pd.to_datetime(assault_gb.index,unit='D')
battery_gb.index = pd.to_datetime(battery_gb.index,unit='D')
burglary_gb.index = pd.to_datetime(burglary_gb.index,unit='D')
homicide_gb.index = pd.to_datetime(homicide_gb.index,unit='D')
sexual_assault_gb.index = pd.to_datetime(sexual_assault_gb.index,unit='D')
robbery_gb.index = pd.to_datetime(robbery_gb.index,unit='D')
theft_gb.index = pd.to_datetime(theft_gb.index,unit='D')
vehicle_theft_gb.index = pd.to_datetime(vehicle_theft_gb.index,unit='D')
personal_gb.index = pd.to_datetime(personal_gb.index,unit='D')
property_gb.index = pd.to_datetime(property_gb.index,unit='D')


ts = pd.DataFrame({'Arson':arson_gb.ix[:'2014-3-31'],
                   'Assault':assault_gb.ix[:'2014-3-31'],
                   'Battery':battery_gb.ix[:'2014-3-31'],
                   'Burglary':burglary_gb.ix[:'2014-3-31'],
                   'Homicide':homicide_gb.ix[:'2014-3-31'],
                   'Sexual_assault':sexual_assault_gb.ix[:'2014-3-31'],
                   'Robbery':robbery_gb.ix[:'2014-3-31'],
                   'Vehicle_theft':vehicle_theft_gb.ix[:'2014-3-31'],
                   'Theft':theft_gb.ix[:'2014-3-31'],
                   'Personal':personal_gb.ix[:'2014-3-31'],
                   'Property':property_gb.ix[:'2014-3-31'],
                   'Temperature':weather_df['Mean TemperatureF'].ix[:'2014-3-31'],
                   'Binned temperature':weather_df['Mean TemperatureF'].ix[:'2014-3-31']//10.*10,
                   'Humidity':weather_df[' Mean Humidity'].ix[:'2014-3-31'],
                   'Precipitation':weather_df['PrecipitationIn'].ix[:'2014-3-31']
                   })

ts['Time'] = range((max(ts.index)-min(ts.index)).days+1)

ts.reset_index(inplace=True)
ts.set_index('index',drop=False,inplace=True)
ts['Weekday'] = ts['index'].apply(lambda x:x.weekday())
ts['Hour'] = ts['index'].apply(lambda x:x.hour)
ts['Week'] = ts['index'].apply(lambda x:x.week)
ts['Month'] = ts['index'].apply(lambda x:x.month)
ts['Year'] = ts['index'].apply(lambda x:x.year)
ts['Weekend'] = ts['Weekday'].isin([5,6]).astype(int)

PART 1: The dynamics of crime, time, and weather

Define a helper function that we'll be using later on.



In [20]:

    
# adapted from http://matplotlib.org/examples/api/barchart_demo.html
def autolabel(rects):
    max_height = max([rect.get_height() for rect in rects if hasattr(rect,'get_height') and not np.isnan(rect.get_height())])
    for rect in rects:
        if hasattr(rect,'get_height'):
            height = rect.get_height()
            if not np.isnan(height):
                ax.text(rect.get_x()+rect.get_width()/2., height-.05*max_height, '%d'%int(height),
                ha='center', va='bottom',color='w')

Plot the occurrence of crimes over time. There are several apparent features:

Many crimes have very strong annual seasonality: rates are higher in the summer months and lower in the winter months.
There's a decreasing trend in many types of crimes over time.

I've marked March 1, 2014 with a vertical black dotted line to indicate when the gun policy went into effect.



In [25]:

    
figsize(12,6)
ts2 = pd.DataFrame({'Arson':arson_gb/float(arson_gb.max()),
                   'Assault':assault_gb/float(assault_gb.max()),
                   'Battery':battery_gb/float(battery_gb.max()),
                   'Burglary':burglary_gb/float(burglary_gb.max()),
                   'Homicide':homicide_gb/float(homicide_gb.max()),
                   'Sexual assault':sexual_assault_gb/float(sexual_assault_gb.max()),
                   'Robbery':robbery_gb/float(robbery_gb.max()),
                   'Theft':theft_gb/float(theft_gb.max()),})
ax = ts2.resample('M').plot(lw=4,alpha=.75,colormap='hsv')
ax.set_ylabel('Incidents (normalized)',fontsize=18)
#ax.right_ax.set_ylabel('Temperature  (F)',fontsize=18)

ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((0,1))
#ax.set_yscale('log')
ax.grid(False,which='minor')

plt.axvline('2014-3-1',c='k',ls='--')
ax.legend(loc='upper center',ncol=4)









    Out[25]:





<matplotlib.legend.Legend at 0x8875f080>

We can plot the strength of the correlations between these crime statistics and some other variables as well. We add in Temperature,Humidity, and Precipitation. Redder colors are stronger correlations, bluer colors are weaker or negative correlations. The strongest correlation we find (darkest red) is between Battery and Assault: when batteries are high assaults are also high. The weakest correlation we find is between Humidity and Temperature: when temperature is high, humidity is low. Note that I've set the diagonal to zero (Arson is perfectly correlated with Arson).



In [28]:

    
figsize(8,6)
#ts.corr().columns
a = np.array(ts[['Arson','Burglary','Robbery','Theft','Assault','Battery','Homicide','Sexual_assault','Temperature','Humidity','Precipitation']].resample('M').corr())
np.fill_diagonal(a,0)
plt.pcolor(a,cmap='RdBu_r',edgecolors='k')

plt.xlim((0,11))
plt.ylim((0,11))

plt.xticks(arange(.5,11.5),['Arson','Burglary','Robbery','Theft','Assault','Battery','Homicide','Sexual assault','Temperature','Humidity','Precipitation'],rotation=90,fontsize=15)
plt.yticks(arange(.5,11.5),['Arson','Burglary','Robbery','Theft','Assault','Battery','Homicide','Sexual assault','Temperature','Humidity','Precipitation'],fontsize=15)

plt.title('Correlation between crime occurences',fontsize=20)
plt.colorbar()
plt.grid(b=True,which='major',alpha=.5)



In [29]:

    
figsize(12,6)
ts3 = pd.DataFrame({'Personal':personal_gb/float(personal_gb.max()),
                    'Property':property_gb/float(property_gb.max())
                    })
ax = ts3.resample('M').plot(lw=4,alpha=.75,colormap='jet')
ax.set_ylabel('Incidents (normalized)',fontsize=18)
#ax.right_ax.set_ylabel('Temperature  (F)',fontsize=18)

ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((0,1))
#ax.set_yscale('log')
ax.grid(False,which='minor')
ax.legend(loc='upper center',ncol=4,fontsize=18)









    Out[29]:





<matplotlib.legend.Legend at 0x4a961400>

Crime as a function of temperature

Next we explore the correlation between temperature and crime above. The figure below plots the



In [393]:

    
ax = ts.boxplot(['Personal','Property'],by='Binned temperature')
ax[0].set_ylabel('Number of crimes',fontsize=18)
ax[0].set_xlabel('Temperature (F)',fontsize=18)
ax[0].set_title('Personal crimes',fontsize=15)

ax[1].set_xlabel('Temperature (F)',fontsize=18)
ax[1].set_title('Property crimes',fontsize=15)

plt.suptitle('Crime increases with temperature',fontsize=20)

#plt.xticks(plt.xticks()[0],arange(-10,100,10),fontsize=15)









    Out[393]:





<matplotlib.text.Text at 0x1234e1780>

We can also examine the relationship between Temperature, Humidity, and number of crimes. First, we extract the total number of crimes for each observed combination of Temperature and Humidity and store this in array1. Then we extract the total observations of Temperature and Humidity and store this in array2. The latter is to normalize the former in the event there are some combinations of temperature occur frequently and result in us overcounting the crime statistics.

Moving from bottom to top we can see the effect we saw above that temperature increases the frequency of crimes (bluer is less crime, redder is more crime). Moving from left to right we see crime doesn't vary substantially as function of humidity, with the exception that very high levels of humidity (above 60%) might have higher rates of crime



In [31]:

    
var = 'Personal'
ct1 = ts.groupby(['Temperature','Humidity'])[var].agg(np.sum).reset_index()
array1 = np.array(pd.pivot_table(ct1,values=var,rows='Temperature',cols='Humidity').fillna(0))

ct2 = ts.groupby(['Temperature','Humidity'])[var].agg(len).reset_index()
array2 = np.array(pd.pivot_table(ct2,values=var,rows='Temperature',cols='Humidity').fillna(0))

normalized_crime = array1/array2

plt.imshow(normalized_crime,cmap='RdBu_r',origin='lower',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Humidity',fontsize=18)
plt.ylabel('Temperature',fontsize=18)
plt.title('{0} crimes by temperature and humidity'.format(var),fontsize=24)
plt.colorbar()









    Out[31]:





<matplotlib.colorbar.Colorbar instance at 0x0000000031055888>



In [32]:

    
var = 'Property'
ct1 = ts.groupby(['Temperature','Humidity'])[var].agg(np.sum).reset_index()
array1 = np.array(pd.pivot_table(ct1,values=var,rows='Temperature',cols='Humidity').fillna(0))

ct2 = ts.groupby(['Temperature','Humidity'])[var].agg(len).reset_index()
array2 = np.array(pd.pivot_table(ct2,values=var,rows='Temperature',cols='Humidity').fillna(0))

normalized_crime = array1/array2

plt.imshow(normalized_crime,cmap='RdBu_r',origin='lower',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Humidity',fontsize=18)
plt.ylabel('Temperature',fontsize=18)
plt.title('{0} crimes by temperature and humidity'.format(var),fontsize=24)
plt.colorbar()









    Out[32]:





<matplotlib.colorbar.Colorbar instance at 0x000000006EF148C8>

Crime as a function of time

We can also examine Assaults as a function of time -- we saw that they peak when temperatures are high, so they should likewise peak in the summer months.



In [33]:

    
crime_df_gb_personal = crime_df[crime_df['Primary Type'].isin(personal_crimes)]
crime_df_gb_property = crime_df[crime_df['Primary Type'].isin(property_crimes)]



In [34]:

    
bar_df = pd.DataFrame()
bar_df['Personal'] = crime_df_gb_personal.groupby('Month')['ID'].agg(len)/len(crime_df_gb_personal)*100
bar_df['Property'] = crime_df_gb_property.groupby('Month')['ID'].agg(len)/len(crime_df_gb_property)*100

ax = bar_df.plot(kind='bar')

plt.title('Crimes peak in summer',fontsize=20)
plt.xlabel('Month',fontsize=18)
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.ylabel('Percentage of crimes',fontsize=18)
plt.yticks(fontsize=15)
plt.ylim((0,12))
plt.legend(fontsize=15,ncol=2,loc='upper center')
autolabel(ax.patches)

We can also examine what time of day assaults occurred. We see assaults peak between 2 and 3pm and crater around 5am.



In [35]:

    
bar_df = pd.DataFrame()

bar_df['Personal'] = crime_df_gb_personal.groupby('Hour')['ID'].agg(len)/len(crime_df_gb_personal)*100
bar_df['Property'] = crime_df_gb_property.groupby('Hour')['ID'].agg(len)/len(crime_df_gb_property)*100

ax = bar_df.plot(kind='bar')

plt.title('Crimes peak in afternoon',fontsize=20)
plt.xlabel('Hour of day',fontsize=18)
plt.xticks(fontsize=15,rotation=0)
plt.ylabel('Percentage of crimes',fontsize=18)
plt.yticks(fontsize=15)
plt.legend(fontsize=15,ncol=2,loc='upper center')
autolabel(ax.patches)



In [36]:

    
bar_df = pd.DataFrame()
bar_df['Personal'] = crime_df_gb_personal.groupby('Weekday')['ID'].agg(len)/len(crime_df_gb_personal)*100
bar_df['Property'] = crime_df_gb_property.groupby('Weekday')['ID'].agg(len)/len(crime_df_gb_property)*100

ax = bar_df.plot(kind='bar')

plt.title('Property crimes fall,\npersonal crimes rise over weekends',fontsize=20)
plt.xlabel('Day of week',fontsize=18)
plt.xticks(plt.xticks()[0],['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],fontsize=15,rotation=0)
plt.ylabel('Percentage of crimes',fontsize=18)
plt.yticks(fontsize=15)
plt.ylim((0,18))
plt.legend(fontsize=15,ncol=2,loc='upper center')
autolabel(ax.patches)

As time of year and day together

We can combine the month of the year and hour of the day analyses to create a "fingerprint" of different criminal patterns. Plotting the distribution for assaults, we see that assaults peak between February and April around 3pm and in May through July, this shifts back to about 8pm, before returning back to about 3pm in August.



In [37]:

    
ct = crime_df[crime_df['Primary Type'] == 'ASSAULT'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Assault by time of day and year',fontsize=24)
plt.colorbar()









    Out[37]:





<matplotlib.colorbar.Colorbar instance at 0x000000003EEE8308>

We can make a similar "fingerprint" for the hour of the day against the day of the week. We see that weekdays have a distinct peak around 3pm while weekends have a less-distinct peak but start and end later in the evening.



In [38]:

    
ct = crime_df[crime_df['Primary Type'] == 'ASSAULT'].groupby(['Weekday','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Weekday',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Day of week',fontsize=18)
plt.title('Assault by time of day and week',fontsize=24)
plt.yticks(arange(0,7),['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
plt.colorbar()









    Out[38]:





<matplotlib.colorbar.Colorbar instance at 0x0000000049F00148>

For other types of crimes

Batteries have a similar pattern of shifting from the afternoon in the spring, to the late evening in the summer, and back to the afternoon in the fall.



In [39]:

    
ct = crime_df[crime_df['Primary Type'] == 'BATTERY'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Battery by time of day and year',fontsize=24)
plt.colorbar()









    Out[39]:





<matplotlib.colorbar.Colorbar instance at 0x0000000044B3CB48>

Some strange going on with reporting of sexual assaults since they peak at midnight and Jan 1.



In [45]:

    
ct = crime_df[crime_df['Primary Type'] == 'SEX OFFENSE'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Sexual assault by time of day and year',fontsize=24)
plt.colorbar()









    Out[45]:





<matplotlib.colorbar.Colorbar instance at 0x000000004429A808>

Robberies show a more extreme pattern of variation in time of day over the course of the year. Robberies move later into the night as spring becomes summer and then move earlier into the evening as summer becomes fall and winter.



In [40]:

    
ct = crime_df[crime_df['Primary Type'] == 'ROBBERY'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Robbery by time of day and year',fontsize=24)
plt.colorbar()









    Out[40]:





<matplotlib.colorbar.Colorbar instance at 0x00000000374C7BC8>

In contrast, buglaries have a consistent pattern of being reported around 8am and 12pm, neither of which change through the year.



In [41]:

    
ct = crime_df[crime_df['Primary Type'] == 'BURGLARY'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Burglary by time of day and year',fontsize=24)
plt.colorbar()









    Out[41]:





<matplotlib.colorbar.Colorbar instance at 0x0000000069551AC8>

Finally, we see that homicides are rare during the winter months and then peak in the late evening and early morning of the summer months.



In [42]:

    
ct = crime_df[crime_df['Primary Type'] == 'HOMICIDE'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Homicides by time of day and year',fontsize=24)
plt.colorbar()









    Out[42]:





<matplotlib.colorbar.Colorbar instance at 0x0000000042C608C8>

Plots

The plots above show the different ways in which temperature and crime are correlated in time and show strong seasonal effects. We can also plot Temperature against Assaults directly. The correlation is very apparent, but there's also a clear decreasing trend over time. This suggests that any changes in crime rates need to account for both seasonal effects as well as longer-term historical patterns that are reducing crime year-to-year.



In [51]:

    
ax = ts[['Temperature','Assault']].resample('M').plot(secondary_y=['Assault'],lw=4)
ax.set_ylabel(u'Temperature (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Assaults',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((0,100))
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Assaults, 2001\N{EN DASH}present',fontsize=24)
plt.autoscale()

Zooming into the last year of data (January 1, 2013 to early April 2014), there's still an obvious seasonality to the assaults that corresponds with the temperature as well. The grey hatching on the right corresponds to the time the gun assault has been in effect -- if the argument that the liberalized gun policies should deter crime, this is where we should see the changes in crime rates (green line). Assaults fell in the coldest part of the winter but have climed again as the temperature recovered.



In [52]:

    
ax = ts[['Temperature','Assault']].resample('W').ix['2013-1-1':].plot(secondary_y=['Assault'],lw=4)
ax.set_ylabel(u'Temperature  (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Incidents',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
#ax.grid(False,which='minor')
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Assaults, 2013\N{EN DASH}present',fontsize=24)
plt.autoscale()

Because the homicides are so much more rare than the assaults, it was difficult to tell whether there was a close correlation. Aggregating the data to a montly level and plotting only murders against temperature, we see there is a pattern here as well.



In [53]:

    
ax = ts[['Temperature','Homicide']].resample('M').plot(secondary_y=['Homicide'],lw=4,alpha=.75)
ax.set_ylabel(u'Temperature  (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Homicides',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
#ax.set_yticklabels(plt.yticks()[0],range(0,90,10))
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Homicides, 2001\N{EN DASH}present',fontsize=24)









    Out[53]:





<matplotlib.text.Text at 0x3a935ef0>

Homicides



In [54]:

    
ax = ts[['Temperature','Homicide']].resample('W').ix['2013-1-1':].plot(secondary_y=['Homicide'],lw=4,alpha=.75)
ax.set_ylabel(u'Temperature  (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Homicides',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
#ax.set_yticklabels(plt.yticks()[0],range(0,90,10))
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Homicides, 2013\N{EN DASH}present',fontsize=24)









    Out[54]:





<matplotlib.text.Text at 0x2e7977b8>

Power spectrum

We saw above that there are regular cyclical patterns at the annual level in all three data sets. In other words, the temperature, number of assaults, or number of murders in a given day, week, or month last year is similar to the number this year. But there may also be other signals hiding in there as well that are harder to see. We borrow an analytical technique from signal processing called the "Fast Fourier Transform" that is literally used to separate the signals from the noise. This method is basically an "x-ray" on temporal data that lets us see the "skeleton" of different underlying temporal patterns hidden within the data. The resulting plots are called power spectra and we can visualize them for each of the temperature, assault, and murder data to see what other patterns are hiding in there in addition to the annual cycles we saw above.

Special thanks to my colleague, Navid Dianati for his help in interpreting these.

Temperature

Based on the plots above as well as our own intuition, there should be a signal at a one year frequency at a minimum. Marking that "once every 365 days" frequency with a vertical dashed cyan line and then plotting the power spectrum, we see a peak: there's definitely an annual cycle in the temperature data. No surprise.



In [55]:

    
d = ts['Temperature'].dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]

plt.xticks(x,(n/x).round(1))
plt.axvline(n/365,ls='--',c='c') # Annual signal

plt.grid(False,which='Minor')
plt.xlabel('Frequency (days)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of temperatures',fontsize=24)









    Out[55]:





<matplotlib.text.Text at 0x488b6748>

Assaults

We also saw a clear annual cycle in the assault data, so I mark that "once every 365" frequency again with a vertical dashed cyan line and then plot the power spectrum for the assault data. We see there's a definite peak there as well, as we expected. However, we also see there are two other noticable peaks in this spectra that didn't occur in the temperature data.

Obviously nature doesn't follow a 7-day week cycle, but this cycle structures a significant amount of human social activity. Plotting a vertical dashed yellow line at the "one a week" frequency, we see that this signal corresponds perfectly with the observed peak. In other words, in addition to the regular annual cycle of gun violence there's also a regular weekly cycle of gun violence. This isn't a revelation to any criminologists or urban sociologists, but it's illustrative of using interdisciplinary method for exploratory analysis.

Finally there's at least one more peak in between these the weekly and the annual cycle. Plotting a vertical dashed magenta line at the "once every 90 days" frequency, this corresponds to another signal. Thus there's at least a three substantive signals in the assault data --- once a year, once a week, and now once a quarter. The power of this signal is weaker than the other two, but it's still about 10x stronger than the background noise. This last finding is somewhat unusual as I don't expect that criminals file a quarterly statement and then promptly go assault someone afterwards. But this strong and regular quarterly signal strongly suggests there's some bureaucratic process at work here that's driving reporting on a ~91-day cycle.



In [56]:

    
d = ts['Assault'].dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='g',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]

plt.xticks(x,(n/x).round(1))
plt.axvline(n/365,ls='--',c='c') # Annual signal
plt.axvline(n/182,ls='--',c='m') # Quarterly signal
plt.axvline(n/7,ls='--',c='y') # Weekly signal
plt.grid(False,which='Minor')
plt.xlabel('Frequency (days)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of assaults',fontsize=24)









    Out[56]:





<matplotlib.text.Text at 0x5189b710>

Just to double-check we're not reading too much into a weird statistical fluke of this method, I'll aggregate the data into weekly-level data and then repeat the analysis. Again we find signal peaks at the "once every 52 weeks", "once every 13 weeks", and "once every week" frequencies (marked in cyan, magenta, and yellow, respectively).



In [57]:

    
d = ts['Assault'].resample('W',how=np.sum).dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='g',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]

plt.xticks(x,(n/x).round(1))
plt.axvline(n/52.,ls='--',c='c') # Annual signal
plt.axvline(n/26,ls='--',c='m') # Quarterly signal
plt.axvline(n,ls='--',c='y') # Weekly signal

plt.grid(False,which='Minor')
plt.xlabel('Frequency (weeks)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of assaults',fontsize=24)









    Out[57]:





<matplotlib.text.Text at 0x37ef62b0>

Homicides

Looking at the power spectrum for murders, let's try to find signals at the same frequencies we found for assaults. Interestingly enough, there don't appear to be any, even when there should have been an annual peak as we saw above. Strangely, there's a peak (marked with the vertical dotted black line) at ~250 days, or approximately 9 months.



In [58]:

    
d = ts['Homicide'].dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='r',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]

plt.xticks(x,(n/x).round(1))
plt.axvline(n/365,ls='--',c='c') # Annual signal
plt.axvline(n/90,ls='--',c='m') # Quarterly signal
plt.axvline(n/7,ls='--',c='y') # Weekly signal
plt.axvline(n/255,ls='--',c='k') # ??? signal

plt.grid(False,which='Minor')
plt.xlabel('Frequency (days)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of homicides',fontsize=24)









    Out[58]:





<matplotlib.text.Text at 0x6a7cc940>

This doesn't really make sense (unless there was some 9-month process like people celebrating murders by conceiving babies and then celebrating babies by murdering people), so let's double-check by aggregating the data into weeks. Sure enough, the weird "one every 9 months" frequency goes away and we find the annual signal again at the "once every 52 weeks" frequency and a weekly signal at the "once every week" frequency. So murder patterns also follow an annual and weekly cycles, as we might expect --- but unlike the there's no quarterly cycle either.



In [394]:

    
d = ts['Homicide'].resample('W').dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='r',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]

plt.xticks(x,(n/x).round(1))
plt.axvline(n/52.,ls='--',c='c') # Annual signal
#plt.axvline(n/13.,ls='--',c='m') # Quarterly signal
plt.axvline(n/1.,ls='--',c='y') # Weekly signal

plt.grid(False,which='Minor')
plt.xlabel('Frequency (weeks)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of homicides',fontsize=24)









    Out[394]:





<matplotlib.text.Text at 0x123c92710>

Curve fitting

The analysis above showed there are clear cyclical patterns in the data. We'll want to use temperature as an explanatory variable, so we need to come up with a function that describes it well. We'll use curve_fit to identify the parameters of a single sin wave function with an amplitude a, a period b, and an intercept c: $a \times sin(t+b) + c$

Curve fit should return the estimates for these parameters for a given year's cycle and we can plot them against the observed data.



In [63]:

    
# from http://nbviewer.ipython.org/gist/keflavich/4042018
from scipy.optimize import curve_fit

def sinfunc(x,a,b,c):
    return a*np.sin(2*np.pi/365.*x+b)+c

y2001 = np.array(ts['Temperature']['2001-1-1':'2001-12-31'])
x = arange(len(y2001))
fitpars,covmat =  curve_fit(f=sinfunc,xdata=x,ydata=y2001)

plt.scatter(x,y2001)
y = sinfunc(x,*fitpars)
plt.plot(x,y,'r-',lw=8,alpha=.5)
plt.ylabel('Temperature',fontsize=18)
plt.xlabel('Day',fontsize=18)
plt.xlim((0,365))









    Out[63]:





(0, 365)

As this is only for a single year's data, we should estimate parameters for each year in the data. Note that curve_fit can't estimate parameters for some years because there's too much noise in the data, so we'll have to skip those. We'll average all of these parameter estimates together to get a single function that attempts to fit the data across years.



In [64]:

    
for y in range(2002,2014):
    y_i = np.array(ts['Temperature']['{0}-1-1'.format(y):'{0}-12-31'.format(y)])
    x = arange(len(y_i))
    try:
        fitpars_i,covmat =  curve_fit(f=sinfunc,xdata=x,ydata=y_i)
    except RuntimeError:
        print "Couldn't find estimates for {0}".format(y)
        pass
    fitpars = np.vstack([fitpars,fitpars_i])
    
avg_params = fitpars.mean(axis=0)









    



Couldn't find estimates for 2008
Couldn't find estimates for 2009
Couldn't find estimates for 2010

We can plot this function, the observed data, and the error between the function and the observed data below. The final chart shows the distributions of errors over time (in degrees F). There are some major errors here -- these are unseasonably warm or cold days -- but these are large enough to make the average error 0.56°F and appear to be distributed mostly around 0. So this averaged model seems to be an adequate fit for all the temperature data.



In [65]:

    
sin_df = pd.DataFrame(index=ts.index)
t = len(sin_df)
sin_df['sin'] = sinfunc(arange(t),*avg_params)
sin_df['Temp'] = ts['Temperature']
sin_df['error'] = sin_df['sin'] - sin_df['Temp']
ax = sin_df.plot(legend=False,subplots=True)
print u"The average error across all observations is: {0}\N{DEGREE SIGN}F".format(round(sin_df['error'].mean(),2))









    



The average error across all observations is: 0.56°F

PART 2: A statistical model of crime

Assaults



In [66]:

    
lm_assault = smf.ols('Assault ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()

lm_assault_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_assault_x['Temperature'] = sinfunc(arange(len(lm_assault_x)),*avg_params)
lm_assault_x['Month'] = lm_assault_x['Date'].apply(lambda x:x.month)
lm_assault_x['Week'] = lm_assault_x['Date'].apply(lambda x:x.week)
lm_assault_x['Weekday'] = lm_assault_x['Date'].apply(lambda x:x.weekday())
lm_assault_x['Time'] = arange(len(lm_assault_x))
lm_assault_x = lm_assault_x.set_index('Date')

std,lower,upper = wls_prediction_std(lm_assault)

start_date = '2013-01-01'
assault_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
assault_pred['Predictions'] = lm_assault.predict(lm_assault_x[start_date:])
assault_pred['Observations'] = ts['Assault'].ix[start_date:]
assault_pred['Lower CI'] = assault_pred['Predictions']-2*std.mean()
assault_pred['Upper CI'] = assault_pred['Predictions']+2*std.mean()



In [68]:

    
assault_pred[['Lower CI','Upper CI']].resample('W').plot(c='g',lw=1,alpha=.75)
assault_pred['Predictions'].resample('W').plot(c='g',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(assault_pred.index,assault_pred['Observations'],c='g',alpha=.25,label='Observations')
plt.ylabel('Number of assaults',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.axvspan(xmin='2014-3-1',xmax=assault_pred.index.max(),color='k',alpha=.1)
plt.legend()









    Out[68]:





<matplotlib.legend.Legend at 0x7a36a8d0>

Homicides

Homicides are very rare relative to other crimes: in any given day, the 95% confidence interval of the model estimates negative deaths.

I'm going to do something different from the models above and put my thumb on the scale with the estimates. If the model estimates 60.9 crimes on a given, that fractional crime is small relative to the absolute magnitude of crimes. We can add up these fractional remainders, and they may tell us something substantive. However, if the model estimates 1.9 crimes per day, this fraction is large relative to the absolute magnitude of the crimes. Adding up all these fractions results in some significant over-estimation. I'm going to round these fractions back down to the next integer: 0.9 crimes becomes 0, 1.9 crimes becomes 1, etc. This has the result of biasing the errors to be low, but making the absolute counts more accurate.



In [69]:

    
lm_homicide = smf.ols('Homicide ~ Temperature + Time + C(Week) + C(Weekday)',data=ts).fit()

lm_homicide_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_homicide_x['Temperature'] = sinfunc(arange(len(lm_homicide_x)),*avg_params)
lm_homicide_x['Month'] = lm_homicide_x['Date'].apply(lambda x:x.month)
lm_homicide_x['Week'] = lm_homicide_x['Date'].apply(lambda x:x.week)
lm_homicide_x['Weekday'] = lm_homicide_x['Date'].apply(lambda x:x.weekday())
lm_homicide_x['Time'] = arange(len(lm_homicide_x))
lm_homicide_x = lm_homicide_x.set_index('Date')

std,lower,upper = wls_prediction_std(lm_homicide)

start_date = '2001-01-01'
homicide_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
homicide_pred['Predictions'] = np.floor(lm_homicide.predict(lm_homicide_x.ix[start_date:]))
homicide_pred['Observations'] = ts['Homicide'].ix[start_date:]
homicide_pred['Lower CI'] = homicide_pred['Predictions'] - 2*std.mean()
homicide_pred['Upper CI'] = homicide_pred['Predictions'] + 2*std.mean()



In [71]:

    
homicide_pred[['Lower CI','Upper CI']].resample('W').plot(c='r',lw=1,alpha=.75)
homicide_pred['Predictions'].resample('W').plot(c='r',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(homicide_pred.index,homicide_pred['Observations'],c='r',alpha=.25,label='Observations')
plt.ylabel('Number of homicides',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.legend()
plt.axvspan(xmin='2014-3-1',xmax=assault_pred.index.max(),color='k',alpha=.1)
plt.axhline(0,c='k',lw=4)
plt.title('Comparison of model predictions and observed data',fontsize=24)









    Out[71]:





<matplotlib.text.Text at 0x7b66b3c8>



In [126]:

    
(homicide_pred['Observations'] - homicide_pred['Predictions']).resample('M').ix[:'2014-3-31'].plot(lw=2,c='r')
plt.axhline(y=0,c='k',lw=2,alpha=.75)
plt.ylabel('Observed error from predictions',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.title('Homicides following model since 2013',fontsize=24)









    Out[126]:





<matplotlib.text.Text at 0x54367390>



In [73]:

    
bar_df = pd.DataFrame()
bar_df['Predictions'] = homicide_pred['Predictions'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Homicide'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)

ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.title('2013 homicide estimates',fontsize=24)
plt.ylabel('Number of homicides',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,60))
autolabel(ax.patches)

print "The model predicts {0} homicides. {1} homicides were reported.".format(round(bar_df['Predictions'].sum()), bar_df['Observations'].sum())









    



The model predicts 423.0 homicides. 415.0 homicides were reported.

Compare model estimates per month to observations from 2013. The model is still over-estimating across every month.

Personal crimes



In [75]:

    
lm_personal = smf.ols('Personal ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()

lm_personal_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_personal_x['Temperature'] = sinfunc(arange(len(lm_personal_x)),*avg_params)
lm_personal_x['Month'] = lm_personal_x['Date'].apply(lambda x:x.month)
lm_personal_x['Week'] = lm_personal_x['Date'].apply(lambda x:x.week)
lm_personal_x['Weekday'] = lm_personal_x['Date'].apply(lambda x:x.weekday())
lm_personal_x['Time'] = arange(len(lm_personal_x))
lm_personal_x = lm_personal_x.set_index('Date')

std,lower,upper = wls_prediction_std(lm_personal)

start_date = '2001-01-01'
personal_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
personal_pred['Predictions'] = lm_personal.predict(lm_personal_x[start_date:])
personal_pred['Observations'] = ts['Personal'].ix[start_date:]
personal_pred['Lower CI'] = personal_pred['Predictions']-2*std.mean()
personal_pred['Upper CI'] = personal_pred['Predictions']+2*std.mean()



In [76]:

    
personal_pred[['Lower CI','Upper CI']].resample('W').plot(c='b',lw=1,alpha=.75)
personal_pred['Predictions'].resample('W').plot(c='b',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(personal_pred.index,personal_pred['Observations'],c='b',alpha=.25,label='Observations')
plt.ylabel('Number of personal crimes',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.axvspan(xmin='2014-3-1',xmax=personal_pred.index.max(),color='k',alpha=.1)
plt.legend()
plt.title('Comparison of model predictions and observed data',fontsize=24)









    Out[76]:





<matplotlib.text.Text at 0x5dd5abe0>



In [124]:

    
(personal_pred['Observations'] - personal_pred['Predictions']).resample('M').ix[:'2014-3-31'].plot(lw=2,c='b')
plt.axhline(y=0,c='k',lw=2,alpha=.75)
plt.ylabel('Observed error from predictions',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.title('Personal crimes below expectations since 2013',fontsize=24)









    Out[124]:





<matplotlib.text.Text at 0x7c0bae80>



In [78]:

    
figsize(12,6)
bar_df = pd.DataFrame()
bar_df['Predictions'] = personal_pred['Predictions'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Personal'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)

ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.title('2013 personal crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,10000))
autolabel(ax.patches)

print "The model predicted {0} personal crimes in 2013. {1} personal crimes occurred in 2013.".format(round(bar_df['Predictions'].sum()), bar_df['Observations'].sum())









    



The model predicted 75915.0 personal crimes in 2013. 74629 personal crimes occurred in 2013.

Property crimes



In [119]:

    
lm_property = smf.ols('Property ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()

lm_property_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_property_x['Temperature'] = sinfunc(arange(len(lm_property_x)),*avg_params)
lm_property_x['Month'] = lm_property_x['Date'].apply(lambda x:x.month)
lm_property_x['Week'] = lm_property_x['Date'].apply(lambda x:x.week)
lm_property_x['Weekday'] = lm_property_x['Date'].apply(lambda x:x.weekday())
lm_property_x['Time'] = arange(len(lm_property_x))
lm_property_x = lm_property_x.set_index('Date')

std,lower,upper = wls_prediction_std(lm_property)

start_date = '2001-01-01'
property_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
property_pred['Predictions'] = lm_property.predict(lm_property_x[start_date:])
property_pred['Observations'] = ts['Property'].ix[start_date:]
property_pred['Lower CI'] = property_pred['Predictions'] - 2*std.mean()
property_pred['Upper CI'] = property_pred['Predictions'] + 2*std.mean()



In [120]:

    
property_pred[['Lower CI','Upper CI']].resample('W').plot(c='g',lw=1,alpha=.75)
property_pred['Predictions'].resample('W').plot(c='g',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(property_pred.index,property_pred['Observations'],c='g',alpha=.25,label='Observations')
plt.ylabel('Number of property crimes',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.axvspan(xmin='2014-3-1',xmax=property_pred.index.max(),color='k',alpha=.1)
plt.legend()
plt.title('Comparison of model predictions and observed data',fontsize=24)









    Out[120]:





<matplotlib.text.Text at 0x22ea2c18>



In [122]:

    
(property_pred['Observations'] - property_pred['Predictions']).resample('M').ix[:'2014-3-31'].plot(lw=2,c='g')
plt.axhline(y=0,c='k',lw=2,alpha=.75)
plt.ylabel('Observed error from predictions',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.title('Property crimes plummet below expectations after 2013',fontsize=24)









    Out[122]:





<matplotlib.text.Text at 0x619091d0>



In [83]:

    
bar_df = pd.DataFrame()
bar_df['Predictions'] = property_pred['Predictions'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Property'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)

ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.title('2013 property crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,14000))
autolabel(ax.patches)


print "The model predicted {0} property crimes in 2013. {1} property crimes occurred in 2013.".format(round(bar_df['Predictions'].sum()), bar_df['Observations'].sum())









    



The model predicted 120326.0 property crimes in 2013. 112670 property crimes occurred in 2013.

PART 3: What does this model predict about crime in 2014?



In [74]:

    
bar_df = pd.DataFrame()
bar_df['Predictions'] = homicide_pred['Predictions'].ix['2014-1-1':].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Homicide'].ix['2014-1-1':'2014-3-31'].groupby(lambda x:x.month).agg(np.sum)

ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.axvspan(xmin=3,xmax=13,color='k',alpha=.1)
plt.title('2014 homicide estimates',fontsize=24)
plt.ylabel('Number of homicides',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,60))
autolabel(ax.patches)

print "The model predicts {0} homicides in 2014.".format(round(bar_df['Predictions'].sum()))









    



The model predicts 416.0 homicides in 2014.



In [79]:

    
bar_df = pd.DataFrame()
bar_df['Predictions'] = personal_pred['Predictions'].ix['2014-1-1':].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Personal'].ix['2014-1-1':'2014-3-31'].groupby(lambda x:x.month).agg(np.sum)

ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.axvspan(xmin=3,xmax=13,color='k',alpha=.1)
plt.title('2014 personal crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,10000))
autolabel(ax.patches)

print "The model predicts {0} personal crimes in 2014.".format(round(bar_df['Predictions'].sum()))









    



The model predicts 71235.0 personal crimes in 2014.



In [85]:

    
bar_df = pd.DataFrame()
bar_df['Predictions'] = property_pred['Predictions'].ix['2014-1-1':].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Property'].ix['2014-1-1':'2014-3-31'].groupby(lambda x:x.month).agg(np.sum)

ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.axvspan(xmin=3,xmax=13,color='k',alpha=.1)
plt.title('2014 property crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,14000))
autolabel(ax.patches)


print "The model predicts {0} property crimes in 2014.".format(round(bar_df['Predictions'].sum()))









    



The model predicts 115874.0 property crimes in 2014.

PART 4: Is CPD cooking the books?

First, examine what specific crimes are driving this trend. Second, examine the precincts with abnormally lower crime rates.



In [128]:

    
property_crimes_ts = ['Arson','Burglary','Vehicle_theft','Robbery','Theft']

ax = ts[property_crimes_ts].resample('M').plot(lw=4)
ax.set_yscale('log')
ax.grid(False,which='minor')
ax.legend(loc='upper center',ncol=5,fontsize=14)
ax.set_ylabel('Number of crimes',fontsize=18)
ax.set_xlabel('Time',fontsize=18)









    Out[128]:





<matplotlib.text.Text at 0x88fda438>



In [131]:

    
lm_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_x['Temperature'] = sinfunc(arange(len(lm_x)),*avg_params)
lm_x['Month'] = lm_x['Date'].apply(lambda x:x.month)
lm_x['Week'] = lm_x['Date'].apply(lambda x:x.week)
lm_x['Weekday'] = lm_x['Date'].apply(lambda x:x.weekday())
lm_x['Time'] = arange(len(lm_x))
lm_x = lm_x.set_index('Date')

lm_arson = smf.ols('Arson ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_burglary = smf.ols('Burglary ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_robbery = smf.ols('Robbery ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_theft = smf.ols('Theft ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_vehicle_theft = smf.ols('Vehicle_theft ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()

arson_std,lower,upper = wls_prediction_std(lm_arson)
burglary_std,lower,upper = wls_prediction_std(lm_burglary)
robbery_std,lower,upper = wls_prediction_std(lm_robbery)
theft_std,lower,upper = wls_prediction_std(lm_theft)
vehicle_theft_std,lower,upper = wls_prediction_std(lm_vehicle_theft)

start_date = '2001-01-01'
property_pred2 = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
property_pred2['Arson_prediction'] = lm_arson.predict(lm_x[start_date:])
property_pred2['Burglary_prediction'] = lm_burglary.predict(lm_x[start_date:])
property_pred2['Robbery_prediction'] = lm_robbery.predict(lm_x[start_date:])
property_pred2['Theft_prediction'] = lm_theft.predict(lm_x[start_date:])
property_pred2['Vehicle_theft_prediction'] = lm_vehicle_theft.predict(lm_x[start_date:])

property_pred2['Arson_observation'] = ts['Arson'].ix[start_date:]
property_pred2['Burglary_observation'] = ts['Burglary'].ix[start_date:]
property_pred2['Robbery_observation'] = ts['Robbery'].ix[start_date:]
property_pred2['Theft_observation'] = ts['Theft'].ix[start_date:]
property_pred2['Vehicle_theft_observation'] = ts['Vehicle_theft'].ix[start_date:]

property_pred2['Arson_error'] = property_pred2['Arson_observation'] - property_pred2['Arson_prediction']
property_pred2['Burglary_error'] = property_pred2['Burglary_observation'] - property_pred2['Burglary_prediction']
property_pred2['Robbery_error'] = property_pred2['Robbery_observation'] - property_pred2['Robbery_prediction']
property_pred2['Theft_error'] = property_pred2['Theft_observation'] - property_pred2['Theft_prediction']
property_pred2['Vehicle_theft_error'] = property_pred2['Vehicle_theft_observation'] - property_pred2['Vehicle_theft_prediction']



In [141]:

    
ax = property_pred2[['Arson_error','Burglary_error','Robbery_error','Theft_error','Vehicle_theft_error']].ix['2010-1-1':'2014-3-31'].resample('M',np.sum).plot(lw=4)
#ax.set_ylim((-40,40))
ax.set_ylabel('Observed error from predictions',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_title('Most property crimes below predictions since Q2 2013',fontsize=24)
ax.legend(['Arson','Burglary','Robbery','Theft','Vehicle theft'],fontsize=12,ncol=5,loc='upper center')









    Out[141]:





<matplotlib.legend.Legend at 0x1f181dd8>



In [375]:

    
personal_crimes_ts = ['Assault','Battery','Sexual_assault','Homicide']

ax = ts[personal_crimes_ts].resample('M',np.sum).plot(lw=4)
ax.set_yscale('log')
ax.grid(False,which='minor')
ax.legend(loc='upper center',ncol=5,fontsize=14)
ax.set_ylabel('Number of crimes',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((10e0,10e4))









    Out[375]:





(10.0, 100000.0)



In [378]:

    
lm_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_x['Temperature'] = sinfunc(arange(len(lm_x)),*avg_params)
lm_x['Month'] = lm_x['Date'].apply(lambda x:x.month)
lm_x['Week'] = lm_x['Date'].apply(lambda x:x.week)
lm_x['Weekday'] = lm_x['Date'].apply(lambda x:x.weekday())
lm_x['Time'] = arange(len(lm_x))
lm_x = lm_x.set_index('Date')

lm_assault = smf.ols('Assault ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_battery = smf.ols('Battery ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_sexual_assault = smf.ols('Sexual_assault ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_homicide = smf.ols('Homicide ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()

start_date = '2001-01-01'
personal_pred2 = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
personal_pred2['Assault_prediction'] = lm_assault.predict(lm_x[start_date:])
personal_pred2['Battery_prediction'] = lm_battery.predict(lm_x[start_date:])
personal_pred2['Sexual_assault_prediction'] = lm_sexual_assault.predict(lm_x[start_date:])
personal_pred2['Homicide_prediction'] = np.floor(lm_homicide.predict(lm_x[start_date:])) # Minority Effect red ball

personal_pred2['Assault_observation'] = ts['Assault'].ix[start_date:]
personal_pred2['Battery_observation'] = ts['Battery'].ix[start_date:]
personal_pred2['Sexual_assault_observation'] = ts['Sexual_assault'].ix[start_date:]
personal_pred2['Homicide_observation'] = ts['Homicide'].ix[start_date:]

personal_pred2['Assault_error'] = personal_pred2['Assault_observation'] - personal_pred2['Assault_prediction']
personal_pred2['Battery_error'] = personal_pred2['Battery_observation'] - personal_pred2['Battery_prediction']
personal_pred2['Sexual_assault_error'] = personal_pred2['Sexual_assault_observation'] - personal_pred2['Sexual_assault_prediction']
personal_pred2['Homicide_error'] = personal_pred2['Homicide_observation'] - personal_pred2['Homicide_prediction']



In [379]:

    
ax = personal_pred2[['Assault_error','Battery_error','Sexual_assault_error','Homicide_error']].ix['2010-1-1':'2014-3-31'].resample('M',np.sum).plot(lw=4)
#ax.set_ylim((-40,40))
ax.set_ylabel('Observed error from predictions',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_title('Most personal crimes below predictions since Q2 2013',fontsize=24)
ax.legend(['Assault','Battery','Sexual assault','Homicide'],fontsize=12,ncol=5,loc='upper center')









    Out[379]:





<matplotlib.legend.Legend at 0x11fc68748>

Patterns by police district

Here we can replicate the claims Bernstein and Isaackson report about the



In [256]:

    
# From: http://gis.chicagopolice.org/pdfs/district_beat.pdf
Image(url='http://i.imgur.com/KMMkatm.png')









    Out[256]:



In [365]:

    
homicide_df = crime_df[crime_df['Primary Type'] == "HOMICIDE"]

homicide_gb_district_df = pd.DataFrame()

for year in range(2003,2014):
    homicide_gb_district_df[year] = homicide_df[homicide_df['Year'] == year].groupby('District')['ID'].agg(len).ix[:25]

district_homicide_change_df = pd.DataFrame()
district_homicide_change_df[u'2010\N{EN DASH}2012 change'] = (homicide_gb_district_df[2012]-homicide_gb_district_df[2010])/homicide_gb_district_df[2010]*100
district_homicide_change_df['2013 change'] = (homicide_gb_district_df[2013]-homicide_gb_district_df[2012])/homicide_gb_district_df[2012]*100
districts = [int(i) for i in list(homicide_gb_district_df.T.columns)]
ax = district_homicide_change_df.plot(kind='bar')
ax.set_xticklabels(districts,rotation=0)
ax.legend(ncol=3,fontsize=15,loc='upper center')
ax.set_ylim((-100,500))

ax.set_ylabel("Percentage change",fontsize=18)
ax.set_xlabel("CPD District",fontsize=18)
ax.set_title("Homicide rates across districts fall\nin 2013 despite significant increases in recent years",fontsize=24)









    Out[365]:





<matplotlib.text.Text at 0x11dc19e48>



In [278]:

    
property_gb_district_df = pd.DataFrame()

for year in range(2003,2014):
    property_gb_district_df[year] = crime_df_gb_property[crime_df_gb_property['Year'] == year].groupby('District')['ID'].agg(len).ix[:25]

district_property_change_df = pd.DataFrame()
district_property_change_df[u'2010\N{EN DASH}2012 change'] = (property_gb_district_df[2012]-property_gb_district_df[2010])/property_gb_district_df[2010]*100
district_property_change_df['2013 change'] = (property_gb_district_df[2013]-property_gb_district_df[2012])/property_gb_district_df[2012]*100
districts = [int(i) for i in list(property_gb_district_df.T.columns)]
ax = district_property_change_df.plot(kind='bar')
ax.set_xticklabels(districts,rotation=0)
ax.legend(ncol=3,fontsize=15,loc='upper center')
ax.set_ylim((-30,30))

ax.set_ylabel("Percentage change",fontsize=18)
ax.set_xlabel("CPD District",fontsize=18)
ax.set_title("Property crime rates across districts fall faster\nin 2013 than previous 3 years combined",fontsize=24)









    Out[278]:





<matplotlib.text.Text at 0xd5006390>



In [280]:

    
personal_gb_district_df = pd.DataFrame()

for year in range(2003,2014):
    personal_gb_district_df[year] = crime_df_gb_personal[crime_df_gb_personal['Year'] == year].groupby('District')['ID'].agg(len).ix[:25]

district_personal_change_df = pd.DataFrame()
district_personal_change_df[u'2010\N{EN DASH}2012 change'] = (personal_gb_district_df[2012]-personal_gb_district_df[2010])/personal_gb_district_df[2010]*100
district_personal_change_df['2013 change'] = (personal_gb_district_df[2013]-personal_gb_district_df[2012])/personal_gb_district_df[2012]*100
districts = [int(i) for i in list(district_personal_change_df.T.columns)]
ax = district_personal_change_df.plot(kind='bar')
ax.set_xticklabels(districts,rotation=0)
ax.legend(ncol=3,fontsize=15,loc='upper center')
ax.set_ylim((-30,20))

ax.set_ylabel("Percentage change",fontsize=18)
ax.set_xlabel("CPD District",fontsize=18)
ax.set_title("Personal crime rates across districts fall faster\nin 2013 than previous 3 years combined",fontsize=24)









    Out[280]:





<matplotlib.text.Text at 0xd53d6198>



In [368]:

    
(homicide_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1)*100).plot(xlim=(-200,200),kind='kde',colormap='jet',alpha=.75,lw=2)
for i,district in enumerate(districts):
    plt.axvline(x=homicide_gb_district_df.T.pct_change().ix[2013][district]*100,c='k')
    
plt.legend(districts,title='District',ncol=2,fontsize=12)
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Year-over-year percentage change in crime',fontsize=18)
plt.title('Homicide changes spread across district distributions',fontsize=24)









    Out[368]:





<matplotlib.text.Text at 0x11e7869e8>



In [358]:

    
(property_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1)*100).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
for i,district in enumerate(districts):
    plt.axvline(x=property_gb_district_df.T.pct_change().ix[2013][district]*100,c='k')
    
plt.legend(districts,title='District',ncol=2,fontsize=12)
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Year-over-year percentage change in crime',fontsize=18)
plt.title('Decrease in 2013 property crime rates across districts\nis lower than normal',fontsize=24)









    Out[358]:





<matplotlib.text.Text at 0x11a8ddc88>



In [359]:

    
(personal_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1)*100).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
for i,district in enumerate(districts):
    plt.axvline(x=personal_gb_district_df.T.pct_change().ix[2013][district]*100,c='k')
    
plt.legend(districts,title='District',ncol=2,fontsize=12)
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Year-over-year percentage change in crime',fontsize=18)
plt.title('Decrease in 2013 personal crime rates across districts\nis lower than normal',fontsize=24)









    Out[359]:





<matplotlib.text.Text at 0x11acb7080>



In [380]:

    
homicide_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1).apply(np.std,axis=1).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
plt.axvline(x=np.std(homicide_gb_district_df.T.pct_change().ix[2013]),c='k')
    
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Spread of changes in property crime rates across districts',fontsize=18)
plt.title('The spread of changes in homicide rates\nis less than previous years',fontsize=24)









    Out[380]:





<matplotlib.text.Text at 0x1200e9780>



In [360]:

    
property_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1).apply(np.std,axis=1).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
plt.axvline(x=np.std(personal_gb_district_df.T.pct_change().ix[2013]),c='k')
    
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Spread of changes in property crime rates across districts',fontsize=18)
plt.title('The spread of changes in property crime rates\nis less than previous years',fontsize=24)









    Out[360]:





<matplotlib.text.Text at 0x11bbafc88>



In [361]:

    
personal_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1).apply(np.std,axis=1).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
plt.axvline(x=np.std(personal_gb_district_df.T.pct_change().ix[2013]),c='k')
    
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Spread of changes in personal crime rates across districts',fontsize=18)
plt.title('But the spread of changes in personal crime rates\ndoes not differ from previous years',fontsize=24)









    Out[361]:





<matplotlib.text.Text at 0x11befc1d0>

Miscellaneous

However extremely warm or cold temperatures can also inhibit these effects as people stay indoors and other temporal variables such as time of day, day of week, holidays, public school sessions, and even lunar cycles may also play a part in the frequency of violent and property crimes. Studying the frequency of crime and weather also invites questions of quality and quantity. Do people more regular crimes of assault escalate into homicide (quality), do some crimes become more prevalent (quantity), or a combination of both?



In [281]:

    
for district in districts:
    print district, stats.ttest_1samp(property_gb_district_df.T.pct_change().ix[2004:2012][district].values,property_gb_district_df.T.pct_change().ix[2013][district])









    



1 (array(2.49958818487028), 0.036965750610989556)
2 (array(1.9320619666217416), 0.089435604385737263)
3 (array(3.029995320845064), 0.016309532535531631)
4 (array(3.659662619335382), 0.0064043318278342387)
5 (array(2.988043547833051), 0.017385746993000887)
6 (array(2.1220541176474526), 0.066611874181757777)
7 (array(3.2008277917654024), 0.012596669894569134)
8 (array(2.5599321789598815), 0.033650521707074235)
9 (array(3.2233192017527212), 0.012178397593181151)
10 (array(3.6033030057254525), 0.0069488903957138351)
11 (array(4.500117179430549), 0.0020017821483030965)
12 (array(3.8137447956001562), 0.0051352002942768923)
14 (array(2.6812622289978223), 0.02787143113202083)
15 (array(3.3005706918430535), 0.010849615884006023)
16 (array(3.1951097761227194), 0.012705396548538738)
17 (array(3.192121931397842), 0.01276260142020974)
18 (array(3.558512695467989), 0.0074167840073363824)
19 (array(3.3744276527431034), 0.0097215779396347558)
20 (array(3.6846932467414577), 0.0061772362854679921)
22 (array(2.8041955455947374), 0.023048304001622125)
24 (array(2.3816270339460726), 0.044432698598261315)
25 (array(2.4612585369974416), 0.039241673447205083)



In [ ]:

	Max TemperatureF	Mean TemperatureF	Min TemperatureF	Max Dew PointF	MeanDew PointF	Min DewpointF	Max Humidity	Mean Humidity	Min Humidity	Max Sea Level PressureIn	Mean Sea Level PressureIn	Min Sea Level PressureIn	Max VisibilityMiles	Mean VisibilityMiles	Min VisibilityMiles	Max Wind SpeedMPH	Mean Wind SpeedMPH	Max Gust SpeedMPH	PrecipitationIn	CloudCover
2014-04-02	45	40	34	29	26	24	70	60	49	30.17	30.12	30.05	10	10	10	17	10	29	NaN	7	...
2014-04-03	43	39	35	36	32	28	89	75	60	30.04	29.86	29.67	10	6	0	29	14	35	0.53	8	...
2014-04-04	48	41	34	41	33	21	92	73	54	30.00	29.60	29.45	10	6	0	30	17	40	NaN	8	...
2014-04-05	51	41	31	25	22	17	69	49	29	30.20	30.13	30.02	10	10	10	13	6	16	0.00	3	...
2014-04-06	57	44	31	29	25	19	63	49	35	30.17	30.08	29.96	10	10	10	14	7	20	0.00	4	...

	ID	Case Number	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	Beat	District	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude
0	9558116	HX209604	2014-04-03	026XX W LOGAN BLVD	0610	BURGLARY	FORCIBLE ENTRY	OTHER	False	False	1411	14	35	22	05	1158555	1917229	2014	04/06/2014 12:38:43 AM	41.928617	...
1	9558320	HX209887	2014-04-03	008XX N LONG AVE	0610	BURGLARY	FORCIBLE ENTRY	APARTMENT	False	False	1524	15	37	25	05	1140117	1905254	2014	04/06/2014 12:38:43 AM	41.896114	...
2	9557107	HX208940	2014-04-03	070XX S RACINE AVE	0495	BATTERY	AGGRAVATED OF A SENIOR CITIZEN	APARTMENT	True	True	734	7	17	67	04B	1169494	1858212	2014	04/06/2014 12:38:43 AM	41.766438	...
3	9557134	HX208947	2014-04-03	025XX N MILWAUKEE AVE	0810	THEFT	OVER $500	STREET	False	False	1414	14	35	22	06	1155439	1916718	2014	04/06/2014 12:38:43 AM	41.927278	...
4	9557154	HX208991	2014-04-03	070XX S MERRILL AVE	2820	OTHER OFFENSE	TELEPHONE THREAT	RESIDENCE	False	True	331	3	5	43	26	1191720	1858787	2014	04/05/2014 12:39:46 AM	41.767505	...

Heat and Violence in Chicago

Brian Keegan (@bkeegan), College of Social Sciences and Humanities, Northeastern University

Start your kernels!

Scrape data

Read data

Weather

All crime data

Group data by date and join crime and weather data together

PART 1: The dynamics of crime, time, and weather

Crime as a function of temperature

Crime as a function of time

As time of year and day together

For other types of crimes

Plots

Power spectrum

Temperature

Assaults

Homicides

Curve fitting

PART 2: A statistical model of crime

Assaults

Homicides

Personal crimes

Property crimes

PART 3: What does this model predict about crime in 2014?

PART 4: Is CPD cooking the books?

Patterns by police district

Miscellaneous