Can we attribute a fall in violent crime in Chicago to its new conceal and carry laws? This was the argument many conservative blogs and news outlets were making in early April after statistics were released showing a marked drop in the murder rate in Chicago. These articles attributed "Chicago's first-quarter murder total [hitting] its lowest number since 1958" to the deterrent effects of Chicago's concealed and carry permits being issued in late February (RedState). Other conservative outlets latched onto the news to argue the policy "is partly responsible for Chicago's across-the-board drop in the crime" (TheBlaze) or that the policy contributed to the "murder rate promptly [falling] to 1958 levels" (TownHall).
Several articles hedged about the causal direction of any relationship and pointed out that this change is hard to separate from falling general crime rates as well as the atrocious winter weather this season (PJMedia, Wonkette, HuffPo) The central claim here is whether the adoption of the conceal and carry policy in March 2014 contributed to significant changes in crime rates rather than other social, historical, or environmental factors.
However, an April 7 feature story by David Bernstein and Noah Isackson in Chicago magazine found substantial evidence of violent crimes like homicides, robberies, burglaries, and assaults being reclassified, downgraded to more minor crimes, and even closed as noncriminal incidents. They argue that after Police Superintendent Garry McCarthy arrived in May 2011, the drop in crime has improbably plummeted in spite of high unemployment and significant contraction in the Chicago Police Department's beat cops. An audit by Chicago's inspector general into these crime numbers suggests assaults and batteries may have been underreported by more than 24%. This raises a second question: can we attribute the fall in violent crime in Chicago to systematic underreporting of criminal statistics?
In this post, I do four things:
In [381]:
Image(filename='homicides_month_hour_heatmap.png')
Out[381]:
In [382]:
Image(filename='personal_model_comparison.png')
Out[382]:
In [383]:
Image(filename='2014_homicide_predictions.png')
Out[383]:
In [384]:
Image(filename='personal_crime_rates_down.png')
Out[384]:
In [271]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import seaborn as sb
from collections import Counter
from IPython.core.display import Image
import scipy.stats as stats
This will scrape historical weather data from Weather Underground between 2001 and 2014 and save each year's data as a CSV files.
In [ ]:
for i in range(2001,2015):
with open('{0}.csv'.format(str(i)),'wb') as f:
data = urllib2.urlopen('http://www.wunderground.com/history/airport/ORD/{0}/1/1/CustomHistory.html?dayend=1&monthend=1&yearend={1}&format=1'.format(str(i),str(i+1))).read()
f.write(data)
Next we want to combine each year's CSV file into a master CSV file for all years. We do this by loading each CSV and saving it into the weather
list of DataFrames and then using pd.concat
to combine them all together. Then we save the resulting data as weather.csv
to load up later.
In [ ]:
weather = list()
for i in range(2001,2015):
weather.append(pd.read_csv('{0}.csv'.format(str(i)),skiprows=1))
weather_df = pd.concat(weather)
weather_df.set_index('CST',inplace=True)
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df.to_csv('weather.csv')
Read in weather.csv
and clean up the data so that it's indexed by a proper DateTimeIndex
and junk in the rainfall data is removed (in case we wanted to use it later).
In [2]:
weather_df = pd.read_csv('weather.csv',low_memory=False,index_col=0)
weather_df.index = pd.to_datetime(weather_df.index,unit='D')
weather_df['PrecipitationIn'] = weather_df['PrecipitationIn'].replace('T',np.nan)
weather_df['PrecipitationIn'] = weather_df['PrecipitationIn'].dropna().astype(float)
weather_df.tail()
Out[2]:
List out all the different variables in the weather data.
In [3]:
weather_df.columns
Out[3]:
In [4]:
crime_df = pd.read_csv('all_crime.csv')
crime_df['Datetime'] = pd.to_datetime(crime_df['Date'],format="%m/%d/%Y %I:%M:%S %p")
crime_df['Date'] = crime_df['Datetime'].apply(lambda x:x.date())
crime_df['Weekday'] = crime_df['Datetime'].apply(lambda x:x.weekday())
crime_df['Hour'] = crime_df['Datetime'].apply(lambda x:x.hour)
crime_df['Day'] = crime_df['Datetime'].apply(lambda x:x.day)
crime_df['Week'] = crime_df['Datetime'].apply(lambda x:x.week)
crime_df['Month'] = crime_df['Datetime'].apply(lambda x:x.month)
crime_df.head()
Out[4]:
In [5]:
dict(Counter(crime_df['Primary Type']))
Out[5]:
In [10]:
personal_crimes = ['ASSAULT','BATTERY','CRIM SEXUAL ASSAULT','HOMICIDE']
property_crimes = ['ARSON','BURGLARY','MOTOR VEHICLE THEFT','ROBBERY','THEFT']
Join the temperature and crime data together based on their sharing a common DateTimeIndex
.
In [19]:
arson_gb = crime_df[crime_df['Primary Type'] == 'ARSON'].groupby('Date')['ID'].agg(len)
assault_gb = crime_df[crime_df['Primary Type'] == 'ASSAULT'].groupby('Date')['ID'].agg(len)
battery_gb = crime_df[crime_df['Primary Type'] == 'BATTERY'].groupby('Date')['ID'].agg(len)
burglary_gb = crime_df[crime_df['Primary Type'] == 'BURGLARY'].groupby('Date')['ID'].agg(len)
homicide_gb = crime_df[crime_df['Primary Type'] == 'HOMICIDE'].groupby('Date')['ID'].agg(len)
sexual_assault_gb = crime_df[crime_df['Primary Type'] == 'CRIM SEXUAL ASSAULT'].groupby('Date')['ID'].agg(len)
robbery_gb = crime_df[crime_df['Primary Type'] == 'ROBBERY'].groupby('Date')['ID'].agg(len)
theft_gb = crime_df[crime_df['Primary Type'] == 'THEFT'].groupby('Date')['ID'].agg(len)
vehicle_theft_gb = crime_df[crime_df['Primary Type'] == 'MOTOR VEHICLE THEFT'].groupby('Date')['ID'].agg(len)
personal_gb = crime_df[crime_df['Primary Type'].isin(personal_crimes)].groupby('Date')['ID'].agg(len)
property_gb = crime_df[crime_df['Primary Type'].isin(property_crimes)].groupby('Date')['ID'].agg(len)
arson_gb.index = pd.to_datetime(arson_gb.index,unit='D')
assault_gb.index = pd.to_datetime(assault_gb.index,unit='D')
battery_gb.index = pd.to_datetime(battery_gb.index,unit='D')
burglary_gb.index = pd.to_datetime(burglary_gb.index,unit='D')
homicide_gb.index = pd.to_datetime(homicide_gb.index,unit='D')
sexual_assault_gb.index = pd.to_datetime(sexual_assault_gb.index,unit='D')
robbery_gb.index = pd.to_datetime(robbery_gb.index,unit='D')
theft_gb.index = pd.to_datetime(theft_gb.index,unit='D')
vehicle_theft_gb.index = pd.to_datetime(vehicle_theft_gb.index,unit='D')
personal_gb.index = pd.to_datetime(personal_gb.index,unit='D')
property_gb.index = pd.to_datetime(property_gb.index,unit='D')
ts = pd.DataFrame({'Arson':arson_gb.ix[:'2014-3-31'],
'Assault':assault_gb.ix[:'2014-3-31'],
'Battery':battery_gb.ix[:'2014-3-31'],
'Burglary':burglary_gb.ix[:'2014-3-31'],
'Homicide':homicide_gb.ix[:'2014-3-31'],
'Sexual_assault':sexual_assault_gb.ix[:'2014-3-31'],
'Robbery':robbery_gb.ix[:'2014-3-31'],
'Vehicle_theft':vehicle_theft_gb.ix[:'2014-3-31'],
'Theft':theft_gb.ix[:'2014-3-31'],
'Personal':personal_gb.ix[:'2014-3-31'],
'Property':property_gb.ix[:'2014-3-31'],
'Temperature':weather_df['Mean TemperatureF'].ix[:'2014-3-31'],
'Binned temperature':weather_df['Mean TemperatureF'].ix[:'2014-3-31']//10.*10,
'Humidity':weather_df[' Mean Humidity'].ix[:'2014-3-31'],
'Precipitation':weather_df['PrecipitationIn'].ix[:'2014-3-31']
})
ts['Time'] = range((max(ts.index)-min(ts.index)).days+1)
ts.reset_index(inplace=True)
ts.set_index('index',drop=False,inplace=True)
ts['Weekday'] = ts['index'].apply(lambda x:x.weekday())
ts['Hour'] = ts['index'].apply(lambda x:x.hour)
ts['Week'] = ts['index'].apply(lambda x:x.week)
ts['Month'] = ts['index'].apply(lambda x:x.month)
ts['Year'] = ts['index'].apply(lambda x:x.year)
ts['Weekend'] = ts['Weekday'].isin([5,6]).astype(int)
Define a helper function that we'll be using later on.
In [20]:
# adapted from http://matplotlib.org/examples/api/barchart_demo.html
def autolabel(rects):
max_height = max([rect.get_height() for rect in rects if hasattr(rect,'get_height') and not np.isnan(rect.get_height())])
for rect in rects:
if hasattr(rect,'get_height'):
height = rect.get_height()
if not np.isnan(height):
ax.text(rect.get_x()+rect.get_width()/2., height-.05*max_height, '%d'%int(height),
ha='center', va='bottom',color='w')
Plot the occurrence of crimes over time. There are several apparent features:
Many crimes have very strong annual seasonality: rates are higher in the summer months and lower in the winter months.
There's a decreasing trend in many types of crimes over time.
I've marked March 1, 2014 with a vertical black dotted line to indicate when the gun policy went into effect.
In [25]:
figsize(12,6)
ts2 = pd.DataFrame({'Arson':arson_gb/float(arson_gb.max()),
'Assault':assault_gb/float(assault_gb.max()),
'Battery':battery_gb/float(battery_gb.max()),
'Burglary':burglary_gb/float(burglary_gb.max()),
'Homicide':homicide_gb/float(homicide_gb.max()),
'Sexual assault':sexual_assault_gb/float(sexual_assault_gb.max()),
'Robbery':robbery_gb/float(robbery_gb.max()),
'Theft':theft_gb/float(theft_gb.max()),})
ax = ts2.resample('M').plot(lw=4,alpha=.75,colormap='hsv')
ax.set_ylabel('Incidents (normalized)',fontsize=18)
#ax.right_ax.set_ylabel('Temperature (F)',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((0,1))
#ax.set_yscale('log')
ax.grid(False,which='minor')
plt.axvline('2014-3-1',c='k',ls='--')
ax.legend(loc='upper center',ncol=4)
Out[25]:
We can plot the strength of the correlations between these crime statistics and some other variables as well. We add in Temperature
,Humidity
, and Precipitation
. Redder colors are stronger correlations, bluer colors are weaker or negative correlations. The strongest correlation we find (darkest red) is between Battery and Assault: when batteries are high assaults are also high. The weakest correlation we find is between Humidity and Temperature: when temperature is high, humidity is low. Note that I've set the diagonal to zero (Arson is perfectly correlated with Arson).
In [28]:
figsize(8,6)
#ts.corr().columns
a = np.array(ts[['Arson','Burglary','Robbery','Theft','Assault','Battery','Homicide','Sexual_assault','Temperature','Humidity','Precipitation']].resample('M').corr())
np.fill_diagonal(a,0)
plt.pcolor(a,cmap='RdBu_r',edgecolors='k')
plt.xlim((0,11))
plt.ylim((0,11))
plt.xticks(arange(.5,11.5),['Arson','Burglary','Robbery','Theft','Assault','Battery','Homicide','Sexual assault','Temperature','Humidity','Precipitation'],rotation=90,fontsize=15)
plt.yticks(arange(.5,11.5),['Arson','Burglary','Robbery','Theft','Assault','Battery','Homicide','Sexual assault','Temperature','Humidity','Precipitation'],fontsize=15)
plt.title('Correlation between crime occurences',fontsize=20)
plt.colorbar()
plt.grid(b=True,which='major',alpha=.5)
In [29]:
figsize(12,6)
ts3 = pd.DataFrame({'Personal':personal_gb/float(personal_gb.max()),
'Property':property_gb/float(property_gb.max())
})
ax = ts3.resample('M').plot(lw=4,alpha=.75,colormap='jet')
ax.set_ylabel('Incidents (normalized)',fontsize=18)
#ax.right_ax.set_ylabel('Temperature (F)',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((0,1))
#ax.set_yscale('log')
ax.grid(False,which='minor')
ax.legend(loc='upper center',ncol=4,fontsize=18)
Out[29]:
Next we explore the correlation between temperature and crime above. The figure below plots the
In [393]:
ax = ts.boxplot(['Personal','Property'],by='Binned temperature')
ax[0].set_ylabel('Number of crimes',fontsize=18)
ax[0].set_xlabel('Temperature (F)',fontsize=18)
ax[0].set_title('Personal crimes',fontsize=15)
ax[1].set_xlabel('Temperature (F)',fontsize=18)
ax[1].set_title('Property crimes',fontsize=15)
plt.suptitle('Crime increases with temperature',fontsize=20)
#plt.xticks(plt.xticks()[0],arange(-10,100,10),fontsize=15)
Out[393]:
We can also examine the relationship between Temperature
, Humidity
, and number of crimes. First, we extract the total number of crimes for each observed combination of Temperature
and Humidity
and store this in array1
. Then we extract the total observations of Temperature
and Humidity
and store this in array2
. The latter is to normalize the former in the event there are some combinations of temperature occur frequently and result in us overcounting the crime statistics.
Moving from bottom to top we can see the effect we saw above that temperature increases the frequency of crimes (bluer is less crime, redder is more crime). Moving from left to right we see crime doesn't vary substantially as function of humidity, with the exception that very high levels of humidity (above 60%) might have higher rates of crime
In [31]:
var = 'Personal'
ct1 = ts.groupby(['Temperature','Humidity'])[var].agg(np.sum).reset_index()
array1 = np.array(pd.pivot_table(ct1,values=var,rows='Temperature',cols='Humidity').fillna(0))
ct2 = ts.groupby(['Temperature','Humidity'])[var].agg(len).reset_index()
array2 = np.array(pd.pivot_table(ct2,values=var,rows='Temperature',cols='Humidity').fillna(0))
normalized_crime = array1/array2
plt.imshow(normalized_crime,cmap='RdBu_r',origin='lower',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Humidity',fontsize=18)
plt.ylabel('Temperature',fontsize=18)
plt.title('{0} crimes by temperature and humidity'.format(var),fontsize=24)
plt.colorbar()
Out[31]:
In [32]:
var = 'Property'
ct1 = ts.groupby(['Temperature','Humidity'])[var].agg(np.sum).reset_index()
array1 = np.array(pd.pivot_table(ct1,values=var,rows='Temperature',cols='Humidity').fillna(0))
ct2 = ts.groupby(['Temperature','Humidity'])[var].agg(len).reset_index()
array2 = np.array(pd.pivot_table(ct2,values=var,rows='Temperature',cols='Humidity').fillna(0))
normalized_crime = array1/array2
plt.imshow(normalized_crime,cmap='RdBu_r',origin='lower',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Humidity',fontsize=18)
plt.ylabel('Temperature',fontsize=18)
plt.title('{0} crimes by temperature and humidity'.format(var),fontsize=24)
plt.colorbar()
Out[32]:
We can also examine Assaults as a function of time -- we saw that they peak when temperatures are high, so they should likewise peak in the summer months.
In [33]:
crime_df_gb_personal = crime_df[crime_df['Primary Type'].isin(personal_crimes)]
crime_df_gb_property = crime_df[crime_df['Primary Type'].isin(property_crimes)]
In [34]:
bar_df = pd.DataFrame()
bar_df['Personal'] = crime_df_gb_personal.groupby('Month')['ID'].agg(len)/len(crime_df_gb_personal)*100
bar_df['Property'] = crime_df_gb_property.groupby('Month')['ID'].agg(len)/len(crime_df_gb_property)*100
ax = bar_df.plot(kind='bar')
plt.title('Crimes peak in summer',fontsize=20)
plt.xlabel('Month',fontsize=18)
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.ylabel('Percentage of crimes',fontsize=18)
plt.yticks(fontsize=15)
plt.ylim((0,12))
plt.legend(fontsize=15,ncol=2,loc='upper center')
autolabel(ax.patches)
We can also examine what time of day assaults occurred. We see assaults peak between 2 and 3pm and crater around 5am.
In [35]:
bar_df = pd.DataFrame()
bar_df['Personal'] = crime_df_gb_personal.groupby('Hour')['ID'].agg(len)/len(crime_df_gb_personal)*100
bar_df['Property'] = crime_df_gb_property.groupby('Hour')['ID'].agg(len)/len(crime_df_gb_property)*100
ax = bar_df.plot(kind='bar')
plt.title('Crimes peak in afternoon',fontsize=20)
plt.xlabel('Hour of day',fontsize=18)
plt.xticks(fontsize=15,rotation=0)
plt.ylabel('Percentage of crimes',fontsize=18)
plt.yticks(fontsize=15)
plt.legend(fontsize=15,ncol=2,loc='upper center')
autolabel(ax.patches)
In [36]:
bar_df = pd.DataFrame()
bar_df['Personal'] = crime_df_gb_personal.groupby('Weekday')['ID'].agg(len)/len(crime_df_gb_personal)*100
bar_df['Property'] = crime_df_gb_property.groupby('Weekday')['ID'].agg(len)/len(crime_df_gb_property)*100
ax = bar_df.plot(kind='bar')
plt.title('Property crimes fall,\npersonal crimes rise over weekends',fontsize=20)
plt.xlabel('Day of week',fontsize=18)
plt.xticks(plt.xticks()[0],['Mon','Tue','Wed','Thu','Fri','Sat','Sun'],fontsize=15,rotation=0)
plt.ylabel('Percentage of crimes',fontsize=18)
plt.yticks(fontsize=15)
plt.ylim((0,18))
plt.legend(fontsize=15,ncol=2,loc='upper center')
autolabel(ax.patches)
We can combine the month of the year and hour of the day analyses to create a "fingerprint" of different criminal patterns. Plotting the distribution for assaults, we see that assaults peak between February and April around 3pm and in May through July, this shifts back to about 8pm, before returning back to about 3pm in August.
In [37]:
ct = crime_df[crime_df['Primary Type'] == 'ASSAULT'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Assault by time of day and year',fontsize=24)
plt.colorbar()
Out[37]:
We can make a similar "fingerprint" for the hour of the day against the day of the week. We see that weekdays have a distinct peak around 3pm while weekends have a less-distinct peak but start and end later in the evening.
In [38]:
ct = crime_df[crime_df['Primary Type'] == 'ASSAULT'].groupby(['Weekday','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Weekday',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Day of week',fontsize=18)
plt.title('Assault by time of day and week',fontsize=24)
plt.yticks(arange(0,7),['Mon','Tue','Wed','Thu','Fri','Sat','Sun'])
plt.colorbar()
Out[38]:
Batteries have a similar pattern of shifting from the afternoon in the spring, to the late evening in the summer, and back to the afternoon in the fall.
In [39]:
ct = crime_df[crime_df['Primary Type'] == 'BATTERY'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Battery by time of day and year',fontsize=24)
plt.colorbar()
Out[39]:
Some strange going on with reporting of sexual assaults since they peak at midnight and Jan 1.
In [45]:
ct = crime_df[crime_df['Primary Type'] == 'SEX OFFENSE'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Sexual assault by time of day and year',fontsize=24)
plt.colorbar()
Out[45]:
Robberies show a more extreme pattern of variation in time of day over the course of the year. Robberies move later into the night as spring becomes summer and then move earlier into the evening as summer becomes fall and winter.
In [40]:
ct = crime_df[crime_df['Primary Type'] == 'ROBBERY'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Robbery by time of day and year',fontsize=24)
plt.colorbar()
Out[40]:
In contrast, buglaries have a consistent pattern of being reported around 8am and 12pm, neither of which change through the year.
In [41]:
ct = crime_df[crime_df['Primary Type'] == 'BURGLARY'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Burglary by time of day and year',fontsize=24)
plt.colorbar()
Out[41]:
Finally, we see that homicides are rare during the winter months and then peak in the late evening and early morning of the summer months.
In [42]:
ct = crime_df[crime_df['Primary Type'] == 'HOMICIDE'].groupby(['Month','Hour'])['ID'].agg(len).reset_index()
plt.imshow(np.array(pd.pivot_table(ct,values='ID',rows='Month',cols='Hour')),cmap='RdBu_r',label='Count')
#plt.legend(loc='upper right',bbox_to_anchor=(1,.5))
plt.xlabel('Hour of day',fontsize=18)
plt.ylabel('Month of year',fontsize=18)
plt.title('Homicides by time of day and year',fontsize=24)
plt.colorbar()
Out[42]:
The plots above show the different ways in which temperature and crime are correlated in time and show strong seasonal effects. We can also plot Temperature
against Assaults
directly. The correlation is very apparent, but there's also a clear decreasing trend over time. This suggests that any changes in crime rates need to account for both seasonal effects as well as longer-term historical patterns that are reducing crime year-to-year.
In [51]:
ax = ts[['Temperature','Assault']].resample('M').plot(secondary_y=['Assault'],lw=4)
ax.set_ylabel(u'Temperature (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Assaults',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((0,100))
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Assaults, 2001\N{EN DASH}present',fontsize=24)
plt.autoscale()
Zooming into the last year of data (January 1, 2013 to early April 2014), there's still an obvious seasonality to the assaults that corresponds with the temperature as well. The grey hatching on the right corresponds to the time the gun assault has been in effect -- if the argument that the liberalized gun policies should deter crime, this is where we should see the changes in crime rates (green line). Assaults fell in the coldest part of the winter but have climed again as the temperature recovered.
In [52]:
ax = ts[['Temperature','Assault']].resample('W').ix['2013-1-1':].plot(secondary_y=['Assault'],lw=4)
ax.set_ylabel(u'Temperature (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Incidents',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
#ax.grid(False,which='minor')
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Assaults, 2013\N{EN DASH}present',fontsize=24)
plt.autoscale()
Because the homicides are so much more rare than the assaults, it was difficult to tell whether there was a close correlation. Aggregating the data to a montly level and plotting only murders against temperature, we see there is a pattern here as well.
In [53]:
ax = ts[['Temperature','Homicide']].resample('M').plot(secondary_y=['Homicide'],lw=4,alpha=.75)
ax.set_ylabel(u'Temperature (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Homicides',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
#ax.set_yticklabels(plt.yticks()[0],range(0,90,10))
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Homicides, 2001\N{EN DASH}present',fontsize=24)
Out[53]:
Homicides
In [54]:
ax = ts[['Temperature','Homicide']].resample('W').ix['2013-1-1':].plot(secondary_y=['Homicide'],lw=4,alpha=.75)
ax.set_ylabel(u'Temperature (\N{DEGREE SIGN}F)',fontsize=18)
ax.right_ax.set_ylabel('Homicides',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
#ax.set_yticklabels(plt.yticks()[0],range(0,90,10))
plt.axvspan(xmin='2014-3-1',xmax=ts.index.max(),color='k',alpha=.1)
plt.title(u'Homicides, 2013\N{EN DASH}present',fontsize=24)
Out[54]:
We saw above that there are regular cyclical patterns at the annual level in all three data sets. In other words, the temperature, number of assaults, or number of murders in a given day, week, or month last year is similar to the number this year. But there may also be other signals hiding in there as well that are harder to see. We borrow an analytical technique from signal processing called the "Fast Fourier Transform" that is literally used to separate the signals from the noise. This method is basically an "x-ray" on temporal data that lets us see the "skeleton" of different underlying temporal patterns hidden within the data. The resulting plots are called power spectra and we can visualize them for each of the temperature, assault, and murder data to see what other patterns are hiding in there in addition to the annual cycles we saw above.
Special thanks to my colleague, Navid Dianati for his help in interpreting these.
Based on the plots above as well as our own intuition, there should be a signal at a one year frequency at a minimum. Marking that "once every 365 days" frequency with a vertical dashed cyan line and then plotting the power spectrum, we see a peak: there's definitely an annual cycle in the temperature data. No surprise.
In [55]:
d = ts['Temperature'].dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]
plt.xticks(x,(n/x).round(1))
plt.axvline(n/365,ls='--',c='c') # Annual signal
plt.grid(False,which='Minor')
plt.xlabel('Frequency (days)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of temperatures',fontsize=24)
Out[55]:
We also saw a clear annual cycle in the assault data, so I mark that "once every 365" frequency again with a vertical dashed cyan line and then plot the power spectrum for the assault data. We see there's a definite peak there as well, as we expected. However, we also see there are two other noticable peaks in this spectra that didn't occur in the temperature data.
Obviously nature doesn't follow a 7-day week cycle, but this cycle structures a significant amount of human social activity. Plotting a vertical dashed yellow line at the "one a week" frequency, we see that this signal corresponds perfectly with the observed peak. In other words, in addition to the regular annual cycle of gun violence there's also a regular weekly cycle of gun violence. This isn't a revelation to any criminologists or urban sociologists, but it's illustrative of using interdisciplinary method for exploratory analysis.
Finally there's at least one more peak in between these the weekly and the annual cycle. Plotting a vertical dashed magenta line at the "once every 90 days" frequency, this corresponds to another signal. Thus there's at least a three substantive signals in the assault data --- once a year, once a week, and now once a quarter. The power of this signal is weaker than the other two, but it's still about 10x stronger than the background noise. This last finding is somewhat unusual as I don't expect that criminals file a quarterly statement and then promptly go assault someone afterwards. But this strong and regular quarterly signal strongly suggests there's some bureaucratic process at work here that's driving reporting on a ~91-day cycle.
In [56]:
d = ts['Assault'].dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='g',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]
plt.xticks(x,(n/x).round(1))
plt.axvline(n/365,ls='--',c='c') # Annual signal
plt.axvline(n/182,ls='--',c='m') # Quarterly signal
plt.axvline(n/7,ls='--',c='y') # Weekly signal
plt.grid(False,which='Minor')
plt.xlabel('Frequency (days)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of assaults',fontsize=24)
Out[56]:
Just to double-check we're not reading too much into a weird statistical fluke of this method, I'll aggregate the data into weekly-level data and then repeat the analysis. Again we find signal peaks at the "once every 52 weeks", "once every 13 weeks", and "once every week" frequencies (marked in cyan, magenta, and yellow, respectively).
In [57]:
d = ts['Assault'].resample('W',how=np.sum).dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='g',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]
plt.xticks(x,(n/x).round(1))
plt.axvline(n/52.,ls='--',c='c') # Annual signal
plt.axvline(n/26,ls='--',c='m') # Quarterly signal
plt.axvline(n,ls='--',c='y') # Weekly signal
plt.grid(False,which='Minor')
plt.xlabel('Frequency (weeks)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of assaults',fontsize=24)
Out[57]:
Looking at the power spectrum for murders, let's try to find signals at the same frequencies we found for assaults. Interestingly enough, there don't appear to be any, even when there should have been an annual peak as we saw above. Strangely, there's a peak (marked with the vertical dotted black line) at ~250 days, or approximately 9 months.
In [58]:
d = ts['Homicide'].dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='r',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]
plt.xticks(x,(n/x).round(1))
plt.axvline(n/365,ls='--',c='c') # Annual signal
plt.axvline(n/90,ls='--',c='m') # Quarterly signal
plt.axvline(n/7,ls='--',c='y') # Weekly signal
plt.axvline(n/255,ls='--',c='k') # ??? signal
plt.grid(False,which='Minor')
plt.xlabel('Frequency (days)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of homicides',fontsize=24)
Out[58]:
This doesn't really make sense (unless there was some 9-month process like people celebrating murders by conceiving babies and then celebrating babies by murdering people), so let's double-check by aggregating the data into weeks. Sure enough, the weird "one every 9 months" frequency goes away and we find the annual signal again at the "once every 52 weeks" frequency and a weekly signal at the "once every week" frequency. So murder patterns also follow an annual and weekly cycles, as we might expect --- but unlike the there's no quarterly cycle either.
In [394]:
d = ts['Homicide'].resample('W').dropna()
n = len(d)
ps = np.abs(np.fft.fft(d)**2)
plt.plot(ps,lw=4,c='r',alpha=.75)
plt.yscale('log')
plt.xscale('log')
plt.xlim((10,100))
x = plt.xticks()[0]
plt.xticks(x,(n/x).round(1))
plt.axvline(n/52.,ls='--',c='c') # Annual signal
#plt.axvline(n/13.,ls='--',c='m') # Quarterly signal
plt.axvline(n/1.,ls='--',c='y') # Weekly signal
plt.grid(False,which='Minor')
plt.xlabel('Frequency (weeks)',fontsize=18)
plt.ylabel('Power',fontsize=18)
plt.title('Cycles of homicides',fontsize=24)
Out[394]:
The analysis above showed there are clear cyclical patterns in the data. We'll want to use temperature as an explanatory variable, so we need to come up with a function that describes it well. We'll use curve_fit
to identify the parameters of a single sin wave function with an amplitude a
, a period b
, and an intercept c
: $a \times sin(t+b) + c$
Curve fit should return the estimates for these parameters for a given year's cycle and we can plot them against the observed data.
In [63]:
# from http://nbviewer.ipython.org/gist/keflavich/4042018
from scipy.optimize import curve_fit
def sinfunc(x,a,b,c):
return a*np.sin(2*np.pi/365.*x+b)+c
y2001 = np.array(ts['Temperature']['2001-1-1':'2001-12-31'])
x = arange(len(y2001))
fitpars,covmat = curve_fit(f=sinfunc,xdata=x,ydata=y2001)
plt.scatter(x,y2001)
y = sinfunc(x,*fitpars)
plt.plot(x,y,'r-',lw=8,alpha=.5)
plt.ylabel('Temperature',fontsize=18)
plt.xlabel('Day',fontsize=18)
plt.xlim((0,365))
Out[63]:
As this is only for a single year's data, we should estimate parameters for each year in the data. Note that curve_fit
can't estimate parameters for some years because there's too much noise in the data, so we'll have to skip those. We'll average all of these parameter estimates together to get a single function that attempts to fit the data across years.
In [64]:
for y in range(2002,2014):
y_i = np.array(ts['Temperature']['{0}-1-1'.format(y):'{0}-12-31'.format(y)])
x = arange(len(y_i))
try:
fitpars_i,covmat = curve_fit(f=sinfunc,xdata=x,ydata=y_i)
except RuntimeError:
print "Couldn't find estimates for {0}".format(y)
pass
fitpars = np.vstack([fitpars,fitpars_i])
avg_params = fitpars.mean(axis=0)
We can plot this function, the observed data, and the error between the function and the observed data below. The final chart shows the distributions of errors over time (in degrees F). There are some major errors here -- these are unseasonably warm or cold days -- but these are large enough to make the average error 0.56°F and appear to be distributed mostly around 0. So this averaged model seems to be an adequate fit for all the temperature data.
In [65]:
sin_df = pd.DataFrame(index=ts.index)
t = len(sin_df)
sin_df['sin'] = sinfunc(arange(t),*avg_params)
sin_df['Temp'] = ts['Temperature']
sin_df['error'] = sin_df['sin'] - sin_df['Temp']
ax = sin_df.plot(legend=False,subplots=True)
print u"The average error across all observations is: {0}\N{DEGREE SIGN}F".format(round(sin_df['error'].mean(),2))
In [66]:
lm_assault = smf.ols('Assault ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_assault_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_assault_x['Temperature'] = sinfunc(arange(len(lm_assault_x)),*avg_params)
lm_assault_x['Month'] = lm_assault_x['Date'].apply(lambda x:x.month)
lm_assault_x['Week'] = lm_assault_x['Date'].apply(lambda x:x.week)
lm_assault_x['Weekday'] = lm_assault_x['Date'].apply(lambda x:x.weekday())
lm_assault_x['Time'] = arange(len(lm_assault_x))
lm_assault_x = lm_assault_x.set_index('Date')
std,lower,upper = wls_prediction_std(lm_assault)
start_date = '2013-01-01'
assault_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
assault_pred['Predictions'] = lm_assault.predict(lm_assault_x[start_date:])
assault_pred['Observations'] = ts['Assault'].ix[start_date:]
assault_pred['Lower CI'] = assault_pred['Predictions']-2*std.mean()
assault_pred['Upper CI'] = assault_pred['Predictions']+2*std.mean()
In [68]:
assault_pred[['Lower CI','Upper CI']].resample('W').plot(c='g',lw=1,alpha=.75)
assault_pred['Predictions'].resample('W').plot(c='g',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(assault_pred.index,assault_pred['Observations'],c='g',alpha=.25,label='Observations')
plt.ylabel('Number of assaults',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.axvspan(xmin='2014-3-1',xmax=assault_pred.index.max(),color='k',alpha=.1)
plt.legend()
Out[68]:
Homicides are very rare relative to other crimes: in any given day, the 95% confidence interval of the model estimates negative deaths.
I'm going to do something different from the models above and put my thumb on the scale with the estimates. If the model estimates 60.9 crimes on a given, that fractional crime is small relative to the absolute magnitude of crimes. We can add up these fractional remainders, and they may tell us something substantive. However, if the model estimates 1.9 crimes per day, this fraction is large relative to the absolute magnitude of the crimes. Adding up all these fractions results in some significant over-estimation. I'm going to round these fractions back down to the next integer: 0.9 crimes becomes 0, 1.9 crimes becomes 1, etc. This has the result of biasing the errors to be low, but making the absolute counts more accurate.
In [69]:
lm_homicide = smf.ols('Homicide ~ Temperature + Time + C(Week) + C(Weekday)',data=ts).fit()
lm_homicide_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_homicide_x['Temperature'] = sinfunc(arange(len(lm_homicide_x)),*avg_params)
lm_homicide_x['Month'] = lm_homicide_x['Date'].apply(lambda x:x.month)
lm_homicide_x['Week'] = lm_homicide_x['Date'].apply(lambda x:x.week)
lm_homicide_x['Weekday'] = lm_homicide_x['Date'].apply(lambda x:x.weekday())
lm_homicide_x['Time'] = arange(len(lm_homicide_x))
lm_homicide_x = lm_homicide_x.set_index('Date')
std,lower,upper = wls_prediction_std(lm_homicide)
start_date = '2001-01-01'
homicide_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
homicide_pred['Predictions'] = np.floor(lm_homicide.predict(lm_homicide_x.ix[start_date:]))
homicide_pred['Observations'] = ts['Homicide'].ix[start_date:]
homicide_pred['Lower CI'] = homicide_pred['Predictions'] - 2*std.mean()
homicide_pred['Upper CI'] = homicide_pred['Predictions'] + 2*std.mean()
In [71]:
homicide_pred[['Lower CI','Upper CI']].resample('W').plot(c='r',lw=1,alpha=.75)
homicide_pred['Predictions'].resample('W').plot(c='r',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(homicide_pred.index,homicide_pred['Observations'],c='r',alpha=.25,label='Observations')
plt.ylabel('Number of homicides',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.legend()
plt.axvspan(xmin='2014-3-1',xmax=assault_pred.index.max(),color='k',alpha=.1)
plt.axhline(0,c='k',lw=4)
plt.title('Comparison of model predictions and observed data',fontsize=24)
Out[71]:
In [126]:
(homicide_pred['Observations'] - homicide_pred['Predictions']).resample('M').ix[:'2014-3-31'].plot(lw=2,c='r')
plt.axhline(y=0,c='k',lw=2,alpha=.75)
plt.ylabel('Observed error from predictions',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.title('Homicides following model since 2013',fontsize=24)
Out[126]:
In [73]:
bar_df = pd.DataFrame()
bar_df['Predictions'] = homicide_pred['Predictions'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Homicide'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.title('2013 homicide estimates',fontsize=24)
plt.ylabel('Number of homicides',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,60))
autolabel(ax.patches)
print "The model predicts {0} homicides. {1} homicides were reported.".format(round(bar_df['Predictions'].sum()), bar_df['Observations'].sum())
Compare model estimates per month to observations from 2013. The model is still over-estimating across every month.
In [75]:
lm_personal = smf.ols('Personal ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_personal_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_personal_x['Temperature'] = sinfunc(arange(len(lm_personal_x)),*avg_params)
lm_personal_x['Month'] = lm_personal_x['Date'].apply(lambda x:x.month)
lm_personal_x['Week'] = lm_personal_x['Date'].apply(lambda x:x.week)
lm_personal_x['Weekday'] = lm_personal_x['Date'].apply(lambda x:x.weekday())
lm_personal_x['Time'] = arange(len(lm_personal_x))
lm_personal_x = lm_personal_x.set_index('Date')
std,lower,upper = wls_prediction_std(lm_personal)
start_date = '2001-01-01'
personal_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
personal_pred['Predictions'] = lm_personal.predict(lm_personal_x[start_date:])
personal_pred['Observations'] = ts['Personal'].ix[start_date:]
personal_pred['Lower CI'] = personal_pred['Predictions']-2*std.mean()
personal_pred['Upper CI'] = personal_pred['Predictions']+2*std.mean()
In [76]:
personal_pred[['Lower CI','Upper CI']].resample('W').plot(c='b',lw=1,alpha=.75)
personal_pred['Predictions'].resample('W').plot(c='b',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(personal_pred.index,personal_pred['Observations'],c='b',alpha=.25,label='Observations')
plt.ylabel('Number of personal crimes',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.axvspan(xmin='2014-3-1',xmax=personal_pred.index.max(),color='k',alpha=.1)
plt.legend()
plt.title('Comparison of model predictions and observed data',fontsize=24)
Out[76]:
In [124]:
(personal_pred['Observations'] - personal_pred['Predictions']).resample('M').ix[:'2014-3-31'].plot(lw=2,c='b')
plt.axhline(y=0,c='k',lw=2,alpha=.75)
plt.ylabel('Observed error from predictions',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.title('Personal crimes below expectations since 2013',fontsize=24)
Out[124]:
In [78]:
figsize(12,6)
bar_df = pd.DataFrame()
bar_df['Predictions'] = personal_pred['Predictions'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Personal'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.title('2013 personal crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,10000))
autolabel(ax.patches)
print "The model predicted {0} personal crimes in 2013. {1} personal crimes occurred in 2013.".format(round(bar_df['Predictions'].sum()), bar_df['Observations'].sum())
In [119]:
lm_property = smf.ols('Property ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_property_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_property_x['Temperature'] = sinfunc(arange(len(lm_property_x)),*avg_params)
lm_property_x['Month'] = lm_property_x['Date'].apply(lambda x:x.month)
lm_property_x['Week'] = lm_property_x['Date'].apply(lambda x:x.week)
lm_property_x['Weekday'] = lm_property_x['Date'].apply(lambda x:x.weekday())
lm_property_x['Time'] = arange(len(lm_property_x))
lm_property_x = lm_property_x.set_index('Date')
std,lower,upper = wls_prediction_std(lm_property)
start_date = '2001-01-01'
property_pred = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
property_pred['Predictions'] = lm_property.predict(lm_property_x[start_date:])
property_pred['Observations'] = ts['Property'].ix[start_date:]
property_pred['Lower CI'] = property_pred['Predictions'] - 2*std.mean()
property_pred['Upper CI'] = property_pred['Predictions'] + 2*std.mean()
In [120]:
property_pred[['Lower CI','Upper CI']].resample('W').plot(c='g',lw=1,alpha=.75)
property_pred['Predictions'].resample('W').plot(c='g',lw=4,alpha=.75,label='Predictions')
#plt.plot(assault_pred.index,assault_pred['Lower_CI'])
plt.scatter(property_pred.index,property_pred['Observations'],c='g',alpha=.25,label='Observations')
plt.ylabel('Number of property crimes',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.axvspan(xmin='2014-3-1',xmax=property_pred.index.max(),color='k',alpha=.1)
plt.legend()
plt.title('Comparison of model predictions and observed data',fontsize=24)
Out[120]:
In [122]:
(property_pred['Observations'] - property_pred['Predictions']).resample('M').ix[:'2014-3-31'].plot(lw=2,c='g')
plt.axhline(y=0,c='k',lw=2,alpha=.75)
plt.ylabel('Observed error from predictions',fontsize=18)
plt.xlabel('Time',fontsize=18)
plt.title('Property crimes plummet below expectations after 2013',fontsize=24)
Out[122]:
In [83]:
bar_df = pd.DataFrame()
bar_df['Predictions'] = property_pred['Predictions'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Property'].ix['2013-1-1':'2013-12-31'].groupby(lambda x:x.month).agg(np.sum)
ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.title('2013 property crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,14000))
autolabel(ax.patches)
print "The model predicted {0} property crimes in 2013. {1} property crimes occurred in 2013.".format(round(bar_df['Predictions'].sum()), bar_df['Observations'].sum())
In [74]:
bar_df = pd.DataFrame()
bar_df['Predictions'] = homicide_pred['Predictions'].ix['2014-1-1':].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Homicide'].ix['2014-1-1':'2014-3-31'].groupby(lambda x:x.month).agg(np.sum)
ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.axvspan(xmin=3,xmax=13,color='k',alpha=.1)
plt.title('2014 homicide estimates',fontsize=24)
plt.ylabel('Number of homicides',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,60))
autolabel(ax.patches)
print "The model predicts {0} homicides in 2014.".format(round(bar_df['Predictions'].sum()))
In [79]:
bar_df = pd.DataFrame()
bar_df['Predictions'] = personal_pred['Predictions'].ix['2014-1-1':].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Personal'].ix['2014-1-1':'2014-3-31'].groupby(lambda x:x.month).agg(np.sum)
ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.axvspan(xmin=3,xmax=13,color='k',alpha=.1)
plt.title('2014 personal crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,10000))
autolabel(ax.patches)
print "The model predicts {0} personal crimes in 2014.".format(round(bar_df['Predictions'].sum()))
In [85]:
bar_df = pd.DataFrame()
bar_df['Predictions'] = property_pred['Predictions'].ix['2014-1-1':].groupby(lambda x:x.month).agg(np.sum)
bar_df['Observations'] = ts['Property'].ix['2014-1-1':'2014-3-31'].groupby(lambda x:x.month).agg(np.sum)
ax = bar_df.plot(kind='bar')
plt.xticks(plt.xticks()[0],['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],fontsize=15,rotation=0)
plt.axvspan(xmin=3,xmax=13,color='k',alpha=.1)
plt.title('2014 property crime estimates',fontsize=24)
plt.ylabel('Number of crimes',fontsize=18)
plt.xlabel('Month',fontsize=18)
ax.legend(fontsize=15,ncol=2,loc='upper center')
plt.ylim((0,14000))
autolabel(ax.patches)
print "The model predicts {0} property crimes in 2014.".format(round(bar_df['Predictions'].sum()))
First, examine what specific crimes are driving this trend. Second, examine the precincts with abnormally lower crime rates.
In [128]:
property_crimes_ts = ['Arson','Burglary','Vehicle_theft','Robbery','Theft']
ax = ts[property_crimes_ts].resample('M').plot(lw=4)
ax.set_yscale('log')
ax.grid(False,which='minor')
ax.legend(loc='upper center',ncol=5,fontsize=14)
ax.set_ylabel('Number of crimes',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
Out[128]:
In [131]:
lm_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_x['Temperature'] = sinfunc(arange(len(lm_x)),*avg_params)
lm_x['Month'] = lm_x['Date'].apply(lambda x:x.month)
lm_x['Week'] = lm_x['Date'].apply(lambda x:x.week)
lm_x['Weekday'] = lm_x['Date'].apply(lambda x:x.weekday())
lm_x['Time'] = arange(len(lm_x))
lm_x = lm_x.set_index('Date')
lm_arson = smf.ols('Arson ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_burglary = smf.ols('Burglary ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_robbery = smf.ols('Robbery ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_theft = smf.ols('Theft ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_vehicle_theft = smf.ols('Vehicle_theft ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
arson_std,lower,upper = wls_prediction_std(lm_arson)
burglary_std,lower,upper = wls_prediction_std(lm_burglary)
robbery_std,lower,upper = wls_prediction_std(lm_robbery)
theft_std,lower,upper = wls_prediction_std(lm_theft)
vehicle_theft_std,lower,upper = wls_prediction_std(lm_vehicle_theft)
start_date = '2001-01-01'
property_pred2 = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
property_pred2['Arson_prediction'] = lm_arson.predict(lm_x[start_date:])
property_pred2['Burglary_prediction'] = lm_burglary.predict(lm_x[start_date:])
property_pred2['Robbery_prediction'] = lm_robbery.predict(lm_x[start_date:])
property_pred2['Theft_prediction'] = lm_theft.predict(lm_x[start_date:])
property_pred2['Vehicle_theft_prediction'] = lm_vehicle_theft.predict(lm_x[start_date:])
property_pred2['Arson_observation'] = ts['Arson'].ix[start_date:]
property_pred2['Burglary_observation'] = ts['Burglary'].ix[start_date:]
property_pred2['Robbery_observation'] = ts['Robbery'].ix[start_date:]
property_pred2['Theft_observation'] = ts['Theft'].ix[start_date:]
property_pred2['Vehicle_theft_observation'] = ts['Vehicle_theft'].ix[start_date:]
property_pred2['Arson_error'] = property_pred2['Arson_observation'] - property_pred2['Arson_prediction']
property_pred2['Burglary_error'] = property_pred2['Burglary_observation'] - property_pred2['Burglary_prediction']
property_pred2['Robbery_error'] = property_pred2['Robbery_observation'] - property_pred2['Robbery_prediction']
property_pred2['Theft_error'] = property_pred2['Theft_observation'] - property_pred2['Theft_prediction']
property_pred2['Vehicle_theft_error'] = property_pred2['Vehicle_theft_observation'] - property_pred2['Vehicle_theft_prediction']
In [141]:
ax = property_pred2[['Arson_error','Burglary_error','Robbery_error','Theft_error','Vehicle_theft_error']].ix['2010-1-1':'2014-3-31'].resample('M',np.sum).plot(lw=4)
#ax.set_ylim((-40,40))
ax.set_ylabel('Observed error from predictions',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_title('Most property crimes below predictions since Q2 2013',fontsize=24)
ax.legend(['Arson','Burglary','Robbery','Theft','Vehicle theft'],fontsize=12,ncol=5,loc='upper center')
Out[141]:
In [375]:
personal_crimes_ts = ['Assault','Battery','Sexual_assault','Homicide']
ax = ts[personal_crimes_ts].resample('M',np.sum).plot(lw=4)
ax.set_yscale('log')
ax.grid(False,which='minor')
ax.legend(loc='upper center',ncol=5,fontsize=14)
ax.set_ylabel('Number of crimes',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_ylim((10e0,10e4))
Out[375]:
In [378]:
lm_x = pd.DataFrame({'Date':pd.date_range('2001-01-01','2014-12-31')})
lm_x['Temperature'] = sinfunc(arange(len(lm_x)),*avg_params)
lm_x['Month'] = lm_x['Date'].apply(lambda x:x.month)
lm_x['Week'] = lm_x['Date'].apply(lambda x:x.week)
lm_x['Weekday'] = lm_x['Date'].apply(lambda x:x.weekday())
lm_x['Time'] = arange(len(lm_x))
lm_x = lm_x.set_index('Date')
lm_assault = smf.ols('Assault ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_battery = smf.ols('Battery ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_sexual_assault = smf.ols('Sexual_assault ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
lm_homicide = smf.ols('Homicide ~ Temperature + Time + C(Week) + C(Weekday)',data=ts.dropna()).fit()
start_date = '2001-01-01'
personal_pred2 = pd.DataFrame(index = pd.date_range(start_date,'2014-12-31'))
personal_pred2['Assault_prediction'] = lm_assault.predict(lm_x[start_date:])
personal_pred2['Battery_prediction'] = lm_battery.predict(lm_x[start_date:])
personal_pred2['Sexual_assault_prediction'] = lm_sexual_assault.predict(lm_x[start_date:])
personal_pred2['Homicide_prediction'] = np.floor(lm_homicide.predict(lm_x[start_date:])) # Minority Effect red ball
personal_pred2['Assault_observation'] = ts['Assault'].ix[start_date:]
personal_pred2['Battery_observation'] = ts['Battery'].ix[start_date:]
personal_pred2['Sexual_assault_observation'] = ts['Sexual_assault'].ix[start_date:]
personal_pred2['Homicide_observation'] = ts['Homicide'].ix[start_date:]
personal_pred2['Assault_error'] = personal_pred2['Assault_observation'] - personal_pred2['Assault_prediction']
personal_pred2['Battery_error'] = personal_pred2['Battery_observation'] - personal_pred2['Battery_prediction']
personal_pred2['Sexual_assault_error'] = personal_pred2['Sexual_assault_observation'] - personal_pred2['Sexual_assault_prediction']
personal_pred2['Homicide_error'] = personal_pred2['Homicide_observation'] - personal_pred2['Homicide_prediction']
In [379]:
ax = personal_pred2[['Assault_error','Battery_error','Sexual_assault_error','Homicide_error']].ix['2010-1-1':'2014-3-31'].resample('M',np.sum).plot(lw=4)
#ax.set_ylim((-40,40))
ax.set_ylabel('Observed error from predictions',fontsize=18)
ax.set_xlabel('Time',fontsize=18)
ax.set_title('Most personal crimes below predictions since Q2 2013',fontsize=24)
ax.legend(['Assault','Battery','Sexual assault','Homicide'],fontsize=12,ncol=5,loc='upper center')
Out[379]:
Here we can replicate the claims Bernstein and Isaackson report about the
In [256]:
# From: http://gis.chicagopolice.org/pdfs/district_beat.pdf
Image(url='http://i.imgur.com/KMMkatm.png')
Out[256]:
In [365]:
homicide_df = crime_df[crime_df['Primary Type'] == "HOMICIDE"]
homicide_gb_district_df = pd.DataFrame()
for year in range(2003,2014):
homicide_gb_district_df[year] = homicide_df[homicide_df['Year'] == year].groupby('District')['ID'].agg(len).ix[:25]
district_homicide_change_df = pd.DataFrame()
district_homicide_change_df[u'2010\N{EN DASH}2012 change'] = (homicide_gb_district_df[2012]-homicide_gb_district_df[2010])/homicide_gb_district_df[2010]*100
district_homicide_change_df['2013 change'] = (homicide_gb_district_df[2013]-homicide_gb_district_df[2012])/homicide_gb_district_df[2012]*100
districts = [int(i) for i in list(homicide_gb_district_df.T.columns)]
ax = district_homicide_change_df.plot(kind='bar')
ax.set_xticklabels(districts,rotation=0)
ax.legend(ncol=3,fontsize=15,loc='upper center')
ax.set_ylim((-100,500))
ax.set_ylabel("Percentage change",fontsize=18)
ax.set_xlabel("CPD District",fontsize=18)
ax.set_title("Homicide rates across districts fall\nin 2013 despite significant increases in recent years",fontsize=24)
Out[365]:
In [278]:
property_gb_district_df = pd.DataFrame()
for year in range(2003,2014):
property_gb_district_df[year] = crime_df_gb_property[crime_df_gb_property['Year'] == year].groupby('District')['ID'].agg(len).ix[:25]
district_property_change_df = pd.DataFrame()
district_property_change_df[u'2010\N{EN DASH}2012 change'] = (property_gb_district_df[2012]-property_gb_district_df[2010])/property_gb_district_df[2010]*100
district_property_change_df['2013 change'] = (property_gb_district_df[2013]-property_gb_district_df[2012])/property_gb_district_df[2012]*100
districts = [int(i) for i in list(property_gb_district_df.T.columns)]
ax = district_property_change_df.plot(kind='bar')
ax.set_xticklabels(districts,rotation=0)
ax.legend(ncol=3,fontsize=15,loc='upper center')
ax.set_ylim((-30,30))
ax.set_ylabel("Percentage change",fontsize=18)
ax.set_xlabel("CPD District",fontsize=18)
ax.set_title("Property crime rates across districts fall faster\nin 2013 than previous 3 years combined",fontsize=24)
Out[278]:
In [280]:
personal_gb_district_df = pd.DataFrame()
for year in range(2003,2014):
personal_gb_district_df[year] = crime_df_gb_personal[crime_df_gb_personal['Year'] == year].groupby('District')['ID'].agg(len).ix[:25]
district_personal_change_df = pd.DataFrame()
district_personal_change_df[u'2010\N{EN DASH}2012 change'] = (personal_gb_district_df[2012]-personal_gb_district_df[2010])/personal_gb_district_df[2010]*100
district_personal_change_df['2013 change'] = (personal_gb_district_df[2013]-personal_gb_district_df[2012])/personal_gb_district_df[2012]*100
districts = [int(i) for i in list(district_personal_change_df.T.columns)]
ax = district_personal_change_df.plot(kind='bar')
ax.set_xticklabels(districts,rotation=0)
ax.legend(ncol=3,fontsize=15,loc='upper center')
ax.set_ylim((-30,20))
ax.set_ylabel("Percentage change",fontsize=18)
ax.set_xlabel("CPD District",fontsize=18)
ax.set_title("Personal crime rates across districts fall faster\nin 2013 than previous 3 years combined",fontsize=24)
Out[280]:
In [368]:
(homicide_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1)*100).plot(xlim=(-200,200),kind='kde',colormap='jet',alpha=.75,lw=2)
for i,district in enumerate(districts):
plt.axvline(x=homicide_gb_district_df.T.pct_change().ix[2013][district]*100,c='k')
plt.legend(districts,title='District',ncol=2,fontsize=12)
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Year-over-year percentage change in crime',fontsize=18)
plt.title('Homicide changes spread across district distributions',fontsize=24)
Out[368]:
In [358]:
(property_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1)*100).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
for i,district in enumerate(districts):
plt.axvline(x=property_gb_district_df.T.pct_change().ix[2013][district]*100,c='k')
plt.legend(districts,title='District',ncol=2,fontsize=12)
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Year-over-year percentage change in crime',fontsize=18)
plt.title('Decrease in 2013 property crime rates across districts\nis lower than normal',fontsize=24)
Out[358]:
In [359]:
(personal_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1)*100).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
for i,district in enumerate(districts):
plt.axvline(x=personal_gb_district_df.T.pct_change().ix[2013][district]*100,c='k')
plt.legend(districts,title='District',ncol=2,fontsize=12)
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Year-over-year percentage change in crime',fontsize=18)
plt.title('Decrease in 2013 personal crime rates across districts\nis lower than normal',fontsize=24)
Out[359]:
In [380]:
homicide_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1).apply(np.std,axis=1).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
plt.axvline(x=np.std(homicide_gb_district_df.T.pct_change().ix[2013]),c='k')
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Spread of changes in property crime rates across districts',fontsize=18)
plt.title('The spread of changes in homicide rates\nis less than previous years',fontsize=24)
Out[380]:
In [360]:
property_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1).apply(np.std,axis=1).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
plt.axvline(x=np.std(personal_gb_district_df.T.pct_change().ix[2013]),c='k')
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Spread of changes in property crime rates across districts',fontsize=18)
plt.title('The spread of changes in property crime rates\nis less than previous years',fontsize=24)
Out[360]:
In [361]:
personal_gb_district_df.T.pct_change().ix[2004:2012].dropna(axis=1).apply(np.std,axis=1).plot(kind='kde',colormap='jet',alpha=.75,lw=2)
plt.axvline(x=np.std(personal_gb_district_df.T.pct_change().ix[2013]),c='k')
plt.ylabel('Count of changes',fontsize=18)
plt.xlabel('Spread of changes in personal crime rates across districts',fontsize=18)
plt.title('But the spread of changes in personal crime rates\ndoes not differ from previous years',fontsize=24)
Out[361]:
However extremely warm or cold temperatures can also inhibit these effects as people stay indoors and other temporal variables such as time of day, day of week, holidays, public school sessions, and even lunar cycles may also play a part in the frequency of violent and property crimes. Studying the frequency of crime and weather also invites questions of quality and quantity. Do people more regular crimes of assault escalate into homicide (quality), do some crimes become more prevalent (quantity), or a combination of both?
In [281]:
for district in districts:
print district, stats.ttest_1samp(property_gb_district_df.T.pct_change().ix[2004:2012][district].values,property_gb_district_df.T.pct_change().ix[2013][district])
In [ ]: