In this notebook, we explore several components of crime, touching upon questions of both personal and business safety. Our analysis is divided into questions relevant to individuals, questions relevant to the police force, and questions relevant to business. Of course, all crime-related questions may be relevant to all groups.
Our personal safety questions are:
Our enforcement question is:
Our business question is:
For each question, we follow a consistent pattern. We begin with data wrangling and exploratory data analysis (EDA). In most cases, we proceed to apply a statistical model. Finally, in some cases, we evaluate the model with residual charts or confusion matrices.
In [1]:
# Load modules, apply settings
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import requests
import json
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit
import datetime
import calendar
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from math import sqrt
%matplotlib inline
mpl.style.use('fivethirtyeight')
pd.options.mode.chained_assignment = None
In [2]:
# Load the primary crime data
base_url = 'https://raw.githubusercontent.com/aisaacso/SafeBoss/master/'
crime_url = base_url + 'Crime_Incident_Reports.csv'
crime = pd.read_csv(crime_url, low_memory = False)
In [3]:
# Create column that is guaranteed to have no NANs, for pivot table counts throughout the notebook
crime['indexer'] = 1
In [4]:
# Date clean-up
# Converts FROMDATE from str to datetime
crime['FROMDATE'] = pd.to_datetime(crime.FROMDATE, format = '%m/%d/%Y %I:%M:%S %p')
# Original range is Jul-2012 to Aug-2015; because in some cases we analyze crime counts, exclude first month of data
crime = crime[crime.FROMDATE > '2012-08-10 00:00:00']
#Add a date column
crime['Date'] = crime.FROMDATE.dt.date
In [5]:
# Convert police district codes to neighborhoods
crime = crime[crime.REPTDISTRICT.notnull()]
crime = crime[crime.REPTDISTRICT <> 'HTU']
def get_neighborhood(d):
if d=='A1': return 'Downtown'
elif d=='A15': return 'Charlestown'
elif d=='A7': return 'EastBoston'
elif d=='B2': return 'Roxbury'
elif d=='B3': return 'Mattapan'
elif d=='C6': return 'SouthBoston'
elif d=='C11': return 'Dorchester'
elif d=='D4': return 'SouthEnd'
elif d=='D14': return 'Brighton'
elif d=='E5': return 'WestRoxbury'
elif d=='E13': return 'JamaicaPlain'
elif d=='E18': return 'HydePark'
else: return '???'
crime['Neighborhood'] = crime['REPTDISTRICT'].map(get_neighborhood)
In [6]:
# Load in weather data
weather_url = base_url + 'weather.csv' # From http://www.ncdc.noaa.gov/cdo-web/datasets
weather = pd.read_csv(weather_url)
In [7]:
# Prepare weather data for adding to crime data
# Include only Boston Logan weather station (has most consistent data)
weather = weather[weather.STATION == 'GHCND:USW00014739']
#Match date format to crime dataset's date format
weather['Date'] = pd.to_datetime(weather.DATE, format = '%Y%m%d')
weather['Date'] = weather.Date.dt.date
# Add temp categories
median = int(weather.TMAX.median())
lower = weather.TMAX.quantile(q = 0.25).astype(int)
upper = weather.TMAX.quantile(q = 0.75).astype(int)
def tmax_groups(t):
if t<=lower: return 'Cold'
elif (t>lower and t<=median): return 'SortaCold'
elif (t>median and t<=upper): return 'SortaHot'
else: return 'Hot'
def prcp_groups(p):
if p > 0: return 1
else: return 0
weather['TempGroups'] = weather['TMAX'].map(tmax_groups)
weather['Precip_Bool'] = weather['PRCP'].map(prcp_groups)
In this section, we analyze three questions:
We expect that these questions will be especially relevant to Boston residents who wish to be aware of their personal risks of facing crime in the city.
In [8]:
# EDA for seasonal variation
dates = pd.pivot_table(crime, values = ['indexer'], index = ['Date', 'Month'], aggfunc = 'count')
dates.rename(columns={'indexer': 'CrimeCount'}, inplace=True) #Rename for more logical referencing
min_crimes = dates.CrimeCount.min()
dates = dates[dates.CrimeCount != min_crimes] # Removes an outlier in Aug, 2105
dates.plot(xticks = None, title = 'Crimes per day varies by season')
Out[8]:
In [9]:
# EDA for season
def season_groups(m):
if m in [12, 1, 2]: return 'Winter'
elif m in [3, 4, 5]: return 'Spring'
elif m in [6, 7, 8]: return 'Summer'
else: return 'Fall'
dates = pd.DataFrame(dates)
dates['Month'] = dates.index.get_level_values(1)
dates['Season'] = dates['Month'].map(season_groups)
seasonal = pd.pivot_table(dates, index = 'Season', values = 'CrimeCount', aggfunc = 'sum')
seasonal.plot(kind = 'bar')
Out[9]:
In [10]:
# EDA for month
months = pd.pivot_table(dates, index = 'Month', values = 'CrimeCount', aggfunc = 'sum')
months.plot(kind = 'bar')
Out[10]:
In [11]:
# EDA for temp
dates['Date'] = dates.index.get_level_values(0)
add_weather = pd.merge(dates, weather, how = 'inner', on = 'Date') # inner join excludes 10 dates
add_weather.plot(kind = 'scatter', x = 'CrimeCount', y = 'TMAX', title = 'Crime increases with temp')
Out[11]:
In [12]:
# EDA for precipitation
add_weather['Raining_or_Snowing?'] = add_weather['Precip_Bool'].map({0:'No', 1:'Yes'})
crime_precip = pd.pivot_table(add_weather, index = 'Raining_or_Snowing?', values = 'CrimeCount', aggfunc = 'count')
crime_precip.plot(kind = 'bar', title = 'Fewer crimes when raining or snowing')
Out[12]:
In [13]:
#EDA for day of the week
def get_week_day(d):
daynum = d.weekday()
days = ['Mon','Tues','Wed','Thurs','Fri','Sat','Sun']
return days[daynum]
add_weather['DayWeek'] = add_weather['Date'].map(get_week_day)
weekdays = pd.pivot_table(add_weather, index = 'DayWeek', values = 'CrimeCount', aggfunc = 'sum')
weekdays.plot(kind = 'bar')
Out[13]:
In [14]:
# Build model
# Removed variables: Spring
season_dummies = pd.get_dummies(add_weather['Season']).iloc[:, 1:]
day_dummies = pd.get_dummies(add_weather['DayWeek']).iloc[:, 1:]
temp_dummies = pd.get_dummies(add_weather['TempGroups']).iloc[:, 1:]
dates_dummy_df = add_weather.join([day_dummies, season_dummies, temp_dummies])
train, test = train_test_split(dates_dummy_df, test_size = 0.2)
perday_model = ols(data=train, formula='CrimeCount ~ Summer + Winter + Hot + SortaCold + SortaHot +\
Mon + Sat + Sun + Thurs + Tues + Wed + Precip_Bool')
perday_result = perday_model.fit()
perday_result.summary()
Out[14]:
In [15]:
# Analyze the model
residuals = perday_result.resid
fig = sns.distplot(residuals)
In [16]:
# Create prediction for test data
test['Prediction'] = perday_result.predict(test)
In [17]:
# Plot the prediction against the actual
test.plot(kind = 'scatter', x='CrimeCount', y = 'Prediction')
Out[17]:
In [18]:
# Linear regression on correlation between prediction and actual
model_test = ols(data=test, formula = 'Prediction ~ CrimeCount')
test_result = model_test.fit()
In [19]:
# Checking residuals of test regression
test_resid = test_result.resid
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.distplot(test_resid, ax=axes[0]);
sm.qqplot(test_resid, fit=True, line='s', ax=axes[1]);
First, we examine crime overall.
Null hypothesis: Numbers of crime per day do not differ by neighborhood.
In [20]:
#EDA for crime overall
reptd = pd.pivot_table(crime, values = 'indexer', index = 'Neighborhood', aggfunc = 'count')
reptd.plot(kind = 'bar', sort_columns = True, title = 'Crime totals', figsize = (8,7))
Out[20]:
In [21]:
#Hypothesis testing
# Set up dataframes
date_by_neigh = pd.pivot_table(crime, index = 'Date', columns = 'Neighborhood', \
values = 'indexer', aggfunc = 'count')
date_by_neigh = date_by_neigh[date_by_neigh.SouthEnd != 167] # removes southbos outlier
date_by_neigh_melt = pd.melt(date_by_neigh).dropna()
# Pop standard deviation
pop_sd = date_by_neigh_melt.std()
# Pop average
pop_avg = date_by_neigh_melt.mean()
# Sample size
sample_size = len(date_by_neigh) # All neighborhoods have the same number of entries +/- 3
# Standard error
st_err = pop_sd / sqrt(sample_size)
date_by_neigh_p = pd.DataFrame(date_by_neigh.mean())
date_by_neigh_p['mean'] = date_by_neigh_p.loc[:,0]
date_by_neigh_p['zscore'] = (date_by_neigh.mean() - pop_avg[0])/st_err[0]
date_by_neigh_p['pscore'] = stats.norm.sf(abs((date_by_neigh_p['zscore'])))
print 'Population average crimes per day: ', pop_avg[0]
date_by_neigh_p
Out[21]:
Null hypothesis is rejected.
Next, we examine crime per capita. The chart below shows that the distribution of crime is very different when examined on a per capita basis.
In [22]:
# load in pop file
pop_url = base_url + 'pop.csv' # From our own web research
pop_df = pd.read_csv(pop_url)
In [23]:
# EDA for crime per capita
reptd = pd.DataFrame(reptd)
reptd.rename(columns={'indexer': 'CrimeCount'}, inplace=True) #Rename for more logical referencing
reptd['Neighborhood'] = reptd.index.get_level_values(0)
add_pop = pd.merge(reptd, pop_df, how = 'inner', on = 'Neighborhood')
add_pop['percapita'] = add_pop.CrimeCount / add_pop.Population
add_pop.plot(kind = 'bar', x='Neighborhood', y = 'percapita', sort_columns = True, title = 'Crime per capita', figsize = (8,7))
Out[23]:
In [24]:
# Dummy for shooting
crime['Shoot_Status']=crime['Shooting'].map({'No':0,'Yes':1}).astype(int)
In [25]:
# EDA for day of the week
shoot = crime[crime.Shoot_Status==1]
days = pd.pivot_table(shoot, values = 'indexer', index = 'DAY_WEEK', aggfunc = 'count')
days.plot(kind = 'bar')
Out[25]:
In [26]:
# EDA for month
months = pd.pivot_table(shoot, values = 'indexer', index = 'Month', aggfunc = 'count')
months.plot(kind = 'bar')
Out[26]:
In [27]:
# EDA for weather
weather_shoot = pd.merge(shoot, weather, how = 'inner', on = 'Date')
temps = pd.pivot_table(weather_shoot, index = 'TempGroups', values = 'indexer', aggfunc = 'count')
temps.plot(kind = 'bar')
Out[27]:
In [28]:
# Add in weather data to crime dataset
crime_weather = pd.merge(crime, weather, how = 'outer', on = 'Date')
In [29]:
#Add a column for the month name (regression can't handle numbers as col names)
def mo_as_name(mo):
return calendar.month_name[mo]
crime_weather['MoName'] = crime_weather['Month'].map(mo_as_name)
In [30]:
# Data prep
week_dummies = pd.get_dummies(crime_weather['DAY_WEEK']).iloc[:, 1:]
month_dummies = pd.get_dummies(crime_weather['MoName']).iloc[:, 1:]
neigh_dummies = pd.get_dummies(crime_weather['Neighborhood']).iloc[:,1:]
temp_dummies = pd.get_dummies(crime_weather['TempGroups']).iloc[:,1:]
crtype_dummies = pd.get_dummies(crime_weather['MAIN_CRIMECODE']).iloc[:,1:]
shoot_df = crime_weather.join([week_dummies, month_dummies, neigh_dummies, temp_dummies, crtype_dummies])
In [31]:
#Regression
# Removed variables: + July + December + August + Downtown + Precip_Bool + Thursday + November + WestRoxbury + SortaCold
# + October + March + June + May + Wednesday + Tuesday
train, test = train_test_split(shoot_df, test_size = 0.2)
model_logistic = logit(data=train, formula='Shoot_Status ~ Monday + Sunday + February + Hot + SortaHot \
+ SouthEnd + Roxbury + HydePark + JamaicaPlain + SouthBoston + Dorchester + Mattapan + EastBoston + Charlestown'
)
result_logistic = model_logistic.fit()
# Function for analyzing p_values; used for removing p values one by one
def analyze_p(res):
p = res.pvalues
p.sort_values(ascending = False, inplace = True)
print res.prsquared
print p
#analyze_p(result_logistic)
result_logistic.summary()
Out[31]:
In [32]:
residuals = result_logistic.resid_dev
fig = sns.distplot(residuals)
Of these three questions, the results of Question 1 are most promising. The adjusted r-squared is high enough to make the model worthwhile, several of the variables show sufficiently low p-values, and the resideuals are evenly distributed. The other models need further refinement in order to provide meaningful insights.
Our analysis of Question 1 reveals that, holding all else constant, summer, winter, temperature level, precipitation, and day of the week all have a statistically significant correlation with the number of crimes on a given day. Notably, when all these other factors are held constant, there are between 43 and 54 fewer crimes on Sundays. Hot days have between 27 and 40 more crimes, again holding all else constant. These figures all use a 95% confidence interval.
In this section, we analyze whether crimes be classified by neighborhood, time of day, or time of year. We imagine that these questions could be particularly relevant to Boston's police department, as they prepare for the city's varying enforcement needs. In this section, we examine only the ten most common crime types.
In [33]:
# Data wrangling
# Find the most common crime types
cr_counts = pd.DataFrame(pd.pivot_table(crime, index = 'MAIN_CRIMECODE', values = 'DAY_WEEK', aggfunc = 'count'))
cr_counts.sort_values('DAY_WEEK', ascending = False, inplace = True)
cr_counts = cr_counts.head(10)
top_crimes = cr_counts.index.tolist()
# Prep the data
districts = neigh_dummies.columns.tolist()
dist_cols = ['Neighborhood'] + top_crimes
neigh_classes = shoot_df[dist_cols].dropna()
In [34]:
# EDA
districts_crimes = pd.pivot_table(neigh_classes, index = 'Neighborhood', values = top_crimes, aggfunc = 'sum')
districts_crimes.plot(kind = 'bar')
Out[34]:
In [35]:
# Build the model
model_distcr = KMeans( n_clusters = len(districts))
model_distcr = model_distcr.fit(neigh_classes.iloc[:,1:])
neigh_classes['kmeans_class'] = model_distcr.labels_
In [36]:
# Analyze the model
#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='Neighborhood', y='kmeans_class', data=neigh_classes, jitter= True)
#Confusion matrix
pd.pivot_table(neigh_classes, index='Neighborhood', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')
Out[36]:
In [37]:
# Data wrangling
# Adds Hour column for each crime
crime['Hour'] = crime.FROMDATE.dt.hour
# Removes the preponderance of rows for which time is 00:00:00 00:00)
crime_no_time = crime[(crime.FROMDATE.dt.hour == 0) & (crime.FROMDATE.dt.minute == 0)]
crime_no_time['no_time'] = 'indicator'
crime_time = crime.merge(crime_no_time, how='left')
crime_time = crime_time[crime_time.no_time <> 'indicator']
def time_groups(t):
if t in [0,1,2,3,4,23]: return "Night"
elif t in [5,6,7,8,9,10]: return "Morning"
elif t in [11,12,13,14,15,16]: return "Midday"
else: return "Evening"
periods = crime_time.join(crtype_dummies)
periods['timegroup'] = periods['Hour'].map(time_groups)
time_cols = ['timegroup', 'Hour'] + top_crimes
periods_classes = periods[time_cols].dropna()
In [38]:
# EDA
hours_crimes = pd.pivot_table(periods_classes, index = 'Hour', values = top_crimes, aggfunc = 'sum')
hours_crimes.plot(kind = 'line')
Out[38]:
In [39]:
# Build the model
model_periocr = KMeans( n_clusters = 4)
model_periocr = model_periocr.fit(periods_classes.iloc[:,2:])
periods_classes['kmeans_class'] = model_periocr.labels_
In [40]:
# Analyze the model
#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='timegroup', y='kmeans_class', data=periods_classes, jitter= True)
#Confusion matrix
pd.pivot_table(periods_classes, index='timegroup', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')
Out[40]:
In [41]:
# Data wrangling
shoot_df['Season'] = shoot_df['Month'].map(season_groups)
seas_cols = ['Season'] + top_crimes
seasons_classes = shoot_df[seas_cols].dropna()
In [42]:
# EDA
seasons_crimes = pd.pivot_table(seasons_classes, index = 'Season', values = top_crimes, aggfunc = 'sum')
seasons_crimes.plot(kind = 'line')
Out[42]:
In [43]:
# Build the model
model_seascr = KMeans( n_clusters = 4)
model_seascr = model_seascr.fit(seasons_classes.iloc[:,2:])
seasons_classes['kmeans_class'] = model_seascr.labels_
In [44]:
# Analyze the model
#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='Season', y='kmeans_class', data=seasons_classes, jitter= True)
#Confusion matrix
pd.pivot_table(seasons_classes, index='Season', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')
Out[44]:
In this section, we attempted to classify crimes by a variety of metrics. None of these variables were effective classifiers for crime. However, as shown here, this model could be used across a number of variables to find crime categories. Again, we believe that, with further refinement, this model could be useful to law enforcement.
In this section, we list crimes that may be of particular concern to business owners, and we show which streets have the highest occurances of these crimes in each of Boston's neighborhoods. We imagine that entrepreneurs could examine this data when siting new businesses.
In [45]:
# create a list of business crimes
business_crimes = ['COMMERCIAL BURGLARY', 'VANDALISM', 'ROBBERY', 'OTHER LARCENY', 'BurgTools', 'ARSON', 'Larceny'\
'Other Burglary', 'PROSTITUTION CHARGES', 'PubDrink']
# Classify crimes based on whether or not they are business-relevant crimes
def is_bus_cr(c):
if c in business_crimes:
return 1
else:
return 0
crime['BusCr'] = crime['INCIDENT_TYPE_DESCRIPTION'].map(is_bus_cr)
dists = crime['Neighborhood'].unique().tolist()
# Create a chart of the top five streets in each district in Boston
for d in dists:
var = crime.loc[crime.Neighborhood == d]
streets = pd.DataFrame(pd.pivot_table(var, values = 'BusCr', index = 'STREETNAME', aggfunc = 'sum'))
streets.sort_values('BusCr', ascending = False, inplace = True)
top_five = streets.head(5)
top_five.plot(kind = 'bar', title = d)
print
In this notebook, we analyzed Boston's crime data from a variety of perspectives. We examined questions relevant to individuals, police officers, and business owners. These models and this approach could be extended to explore questions relevant to other groups of stakeholders, including youth, the elderly, minorities, and city administration and leadership. Each group has its own interests and questions with respect to crime. Other data, such as economic data, unemployment data, and demographic could also be incorporated into our models to provide further crime-related insights.