Crime Analysis

Team 6

Jeffrey Bohn, Nirmalya Chanda, Melissa Freeman, Anna Isaacson, Jong Geun Shin

Introduction

In this notebook, we explore several components of crime, touching upon questions of both personal and business safety. Our analysis is divided into questions relevant to individuals, questions relevant to the police force, and questions relevant to business. Of course, all crime-related questions may be relevant to all groups.

Our personal safety questions are:

  • What factors contribute to the number of crimes that occur on a given day?
  • Do some areas have more crime than others?
  • Given that there has been a crime, what factors contribute to whether or not a shooting occurs with the crime?

Our enforcement question is:

  • Can crimes be classified by neighborhood, time of day, or time of year?

Our business question is:

  • In each of Boston's neighborhoods, which streets are most likely to see crimes of paricular concern to businesses, such as burglary and vandalism?

For each question, we follow a consistent pattern. We begin with data wrangling and exploratory data analysis (EDA). In most cases, we proceed to apply a statistical model. Finally, in some cases, we evaluate the model with residual charts or confusion matrices.

Part I: Load and clean the data


In [1]:
# Load modules, apply settings
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import requests
import json
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit
import datetime
import calendar
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from math import sqrt
%matplotlib inline
mpl.style.use('fivethirtyeight')
pd.options.mode.chained_assignment = None

In [2]:
# Load the primary crime data
base_url = 'https://raw.githubusercontent.com/aisaacso/SafeBoss/master/'
crime_url = base_url + 'Crime_Incident_Reports.csv'
crime = pd.read_csv(crime_url, low_memory = False)

In [3]:
# Create column that is guaranteed to have no NANs, for pivot table counts throughout the notebook
crime['indexer'] = 1

In [4]:
# Date clean-up

# Converts FROMDATE from str to datetime
crime['FROMDATE'] = pd.to_datetime(crime.FROMDATE, format = '%m/%d/%Y %I:%M:%S %p')

# Original range is Jul-2012 to Aug-2015; because in some cases we analyze crime counts, exclude first month of data
crime = crime[crime.FROMDATE > '2012-08-10 00:00:00']

#Add a date column
crime['Date'] = crime.FROMDATE.dt.date

In [5]:
# Convert police district codes to neighborhoods

crime = crime[crime.REPTDISTRICT.notnull()]

crime = crime[crime.REPTDISTRICT <> 'HTU']
def get_neighborhood(d):
    if d=='A1': return 'Downtown'
    elif d=='A15': return 'Charlestown'
    elif d=='A7': return 'EastBoston'
    elif d=='B2': return 'Roxbury'
    elif d=='B3': return 'Mattapan'
    elif d=='C6': return 'SouthBoston'
    elif d=='C11': return 'Dorchester'
    elif d=='D4': return 'SouthEnd'
    elif d=='D14': return 'Brighton'
    elif d=='E5': return 'WestRoxbury'
    elif d=='E13': return 'JamaicaPlain'
    elif d=='E18': return 'HydePark'
    else: return '???'

crime['Neighborhood'] = crime['REPTDISTRICT'].map(get_neighborhood)

In [6]:
# Load in weather data
weather_url = base_url + 'weather.csv' # From http://www.ncdc.noaa.gov/cdo-web/datasets
weather = pd.read_csv(weather_url)

In [7]:
# Prepare weather data for adding to crime data

# Include only Boston Logan weather station (has most consistent data)
weather = weather[weather.STATION == 'GHCND:USW00014739']

#Match date format to crime dataset's date format
weather['Date'] = pd.to_datetime(weather.DATE, format = '%Y%m%d')
weather['Date'] = weather.Date.dt.date

# Add temp categories
median = int(weather.TMAX.median())
lower = weather.TMAX.quantile(q = 0.25).astype(int)
upper = weather.TMAX.quantile(q = 0.75).astype(int)

def tmax_groups(t):
    if t<=lower: return 'Cold'
    elif (t>lower and t<=median): return 'SortaCold'
    elif (t>median and t<=upper): return 'SortaHot'
    else: return 'Hot'

def prcp_groups(p):
    if p > 0: return 1
    else: return 0
    
weather['TempGroups'] = weather['TMAX'].map(tmax_groups)
weather['Precip_Bool'] = weather['PRCP'].map(prcp_groups)

Part II : Personal Safety

In this section, we analyze three questions:

  • What factors contribute to the number of crimes that occur on a given day?
  • Do some areas have more crime than others?
  • Given that there has been a crime, what factors contribute to whether or not a shooting occurs with the crime?

We expect that these questions will be especially relevant to Boston residents who wish to be aware of their personal risks of facing crime in the city.

Part II Question 1: What factors contribute to crime per day?


In [8]:
# EDA for seasonal variation
dates = pd.pivot_table(crime, values = ['indexer'], index = ['Date', 'Month'], aggfunc = 'count')
dates.rename(columns={'indexer': 'CrimeCount'}, inplace=True) #Rename for more logical referencing
min_crimes = dates.CrimeCount.min()
dates = dates[dates.CrimeCount != min_crimes] # Removes an outlier in Aug, 2105
dates.plot(xticks = None, title = 'Crimes per day varies by season')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a15b1d0>

In [9]:
# EDA for season

def season_groups(m):
    if m in [12, 1, 2]: return 'Winter'
    elif m in [3, 4, 5]: return 'Spring'
    elif m in [6, 7, 8]: return 'Summer'
    else: return 'Fall'
    
dates = pd.DataFrame(dates)
dates['Month'] = dates.index.get_level_values(1)
dates['Season'] = dates['Month'].map(season_groups)
seasonal = pd.pivot_table(dates, index = 'Season', values = 'CrimeCount', aggfunc = 'sum')
seasonal.plot(kind = 'bar')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a15dbd0>

In [10]:
# EDA for month
months = pd.pivot_table(dates, index = 'Month', values = 'CrimeCount', aggfunc = 'sum')
months.plot(kind = 'bar')


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a15ff10>

In [11]:
# EDA for temp

dates['Date'] = dates.index.get_level_values(0)
add_weather = pd.merge(dates, weather, how = 'inner', on = 'Date') # inner join excludes 10 dates
add_weather.plot(kind = 'scatter', x = 'CrimeCount', y = 'TMAX', title = 'Crime increases with temp')


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a128910>

In [12]:
# EDA for precipitation
add_weather['Raining_or_Snowing?'] = add_weather['Precip_Bool'].map({0:'No', 1:'Yes'})
crime_precip = pd.pivot_table(add_weather, index = 'Raining_or_Snowing?', values = 'CrimeCount', aggfunc = 'count')
crime_precip.plot(kind = 'bar', title = 'Fewer crimes when raining or snowing')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a155ad0>

In [13]:
#EDA for day of the week

def get_week_day(d):
    daynum = d.weekday()
    days = ['Mon','Tues','Wed','Thurs','Fri','Sat','Sun']
    return days[daynum]
add_weather['DayWeek'] = add_weather['Date'].map(get_week_day)
weekdays = pd.pivot_table(add_weather, index = 'DayWeek', values = 'CrimeCount', aggfunc = 'sum')
weekdays.plot(kind = 'bar')


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a15ff90>

In [14]:
# Build model
# Removed variables: Spring
season_dummies = pd.get_dummies(add_weather['Season']).iloc[:, 1:]
day_dummies = pd.get_dummies(add_weather['DayWeek']).iloc[:, 1:]
temp_dummies = pd.get_dummies(add_weather['TempGroups']).iloc[:, 1:]
dates_dummy_df = add_weather.join([day_dummies, season_dummies, temp_dummies])
train, test = train_test_split(dates_dummy_df, test_size = 0.2)
perday_model = ols(data=train, formula='CrimeCount ~  Summer + Winter + Hot + SortaCold + SortaHot +\
Mon + Sat + Sun + Thurs + Tues + Wed + Precip_Bool')
perday_result = perday_model.fit()
perday_result.summary()


Out[14]:
OLS Regression Results
Dep. Variable: CrimeCount R-squared: 0.375
Model: OLS Adj. R-squared: 0.367
Method: Least Squares F-statistic: 43.20
Date: Wed, 27 Apr 2016 Prob (F-statistic): 6.82e-80
Time: 18:55:06 Log-Likelihood: -4069.0
No. Observations: 876 AIC: 8164.
Df Residuals: 863 BIC: 8226.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 247.2068 3.213 76.945 0.000 240.901 253.513
Summer -5.6797 2.763 -2.056 0.040 -11.102 -0.257
Winter -5.0118 2.615 -1.917 0.056 -10.144 0.121
Hot 33.8983 3.647 9.294 0.000 26.739 41.057
SortaCold 15.1661 2.610 5.811 0.000 10.043 20.289
SortaHot 28.6033 3.054 9.365 0.000 22.608 34.598
Mon -25.9612 3.207 -8.094 0.000 -32.256 -19.666
Sat -23.8517 3.179 -7.504 0.000 -30.090 -17.613
Sun -49.2581 3.213 -15.330 0.000 -55.565 -42.951
Thurs -19.4073 3.215 -6.037 0.000 -25.717 -13.097
Tues -29.0832 3.205 -9.074 0.000 -35.374 -22.793
Wed -22.0888 3.223 -6.854 0.000 -28.414 -15.764
Precip_Bool -5.5056 1.813 -3.037 0.002 -9.064 -1.947
Omnibus: 60.861 Durbin-Watson: 1.954
Prob(Omnibus): 0.000 Jarque-Bera (JB): 166.717
Skew: -0.339 Prob(JB): 6.28e-37
Kurtosis: 5.027 Cond. No. 9.46

In [15]:
# Analyze the model

residuals = perday_result.resid
fig = sns.distplot(residuals)



In [16]:
# Create prediction for test data
test['Prediction'] = perday_result.predict(test)

In [17]:
# Plot the prediction against the actual
test.plot(kind = 'scatter', x='CrimeCount', y = 'Prediction')


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e074150>

In [18]:
# Linear regression on correlation between prediction and actual
model_test = ols(data=test, formula = 'Prediction ~ CrimeCount')
test_result = model_test.fit()

In [19]:
# Checking residuals of test regression
test_resid = test_result.resid
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.distplot(test_resid, ax=axes[0]);
sm.qqplot(test_resid, fit=True, line='s', ax=axes[1]);


Part II Question 2: Do some areas have more crimes than others?

First, we examine crime overall.

Null hypothesis: Numbers of crime per day do not differ by neighborhood.


In [20]:
#EDA for crime overall
reptd = pd.pivot_table(crime, values = 'indexer', index = 'Neighborhood', aggfunc = 'count')
reptd.plot(kind = 'bar', sort_columns = True, title = 'Crime totals', figsize = (8,7))


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b9923d0>

In [21]:
#Hypothesis testing

# Set up dataframes
date_by_neigh = pd.pivot_table(crime, index = 'Date', columns = 'Neighborhood', \
                               values = 'indexer', aggfunc = 'count')
date_by_neigh = date_by_neigh[date_by_neigh.SouthEnd != 167] # removes southbos outlier
date_by_neigh_melt = pd.melt(date_by_neigh).dropna()

# Pop standard deviation
pop_sd = date_by_neigh_melt.std()

# Pop average
pop_avg = date_by_neigh_melt.mean()

# Sample size
sample_size = len(date_by_neigh) # All neighborhoods have the same number of entries +/- 3

# Standard error
st_err = pop_sd / sqrt(sample_size)

date_by_neigh_p = pd.DataFrame(date_by_neigh.mean())
date_by_neigh_p['mean'] = date_by_neigh_p.loc[:,0]
date_by_neigh_p['zscore'] = (date_by_neigh.mean() - pop_avg[0])/st_err[0]
date_by_neigh_p['pscore'] = stats.norm.sf(abs((date_by_neigh_p['zscore'])))
print 'Population average crimes per day: ', pop_avg[0]
date_by_neigh_p


Population average crimes per day:  19.7298079854
Out[21]:
0 mean zscore pscore
Neighborhood
Brighton 17.235616 17.235616 -7.481703 3.668292e-14
Charlestown 5.038781 5.038781 -44.067944 0.000000e+00
Dorchester 30.616438 30.616438 32.656085 3.284036e-234
Downtown 26.258447 26.258447 19.583636 1.066213e-85
EastBoston 11.250457 11.250457 -25.435088 5.161396e-143
HydePark 12.336073 12.336073 -22.178620 2.762404e-109
JamaicaPlain 13.405850 13.405850 -18.969663 1.519254e-80
Mattapan 21.995434 21.995434 6.796085 5.375009e-12
Roxbury 35.888584 35.888584 48.470680 0.000000e+00
SouthBoston 18.321461 18.321461 -4.224548 1.197104e-05
SouthEnd 33.404566 33.404566 41.019493 0.000000e+00
WestRoxbury 10.815188 10.815188 -26.740744 7.910729e-158

Null hypothesis is rejected.

Next, we examine crime per capita. The chart below shows that the distribution of crime is very different when examined on a per capita basis.


In [22]:
# load in pop file
pop_url = base_url + 'pop.csv' # From our own web research
pop_df = pd.read_csv(pop_url)

In [23]:
# EDA for crime per capita

reptd = pd.DataFrame(reptd)
reptd.rename(columns={'indexer': 'CrimeCount'}, inplace=True) #Rename for more logical referencing
reptd['Neighborhood'] = reptd.index.get_level_values(0)
add_pop = pd.merge(reptd, pop_df, how = 'inner', on = 'Neighborhood')
add_pop['percapita'] = add_pop.CrimeCount / add_pop.Population
add_pop.plot(kind = 'bar', x='Neighborhood', y = 'percapita', sort_columns = True, title = 'Crime per capita', figsize = (8,7))


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b915f90>

Part II Question 3: Given a crime, what factors contribute to crime including a shooting?


In [24]:
# Dummy for shooting
crime['Shoot_Status']=crime['Shooting'].map({'No':0,'Yes':1}).astype(int)

In [25]:
# EDA for day of the week
shoot = crime[crime.Shoot_Status==1]
days = pd.pivot_table(shoot, values = 'indexer', index = 'DAY_WEEK', aggfunc = 'count')
days.plot(kind = 'bar')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d762e50>

In [26]:
# EDA for month
months = pd.pivot_table(shoot, values = 'indexer', index = 'Month', aggfunc = 'count')
months.plot(kind = 'bar')


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d7c4a50>

In [27]:
# EDA for weather
weather_shoot = pd.merge(shoot, weather, how = 'inner', on = 'Date')
temps = pd.pivot_table(weather_shoot, index = 'TempGroups', values = 'indexer', aggfunc = 'count')
temps.plot(kind = 'bar')


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b915fd0>

In [28]:
# Add in weather data to crime dataset
crime_weather = pd.merge(crime, weather, how = 'outer', on = 'Date')

In [29]:
#Add a column for the month name (regression can't handle numbers as col names)
def mo_as_name(mo):
    return calendar.month_name[mo]

crime_weather['MoName'] =  crime_weather['Month'].map(mo_as_name)

In [30]:
# Data prep
week_dummies = pd.get_dummies(crime_weather['DAY_WEEK']).iloc[:, 1:]
month_dummies = pd.get_dummies(crime_weather['MoName']).iloc[:, 1:]
neigh_dummies = pd.get_dummies(crime_weather['Neighborhood']).iloc[:,1:]
temp_dummies = pd.get_dummies(crime_weather['TempGroups']).iloc[:,1:]
crtype_dummies = pd.get_dummies(crime_weather['MAIN_CRIMECODE']).iloc[:,1:]
shoot_df = crime_weather.join([week_dummies, month_dummies, neigh_dummies, temp_dummies, crtype_dummies])

In [31]:
#Regression
# Removed variables: + July + December + August + Downtown + Precip_Bool + Thursday + November + WestRoxbury + SortaCold
# + October + March + June + May + Wednesday + Tuesday   

train, test = train_test_split(shoot_df, test_size = 0.2)
model_logistic = logit(data=train, formula='Shoot_Status ~ Monday + Sunday + February + Hot + SortaHot \
+ SouthEnd + Roxbury + HydePark + JamaicaPlain + SouthBoston + Dorchester + Mattapan + EastBoston + Charlestown'
                     )
result_logistic = model_logistic.fit()

# Function for analyzing p_values; used for removing p values one by one
def analyze_p(res):
    p = res.pvalues
    p.sort_values(ascending = False, inplace = True)
    print res.prsquared
    print p
    
#analyze_p(result_logistic)
result_logistic.summary()


Optimization terminated successfully.
         Current function value: 0.016461
         Iterations 12
Out[31]:
Logit Regression Results
Dep. Variable: Shoot_Status No. Observations: 207424
Model: Logit Df Residuals: 207409
Method: MLE Df Model: 14
Date: Wed, 27 Apr 2016 Pseudo R-squ.: 0.05560
Time: 18:55:19 Log-Likelihood: -3414.5
converged: True LL-Null: -3615.5
LLR p-value: 4.662e-77
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -7.8589 0.215 -36.554 0.000 -8.280 -7.438
Monday -0.1542 0.138 -1.118 0.264 -0.425 0.116
Sunday 0.4900 0.114 4.312 0.000 0.267 0.713
February -0.4560 0.242 -1.887 0.059 -0.930 0.018
Hot 0.5515 0.106 5.216 0.000 0.344 0.759
SortaHot 0.2461 0.114 2.165 0.030 0.023 0.469
SouthEnd 0.6716 0.276 2.433 0.015 0.130 1.213
Roxbury 2.4074 0.218 11.048 0.000 1.980 2.834
HydePark 1.3624 0.299 4.556 0.000 0.776 1.948
JamaicaPlain 1.5283 0.281 5.443 0.000 0.978 2.079
SouthBoston 0.2161 0.377 0.574 0.566 -0.522 0.954
Dorchester 2.0900 0.226 9.263 0.000 1.648 2.532
Mattapan 2.3584 0.227 10.387 0.000 1.913 2.803
EastBoston 1.1748 0.323 3.637 0.000 0.542 1.808
Charlestown 0.8175 0.492 1.662 0.097 -0.147 1.782

In [32]:
residuals = result_logistic.resid_dev
fig = sns.distplot(residuals)


Part II Conclusion

Of these three questions, the results of Question 1 are most promising. The adjusted r-squared is high enough to make the model worthwhile, several of the variables show sufficiently low p-values, and the resideuals are evenly distributed. The other models need further refinement in order to provide meaningful insights.

Our analysis of Question 1 reveals that, holding all else constant, summer, winter, temperature level, precipitation, and day of the week all have a statistically significant correlation with the number of crimes on a given day. Notably, when all these other factors are held constant, there are between 43 and 54 fewer crimes on Sundays. Hot days have between 27 and 40 more crimes, again holding all else constant. These figures all use a 95% confidence interval.

Part III: Enforcement

In this section, we analyze whether crimes be classified by neighborhood, time of day, or time of year. We imagine that these questions could be particularly relevant to Boston's police department, as they prepare for the city's varying enforcement needs. In this section, we examine only the ten most common crime types.

Part III Question 1: Do certain kinds of common crimes happen in certain neighborhoods?


In [33]:
# Data wrangling

# Find the most common crime types
cr_counts = pd.DataFrame(pd.pivot_table(crime, index = 'MAIN_CRIMECODE', values = 'DAY_WEEK', aggfunc = 'count'))
cr_counts.sort_values('DAY_WEEK', ascending = False, inplace = True)
cr_counts = cr_counts.head(10)
top_crimes = cr_counts.index.tolist()

# Prep the data
districts = neigh_dummies.columns.tolist()
dist_cols = ['Neighborhood'] + top_crimes
neigh_classes = shoot_df[dist_cols].dropna()

In [34]:
# EDA
districts_crimes = pd.pivot_table(neigh_classes, index = 'Neighborhood', values = top_crimes, aggfunc = 'sum')
districts_crimes.plot(kind = 'bar')


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b270950>

In [35]:
# Build the model
model_distcr = KMeans( n_clusters =  len(districts))
model_distcr = model_distcr.fit(neigh_classes.iloc[:,1:])
neigh_classes['kmeans_class'] = model_distcr.labels_

In [36]:
# Analyze the model

#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='Neighborhood', y='kmeans_class', data=neigh_classes, jitter= True)

#Confusion matrix
pd.pivot_table(neigh_classes, index='Neighborhood', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')


Out[36]:
kmeans_class 0 1 2 3 4 5 6 7 8 9 10
Neighborhood
Brighton 1167 7476 1557 1243 873 2220 1118 763 1221 403 862
Charlestown 359 1772 415 398 291 788 365 380 369 173 150
Dorchester 2048 12572 2029 2711 1983 3736 2480 1416 1763 1770 1050
Downtown 981 10884 4987 1483 897 2545 2278 1570 1119 1126 906
EastBoston 900 4395 791 1071 620 1370 897 571 638 771 290
HydePark 719 4572 888 1248 810 1644 823 442 882 746 748
JamaicaPlain 738 5305 1175 1030 736 1755 843 840 1095 683 471
Mattapan 1412 9853 1041 1442 1655 2734 1940 626 1425 1133 841
Roxbury 2020 16017 2647 2098 2107 4469 2816 1483 2385 2210 1074
SouthBoston 992 7523 1733 1768 1125 1727 1236 986 1064 1322 601
SouthEnd 1468 13197 6910 1724 1454 2557 2318 3036 1837 1088 1156
WestRoxbury 667 3737 729 1156 743 1239 649 665 870 710 662

Part III Question 2: Do certain kinds of common crimes happen at certain times of day?


In [37]:
# Data wrangling

# Adds Hour column for each crime
crime['Hour'] = crime.FROMDATE.dt.hour

# Removes the preponderance of rows for which time is 00:00:00 00:00)
crime_no_time = crime[(crime.FROMDATE.dt.hour == 0) & (crime.FROMDATE.dt.minute == 0)]
crime_no_time['no_time'] = 'indicator'
crime_time = crime.merge(crime_no_time, how='left')
crime_time = crime_time[crime_time.no_time <> 'indicator']

def time_groups(t):
    if t in [0,1,2,3,4,23]: return "Night"
    elif t in [5,6,7,8,9,10]: return "Morning"
    elif t in [11,12,13,14,15,16]: return "Midday"
    else: return "Evening"

periods = crime_time.join(crtype_dummies)
periods['timegroup'] = periods['Hour'].map(time_groups)
time_cols = ['timegroup', 'Hour'] + top_crimes
periods_classes = periods[time_cols].dropna()

In [38]:
# EDA
hours_crimes = pd.pivot_table(periods_classes, index = 'Hour', values = top_crimes, aggfunc = 'sum')
hours_crimes.plot(kind = 'line')


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a007fd0>

In [39]:
# Build the model
model_periocr = KMeans( n_clusters =  4)
model_periocr = model_periocr.fit(periods_classes.iloc[:,2:])
periods_classes['kmeans_class'] = model_periocr.labels_

In [40]:
# Analyze the model

#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='timegroup', y='kmeans_class', data=periods_classes, jitter= True)

#Confusion matrix
pd.pivot_table(periods_classes, index='timegroup', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')


Out[40]:
kmeans_class 0 1 2 3
timegroup
Evening 55258 5325 8525 7323
Midday 58432 5954 7764 10359
Morning 38215 2576 6073 3667
Night 28040 3307 4085 2210

Part III Question 3: Do certain types of crimes happen at certain times of year?


In [41]:
# Data wrangling
shoot_df['Season'] = shoot_df['Month'].map(season_groups)
seas_cols = ['Season'] + top_crimes
seasons_classes = shoot_df[seas_cols].dropna()

In [42]:
# EDA
seasons_crimes = pd.pivot_table(seasons_classes, index = 'Season', values = top_crimes, aggfunc = 'sum')
seasons_crimes.plot(kind = 'line')


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b265c90>

In [43]:
# Build the model

model_seascr = KMeans( n_clusters =  4)
model_seascr = model_seascr.fit(seasons_classes.iloc[:,2:])
seasons_classes['kmeans_class'] = model_seascr.labels_

In [44]:
# Analyze the model

#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='Season', y='kmeans_class', data=seasons_classes, jitter= True)

#Confusion matrix
pd.pivot_table(seasons_classes, index='Season', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')


Out[44]:
kmeans_class 0 1 2 3
Season
Fall 4577 50915 4174 6707
Spring 4682 50214 4774 5843
Summer 4566 52238 4204 6995
Winter 3938 45877 4220 5357

Part III Conclusion

In this section, we attempted to classify crimes by a variety of metrics. None of these variables were effective classifiers for crime. However, as shown here, this model could be used across a number of variables to find crime categories. Again, we believe that, with further refinement, this model could be useful to law enforcement.

Part IV: Crimes of Concern to Business-Owners

In this section, we list crimes that may be of particular concern to business owners, and we show which streets have the highest occurances of these crimes in each of Boston's neighborhoods. We imagine that entrepreneurs could examine this data when siting new businesses.


In [45]:
# create a list of business crimes
business_crimes = ['COMMERCIAL BURGLARY', 'VANDALISM', 'ROBBERY', 'OTHER LARCENY', 'BurgTools', 'ARSON', 'Larceny'\
                  'Other Burglary', 'PROSTITUTION CHARGES', 'PubDrink']

# Classify crimes based on whether or not they are business-relevant crimes
def is_bus_cr(c):
    if c in business_crimes:
        return 1
    else:
        return 0

crime['BusCr'] =  crime['INCIDENT_TYPE_DESCRIPTION'].map(is_bus_cr)
dists = crime['Neighborhood'].unique().tolist()

# Create a chart of the top five streets in each district in Boston
for d in dists:
    var = crime.loc[crime.Neighborhood == d]
    streets = pd.DataFrame(pd.pivot_table(var, values = 'BusCr', index = 'STREETNAME', aggfunc = 'sum'))
    streets.sort_values('BusCr', ascending = False, inplace = True)
    top_five = streets.head(5)
    top_five.plot(kind = 'bar', title = d)
    print













Conclusion

In this notebook, we analyzed Boston's crime data from a variety of perspectives. We examined questions relevant to individuals, police officers, and business owners. These models and this approach could be extended to explore questions relevant to other groups of stakeholders, including youth, the elderly, minorities, and city administration and leadership. Each group has its own interests and questions with respect to crime. Other data, such as economic data, unemployment data, and demographic could also be incorporated into our models to provide further crime-related insights.