Crime Analysis

Team 6

Jeffrey Bohn, Nirmalya Chanda, Melissa Freeman, Anna Isaacson, Jong Geun Shin

Introduction

In this notebook, we explore several components of crime, touching upon questions of both personal and business safety. Our analysis is divided into questions relevant to individuals, questions relevant to the police force, and questions relevant to business. Of course, all crime-related questions may be relevant to all groups.

Our personal safety questions are:

What factors contribute to the number of crimes that occur on a given day?
Do some areas have more crime than others?
Given that there has been a crime, what factors contribute to whether or not a shooting occurs with the crime?

Our enforcement question is:

Can crimes be classified by neighborhood, time of day, or time of year?

Our business question is:

In each of Boston's neighborhoods, which streets are most likely to see crimes of paricular concern to businesses, such as burglary and vandalism?

For each question, we follow a consistent pattern. We begin with data wrangling and exploratory data analysis (EDA). In most cases, we proceed to apply a statistical model. Finally, in some cases, we evaluate the model with residual charts or confusion matrices.

Part I: Load and clean the data



In [1]:

    
# Load modules, apply settings
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import requests
import json
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit
import datetime
import calendar
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from math import sqrt
%matplotlib inline
mpl.style.use('fivethirtyeight')
pd.options.mode.chained_assignment = None



In [2]:

    
# Load the primary crime data
base_url = 'https://raw.githubusercontent.com/aisaacso/SafeBoss/master/'
crime_url = base_url + 'Crime_Incident_Reports.csv'
crime = pd.read_csv(crime_url, low_memory = False)



In [3]:

    
# Create column that is guaranteed to have no NANs, for pivot table counts throughout the notebook
crime['indexer'] = 1



In [4]:

    
# Date clean-up

# Converts FROMDATE from str to datetime
crime['FROMDATE'] = pd.to_datetime(crime.FROMDATE, format = '%m/%d/%Y %I:%M:%S %p')

# Original range is Jul-2012 to Aug-2015; because in some cases we analyze crime counts, exclude first month of data
crime = crime[crime.FROMDATE > '2012-08-10 00:00:00']

#Add a date column
crime['Date'] = crime.FROMDATE.dt.date



In [5]:

    
# Convert police district codes to neighborhoods

crime = crime[crime.REPTDISTRICT.notnull()]

crime = crime[crime.REPTDISTRICT <> 'HTU']
def get_neighborhood(d):
    if d=='A1': return 'Downtown'
    elif d=='A15': return 'Charlestown'
    elif d=='A7': return 'EastBoston'
    elif d=='B2': return 'Roxbury'
    elif d=='B3': return 'Mattapan'
    elif d=='C6': return 'SouthBoston'
    elif d=='C11': return 'Dorchester'
    elif d=='D4': return 'SouthEnd'
    elif d=='D14': return 'Brighton'
    elif d=='E5': return 'WestRoxbury'
    elif d=='E13': return 'JamaicaPlain'
    elif d=='E18': return 'HydePark'
    else: return '???'

crime['Neighborhood'] = crime['REPTDISTRICT'].map(get_neighborhood)



In [6]:

    
# Load in weather data
weather_url = base_url + 'weather.csv' # From http://www.ncdc.noaa.gov/cdo-web/datasets
weather = pd.read_csv(weather_url)



In [7]:

    
# Prepare weather data for adding to crime data

# Include only Boston Logan weather station (has most consistent data)
weather = weather[weather.STATION == 'GHCND:USW00014739']

#Match date format to crime dataset's date format
weather['Date'] = pd.to_datetime(weather.DATE, format = '%Y%m%d')
weather['Date'] = weather.Date.dt.date

# Add temp categories
median = int(weather.TMAX.median())
lower = weather.TMAX.quantile(q = 0.25).astype(int)
upper = weather.TMAX.quantile(q = 0.75).astype(int)

def tmax_groups(t):
    if t<=lower: return 'Cold'
    elif (t>lower and t<=median): return 'SortaCold'
    elif (t>median and t<=upper): return 'SortaHot'
    else: return 'Hot'

def prcp_groups(p):
    if p > 0: return 1
    else: return 0
    
weather['TempGroups'] = weather['TMAX'].map(tmax_groups)
weather['Precip_Bool'] = weather['PRCP'].map(prcp_groups)

Part II : Personal Safety

In this section, we analyze three questions:

What factors contribute to the number of crimes that occur on a given day?
Do some areas have more crime than others?
Given that there has been a crime, what factors contribute to whether or not a shooting occurs with the crime?

We expect that these questions will be especially relevant to Boston residents who wish to be aware of their personal risks of facing crime in the city.

Part II Question 1: What factors contribute to crime per day?



In [8]:

    
# EDA for seasonal variation
dates = pd.pivot_table(crime, values = ['indexer'], index = ['Date', 'Month'], aggfunc = 'count')
dates.rename(columns={'indexer': 'CrimeCount'}, inplace=True) #Rename for more logical referencing
min_crimes = dates.CrimeCount.min()
dates = dates[dates.CrimeCount != min_crimes] # Removes an outlier in Aug, 2105
dates.plot(xticks = None, title = 'Crimes per day varies by season')









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a15b1d0>



In [9]:

    
# EDA for season

def season_groups(m):
    if m in [12, 1, 2]: return 'Winter'
    elif m in [3, 4, 5]: return 'Spring'
    elif m in [6, 7, 8]: return 'Summer'
    else: return 'Fall'
    
dates = pd.DataFrame(dates)
dates['Month'] = dates.index.get_level_values(1)
dates['Season'] = dates['Month'].map(season_groups)
seasonal = pd.pivot_table(dates, index = 'Season', values = 'CrimeCount', aggfunc = 'sum')
seasonal.plot(kind = 'bar')









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a15dbd0>



In [10]:

    
# EDA for month
months = pd.pivot_table(dates, index = 'Month', values = 'CrimeCount', aggfunc = 'sum')
months.plot(kind = 'bar')









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a15ff10>



In [11]:

    
# EDA for temp

dates['Date'] = dates.index.get_level_values(0)
add_weather = pd.merge(dates, weather, how = 'inner', on = 'Date') # inner join excludes 10 dates
add_weather.plot(kind = 'scatter', x = 'CrimeCount', y = 'TMAX', title = 'Crime increases with temp')









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a128910>



In [12]:

    
# EDA for precipitation
add_weather['Raining_or_Snowing?'] = add_weather['Precip_Bool'].map({0:'No', 1:'Yes'})
crime_precip = pd.pivot_table(add_weather, index = 'Raining_or_Snowing?', values = 'CrimeCount', aggfunc = 'count')
crime_precip.plot(kind = 'bar', title = 'Fewer crimes when raining or snowing')









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a155ad0>



In [13]:

    
#EDA for day of the week

def get_week_day(d):
    daynum = d.weekday()
    days = ['Mon','Tues','Wed','Thurs','Fri','Sat','Sun']
    return days[daynum]
add_weather['DayWeek'] = add_weather['Date'].map(get_week_day)
weekdays = pd.pivot_table(add_weather, index = 'DayWeek', values = 'CrimeCount', aggfunc = 'sum')
weekdays.plot(kind = 'bar')









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a15ff90>



In [14]:

    
# Build model
# Removed variables: Spring
season_dummies = pd.get_dummies(add_weather['Season']).iloc[:, 1:]
day_dummies = pd.get_dummies(add_weather['DayWeek']).iloc[:, 1:]
temp_dummies = pd.get_dummies(add_weather['TempGroups']).iloc[:, 1:]
dates_dummy_df = add_weather.join([day_dummies, season_dummies, temp_dummies])
train, test = train_test_split(dates_dummy_df, test_size = 0.2)
perday_model = ols(data=train, formula='CrimeCount ~  Summer + Winter + Hot + SortaCold + SortaHot +\
Mon + Sat + Sun + Thurs + Tues + Wed + Precip_Bool')
perday_result = perday_model.fit()
perday_result.summary()









    Out[14]:





OLS Regression Results

  Dep. Variable:        CrimeCount       R-squared:             0.375


  Model:                    OLS          Adj. R-squared:        0.367


  Method:              Least Squares     F-statistic:           43.20


  Date:              Wed, 27 Apr 2016    Prob (F-statistic):  6.82e-80


  Time:                  18:55:06        Log-Likelihood:      -4069.0


  No. Observations:          876         AIC:                   8164.


  Df Residuals:              863         BIC:                   8226.


  Df Model:                   12                                     


  Covariance Type:       nonrobust                                   




                 coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      247.2068      3.213     76.945   0.000    240.901   253.513


  Summer          -5.6797      2.763     -2.056   0.040    -11.102    -0.257


  Winter          -5.0118      2.615     -1.917   0.056    -10.144     0.121


  Hot             33.8983      3.647      9.294   0.000     26.739    41.057


  SortaCold       15.1661      2.610      5.811   0.000     10.043    20.289


  SortaHot        28.6033      3.054      9.365   0.000     22.608    34.598


  Mon            -25.9612      3.207     -8.094   0.000    -32.256   -19.666


  Sat            -23.8517      3.179     -7.504   0.000    -30.090   -17.613


  Sun            -49.2581      3.213    -15.330   0.000    -55.565   -42.951


  Thurs          -19.4073      3.215     -6.037   0.000    -25.717   -13.097


  Tues           -29.0832      3.205     -9.074   0.000    -35.374   -22.793


  Wed            -22.0888      3.223     -6.854   0.000    -28.414   -15.764


  Precip_Bool     -5.5056      1.813     -3.037   0.002     -9.064    -1.947




  Omnibus:        60.861    Durbin-Watson:         1.954


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    166.717


  Skew:           -0.339    Prob(JB):           6.28e-37


  Kurtosis:        5.027    Cond. No.               9.46



In [15]:

    
# Analyze the model

residuals = perday_result.resid
fig = sns.distplot(residuals)



In [16]:

    
# Create prediction for test data
test['Prediction'] = perday_result.predict(test)



In [17]:

    
# Plot the prediction against the actual
test.plot(kind = 'scatter', x='CrimeCount', y = 'Prediction')









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x11e074150>



In [18]:

    
# Linear regression on correlation between prediction and actual
model_test = ols(data=test, formula = 'Prediction ~ CrimeCount')
test_result = model_test.fit()



In [19]:

    
# Checking residuals of test regression
test_resid = test_result.resid
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
sns.distplot(test_resid, ax=axes[0]);
sm.qqplot(test_resid, fit=True, line='s', ax=axes[1]);

Part II Question 2: Do some areas have more crimes than others?

First, we examine crime overall.

Null hypothesis: Numbers of crime per day do not differ by neighborhood.



In [20]:

    
#EDA for crime overall
reptd = pd.pivot_table(crime, values = 'indexer', index = 'Neighborhood', aggfunc = 'count')
reptd.plot(kind = 'bar', sort_columns = True, title = 'Crime totals', figsize = (8,7))









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b9923d0>



In [21]:

    
#Hypothesis testing

# Set up dataframes
date_by_neigh = pd.pivot_table(crime, index = 'Date', columns = 'Neighborhood', \
                               values = 'indexer', aggfunc = 'count')
date_by_neigh = date_by_neigh[date_by_neigh.SouthEnd != 167] # removes southbos outlier
date_by_neigh_melt = pd.melt(date_by_neigh).dropna()

# Pop standard deviation
pop_sd = date_by_neigh_melt.std()

# Pop average
pop_avg = date_by_neigh_melt.mean()

# Sample size
sample_size = len(date_by_neigh) # All neighborhoods have the same number of entries +/- 3

# Standard error
st_err = pop_sd / sqrt(sample_size)

date_by_neigh_p = pd.DataFrame(date_by_neigh.mean())
date_by_neigh_p['mean'] = date_by_neigh_p.loc[:,0]
date_by_neigh_p['zscore'] = (date_by_neigh.mean() - pop_avg[0])/st_err[0]
date_by_neigh_p['pscore'] = stats.norm.sf(abs((date_by_neigh_p['zscore'])))
print 'Population average crimes per day: ', pop_avg[0]
date_by_neigh_p









    



Population average crimes per day:  19.7298079854






    Out[21]:






  
    
      
      0
      mean
      zscore
      pscore
    
    
      Neighborhood
      
      
      
      
    
  
  
    
      Brighton
      17.235616
      17.235616
      -7.481703
      3.668292e-14
    
    
      Charlestown
      5.038781
      5.038781
      -44.067944
      0.000000e+00
    
    
      Dorchester
      30.616438
      30.616438
      32.656085
      3.284036e-234
    
    
      Downtown
      26.258447
      26.258447
      19.583636
      1.066213e-85
    
    
      EastBoston
      11.250457
      11.250457
      -25.435088
      5.161396e-143
    
    
      HydePark
      12.336073
      12.336073
      -22.178620
      2.762404e-109
    
    
      JamaicaPlain
      13.405850
      13.405850
      -18.969663
      1.519254e-80
    
    
      Mattapan
      21.995434
      21.995434
      6.796085
      5.375009e-12
    
    
      Roxbury
      35.888584
      35.888584
      48.470680
      0.000000e+00
    
    
      SouthBoston
      18.321461
      18.321461
      -4.224548
      1.197104e-05
    
    
      SouthEnd
      33.404566
      33.404566
      41.019493
      0.000000e+00
    
    
      WestRoxbury
      10.815188
      10.815188
      -26.740744
      7.910729e-158

Null hypothesis is rejected.

Next, we examine crime per capita. The chart below shows that the distribution of crime is very different when examined on a per capita basis.



In [22]:

    
# load in pop file
pop_url = base_url + 'pop.csv' # From our own web research
pop_df = pd.read_csv(pop_url)



In [23]:

    
# EDA for crime per capita

reptd = pd.DataFrame(reptd)
reptd.rename(columns={'indexer': 'CrimeCount'}, inplace=True) #Rename for more logical referencing
reptd['Neighborhood'] = reptd.index.get_level_values(0)
add_pop = pd.merge(reptd, pop_df, how = 'inner', on = 'Neighborhood')
add_pop['percapita'] = add_pop.CrimeCount / add_pop.Population
add_pop.plot(kind = 'bar', x='Neighborhood', y = 'percapita', sort_columns = True, title = 'Crime per capita', figsize = (8,7))









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b915f90>

Part II Question 3: Given a crime, what factors contribute to crime including a shooting?



In [24]:

    
# Dummy for shooting
crime['Shoot_Status']=crime['Shooting'].map({'No':0,'Yes':1}).astype(int)



In [25]:

    
# EDA for day of the week
shoot = crime[crime.Shoot_Status==1]
days = pd.pivot_table(shoot, values = 'indexer', index = 'DAY_WEEK', aggfunc = 'count')
days.plot(kind = 'bar')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d762e50>



In [26]:

    
# EDA for month
months = pd.pivot_table(shoot, values = 'indexer', index = 'Month', aggfunc = 'count')
months.plot(kind = 'bar')









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d7c4a50>



In [27]:

    
# EDA for weather
weather_shoot = pd.merge(shoot, weather, how = 'inner', on = 'Date')
temps = pd.pivot_table(weather_shoot, index = 'TempGroups', values = 'indexer', aggfunc = 'count')
temps.plot(kind = 'bar')









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b915fd0>



In [28]:

    
# Add in weather data to crime dataset
crime_weather = pd.merge(crime, weather, how = 'outer', on = 'Date')



In [29]:

    
#Add a column for the month name (regression can't handle numbers as col names)
def mo_as_name(mo):
    return calendar.month_name[mo]

crime_weather['MoName'] =  crime_weather['Month'].map(mo_as_name)



In [30]:

    
# Data prep
week_dummies = pd.get_dummies(crime_weather['DAY_WEEK']).iloc[:, 1:]
month_dummies = pd.get_dummies(crime_weather['MoName']).iloc[:, 1:]
neigh_dummies = pd.get_dummies(crime_weather['Neighborhood']).iloc[:,1:]
temp_dummies = pd.get_dummies(crime_weather['TempGroups']).iloc[:,1:]
crtype_dummies = pd.get_dummies(crime_weather['MAIN_CRIMECODE']).iloc[:,1:]
shoot_df = crime_weather.join([week_dummies, month_dummies, neigh_dummies, temp_dummies, crtype_dummies])



In [31]:

    
#Regression
# Removed variables: + July + December + August + Downtown + Precip_Bool + Thursday + November + WestRoxbury + SortaCold
# + October + March + June + May + Wednesday + Tuesday   

train, test = train_test_split(shoot_df, test_size = 0.2)
model_logistic = logit(data=train, formula='Shoot_Status ~ Monday + Sunday + February + Hot + SortaHot \
+ SouthEnd + Roxbury + HydePark + JamaicaPlain + SouthBoston + Dorchester + Mattapan + EastBoston + Charlestown'
                     )
result_logistic = model_logistic.fit()

# Function for analyzing p_values; used for removing p values one by one
def analyze_p(res):
    p = res.pvalues
    p.sort_values(ascending = False, inplace = True)
    print res.prsquared
    print p
    
#analyze_p(result_logistic)
result_logistic.summary()









    



Optimization terminated successfully.
         Current function value: 0.016461
         Iterations 12






    Out[31]:





Logit Regression Results

  Dep. Variable:    Shoot_Status      No. Observations:    207424  


  Model:                Logit         Df Residuals:        207409  


  Method:                MLE          Df Model:                14  


  Date:           Wed, 27 Apr 2016    Pseudo R-squ.:       0.05560 


  Time:               18:55:19        Log-Likelihood:      -3414.5 


  converged:            True          LL-Null:             -3615.5 


                                    LLR p-value:        4.662e-77




                  coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept        -7.8589      0.215    -36.554   0.000     -8.280    -7.438


  Monday           -0.1542      0.138     -1.118   0.264     -0.425     0.116


  Sunday            0.4900      0.114      4.312   0.000      0.267     0.713


  February         -0.4560      0.242     -1.887   0.059     -0.930     0.018


  Hot               0.5515      0.106      5.216   0.000      0.344     0.759


  SortaHot          0.2461      0.114      2.165   0.030      0.023     0.469


  SouthEnd          0.6716      0.276      2.433   0.015      0.130     1.213


  Roxbury           2.4074      0.218     11.048   0.000      1.980     2.834


  HydePark          1.3624      0.299      4.556   0.000      0.776     1.948


  JamaicaPlain      1.5283      0.281      5.443   0.000      0.978     2.079


  SouthBoston       0.2161      0.377      0.574   0.566     -0.522     0.954


  Dorchester        2.0900      0.226      9.263   0.000      1.648     2.532


  Mattapan          2.3584      0.227     10.387   0.000      1.913     2.803


  EastBoston        1.1748      0.323      3.637   0.000      0.542     1.808


  Charlestown       0.8175      0.492      1.662   0.097     -0.147     1.782



In [32]:

    
residuals = result_logistic.resid_dev
fig = sns.distplot(residuals)

Part II Conclusion

Of these three questions, the results of Question 1 are most promising. The adjusted r-squared is high enough to make the model worthwhile, several of the variables show sufficiently low p-values, and the resideuals are evenly distributed. The other models need further refinement in order to provide meaningful insights.

Our analysis of Question 1 reveals that, holding all else constant, summer, winter, temperature level, precipitation, and day of the week all have a statistically significant correlation with the number of crimes on a given day. Notably, when all these other factors are held constant, there are between 43 and 54 fewer crimes on Sundays. Hot days have between 27 and 40 more crimes, again holding all else constant. These figures all use a 95% confidence interval.

Part III: Enforcement

In this section, we analyze whether crimes be classified by neighborhood, time of day, or time of year. We imagine that these questions could be particularly relevant to Boston's police department, as they prepare for the city's varying enforcement needs. In this section, we examine only the ten most common crime types.

Part III Question 1: Do certain kinds of common crimes happen in certain neighborhoods?



In [33]:

    
# Data wrangling

# Find the most common crime types
cr_counts = pd.DataFrame(pd.pivot_table(crime, index = 'MAIN_CRIMECODE', values = 'DAY_WEEK', aggfunc = 'count'))
cr_counts.sort_values('DAY_WEEK', ascending = False, inplace = True)
cr_counts = cr_counts.head(10)
top_crimes = cr_counts.index.tolist()

# Prep the data
districts = neigh_dummies.columns.tolist()
dist_cols = ['Neighborhood'] + top_crimes
neigh_classes = shoot_df[dist_cols].dropna()



In [34]:

    
# EDA
districts_crimes = pd.pivot_table(neigh_classes, index = 'Neighborhood', values = top_crimes, aggfunc = 'sum')
districts_crimes.plot(kind = 'bar')









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b270950>



In [35]:

    
# Build the model
model_distcr = KMeans( n_clusters =  len(districts))
model_distcr = model_distcr.fit(neigh_classes.iloc[:,1:])
neigh_classes['kmeans_class'] = model_distcr.labels_



In [36]:

    
# Analyze the model

#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='Neighborhood', y='kmeans_class', data=neigh_classes, jitter= True)

#Confusion matrix
pd.pivot_table(neigh_classes, index='Neighborhood', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')









    Out[36]:






  
    
      kmeans_class
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
    
    
      Neighborhood
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Brighton
      1167
      7476
      1557
      1243
      873
      2220
      1118
      763
      1221
      403
      862
    
    
      Charlestown
      359
      1772
      415
      398
      291
      788
      365
      380
      369
      173
      150
    
    
      Dorchester
      2048
      12572
      2029
      2711
      1983
      3736
      2480
      1416
      1763
      1770
      1050
    
    
      Downtown
      981
      10884
      4987
      1483
      897
      2545
      2278
      1570
      1119
      1126
      906
    
    
      EastBoston
      900
      4395
      791
      1071
      620
      1370
      897
      571
      638
      771
      290
    
    
      HydePark
      719
      4572
      888
      1248
      810
      1644
      823
      442
      882
      746
      748
    
    
      JamaicaPlain
      738
      5305
      1175
      1030
      736
      1755
      843
      840
      1095
      683
      471
    
    
      Mattapan
      1412
      9853
      1041
      1442
      1655
      2734
      1940
      626
      1425
      1133
      841
    
    
      Roxbury
      2020
      16017
      2647
      2098
      2107
      4469
      2816
      1483
      2385
      2210
      1074
    
    
      SouthBoston
      992
      7523
      1733
      1768
      1125
      1727
      1236
      986
      1064
      1322
      601
    
    
      SouthEnd
      1468
      13197
      6910
      1724
      1454
      2557
      2318
      3036
      1837
      1088
      1156
    
    
      WestRoxbury
      667
      3737
      729
      1156
      743
      1239
      649
      665
      870
      710
      662

Part III Question 2: Do certain kinds of common crimes happen at certain times of day?



In [37]:

    
# Data wrangling

# Adds Hour column for each crime
crime['Hour'] = crime.FROMDATE.dt.hour

# Removes the preponderance of rows for which time is 00:00:00 00:00)
crime_no_time = crime[(crime.FROMDATE.dt.hour == 0) & (crime.FROMDATE.dt.minute == 0)]
crime_no_time['no_time'] = 'indicator'
crime_time = crime.merge(crime_no_time, how='left')
crime_time = crime_time[crime_time.no_time <> 'indicator']

def time_groups(t):
    if t in [0,1,2,3,4,23]: return "Night"
    elif t in [5,6,7,8,9,10]: return "Morning"
    elif t in [11,12,13,14,15,16]: return "Midday"
    else: return "Evening"

periods = crime_time.join(crtype_dummies)
periods['timegroup'] = periods['Hour'].map(time_groups)
time_cols = ['timegroup', 'Hour'] + top_crimes
periods_classes = periods[time_cols].dropna()



In [38]:

    
# EDA
hours_crimes = pd.pivot_table(periods_classes, index = 'Hour', values = top_crimes, aggfunc = 'sum')
hours_crimes.plot(kind = 'line')









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a007fd0>



In [39]:

    
# Build the model
model_periocr = KMeans( n_clusters =  4)
model_periocr = model_periocr.fit(periods_classes.iloc[:,2:])
periods_classes['kmeans_class'] = model_periocr.labels_



In [40]:

    
# Analyze the model

#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='timegroup', y='kmeans_class', data=periods_classes, jitter= True)

#Confusion matrix
pd.pivot_table(periods_classes, index='timegroup', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')









    Out[40]:






  
    
      kmeans_class
      0
      1
      2
      3
    
    
      timegroup
      
      
      
      
    
  
  
    
      Evening
      55258
      5325
      8525
      7323
    
    
      Midday
      58432
      5954
      7764
      10359
    
    
      Morning
      38215
      2576
      6073
      3667
    
    
      Night
      28040
      3307
      4085
      2210

Part III Question 3: Do certain types of crimes happen at certain times of year?



In [41]:

    
# Data wrangling
shoot_df['Season'] = shoot_df['Month'].map(season_groups)
seas_cols = ['Season'] + top_crimes
seasons_classes = shoot_df[seas_cols].dropna()



In [42]:

    
# EDA
seasons_crimes = pd.pivot_table(seasons_classes, index = 'Season', values = top_crimes, aggfunc = 'sum')
seasons_crimes.plot(kind = 'line')









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b265c90>



In [43]:

    
# Build the model

model_seascr = KMeans( n_clusters =  4)
model_seascr = model_seascr.fit(seasons_classes.iloc[:,2:])
seasons_classes['kmeans_class'] = model_seascr.labels_



In [44]:

    
# Analyze the model

#Plot the classification
plt.figure(figsize=(7,6))
sns.stripplot(x='Season', y='kmeans_class', data=seasons_classes, jitter= True)

#Confusion matrix
pd.pivot_table(seasons_classes, index='Season', columns = 'kmeans_class', values = '11xx', aggfunc = 'count')









    Out[44]:






  
    
      kmeans_class
      0
      1
      2
      3
    
    
      Season
      
      
      
      
    
  
  
    
      Fall
      4577
      50915
      4174
      6707
    
    
      Spring
      4682
      50214
      4774
      5843
    
    
      Summer
      4566
      52238
      4204
      6995
    
    
      Winter
      3938
      45877
      4220
      5357

Part III Conclusion

In this section, we attempted to classify crimes by a variety of metrics. None of these variables were effective classifiers for crime. However, as shown here, this model could be used across a number of variables to find crime categories. Again, we believe that, with further refinement, this model could be useful to law enforcement.

Part IV: Crimes of Concern to Business-Owners

In this section, we list crimes that may be of particular concern to business owners, and we show which streets have the highest occurances of these crimes in each of Boston's neighborhoods. We imagine that entrepreneurs could examine this data when siting new businesses.



In [45]:

    
# create a list of business crimes
business_crimes = ['COMMERCIAL BURGLARY', 'VANDALISM', 'ROBBERY', 'OTHER LARCENY', 'BurgTools', 'ARSON', 'Larceny'\
                  'Other Burglary', 'PROSTITUTION CHARGES', 'PubDrink']

# Classify crimes based on whether or not they are business-relevant crimes
def is_bus_cr(c):
    if c in business_crimes:
        return 1
    else:
        return 0

crime['BusCr'] =  crime['INCIDENT_TYPE_DESCRIPTION'].map(is_bus_cr)
dists = crime['Neighborhood'].unique().tolist()

# Create a chart of the top five streets in each district in Boston
for d in dists:
    var = crime.loc[crime.Neighborhood == d]
    streets = pd.DataFrame(pd.pivot_table(var, values = 'BusCr', index = 'STREETNAME', aggfunc = 'sum'))
    streets.sort_values('BusCr', ascending = False, inplace = True)
    top_five = streets.head(5)
    top_five.plot(kind = 'bar', title = d)
    print

Conclusion

In this notebook, we analyzed Boston's crime data from a variety of perspectives. We examined questions relevant to individuals, police officers, and business owners. These models and this approach could be extended to explore questions relevant to other groups of stakeholders, including youth, the elderly, minorities, and city administration and leadership. Each group has its own interests and questions with respect to crime. Other data, such as economic data, unemployment data, and demographic could also be incorporated into our models to provide further crime-related insights.

Dep. Variable:	CrimeCount	R-squared:	0.375
Model:	OLS	Adj. R-squared:	0.367
Method:	Least Squares	F-statistic:	43.20
Date:	Wed, 27 Apr 2016	Prob (F-statistic):	6.82e-80
Time:	18:55:06	Log-Likelihood:	-4069.0
No. Observations:	876	AIC:	8164.
Df Residuals:	863	BIC:	8226.
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	247.2068	3.213	76.945	0.000	240.901 253.513
Summer	-5.6797	2.763	-2.056	0.040	-11.102 -0.257
Winter	-5.0118	2.615	-1.917	0.056	-10.144 0.121
Hot	33.8983	3.647	9.294	0.000	26.739 41.057
SortaCold	15.1661	2.610	5.811	0.000	10.043 20.289
SortaHot	28.6033	3.054	9.365	0.000	22.608 34.598
Mon	-25.9612	3.207	-8.094	0.000	-32.256 -19.666
Sat	-23.8517	3.179	-7.504	0.000	-30.090 -17.613
Sun	-49.2581	3.213	-15.330	0.000	-55.565 -42.951
Thurs	-19.4073	3.215	-6.037	0.000	-25.717 -13.097
Tues	-29.0832	3.205	-9.074	0.000	-35.374 -22.793
Wed	-22.0888	3.223	-6.854	0.000	-28.414 -15.764
Precip_Bool	-5.5056	1.813	-3.037	0.002	-9.064 -1.947

Omnibus:	60.861	Durbin-Watson:	1.954
Prob(Omnibus):	0.000	Jarque-Bera (JB):	166.717
Skew:	-0.339	Prob(JB):	6.28e-37
Kurtosis:	5.027	Cond. No.	9.46

	0	mean	zscore	pscore
Neighborhood
Brighton	17.235616	17.235616	-7.481703	3.668292e-14
Charlestown	5.038781	5.038781	-44.067944	0.000000e+00
Dorchester	30.616438	30.616438	32.656085	3.284036e-234
Downtown	26.258447	26.258447	19.583636	1.066213e-85
EastBoston	11.250457	11.250457	-25.435088	5.161396e-143
HydePark	12.336073	12.336073	-22.178620	2.762404e-109
JamaicaPlain	13.405850	13.405850	-18.969663	1.519254e-80
Mattapan	21.995434	21.995434	6.796085	5.375009e-12
Roxbury	35.888584	35.888584	48.470680	0.000000e+00
SouthBoston	18.321461	18.321461	-4.224548	1.197104e-05
SouthEnd	33.404566	33.404566	41.019493	0.000000e+00
WestRoxbury	10.815188	10.815188	-26.740744	7.910729e-158

Dep. Variable:	Shoot_Status	No. Observations:	207424
Model:	Logit	Df Residuals:	207409
Method:	MLE	Df Model:	14
Date:	Wed, 27 Apr 2016	Pseudo R-squ.:	0.05560
Time:	18:55:19	Log-Likelihood:	-3414.5
converged:	True	LL-Null:	-3615.5
		LLR p-value:	4.662e-77

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-7.8589	0.215	-36.554	0.000	-8.280 -7.438
Monday	-0.1542	0.138	-1.118	0.264	-0.425 0.116
Sunday	0.4900	0.114	4.312	0.000	0.267 0.713
February	-0.4560	0.242	-1.887	0.059	-0.930 0.018
Hot	0.5515	0.106	5.216	0.000	0.344 0.759
SortaHot	0.2461	0.114	2.165	0.030	0.023 0.469
SouthEnd	0.6716	0.276	2.433	0.015	0.130 1.213
Roxbury	2.4074	0.218	11.048	0.000	1.980 2.834
HydePark	1.3624	0.299	4.556	0.000	0.776 1.948
JamaicaPlain	1.5283	0.281	5.443	0.000	0.978 2.079
SouthBoston	0.2161	0.377	0.574	0.566	-0.522 0.954
Dorchester	2.0900	0.226	9.263	0.000	1.648 2.532
Mattapan	2.3584	0.227	10.387	0.000	1.913 2.803
EastBoston	1.1748	0.323	3.637	0.000	0.542 1.808
Charlestown	0.8175	0.492	1.662	0.097	-0.147 1.782

kmeans_class	0	1	2	3	4	5	6	7	8	9	10
Neighborhood
Brighton	1167	7476	1557	1243	873	2220	1118	763	1221	403	862
Charlestown	359	1772	415	398	291	788	365	380	369	173	150
Dorchester	2048	12572	2029	2711	1983	3736	2480	1416	1763	1770	1050
Downtown	981	10884	4987	1483	897	2545	2278	1570	1119	1126	906
EastBoston	900	4395	791	1071	620	1370	897	571	638	771	290
HydePark	719	4572	888	1248	810	1644	823	442	882	746	748
JamaicaPlain	738	5305	1175	1030	736	1755	843	840	1095	683	471
Mattapan	1412	9853	1041	1442	1655	2734	1940	626	1425	1133	841
Roxbury	2020	16017	2647	2098	2107	4469	2816	1483	2385	2210	1074
SouthBoston	992	7523	1733	1768	1125	1727	1236	986	1064	1322	601
SouthEnd	1468	13197	6910	1724	1454	2557	2318	3036	1837	1088	1156
WestRoxbury	667	3737	729	1156	743	1239	649	665	870	710	662

kmeans_class	0	1	2	3
timegroup
Evening	55258	5325	8525	7323
Midday	58432	5954	7764	10359
Morning	38215	2576	6073	3667
Night	28040	3307	4085	2210

kmeans_class	0	1	2	3
Season
Fall	4577	50915	4174	6707
Spring	4682	50214	4774	5843
Summer	4566	52238	4204	6995
Winter	3938	45877	4220	5357