DC Crimebusters set out to help users make knowledgeable decisions about their personal safety when taking Washington Metropolitan Area Transit Authority (WMATA) trains to their destinations. Riders have developed perceptions about Metro stops within the District based on anecdotal evidence and this analysis aims to support or disprove riders’ notions. The final product will be a mobile application in which users may enter their destination in Washington, DC and the time of travel, and our app will inform them of the relative safety of the destination neighborhood at that time. It will inform them which crimes to be most aware of and recommend taking a taxi or private car hire if the probability of being victimized exceeds a specified threshold.
Overall, crime happens the closer a commuter gets to a Metro station. A Metro rider has a higher chance of encountering or being victimized by a crime because of his/her proximity to a station. In addition to this high level hypothesis, we also have the following based on our data:
Distance from Metro Station: Crime happens closer to a metro station.
Percent of Vacant Homes: The more vacant homes, the more likely crime will occur.
Percent of Occupied Homes: The more occupied homes, the less likely crime will occur.
Percent below the Poverty Line: The more residences below the poverty line, the more likely crime will occur.
Median Home Value: The higher the median home value, the less likely crime will occur.
Data files:
Washington, DC Crime information from the District of Columbia’s Open Data Initiative: 2014 crime data that included the type of crime, the time it occurred (broken out by day, evening and night), the date and the location.
Location of Metro Stations: includes addresses and latitude/longitude information that would allow us to map the Metro stations.
Demographic/Census Information: this data was broken out by block groups and included the following: Percent of vacant homes, percent of occupied homes, percent of population below the poverty line, the median home value and the median household income.
Before conducting deeper data analysis, such as the regression or classification of the data, we wanted to become familiar with the main features of the data set. In order to do this, we started with answering three questions:
Out of our data set of about 38,000 events, roughly 14,500 of those are of the type "THEFT/OTHER" making that the leading kind of crime in DC in 2014. Another 11,000 events are of the type "THEFT F/AUTO" which is specifically theft from auto vehicles. This means that around 67.5% of the crimes that occurred in 2014 are of some form of theft.
This graph shows us the breakdown of each Offense type by the shift - Day, Evening, and Midnight. We can see that the leading type of crime, THEFT/OTHER, occurred during the Evening shift. Theft from auto vehicles, specifically, occurred mainly during the Day shift. Another interesting observation is that the HOMICIDE type crimes occurred almost exclusively during the midnight shift.
Mapping individual variables against the dependent variable begins to reveal relationships. Through our investigation of the two graphs below, we were able to identify two potential explanatory variables - distance from Metro stations and percentage of households below the poverty line.
In [4]:
import numpy as np
from bokeh import plotting
from bokeh.models import HoverTool
#plotting.output_file('crime_and_poverty.html')
df = pd.read_csv('CrimeEvents_CalculatedAttr.csv')
TOOLS = "save, hover"
poverty_percentage = u'Per Below Poverty Line'
distance = u'Distance from metro KM'
pov_frame = df[poverty_percentage].replace([np.inf, -np.inf, 0, np.str], np.nan).dropna() *100
hist1, edges1 = np.histogram(pov_frame, bins=20)
source1 = plotting.ColumnDataSource(data=dict(count=hist1))
fig1 = plotting.figure(title="Poverty Percentage and Crime Count",
tools=TOOLS, background_fill="#E8DDCB")
fig1.quad(top=hist1, bottom=0, left=edges1[:-1], right=edges1[1:],
fill_color="#036564", line_color="#033649", source=source1)
hover1 = fig1.select(dict(type=HoverTool))
hover1.tooltips=[("count", "@count"),]
fig1.xaxis.axis_label = 'Percentage Poverty Status'
fig1.yaxis.axis_label = 'Crime Count'
hist2, edges2 = np.histogram(df[distance], bins=20)
source2 = plotting.ColumnDataSource(data=dict(count=hist2))
fig2 = plotting.figure(title="Distance From Metro and Crime Count",
tools=TOOLS, background_fill="#E8DDCB")
fig2.quad(top=hist2, bottom=0, left=edges2[:-1], right=edges2[1:],
fill_color="#036564", line_color="#033649", source=source2)
hover2 = fig2.select(dict(type=HoverTool))
hover2.tooltips=[("count", "@count"),]
fig2.xaxis.axis_label = 'Distance From Metro (KM)'
fig2.yaxis.axis_label = 'Crime Count'
plotting.output_notebook(plotting.show(plotting.vplot(fig1, fig2)))
To get a better sense of the crime environment, we calculated yearly crime rate per Metro trip and aggregated the rate to the Metro neighborhoods. In the interactive map below, you can see higher rates of crime in Stadium Armory Metro neighborhood and much lower rates in Federal Triangle and the Smithsonian.
We wanted to see if crime clustered around Metro stations. Using k means clustering where k equals 40 (for 40 Metro stations) we grouped each crime. The black dots on the map below display the crime centers and the red dots represent the Metro stations within the District. If you zoom in, the Metro neighborhoods disappear and the crime events appear and are colored by cluster.
In [5]:
%%html
<iframe width="940" height="600" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.arcgis.com/apps/Embed/index.html?webmap=c84cfb5536f540ac810cd7d7860d99e5&extent=-77.2227,38.8097,-76.8182,38.9866&home=true&zoom=true&scale=true&legend=true&theme=light"></iframe>
While it is difficult to determine which metro station experiences the most crime from this map, we can observe less dense regions such as in the NW areas of DC nearer to the grey Rosslyn and maroon Court House metros, or near the mid section of the map within the light blue Smithsonian metro area. Indeed, if check in image 1, Smithsonian and Court House have some of the lowest numbers of crimes.
We constructed multiple regression models to drive out statistically significant variable correlations. In the code below, we fit a linear model to assault rate, homicide rate, robbery rate and rape rate.
In [6]:
# Make sure you have run "pip install rpy2" on the command line prior to running this code
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
crime_summary = pd.read_csv("CrimeEventMultReg.csv")
AssaultRate = crime_summary['AssaultRate']
HomicideRate = crime_summary['HomicideRate']
RobberyRate = crime_summary['RobberyRate']
RapeRate = crime_summary['RapeRate']
independent_vars = pd.DataFrame()
independent_vars['Avg_Dist_Metro_KM'] = crime_summary['Avg_Dist_Metro_KM']
independent_vars['Avg_Ridership'] = crime_summary['Avg_Ridership']/1000.0
independent_vars['Avg_Percent_Public_Transit'] = crime_summary['Avg_Percent_Public_Transit']
independent_vars['Avg_Percent_Vacant_Houses'] = crime_summary['Avg_Percent_Vacant_Houses']
independent_vars['Avg_Percent_Below_Poverty'] = crime_summary['Avg_Percent_Below_Poverty']
independent_vars['Avg_Median_Home_Value'] = crime_summary['Avg_Median_Home_Value']
#add constant offset to make it like r lm
independent_vars = add_constant(independent_vars, prepend=False)
assault_model = sm.OLS(AssaultRate, independent_vars).fit()
homicide_model = sm.OLS(HomicideRate,independent_vars).fit()
robbery_model = sm.OLS(RobberyRate, independent_vars).fit()
rape_model = sm.OLS(RapeRate, independent_vars).fit()
The summary of the linear model for assaults shows that average distance to metro and average percent below the poverty line are related to the assault rate.
In [7]:
print assault_model.summary()
The summary of the linear model for homicides shows that the average percent below the poverty line is related to the homicide rate.
In [8]:
print homicide_model.summary()
The summary of the linear model for robberies shows that average distance to metro and average percent below the poverty line are related to the assault rate.
In [9]:
print robbery_model.summary()
The summary of the linear model for rape shows that average distance to metro and average percent below the poverty line are related to the assault rate.
In [10]:
print rape_model.summary()
To fit a model that considered spatial distribution, we turned to PySal (Python Spatial Analysis Library). PySal allows a user to enter a spatial weights matrix to represent geographic relationships - in our case the neighborhoods. When we fit a geographic OLS we discovered that the residuals were not randomly distributed over the study area and we saw nonstationarity (heteroskedasticity) indicating spatial dependency. This aspect made our model untrustworthy so we turned to a two stage geographically weighted regression that accounts for spatial lag and heteroskedasticity. Below, we build a class to wrap the regression function. Using a shapefile (a geographic data model format) we construct the class and fit the regression.
In [11]:
import os
import pysal
class GeographicallyWeightedRegression(object):
"""
object that loads and analyzes spatial
data.
"""
def __init__(self, in_shapefile, dependent_variable, independent_variables, **kwargs):
self.infile = in_shapefile
self.dependent_var = dependent_variable
#Checking if input is the correct object. This verification
#can be applied to all variables
if isinstance(independent_variables, list):
self.independent_vars = independent_variables
else:
raise TypeError(message="independent variables must be a list of column headers.")
#uses key word arg
self.outfile_summary = kwargs.pop("out_summary", None)
#we can include options. For now
#just setting one representation
self.spat_weights = kwargs.pop("spatial_relationship", "queen")
self.independent_array = None
self.dependent_array = None
self.weights = None
self._summary = None
def _get_shapefile_dbf(self):
"""
returns the dbf file path associated with the
input shapefile
"""
#name of the file without dir
file_path = os.path.basename(self.infile).split('.')
#name of the dbf file without dir and add dbf extension
name = "{0}.dbf".format(file_path[0])
#return full filepath
return os.path.join(os.path.dirname(self.infile), name)
def _get_dependent_var_array(self,dbf):
"""
returns the independent variable colmn as a numpy
array. Pysal regression requires this
"""
#grab the single column from the dbf
dependent = dbf.by_col(self.dependent_var)
#turn it into a numpy type
dependent_array = np.asfarray(dependent)
#requires nx1 array. Not sure if we need this
dependent_array.shape = (len(dependent),1)
return dependent_array
def _get_independent_matrix(self, dbf):
"""
turns the independent columns into a
numpy array for analysis.
"""
#create a list of columns
new_array = [dbf.by_col(col) for col in self.independent_vars]
#turn the list of columns into an array, transpose, and return
return np.asfarray(new_array).T
def _save_summary(self, ols):
"""
writes the output of the ols to file
"""
with open(self.outfile_summary, "w") as target:
target.write(ols.summary)
def _get_spatial_weights(self):
"""
creates the spatial weights object for use in
OLS. This structure defaults to a certain
spatial relationship. We can add more key words
and create different relationships
"""
#this matrix tells the algorithm how to
#look at neighboring features
#queen looks at polygons with shared edges
#queen b/c the way the chess piece moves
if self.spat_weights == "queen":
return pysal.queen_from_shapefile(self.infile)
elif self.spat_weights == "rook":
return pysal.rook_from_shapefile(self.infile)
else:
#won't use spatial weights
return None
def run(self):
"""
This will run the spatial analysis
"""
#open the shapefile to start working
#working_file = pysal.open(self.infile)
#create the spatial weights matrix
self.weights = self._get_spatial_weights()
#all shapefiles come with a dbf file
#this will open that
dbf_file = self._get_shapefile_dbf()
#open the dbf file for analysis
open_dbf = pysal.open(dbf_file)
#create the dependent array for OLS input
self.dependent_array = self._get_dependent_var_array(open_dbf)
#create the indepent array for OLS input
self.independent_array = self._get_independent_matrix(open_dbf)
#run the OLS
#This is set up to run Moran's I on the residuals to ensure
#they are not spatially correlated
ols = pysal.spreg.GM_Combo_Het(self.dependent_array, self.independent_array,
w=self.weights, name_y=self.dependent_var,
name_x=self.independent_vars, name_w=self.spat_weights,
name_ds=os.path.basename(self.infile))
if self.outfile_summary:
self._save_summary(ols)
self._summary = ols.summary
open_dbf.close()
return ols
def print_summary(self):
print self._summary
"""
StationA_4 Average of Per Below Poverty Line
StationA_7 Average of Distance from metro KM
StationA_8 CrimePerRider
"""
shapefile= "Metro_Neighborhoods_StatsJoin.shp"
dependent = "StationA_8"
independent = ['StationA_7', 'StationA_4']
In [15]:
gwr = GeographicallyWeightedRegression(shapefile, dependent, independent,
spatial_relationship="queen")
gwr.run()
gwr.print_summary()
The summary output shows that the model performed with an R-squared of 0.67 and we accept the null hypothesis for the lambda variable meaning we effectively handled nonstationarity. It is important to note that the only significant variable in this regression is distance from Metro stations.
Now that we know more about the relationships and the crime environment within the District, we want to help riders plan their trips. To do this, we created a Support Vector classifier (SVC) that considers crime location, demographic information and time of data. The example below shows an example of raw data being used to predict what crime you might encounter. The SVC was wrapped in a pipeline that optimized the build with Grid Search and an imputer to fix null values. The classifier performs at 43% accuracy. The class below is used to load the SVC to classify inputs.
In [13]:
import dill
class EventClassifier(object):
"""
Class that will be used to classify user input
events. The event classifier will use the model
from build.py
"""
def __init__(self, model=None):
"""
Initialize the event classifier with the model
created in build.py
"""
## Load the model from the pickle
with open(model, 'rb') as pkl:
self._classifier = dill.load(pkl)
def classify(self, instance):
"""
This is the function that will take the user
input (instance) and return the probability that the
user will encounter a crime on their trip
"""
## Use the classifier to predict the probabilities of each crime
most_likely = self._classifier.predict(instance)
return most_likely
In [14]:
#Create the classifier object
clf = EventClassifier(model='SVC_2015-29-04.pickle')
#Example data ready for the classifier
tests = np.asfarray([['-77.00589537', '38.90611936', '0.44015444', '0.505928854', '0.055153707', '0.944846293', '0.052631579', '0.295465729', '4.238', '7.7461', '8.412', '0', '1', '0'],
['-77.03382744', '38.93071433', '0.499459459', '0.59562212', '0.178443114', '0.821556886', '0.048104956', '0.351140991', '4.028', '4.4688', '12.755', '0', '1', '0'],
['-77.06697201', '38.90685672', '0.12755102', '0.452574526', '0.122916667', '0.877083333', '0', '1.583192993', '10.00001', '7.0388', '-999', '0', '1', '0'],
['-77.0646675', '38.94619085', '0.638018937', '0.588370314', '0.11689008', '0.88310992', '0', '0.323109894', '3.451', '7.6532', '6.505', '0', '1', '0']])
#correct answers to the tests
answers = ['THEFT/OTHER',
'ASSAULT W/DANGEROUS WEAPON',
'THEFT F/AUTO',
'THEFT/OTHER']
for i,test in enumerate(tests):
print "Classifier predicts {0} and the actual answer is {1}".format(clf.classify(test)[0], answers[i])
DC-Crimebusters:
Georgetown University
School of Continuing Studies
Data Analytics Certificate 2015