DC Crimebusters - Georgetown Data Analytics Certificate, Spring 2015

Rana, David, Mike, Kathleen and Andrew

DC Crimebusters set out to help users make knowledgeable decisions about their personal safety when taking Washington Metropolitan Area Transit Authority (WMATA) trains to their destinations. Riders have developed perceptions about Metro stops within the District based on anecdotal evidence and this analysis aims to support or disprove riders’ notions. The final product will be a mobile application in which users may enter their destination in Washington, DC and the time of travel, and our app will inform them of the relative safety of the destination neighborhood at that time. It will inform them which crimes to be most aware of and recommend taking a taxi or private car hire if the probability of being victimized exceeds a specified threshold.

Introduction

Metro Usage Polls

Please take a few minutes to answer this poll question:

pollev.com/davidculver948

Hypothesis

Overall, crime happens the closer a commuter gets to a Metro station. A Metro rider has a higher chance of encountering or being victimized by a crime because of his/her proximity to a station. In addition to this high level hypothesis, we also have the following based on our data:

Distance from Metro Station: Crime happens closer to a metro station.

Percent of Vacant Homes: The more vacant homes, the more likely crime will occur.

Percent of Occupied Homes: The more occupied homes, the less likely crime will occur.

Percent below the Poverty Line: The more residences below the poverty line, the more likely crime will occur.

Median Home Value: The higher the median home value, the less likely crime will occur.

Data

Data files:

Washington, DC Crime information from the District of Columbia’s Open Data Initiative: 2014 crime data that included the type of crime, the time it occurred (broken out by day, evening and night), the date and the location.

Location of Metro Stations: includes addresses and latitude/longitude information that would allow us to map the Metro stations.

Demographic/Census Information: this data was broken out by block groups and included the following: Percent of vacant homes, percent of occupied homes, percent of population below the poverty line, the median home value and the median household income.

Data Exploration:

Before conducting deeper data analysis, such as the regression or classification of the data, we wanted to become familiar with the main features of the data set. In order to do this, we started with answering three questions:

  • What is the most common type of crime?
  • When did crimes most frequently occur?
  • Which metro station experience the most crime?

What is the most common type of crime?

Out of our data set of about 38,000 events, roughly 14,500 of those are of the type "THEFT/OTHER" making that the leading kind of crime in DC in 2014. Another 11,000 events are of the type "THEFT F/AUTO" which is specifically theft from auto vehicles. This means that around 67.5% of the crimes that occurred in 2014 are of some form of theft.

When did crime most frequently occur?

This graph shows us the breakdown of each Offense type by the shift - Day, Evening, and Midnight. We can see that the leading type of crime, THEFT/OTHER, occurred during the Evening shift. Theft from auto vehicles, specifically, occurred mainly during the Day shift. Another interesting observation is that the HOMICIDE type crimes occurred almost exclusively during the midnight shift.

Which metro station experienced the most crime?

Finally, we can see that the metro station that experienced the most crimes in 2014 was Columbia Heights with 2,593 crime incidents. Within that, THEFT/OTHER is the leading type of crime experienced.

Mapping individual variables against the dependent variable begins to reveal relationships. Through our investigation of the two graphs below, we were able to identify two potential explanatory variables - distance from Metro stations and percentage of households below the poverty line.


In [4]:
import numpy as np
from bokeh import plotting
from bokeh.models import HoverTool

#plotting.output_file('crime_and_poverty.html')

df = pd.read_csv('CrimeEvents_CalculatedAttr.csv')

TOOLS = "save, hover"

poverty_percentage = u'Per Below Poverty Line'

distance = u'Distance from metro KM'

pov_frame = df[poverty_percentage].replace([np.inf, -np.inf, 0, np.str], np.nan).dropna() *100

hist1, edges1 = np.histogram(pov_frame, bins=20)

source1 = plotting.ColumnDataSource(data=dict(count=hist1))

fig1 = plotting.figure(title="Poverty Percentage and Crime Count",
                       tools=TOOLS, background_fill="#E8DDCB")

fig1.quad(top=hist1, bottom=0, left=edges1[:-1], right=edges1[1:],
     fill_color="#036564", line_color="#033649", source=source1)    

hover1 = fig1.select(dict(type=HoverTool))                  
hover1.tooltips=[("count", "@count"),]

fig1.xaxis.axis_label = 'Percentage Poverty Status'
fig1.yaxis.axis_label = 'Crime Count'

hist2, edges2 = np.histogram(df[distance], bins=20)

source2 = plotting.ColumnDataSource(data=dict(count=hist2))

fig2 = plotting.figure(title="Distance From Metro and Crime Count",
                       tools=TOOLS, background_fill="#E8DDCB")

fig2.quad(top=hist2, bottom=0, left=edges2[:-1], right=edges2[1:],
     fill_color="#036564", line_color="#033649", source=source2)  

hover2 = fig2.select(dict(type=HoverTool))                  
hover2.tooltips=[("count", "@count"),]                    

fig2.xaxis.axis_label = 'Distance From Metro (KM)'
fig2.yaxis.axis_label = 'Crime Count'

plotting.output_notebook(plotting.show(plotting.vplot(fig1, fig2)))


BokehJS successfully loaded.

To get a better sense of the crime environment, we calculated yearly crime rate per Metro trip and aggregated the rate to the Metro neighborhoods. In the interactive map below, you can see higher rates of crime in Stadium Armory Metro neighborhood and much lower rates in Federal Triangle and the Smithsonian.

We wanted to see if crime clustered around Metro stations. Using k means clustering where k equals 40 (for 40 Metro stations) we grouped each crime. The black dots on the map below display the crime centers and the red dots represent the Metro stations within the District. If you zoom in, the Metro neighborhoods disappear and the crime events appear and are colored by cluster.


In [5]:
%%html
<iframe width="940" height="600" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.arcgis.com/apps/Embed/index.html?webmap=c84cfb5536f540ac810cd7d7860d99e5&amp;extent=-77.2227,38.8097,-76.8182,38.9866&amp;home=true&amp;zoom=true&amp;scale=true&amp;legend=true&amp;theme=light"></iframe>


While it is difficult to determine which metro station experiences the most crime from this map, we can observe less dense regions such as in the NW areas of DC nearer to the grey Rosslyn and maroon Court House metros, or near the mid section of the map within the light blue Smithsonian metro area. Indeed, if check in image 1, Smithsonian and Court House have some of the lowest numbers of crimes.

We constructed multiple regression models to drive out statistically significant variable correlations. In the code below, we fit a linear model to assault rate, homicide rate, robbery rate and rape rate.


In [6]:
# Make sure you have run "pip install rpy2" on the command line prior to running this code
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
crime_summary = pd.read_csv("CrimeEventMultReg.csv")

AssaultRate = crime_summary['AssaultRate']
HomicideRate = crime_summary['HomicideRate']
RobberyRate = crime_summary['RobberyRate']
RapeRate = crime_summary['RapeRate']

independent_vars = pd.DataFrame()

independent_vars['Avg_Dist_Metro_KM'] = crime_summary['Avg_Dist_Metro_KM']
independent_vars['Avg_Ridership'] = crime_summary['Avg_Ridership']/1000.0
independent_vars['Avg_Percent_Public_Transit'] = crime_summary['Avg_Percent_Public_Transit']
independent_vars['Avg_Percent_Vacant_Houses'] = crime_summary['Avg_Percent_Vacant_Houses']
independent_vars['Avg_Percent_Below_Poverty'] = crime_summary['Avg_Percent_Below_Poverty']
independent_vars['Avg_Median_Home_Value'] = crime_summary['Avg_Median_Home_Value']

#add constant offset to make it like r lm
independent_vars = add_constant(independent_vars, prepend=False)

assault_model = sm.OLS(AssaultRate, independent_vars).fit()
homicide_model = sm.OLS(HomicideRate,independent_vars).fit()
robbery_model = sm.OLS(RobberyRate, independent_vars).fit()
rape_model = sm.OLS(RapeRate, independent_vars).fit()

The summary of the linear model for assaults shows that average distance to metro and average percent below the poverty line are related to the assault rate.


In [7]:
print assault_model.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:            AssaultRate   R-squared:                       0.749
Model:                            OLS   Adj. R-squared:                  0.703
Method:                 Least Squares   F-statistic:                     16.39
Date:                Sat, 02 May 2015   Prob (F-statistic):           1.19e-08
Time:                        00:21:27   Log-Likelihood:                 113.26
No. Observations:                  40   AIC:                            -212.5
Df Residuals:                      33   BIC:                            -200.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
Avg_Dist_Metro_KM              0.0398      0.011      3.786      0.001         0.018     0.061
Avg_Ridership                  0.0003      0.000      0.715      0.480        -0.001     0.001
Avg_Percent_Public_Transit     0.0431      0.054      0.792      0.434        -0.068     0.154
Avg_Percent_Vacant_Houses      0.0988      0.066      1.501      0.143        -0.035     0.233
Avg_Percent_Below_Poverty      0.2449      0.081      3.039      0.005         0.081     0.409
Avg_Median_Home_Value         -0.0015      0.002     -0.636      0.529        -0.006     0.003
const                         -0.0407      0.034     -1.181      0.246        -0.111     0.029
==============================================================================
Omnibus:                        0.539   Durbin-Watson:                   2.416
Prob(Omnibus):                  0.764   Jarque-Bera (JB):                0.670
Skew:                           0.207   Prob(JB):                        0.715
Kurtosis:                       2.521   Cond. No.                         460.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The summary of the linear model for homicides shows that the average percent below the poverty line is related to the homicide rate.


In [8]:
print homicide_model.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:           HomicideRate   R-squared:                       0.805
Model:                            OLS   Adj. R-squared:                  0.769
Method:                 Least Squares   F-statistic:                     22.65
Date:                Sat, 02 May 2015   Prob (F-statistic):           2.15e-10
Time:                        00:21:29   Log-Likelihood:                 111.48
No. Observations:                  40   AIC:                            -209.0
Df Residuals:                      33   BIC:                            -197.1
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
Avg_Dist_Metro_KM              0.0105      0.011      0.957      0.346        -0.012     0.033
Avg_Ridership                  0.0004      0.000      0.977      0.336        -0.000     0.001
Avg_Percent_Public_Transit     0.0664      0.057      1.167      0.252        -0.049     0.182
Avg_Percent_Vacant_Houses      0.1106      0.069      1.606      0.118        -0.029     0.251
Avg_Percent_Below_Poverty      0.5851      0.084      6.942      0.000         0.414     0.757
Avg_Median_Home_Value          0.0031      0.002      1.259      0.217        -0.002     0.008
const                         -0.0694      0.036     -1.924      0.063        -0.143     0.004
==============================================================================
Omnibus:                        3.273   Durbin-Watson:                   2.072
Prob(Omnibus):                  0.195   Jarque-Bera (JB):                2.302
Skew:                           0.572   Prob(JB):                        0.316
Kurtosis:                       3.272   Cond. No.                         460.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The summary of the linear model for robberies shows that average distance to metro and average percent below the poverty line are related to the assault rate.


In [9]:
print robbery_model.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:            RobberyRate   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.541
Method:                 Least Squares   F-statistic:                     8.660
Date:                Sat, 02 May 2015   Prob (F-statistic):           1.09e-05
Time:                        00:21:38   Log-Likelihood:                 111.15
No. Observations:                  40   AIC:                            -208.3
Df Residuals:                      33   BIC:                            -196.5
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
Avg_Dist_Metro_KM              0.0270      0.011      2.439      0.020         0.004     0.050
Avg_Ridership                  0.0002      0.000      0.395      0.696        -0.001     0.001
Avg_Percent_Public_Transit     0.0530      0.057      0.925      0.362        -0.064     0.170
Avg_Percent_Vacant_Houses      0.0453      0.069      0.652      0.519        -0.096     0.186
Avg_Percent_Below_Poverty      0.1996      0.085      2.348      0.025         0.027     0.372
Avg_Median_Home_Value         -0.0007      0.002     -0.275      0.785        -0.006     0.004
const                         -0.0283      0.036     -0.777      0.443        -0.102     0.046
==============================================================================
Omnibus:                        3.583   Durbin-Watson:                   2.257
Prob(Omnibus):                  0.167   Jarque-Bera (JB):                2.720
Skew:                           0.634   Prob(JB):                        0.257
Kurtosis:                       3.163   Cond. No.                         460.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The summary of the linear model for rape shows that average distance to metro and average percent below the poverty line are related to the assault rate.


In [10]:
print rape_model.summary()


                            OLS Regression Results                            
==============================================================================
Dep. Variable:               RapeRate   R-squared:                       0.659
Model:                            OLS   Adj. R-squared:                  0.597
Method:                 Least Squares   F-statistic:                     10.62
Date:                Sat, 02 May 2015   Prob (F-statistic):           1.47e-06
Time:                        00:21:39   Log-Likelihood:                 114.77
No. Observations:                  40   AIC:                            -215.5
Df Residuals:                      33   BIC:                            -203.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
Avg_Dist_Metro_KM              0.0292      0.010      2.884      0.007         0.009     0.050
Avg_Ridership                  0.0004      0.000      1.036      0.308        -0.000     0.001
Avg_Percent_Public_Transit     0.0044      0.052      0.083      0.934        -0.102     0.111
Avg_Percent_Vacant_Houses      0.0154      0.063      0.243      0.809        -0.114     0.144
Avg_Percent_Below_Poverty      0.2473      0.078      3.186      0.003         0.089     0.405
Avg_Median_Home_Value         -0.0006      0.002     -0.272      0.787        -0.005     0.004
const                         -0.0124      0.033     -0.374      0.711        -0.080     0.055
==============================================================================
Omnibus:                        1.809   Durbin-Watson:                   2.003
Prob(Omnibus):                  0.405   Jarque-Bera (JB):                1.723
Skew:                           0.452   Prob(JB):                        0.422
Kurtosis:                       2.536   Cond. No.                         460.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To fit a model that considered spatial distribution, we turned to PySal (Python Spatial Analysis Library). PySal allows a user to enter a spatial weights matrix to represent geographic relationships - in our case the neighborhoods. When we fit a geographic OLS we discovered that the residuals were not randomly distributed over the study area and we saw nonstationarity (heteroskedasticity) indicating spatial dependency. This aspect made our model untrustworthy so we turned to a two stage geographically weighted regression that accounts for spatial lag and heteroskedasticity. Below, we build a class to wrap the regression function. Using a shapefile (a geographic data model format) we construct the class and fit the regression.


In [11]:
import os
import pysal

class GeographicallyWeightedRegression(object):
    """
    object that loads and analyzes spatial
    data.
    """    
    def __init__(self, in_shapefile, dependent_variable, independent_variables, **kwargs):
        
        self.infile = in_shapefile

        self.dependent_var = dependent_variable
        
        #Checking if input is the correct object. This verification
        #can be applied to all variables        
        if isinstance(independent_variables, list):
            self.independent_vars = independent_variables
        else:
            raise TypeError(message="independent variables must be a list of column headers.")
        
        #uses key word arg            
        self.outfile_summary = kwargs.pop("out_summary", None)
        
        #we can include options. For now
        #just setting one representation
        self.spat_weights = kwargs.pop("spatial_relationship", "queen")
        
        self.independent_array = None
        self.dependent_array = None
        self.weights = None
        self._summary = None
        
    def _get_shapefile_dbf(self):
        """
        returns the dbf file path associated with the 
        input shapefile
        """
        
        #name of the file without dir
        file_path = os.path.basename(self.infile).split('.')
        
        #name of the dbf file without dir and add dbf extension
        name = "{0}.dbf".format(file_path[0])
        
        #return full filepath
        return os.path.join(os.path.dirname(self.infile), name)
        
    def _get_dependent_var_array(self,dbf):
        """
        returns the independent variable colmn as a numpy 
        array. Pysal regression requires this
        """
        #grab the single column from the dbf
        dependent = dbf.by_col(self.dependent_var)
        
        #turn it into a numpy type
        dependent_array = np.asfarray(dependent)
        
        #requires nx1 array. Not sure if we need this        
        dependent_array.shape = (len(dependent),1)
        
        return dependent_array
        
    def _get_independent_matrix(self, dbf):
        """
        turns the independent columns into a 
        numpy array for analysis.
        """
        #create a list of columns
        new_array = [dbf.by_col(col) for col in self.independent_vars]
        
        #turn the list of columns into an array, transpose, and return
        return np.asfarray(new_array).T
        
    def _save_summary(self, ols):
        """
        writes the output of the ols to file
        """
        with open(self.outfile_summary, "w") as target:
            target.write(ols.summary)   
            
    def _get_spatial_weights(self):
        """
        creates the spatial weights object for use in
        OLS. This structure defaults to a certain
        spatial relationship. We can add more key words
        and create different relationships
        """
        #this matrix tells the algorithm how to 
        #look at neighboring features
        #queen looks at polygons with shared edges
        #queen b/c the way the chess piece moves
        if self.spat_weights == "queen":
            return pysal.queen_from_shapefile(self.infile)
            
        elif self.spat_weights == "rook":
            return pysal.rook_from_shapefile(self.infile)
            
        else:
            #won't use spatial weights
            return None
            
    def run(self):
        """
        This will run the spatial analysis
        """
        
        #open the shapefile to start working
        #working_file = pysal.open(self.infile)

        #create the spatial weights matrix
        self.weights = self._get_spatial_weights()
        
        #all shapefiles come with a dbf file
        #this will open that
        dbf_file = self._get_shapefile_dbf()
        
        #open the dbf file for analysis
        open_dbf = pysal.open(dbf_file)
        
        #create the dependent array for OLS input
        self.dependent_array = self._get_dependent_var_array(open_dbf)
        
        #create the indepent array for OLS input
        self.independent_array = self._get_independent_matrix(open_dbf)
        
        #run the OLS
        #This is set up to run Moran's I on the residuals to ensure
        #they are not spatially correlated 
        ols = pysal.spreg.GM_Combo_Het(self.dependent_array, self.independent_array,
                              w=self.weights, name_y=self.dependent_var, 
                              name_x=self.independent_vars, name_w=self.spat_weights,
                              name_ds=os.path.basename(self.infile))
                                
        if self.outfile_summary:
            self._save_summary(ols)
        self._summary = ols.summary
        
        open_dbf.close()
        return ols
        
    def print_summary(self):
        print self._summary
        
"""
StationA_4	Average of Per Below Poverty Line
StationA_7	Average of Distance from metro KM
StationA_8 CrimePerRider
"""
shapefile= "Metro_Neighborhoods_StatsJoin.shp"
dependent = "StationA_8"
independent = ['StationA_7', 'StationA_4']

In [15]:
gwr = GeographicallyWeightedRegression(shapefile, dependent, independent,
                                       spatial_relationship="queen")
gwr.run()
gwr.print_summary()


REGRESSION
----------
SUMMARY OF OUTPUT: SPATIALLY WEIGHTED TWO STAGE LEAST SQUARES (HET)
-------------------------------------------------------------------
Data set            :Metro_Neighborhoods_StatsJoin.shp
Weights matrix      :       queen
Dependent Variable  :  StationA_8               Number of Observations:          40
Mean dependent var  :      0.4360               Number of Variables   :           4
S.D. dependent var  :      0.5128               Degrees of Freedom    :          36
Pseudo R-squared    :      0.6743
Spatial Pseudo R-squared:  0.6567
N. of iterations    :           1               Step1c computed       :          No

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      -0.2461448       0.1021545      -2.4095361       0.0159728
          StationA_4       3.9211239       2.5649987       1.5287041       0.1263378
          StationA_7       0.5850807       0.2550175       2.2942763       0.0217746
        W_StationA_8       0.0569309       0.0464038       1.2268579       0.2198760
              lambda       0.0102242       0.0901727       0.1133842       0.9097260
------------------------------------------------------------------------------------
Instrumented: W_StationA_8
Instruments: W_StationA_4, W_StationA_7
================================ END OF REPORT =====================================

The summary output shows that the model performed with an R-squared of 0.67 and we accept the null hypothesis for the lambda variable meaning we effectively handled nonstationarity. It is important to note that the only significant variable in this regression is distance from Metro stations.

Planning a trip

Now that we know more about the relationships and the crime environment within the District, we want to help riders plan their trips. To do this, we created a Support Vector classifier (SVC) that considers crime location, demographic information and time of data. The example below shows an example of raw data being used to predict what crime you might encounter. The SVC was wrapped in a pipeline that optimized the build with Grid Search and an imputer to fix null values. The classifier performs at 43% accuracy. The class below is used to load the SVC to classify inputs.


In [13]:
import dill

class EventClassifier(object):
    """
    Class that will be used to classify user input
    events. The event classifier will use the model 
    from build.py
    """
    
    def __init__(self, model=None):
        """
        Initialize the event classifier with the model
        created in build.py
        """
    		
    		## Load the model from the pickle
    	with open(model, 'rb') as pkl:
    		self._classifier = dill.load(pkl)
    
    def classify(self, instance):
        """
        This is the function that will take the user 
        input (instance) and return the probability that the 
        user will encounter a crime on their trip
        """
        		
    	## Use the classifier to predict the probabilities of each crime
    	most_likely = self._classifier.predict(instance)
     
    	return most_likely

In [14]:
#Create the classifier object
clf = EventClassifier(model='SVC_2015-29-04.pickle')

#Example data ready for the classifier
tests = np.asfarray([['-77.00589537', '38.90611936', '0.44015444', '0.505928854', '0.055153707', '0.944846293', '0.052631579', '0.295465729', '4.238', '7.7461', '8.412', '0', '1', '0'],
                         ['-77.03382744', '38.93071433', '0.499459459', '0.59562212', '0.178443114', '0.821556886', '0.048104956', '0.351140991', '4.028', '4.4688', '12.755', '0', '1', '0'],
                         ['-77.06697201', '38.90685672', '0.12755102', '0.452574526', '0.122916667', '0.877083333', '0', '1.583192993', '10.00001', '7.0388', '-999', '0', '1', '0'],
                         ['-77.0646675', '38.94619085', '0.638018937', '0.588370314', '0.11689008', '0.88310992', '0', '0.323109894', '3.451', '7.6532', '6.505', '0', '1', '0']])
#correct answers to the tests
answers =  ['THEFT/OTHER',
                'ASSAULT W/DANGEROUS WEAPON', 
                'THEFT F/AUTO',
                'THEFT/OTHER']
for i,test in enumerate(tests):
    print "Classifier predicts {0} and the actual answer is {1}".format(clf.classify(test)[0], answers[i])


Classifier predicts THEFT/OTHER and the actual answer is THEFT/OTHER
Classifier predicts THEFT/OTHER and the actual answer is ASSAULT W/DANGEROUS WEAPON
Classifier predicts THEFT/OTHER and the actual answer is THEFT F/AUTO
Classifier predicts THEFT/OTHER and the actual answer is THEFT/OTHER

DC-Crimebusters:

Georgetown University
School of Continuing Studies
Data Analytics Certificate 2015