Multivariable Regression Model of FBI Property Crime Statistics

Using the FBI:UCR Crime dataset, which can be found here, build a regression model to predict property crimes.

The FBI defines property crime as including the offenses of burglary, larceny-theft, motor vehicle theft, and arson. To predict property crime, one can simply use these features.



In [1]:

    
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn import linear_model


# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore"
)



In [2]:

    
data_path = "https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv"
data = pd.read_csv(data_path, delimiter = ',', skiprows=4, header=0, skipfooter=3, thousands=',')
data = pd.DataFrame(data)



In [3]:

    
data.head()









    Out[3]:






  
    
      
      City
      Population
      Violent
crime
      Murder and
nonnegligent
manslaughter
      Rape
(revised
definition)1
      Rape
(legacy
definition)2
      Robbery
      Aggravated
assault
      Property
crime
      Burglary
      Larceny-
theft
      Motor
vehicle
theft
      Arson3
    
  
  
    
      0
      Adams Village
      1861
      0
      0
      NaN
      0
      0
      0
      12
      2
      10
      0
      0.0
    
    
      1
      Addison Town and Village
      2577
      3
      0
      NaN
      0
      0
      3
      24
      3
      20
      1
      0.0
    
    
      2
      Akron Village
      2846
      3
      0
      NaN
      0
      0
      3
      16
      1
      15
      0
      0.0
    
    
      3
      Albany
      97956
      791
      8
      NaN
      30
      227
      526
      4090
      705
      3243
      142
      NaN
    
    
      4
      Albion Village
      6388
      23
      0
      NaN
      3
      4
      16
      223
      53
      165
      5
      NaN



In [4]:

    
# Instantiate and fit our model.
regr = linear_model.LinearRegression()
Y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[["Larceny-\ntheft", "Motor\nvehicle\ntheft", "Burglary"]]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))









    



Coefficients: 
 [[ 1.  1.  1.]]

Intercept: 
 [ -2.27373675e-13]

R-squared:
1.0

Perfect accuracy, as expected. However......

Predicting ALL property crimes is a more interesting question.

Building a Model to Predict Property Crimes (without using the Property Crime features)

To start, let's take a look at how each of the non-property crime features interact with property crime.



In [5]:

    
plt.figure(figsize=(15,5))
sns.pairplot(data, vars =['Property\ncrime', 'Population', 'Violent\ncrime',
       'Murder and\nnonnegligent\nmanslaughter',
       'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']) 
plt.show()









    





<matplotlib.figure.Figure at 0x10c358898>

That single outlier is making the relationships difficult to view. Let's remove the outlier.



In [6]:

    
dataCleaned = data[data["Property\ncrime"] < 20000]



In [7]:

    
plt.figure(figsize=(15,5))
sns.pairplot(dataCleaned, vars =['Property\ncrime', 'Population', 'Violent\ncrime',
       'Murder and\nnonnegligent\nmanslaughter',
       'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']) 
plt.show()









    





<matplotlib.figure.Figure at 0x10ed50320>



In [8]:

    
plt.scatter(dataCleaned["Property\ncrime"], dataCleaned["Murder and\nnonnegligent\nmanslaughter"])
plt.title('Raw values')
plt.xlabel("Property Crime")
plt.ylabel("Murder")
plt.show()

There is a large number of 0's for Murder. Perhaps let's use a binary value for murder occurring vs no murder occurring.



In [9]:

    
dataCleaned["Murder"] = dataCleaned['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)

plt.scatter(dataCleaned["Property\ncrime"], dataCleaned["Murder"])
plt.title('Raw values')
plt.xlabel("Property Crime")
plt.ylabel("Murder")
plt.show()



In [10]:

    
dataCleaned.head()









    Out[10]:






  
    
      
      City
      Population
      Violent
crime
      Murder and
nonnegligent
manslaughter
      Rape
(revised
definition)1
      Rape
(legacy
definition)2
      Robbery
      Aggravated
assault
      Property
crime
      Burglary
      Larceny-
theft
      Motor
vehicle
theft
      Arson3
      Murder
    
  
  
    
      0
      Adams Village
      1861
      0
      0
      NaN
      0
      0
      0
      12
      2
      10
      0
      0.0
      0
    
    
      1
      Addison Town and Village
      2577
      3
      0
      NaN
      0
      0
      3
      24
      3
      20
      1
      0.0
      0
    
    
      2
      Akron Village
      2846
      3
      0
      NaN
      0
      0
      3
      16
      1
      15
      0
      0.0
      0
    
    
      3
      Albany
      97956
      791
      8
      NaN
      30
      227
      526
      4090
      705
      3243
      142
      NaN
      1
    
    
      4
      Albion Village
      6388
      23
      0
      NaN
      3
      4
      16
      223
      53
      165
      5
      NaN
      0



In [11]:

    
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
       'Murder and\nnonnegligent\nmanslaughter',
       'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))









    



Coefficients: 
 [[  1.35211120e-02   2.03018615e+01   3.63790825e+01   2.47232760e+01
   -2.12521814e+01  -1.95483156e+01]]

Intercept: 
 [ 9.48025988]

R-squared:
0.950982619239

Noice!!

What about performance when the binary Murder feature is used?



In [12]:

    
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))









    



Coefficients: 
 [[  1.39419257e-02   6.37574382e+01  -5.85043045e+01  -1.97448163e+01
   -6.52374519e+01  -6.27376441e+01]]

Intercept: 
 [ 10.74635836]

R-squared:
0.951222943644

There is a slight increase of performance when the binary indicator for murder is used.

Leave no man behind!

Reintroduce the outlier to the model.



In [13]:

    
data["Murder"] = data['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)


regr = linear_model.LinearRegression()
Y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))









    



Coefficients: 
 [[  1.19440110e-02   7.04373894e+01   3.10866677e+01  -9.17864985e+00
   -7.07330299e+01  -7.18343944e+01]]

Intercept: 
 [ 31.38463798]

R-squared:
0.999003849335

Hmmmm....it seems that outlier result has also heavily weighted the R-squared result and coefficients. The linear model which did not incorporate the outlier is likely to be a better indicator of overall trends and accuracy.

Best Model:



In [14]:

    
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))









    



Coefficients: 
 [[  1.39419257e-02   6.37574382e+01  -5.85043045e+01  -1.97448163e+01
   -6.52374519e+01  -6.27376441e+01]]

Intercept: 
 [ 10.74635836]

R-squared:
0.951222943644



In [ ]:



In [ ]:



In [ ]:

Validating regression models for prediction

Now let's use cross-validation to obtain a more accurate description of our accuracy:



In [15]:

    
from sklearn.cross_validation import cross_val_score

regr = linear_model.LinearRegression()

y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]

scores = cross_val_score(regr, X, y, cv = 10)

print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())









    



Percent accuracy within each fold:

[ 0.85113397  0.97672918  0.39041084  0.76138868  0.7549484   0.72489609
  0.96157164  0.9595789   0.97762802  0.85256133]

Mean accuracy:

0.821084705201

Test the Model with Data From Another State

Now let's test our model with the 2013 Crime Rate dataset for California. Will the predictive power be similar?



In [24]:

    
data_path = "files/table_8_offenses_known_to_law_enforcement_california_by_city_2013.csv"
dataCA = pd.read_csv(data_path, delimiter = ',', skiprows=4, header=0, skipfooter=3, thousands=',')
dataCA = pd.DataFrame(dataCA)



In [25]:

    
dataCA.head()









    Out[25]:






  
    
      
      City
      Population
      Violent
crime
      Murder and
nonnegligent
manslaughter
      Rape
(revised
definition)1
      Rape
(legacy
definition)2
      Robbery
      Aggravated
assault
      Property
crime
      Burglary
      Larceny-
theft
      Motor
vehicle
theft
      Arson
      Unnamed: 13
    
  
  
    
      0
      Adelanto
      31165
      198
      2
      NaN
      15
      52
      129
      886
      381
      372
      133
      17
      NaN
    
    
      1
      Agoura Hills
      20762
      19
      0
      NaN
      2
      10
      7
      306
      109
      185
      12
      7
      NaN
    
    
      2
      Alameda
      76206
      158
      0
      NaN
      10
      85
      63
      1902
      287
      1285
      330
      17
      NaN
    
    
      3
      Albany
      19104
      29
      0
      NaN
      1
      24
      4
      557
      94
      388
      75
      7
      NaN
    
    
      4
      Alhambra
      84710
      163
      1
      NaN
      9
      81
      72
      1774
      344
      1196
      234
      7
      NaN



In [26]:

    
dataCA["Murder"] = dataCA['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)



In [27]:

    
y = dataCA['Property\ncrime'].values.reshape(-1, 1)
X = dataCA[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]

scores = cross_val_score(regr, X, y, cv = 10)

print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())









    



Percent accuracy within each fold:

[ 0.88256872  0.83813945  0.63594188  0.82741147  0.94607165  0.89356858
  0.59922784  0.90741413  0.86736273  0.76805549]

Mean accuracy:

0.816576193827

Results

The FBI dataset for California had similar predictive power to the New York dataset on which the Multivariable Regression Model was built. Within each cross-validation fold, there is a fair amount of variability for both California and New York.

	City	Population	Violent crime	Murder and nonnegligent manslaughter	Rape (revised definition)1	Rape (legacy definition)2	Robbery	Aggravated assault	Property crime	Burglary	Larceny- theft	Motor vehicle theft	Arson3
0	Adams Village	1861	0	0	NaN	0	0	0	12	2	10	0	0.0
1	Addison Town and Village	2577	3	0	NaN	0	0	3	24	3	20	1	0.0
2	Akron Village	2846	3	0	NaN	0	0	3	16	1	15	0	0.0
3	Albany	97956	791	8	NaN	30	227	526	4090	705	3243	142	NaN
4	Albion Village	6388	23	0	NaN	3	4	16	223	53	165	5	NaN

	City	Population	Violent crime	Murder and nonnegligent manslaughter	Rape (revised definition)1	Rape (legacy definition)2	Robbery	Aggravated assault	Property crime	Burglary	Larceny- theft	Motor vehicle theft	Arson	Unnamed: 13
0	Adelanto	31165	198	2	NaN	15	52	129	886	381	372	133	17	NaN
1	Agoura Hills	20762	19	0	NaN	2	10	7	306	109	185	12	7	NaN
2	Alameda	76206	158	0	NaN	10	85	63	1902	287	1285	330	17	NaN
3	Albany	19104	29	0	NaN	1	24	4	557	94	388	75	7	NaN
4	Alhambra	84710	163	1	NaN	9	81	72	1774	344	1196	234	7	NaN