Using the FBI:UCR Crime dataset, which can be found here, build a regression model to predict property crimes.
The FBI defines property crime as including the offenses of burglary, larceny-theft, motor vehicle theft, and arson. To predict property crime, one can simply use these features.
In [1]:
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
# Suppress annoying harmless error.
warnings.filterwarnings(
action="ignore"
)
In [2]:
data_path = "https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv"
data = pd.read_csv(data_path, delimiter = ',', skiprows=4, header=0, skipfooter=3, thousands=',')
data = pd.DataFrame(data)
In [3]:
data.head()
Out[3]:
In [4]:
# Instantiate and fit our model.
regr = linear_model.LinearRegression()
Y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[["Larceny-\ntheft", "Motor\nvehicle\ntheft", "Burglary"]]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
In [5]:
plt.figure(figsize=(15,5))
sns.pairplot(data, vars =['Property\ncrime', 'Population', 'Violent\ncrime',
'Murder and\nnonnegligent\nmanslaughter',
'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault'])
plt.show()
That single outlier is making the relationships difficult to view. Let's remove the outlier.
In [6]:
dataCleaned = data[data["Property\ncrime"] < 20000]
In [7]:
plt.figure(figsize=(15,5))
sns.pairplot(dataCleaned, vars =['Property\ncrime', 'Population', 'Violent\ncrime',
'Murder and\nnonnegligent\nmanslaughter',
'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault'])
plt.show()
In [8]:
plt.scatter(dataCleaned["Property\ncrime"], dataCleaned["Murder and\nnonnegligent\nmanslaughter"])
plt.title('Raw values')
plt.xlabel("Property Crime")
plt.ylabel("Murder")
plt.show()
There is a large number of 0's for Murder. Perhaps let's use a binary value for murder occurring vs no murder occurring.
In [9]:
dataCleaned["Murder"] = dataCleaned['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)
plt.scatter(dataCleaned["Property\ncrime"], dataCleaned["Murder"])
plt.title('Raw values')
plt.xlabel("Property Crime")
plt.ylabel("Murder")
plt.show()
In [10]:
dataCleaned.head()
Out[10]:
In [11]:
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
'Murder and\nnonnegligent\nmanslaughter',
'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
In [12]:
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
'Murder', 'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
There is a slight increase of performance when the binary indicator for murder is used.
In [13]:
data["Murder"] = data['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)
regr = linear_model.LinearRegression()
Y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[['Population', 'Violent\ncrime',
'Murder', 'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
Hmmmm....it seems that outlier result has also heavily weighted the R-squared result and coefficients. The linear model which did not incorporate the outlier is likely to be a better indicator of overall trends and accuracy.
In [14]:
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
'Murder', 'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
In [ ]:
In [ ]:
In [ ]:
In [15]:
from sklearn.cross_validation import cross_val_score
regr = linear_model.LinearRegression()
y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[['Population', 'Violent\ncrime',
'Murder', 'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault']]
scores = cross_val_score(regr, X, y, cv = 10)
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
In [24]:
data_path = "files/table_8_offenses_known_to_law_enforcement_california_by_city_2013.csv"
dataCA = pd.read_csv(data_path, delimiter = ',', skiprows=4, header=0, skipfooter=3, thousands=',')
dataCA = pd.DataFrame(dataCA)
In [25]:
dataCA.head()
Out[25]:
In [26]:
dataCA["Murder"] = dataCA['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)
In [27]:
y = dataCA['Property\ncrime'].values.reshape(-1, 1)
X = dataCA[['Population', 'Violent\ncrime',
'Murder', 'Rape\n(legacy\ndefinition)2',
'Robbery', 'Aggravated\nassault']]
scores = cross_val_score(regr, X, y, cv = 10)
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())