Using the FBI:UCR Crime dataset, which can be found here, build a regression model to predict property crimes.
The FBI defines property crime as including the offenses of burglary, larceny-theft, motor vehicle theft, and arson. To predict property crime, one can simply use these features.
In [1]:
    
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn import linear_model
# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore"
)
    
In [2]:
    
data_path = "https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv"
data = pd.read_csv(data_path, delimiter = ',', skiprows=4, header=0, skipfooter=3, thousands=',')
data = pd.DataFrame(data)
    
In [3]:
    
data.head()
    
    Out[3]:
In [4]:
    
# Instantiate and fit our model.
regr = linear_model.LinearRegression()
Y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[["Larceny-\ntheft", "Motor\nvehicle\ntheft", "Burglary"]]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
    
    
In [5]:
    
plt.figure(figsize=(15,5))
sns.pairplot(data, vars =['Property\ncrime', 'Population', 'Violent\ncrime',
       'Murder and\nnonnegligent\nmanslaughter',
       'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']) 
plt.show()
    
    
    
That single outlier is making the relationships difficult to view. Let's remove the outlier.
In [6]:
    
dataCleaned = data[data["Property\ncrime"] < 20000]
    
In [7]:
    
plt.figure(figsize=(15,5))
sns.pairplot(dataCleaned, vars =['Property\ncrime', 'Population', 'Violent\ncrime',
       'Murder and\nnonnegligent\nmanslaughter',
       'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']) 
plt.show()
    
    
    
In [8]:
    
plt.scatter(dataCleaned["Property\ncrime"], dataCleaned["Murder and\nnonnegligent\nmanslaughter"])
plt.title('Raw values')
plt.xlabel("Property Crime")
plt.ylabel("Murder")
plt.show()
    
    
There is a large number of 0's for Murder. Perhaps let's use a binary value for murder occurring vs no murder occurring.
In [9]:
    
dataCleaned["Murder"] = dataCleaned['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)
plt.scatter(dataCleaned["Property\ncrime"], dataCleaned["Murder"])
plt.title('Raw values')
plt.xlabel("Property Crime")
plt.ylabel("Murder")
plt.show()
    
    
In [10]:
    
dataCleaned.head()
    
    Out[10]:
In [11]:
    
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
       'Murder and\nnonnegligent\nmanslaughter',
       'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
    
    
In [12]:
    
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
    
    
There is a slight increase of performance when the binary indicator for murder is used.
In [13]:
    
data["Murder"] = data['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)
regr = linear_model.LinearRegression()
Y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
    
    
Hmmmm....it seems that outlier result has also heavily weighted the R-squared result and coefficients. The linear model which did not incorporate the outlier is likely to be a better indicator of overall trends and accuracy.
In [14]:
    
regr = linear_model.LinearRegression()
Y = dataCleaned['Property\ncrime'].values.reshape(-1, 1)
X = dataCleaned[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
regr.fit(X, Y)
# Inspect the results.
print('\nCoefficients: \n', regr.coef_)
print('\nIntercept: \n', regr.intercept_)
print('\nR-squared:')
print(regr.score(X, Y))
    
    
In [ ]:
    
    
In [ ]:
    
    
In [ ]:
    
    
In [15]:
    
from sklearn.cross_validation import cross_val_score
regr = linear_model.LinearRegression()
y = data['Property\ncrime'].values.reshape(-1, 1)
X = data[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
scores = cross_val_score(regr, X, y, cv = 10)
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())
    
    
In [24]:
    
data_path = "files/table_8_offenses_known_to_law_enforcement_california_by_city_2013.csv"
dataCA = pd.read_csv(data_path, delimiter = ',', skiprows=4, header=0, skipfooter=3, thousands=',')
dataCA = pd.DataFrame(dataCA)
    
In [25]:
    
dataCA.head()
    
    Out[25]:
In [26]:
    
dataCA["Murder"] = dataCA['Murder and\nnonnegligent\nmanslaughter'].apply(lambda x: 0 if x == 0 else 1)
    
In [27]:
    
y = dataCA['Property\ncrime'].values.reshape(-1, 1)
X = dataCA[['Population', 'Violent\ncrime',
       'Murder', 'Rape\n(legacy\ndefinition)2',
       'Robbery', 'Aggravated\nassault']]
scores = cross_val_score(regr, X, y, cv = 10)
print("Percent accuracy within each fold:\n")
print(scores)
print("\nMean accuracy:\n")
print(scores.mean())