In [96]:
import urllib2
import numpy as np
# Downloads data given a url using urllib2
def getTsv(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# Transforms the input data from a tabular seperated
# file and casts the values to floats.
def transform(data):
return np.array([[float(x) for x in line.split('\t')] for line in data.split('\n')])
data = [
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data1.tsv'),
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data2.tsv'),
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data3.tsv'),
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data4.tsv')
]
data_t = [transform(d) for d in data]
data_t
now represents 4 lists containing a list of two elements (x,y) for each row of the tsv files.
Next, we calculate all the mean values. We use numpy as it has an easy way to calculate the mean of values in a numpy array. More specifically we can call a function on each of the four lists getting the mean across indexes.
See the documentation here for more details.
In [97]:
import numpy as np
# Calculates the mean values of a lists sublists/tuples
# on the axis 0, i.e. up/down (axis=1, right,left)
means = [dt.mean(axis=0) for dt in data_t]
In [98]:
for m in means:
print "Mean: {:10.2f},{:10.2f}".format(*m)
As we did with the means we can calculate the variance in x- and y-values the same way for each dataset.
In [99]:
variance = [dt.var(axis=0) for dt in data_t]
for v in variance:
print "Variance: {:10.3f},{:10.3f}".format(*v)
numpy
to calculate the Pearson correlation between $x$- and $y$-values for all four data sets (also to three decimal places).As before we use a list comprehension on the transformed data data_t
calling a simple function that calculates the pearson correlation on each of the 4 datasets.
In [100]:
# Calculates the pearson correlation for
# each dataset
def calc_pearson(data):
data_xs = data[:,0] # all xs across sublists/tuples
data_ys = data[:,1] # all xs across sublists/tuples
return np.corrcoef(data_xs, data_ys)[0][1]
pears = [calc_pearson(x) for x in data_t]
for pear in pears:
print "Pearson: {:10.3f}".format(pear)
Before we start plotting, lets first generate the straigt lines needed for each dataset.
Because we use a single list comprehension for all datasets we return a dictionary from lin_reg()
for easy lookup later on.
In [101]:
from scipy import stats
# Calculates the linear regression for each dataset
# and returns a dictionary containing lin reg values
def lin_reg(data):
data_xs = data[:,0]
data_ys = data[:,1]
slope, intercept, r_value, p_value, std_err = stats.linregress(data_xs,data_ys)
return {'slope': slope, 'intercept': intercept, 'r_value': r_value, 'p_value': p_value, 'std_err': std_err}
lin_regs = [lin_reg(x) for x in data_t]
matplotlib.pyplot
. Use a two-by-two subplot
to put all of the plots nicely in a grid and use the same $x$ and $y$ range for all four plots. And include the linear fit in all four plots. (To get a sense of what I think the plot should look like, you can take a look at my version here.)Now that the regressions have been calculated, a plot for each dataset can be made.
Because the datasets are kept in a single list we can use a for loop
to generate each subplot.
In [102]:
import matplotlib.pyplot as plt
%matplotlib inline
for i in range(0, len(data_t)):
d = data_t[i] # Get the individual dataset
xs = d[:,0]
ys = d[:,1]
predict_y = lin_regs[i]["intercept"] + lin_regs[i]["slope"] * xs # Straigt line function
plt.subplot(221+i) # 2 rows, 2 columns, current plot + i (starts at zero)
plt.scatter(xs,ys)
plt.plot(xs,predict_y, 'r')
plt.xlim(xmin=0, xmax=20) # Standadize x axis min
plt.ylim(ymin=2, ymax=14) # Standadize y axis min
plt.show()
Other than learning how to, extract and transform data, use numpy and matlibplot, we think it is fair to say that up until the plotting the data looks very similiar. And indeed they do on the mean, variance, pearson correlation which we have calculated.
However, looking at the data plots it is easy to see how different each dataset really is. Visualization is a really important factor to help understand your data better.
We begin by loading the cvs file. Containing all the crime records, and parse it into a dictonary, for easy access.
Note that we ignore any crime that do not have a district.
In [103]:
from csv import reader
def parseCrimeCsv(row):
return {
'IncidntNum': line[0],
'Category': line[1],
'Descript': line[2],
'DayOfWeek': line[3],
'Date': line[4],
'Time': line[5],
'PdDistrict': line[6],
'Resolution': line[7],
'Address': line[8],
'X': line[9],
'Y': line[10],
'Location': line[11],
'PdId': line[12]
}
with open('../SFPD_Incidents_-_from_1_January_2003.csv', 'r') as infile:
data = [parseCrimeCsv(line) for line in list(reader(infile))[1:] if not line[6] == '']
Now that the file have been loaded into memory we can extract all the districts.
We do this by taking value PdDistrict
of all records, and insert them into a set.
In [104]:
PdDistrictSet = set([i['PdDistrict'] for i in data])
PdDistrictSet
Out[104]:
If we being finding the district with the most overall crimes.
We begin with creating a Counter
object, were we input the districts of the dataset.
This object, helps us counting the occorences of the given values.
Lastly we call the most_common(1)
on the Counter
to extract the district with the most crimes overall.
In [105]:
from collections import Counter
DistrictCrimeCounter = Counter([d['PdDistrict'] for d in data])
DistrictCrimeCounter.most_common(1)
Out[105]:
By looking at hthe output above, we can se that Southern is the district with the most crimes overall with a total count of 335978 crimes from Jan 2003 to Jan 2016.
Which has the most focus crimes? To do this we first define the focus crimes, these are taken from the exercises from week 3:
In [106]:
focuscrimes = set([
'WEAPON LAWS',
'PROSTITUTION',
'DRIVING UNDER THE INFLUENCE',
'ROBBERY',
'BURGLARY',
'ASSAULT',
'DRUNKENNESS',
'DRUG/NARCOTIC',
'TRESPASS',
'LARCENY/THEFT',
'VANDALISM',
'VEHICLE THEFT',
'STOLEN PROPERTY',
'DISORDERLY CONDUCT'
])
Now that we have defined the focus crimes, we can filter our dataset. So that we get a new data set that only contains the records for these crimes. We do this by utilizing the build in filter function.
Then we do the same operatios a above, to get the district with the most focus crimes.
In [107]:
data_filtered_for_focus_crimes = filter(lambda c: c['Category'] in list(focuscrimes), data)
most_common_focus_crime_by_dist = Counter([d['PdDistrict'] for d in data_filtered_for_focus_crimes])
most_common_focus_crime_by_dist.most_common(1)
Out[107]:
We see that it is again the district south that have the most focus crimes, just as it also had with all crimes. However this time around we are down to 191521 crimes commited between Jan 2003 and Jan 2016.
Firstly we have to create a Counter
for the categories. Then we can use list comprehention, over the output of most common
, to create a nicely structred dataset.
In [108]:
from __future__ import division
# DistrictCrimeCounter
CatetoryCrimeCounter = Counter([d['Category'] for d in data])
p_crime = {}
for k,v in CatetoryCrimeCounter.most_common():
p_crime[k] = v/len(data)
p_crime
Out[108]:
Now we do the same, however we do it per district instead.
Our result is a dictonary, with the district as key, followed by a list, containing dictionaries again, with the crime category as key and properbility as value.
In [109]:
p_crime_district = {}
l = len(data)
for district in list(PdDistrictSet):
counter = Counter([d['Category'] for d in data if d['PdDistrict'] == district])
p_crime_district[district] = {}
for k,v in counter.most_common():
p_crime_district[district][k] = v / len(list(counter.elements()));
p_crime_district
Out[109]:
In [110]:
import pandas as pd
%matplotlib inline
def calcRatioForDistrict(districtCounter, overAllCounter, district):
ys = []
xs = []
for fc in list(focuscrimes):
ys.append(districtCounter[fc] / overAllCounter[fc])
xs.append(fc)
return pd.Series(ys, index=xs)
series = {}
for district in list(PdDistrictSet):
series[district] = calcRatioForDistrict(p_crime_district[district], p_crime, district)
df = pd.DataFrame(series)
df.plot(kind="bar", subplots=True, figsize=(14,14),layout=(7,2), legend=False,sharey=True)
Out[110]:
Tenderloin
The top crime in Tenderloin is by far DRUG/NARCOTICS as illustrated above. Looking through articles and wikipedia, it does seem to be representative current/past activity in the area.
From Wikipedia: "The first block of Turk Street, between Taylor and Mason, had one of the highest rates of violence and drug activity in San Francisco, according to a survey conducted by the Tenderloin Housing Clinic."
Mission
Prostitution. The top crime in Mission with some margin. Wikipedia does not give any indication about prostitution or crime in general for the district.
Doing a leading search on google, there does seem to be articles that mentions prostitution is on the uptick recently.
Richmond
Richmond has the highest ratio of crimes related to DRIVING UNDER THE INFLUENCE (DUI) in San Francisco. Oddly, both DRUNKENESS and DRUG/NARCOTIC are very low in the area.
We were not able to find any information that backs these statistics.
geoplotlib
to plot all incidents of the three crime types on their own map using geoplotlib.kde()
. This will give you an idea of how the varioius crimes are distributed across the city.Here we define the new focus crimes.
Then we create a dictionary that contains all the crimes from the tree specific focus crimes.
So that on the key PROSTITUTION
the value is a list containing all of these crimes.
In [111]:
focusCrimes = ['PROSTITUTION', 'DRUG/NARCOTIC', 'DRIVING UNDER THE INFLUENCE']
filteredCrimes = {}
for f in focusCrimes:
filteredCrimes[f] = filter(lambda c: c['Category'] == f and float(c['Y']) < 90.0, data) # We remove some crimes located at the north pole
Then we create a dictionary that cointains two lists, with all the latitutes and longtitues.
This is the structure that geoplotlib
accepts.
We also construct a Bounding Box
for San Francisco so that we have a nice view box.
In [112]:
from geoplotlib.utils import BoundingBox
def createLongAndLatList(data):
return {'lat': [float(d['Y']) for d in data], 'lon': [float(d['X']) for d in data]}
# San Francisco
bbox_SF = BoundingBox(north=37.83, west=-122.52, south=37.67, east=-122.35)
In [113]:
import geoplotlib as glp
from geoplotlib.utils import BoundingBox, DataAccessObject
for fc in focusCrimes:
print "Plotting " + fc
glp.kde(createLongAndLatList(filteredCrimes[fc]), bw=5)
glp.set_bbox(bbox_SF)
glp.inline(width=900)
In [114]:
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets
def times(n, v):
return [v for _ in range(0,n)]
In [115]:
%matplotlib inline
def createKNNPlot(X,Y,h):
knn=neighbors.KNeighborsClassifier()
# # we create an instance of Neighbours Classifier and fit the data.
knn.fit(X, Y)
# # Plot the decision boundary. For that, we will asign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:,0].min() - 0.02, X[:,0].max() + 0.02
y_min, y_max = X[:,1].min() - 0.02, X[:,1].max() + 0.02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
# # Put the result into a color plot
Z = Z.reshape(xx.shape)
pl.figure(1, figsize=(15, 15))
pl.set_cmap(pl.cm.Paired)
pl.pcolormesh(xx, yy, Z, cmap='PuBuGn')
# # Plot also the training points
pl.scatter(X[:,0], X[:,1], c=Y, cmap='PuBuGn')
pl.xlabel('Longtitue')
pl.ylabel('Latitude')
pl.xlim(xx.min(), xx.max())
pl.ylim(yy.min(), yy.max())
pl.xticks(())
pl.yticks(())
pl.show()
In [116]:
allLats = []
allLons = []
Y = []
for fc in focusCrimes:
print "We have %d number of %s crimes" % (len(filteredCrimes[fc]), fc)
ll = createLongAndLatList(filteredCrimes[fc])
allLats.extend(ll['lat'])
allLons.extend(ll['lon'])
Y.extend(times(len(ll['lon']), focusCrimes.index(fc)))
X = np.array([[allLats[i],allLons[i]] for i in range(0, len(allLons))])
Y = np.array(Y)
h = .001 # step size in the mesh
# Call the magic method
createKNNPlot(X,Y,h)
In the plot
`DRUG/NARCOTIC` is shown as cyan
`PROSTITUTION` is shown as white
`DRIVING UNDER THE INFLUENCE` is shown as Green
As the above KNN plot shows the focus crime represented in cyan color DRUG/NARCOTIC
is the dominant focus crime in SF. With +100,000 compared to the others. The KNN plot illustrates this.
The distribution of the different crimes should not affect the KNN plot in any negative way. In principal this ditribution just shows the real-world situation.
As the DRUG/NARCOTICS
where over repesented in the last map, it resulted in an overall cyan map. However when you blanace the dataset, we expect that this will change. This will result in a map with we think will be devided into more zones.
Where we might be able to see the spread of all the tree crimes, as the DRUG/NARCOTICS
will not domininate as it did in the previous map.
In [117]:
import sys
allLats = []
allLons = []
Y = []
balance = sys.maxint;
for fc in focusCrimes:
if (balance > len(filteredCrimes[fc])):
balance = len(filteredCrimes[fc])
for fc in focusCrimes:
print "We have %d number of %s crimes" % (len(filteredCrimes[fc][:balance]), fc)
ll = createLongAndLatList(filteredCrimes[fc])
allLats.extend(ll['lat'][:balance])
allLons.extend(ll['lon'][:balance])
Y.extend(times(len(ll['lon'][:balance]), focusCrimes.index(fc)))
X = np.array([[allLats[i],allLons[i]] for i in range(0, len(allLons))])
Y = np.array(Y)
h = .001 # step size in the mesh
# Call the magic method
createKNNPlot(X,Y,h)
Above is our result of plotting the balanced dataset. Note that we balanced by the age of the data. Meaning that we only too data from Jan 2003, and then 4918 records forward, in the respective crime insidents. This of course means that if we balanced diffrently the map could with most certainty look a lot different.
What we can see from the map, is that the DRUG/NARCOTICS
are accually really centered around Tenderlion.
Were as DRIVING UNDER THE INFLUENCE
is way more distributed thoughout the city.
As predicted, this map gave us a better understand of how the different crime incidents were distributed, across the city.
In [118]:
import geoplotlib as gpl
# We define a set of colors for the different focus crimes. were r is red, b is blue and g is green
colorDic = {'PROSTITUTION': 'r', 'DRUG/NARCOTIC': 'b', 'DRIVING UNDER THE INFLUENCE': 'g'}
First we define a function that help us create a list with 100
steps from the min og max values provided.
We then use a set of latitues and longtitues that bound San Francisco. We use the samme values that we used in the boundbong box, in an ealier assignment.
Lastly we plot the grid as a dot plot with geolibplot
. And we can see below how the grid fits nicly over San Franciso
In [119]:
import geoplotlib as gpl
from __future__ import division
def createDots(mi, ma):
step = abs((ma - mi) / 100)
return [(step*v)+mi for v in range(0,100)]
gridDots = {'lat': [], 'lon': []}
for lon in createDots(-122.52, -122.35):
for lat in createDots(37.67, 37.82):
gridDots['lat'].append(lat)
gridDots['lon'].append(lon)
gpl.dot(gridDots, color='w')
gpl.inline(width=900)
Grid over SF
We begin by pulling some helper functions in from the book. These will help us construcing the map.
In [120]:
from __future__ import division
from collections import Counter
import re, math, random
def vector_add(v, w):
"""adds two vectors componentwise"""
return [v_i + w_i for v_i, w_i in zip(v,w)]
def vector_subtract(v, w):
"""subtracts two vectors componentwise"""
return [v_i - w_i for v_i, w_i in zip(v,w)]
def vector_sum(vectors):
return reduce(vector_add, vectors)
def scalar_multiply(c, v):
return [c * v_i for v_i in v]
# this isn't right if you don't from __future__ import division
def vector_mean(vectors):
"""compute the vector whose i-th element is the mean of the
i-th elements of the input vectors"""
n = len(vectors)
return scalar_multiply(1/n, vector_sum(vectors))
def dot(v, w):
"""v_1 * w_1 + ... + v_n * w_n"""
return sum(v_i * w_i for v_i, w_i in zip(v, w))
def sum_of_squares(v):
"""v_1 * v_1 + ... + v_n * v_n"""
return dot(v, v)
def squared_distance(v, w):
return sum_of_squares(vector_subtract(v, w))
def distance(v, w):
return math.sqrt(squared_distance(v, w))
## From the book
def majority_vote(labels):
"""assumes that labels are ordered from nearest to farthest"""
vote_counts = Counter(labels)
winner, winner_count = vote_counts.most_common(1)[0]
num_winners = len([count
for count in vote_counts.values()
if count == winner_count])
if num_winners == 1:
return winner # unique winner, so return it
else:
return majority_vote(labels[:-1]) # try again without the farthest
## From the book
def knn_classify(k, labeled_points, new_point):
"""each labeled point should be a pair (point, label)"""
# order the labeled points from nearest to farthest
by_distance = sorted(labeled_points, key=lambda (point, _): distance(point, new_point))
# find the labels for the k closest
k_nearest_labels = [label for _, label in by_distance[:k]]
# and let them vote
return majority_vote(k_nearest_labels)
In [121]:
import geoplotlib as gpl
def plotKnnForGrid(K):
plots = {
"PROSTITUTION" : {'lat': [], 'lon':[]},
"DRUG/NARCOTIC" : {'lat': [], 'lon':[]},
"DRIVING UNDER THE INFLUENCE" : {'lat': [], 'lon':[]} }
crimes = []
for fc in focusCrimes:
for c in filteredCrimes[fc][:20]:
crimes.append(([float(c['X']), float(c['Y'])], c['Category']))
for lon in createDots(-122.52, -122.35):
for lat in createDots(37.67, 37.82):
predicted_crime = knn_classify(k, crimes, [lon, lat])
plots[predicted_crime]['lon'].append(lon)
plots[predicted_crime]['lat'].append(lat)
for fc in focusCrimes:
#print plots[fc]
gpl.dot(plots[fc], color=colorDic[fc])
gpl.inline(width=900)
In [122]:
K = [5,10,30]
for k in K:
print "For majority of its k=%d nearest neighbours" % k
plotKnnForGrid(k)
K=5
K=10
K=30
As you increase K, the number of nearest neighbors we look at, you get a much less granularity, for a lack of better a word.
As an example, the green - DRIVING UNDER THE INFLUENCE
(DUI
) - seem to make the biggest change going from K=5
to K=30
. DUI
crimes are much more spread out in SF compared to DRUG/NARCOTIC
and PROSTITUTION
which are generally very concentrated in certain areas.
Because of this, when we only look at K=5
, it favours DUI because of the its geographical distribution. But when looking at K=30
, we can conclude that looking at any point for its nearest neighbors the chance of it being DRUG/NARCOTIC
is higher because of the density and overall amount compared to the beforementioned crimes.
In [123]:
from csv import reader
def parseCsv(row):
return {
'IncidntNum': line[0],
'Category': line[1],
'Descript': line[2],
'DayOfWeek': line[3],
'Date': line[4],
'Time': line[5],
'PdDistrict': line[6],
'Resolution': line[7],
'Address': line[8],
'X': line[9],
'Y': line[10],
'Location': line[11],
'PdId': line[12]
}
with open('../x-files.csv', 'r') as infile:
data_all = [parseCsv(line) for line in list(reader(infile))[1:]]
data_rb = [line for line in data_all if line['Resolution'] == 'RED BARON']
print "Total count of incidents within the category of " + data_all[0]['Category'] + ": " + str(len(data_all))
print "Total count of incidents from Red Baron: " + str(len(data_rb))
In [124]:
import time
def get_hour(time):
return int(time.split(':')[0])
def convert_day_to_num(day):
return time.strptime(day, "%A").tm_wday + 1
Red Baron
in the resolution field and use the day of the week to predict the hour of the day when he is attacking, e.g. use linear regression to infer the hour of the day based on the weekday! Again, take 4/5 of the data for training and then calculate goodness of fit using $R^2$ on the rest 1/5. Don't forget to rescale your input variables! (Note 1: My goodness of fit after using the weekdays is only around 0.618). (Note 2: For multivariate regression, as always you can simply re-use the code in the DSFS book (Chapters 14-15) or scikit-learn).We begin by dividing our filtered dataset into two sets. A training set, and a test set. WE then contruct the x values and y values. Then we fit from to the linear regression model. And the test with the test set. This results in a R squared value, and plot as seen below, in the output.
In [125]:
from __future__ import division
%matplotlib inline
training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]
data_train_xs = [[convert_day_to_num(d["DayOfWeek"])] for d in training_rb]
data_train_ys = [[get_hour(d["Time"])] for d in training_rb]
data_test_xs = [[convert_day_to_num(d["DayOfWeek"])] for d in test_rb]
data_test_ys = [[get_hour(d["Time"])] for d in test_rb]
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)
# R squared coeficient
print 'R squared coefficient: %.3f' % regr.score(data_test_xs, data_test_ys)
# Plot outputs
plt.scatter(data_test_xs, data_test_ys, color='black')
plt.plot(data_test_xs, regr.predict(data_test_xs), color='blue', linewidth=3)
plt.show()
In [126]:
def getYear(d):
return int(d['Date'].split('/')[-1])
As we need to add another feature to the model, we add the year to each of the x values. This greatly improves the R squared value, and we can now see that that model fit way better.
In [127]:
from __future__ import division
%matplotlib inline
training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]
data_train_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d)] for d in training_rb]
data_train_ys = [[get_hour(d["Time"])] for d in training_rb]
data_test_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d)] for d in test_rb]
data_test_ys = [[get_hour(d["Time"])] for d in test_rb]
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)
# R squared coeficient
print 'R squared coefficient: %.4f' % regr.score(data_test_xs, data_test_ys)
As Chief Suneman yells we add the longtitude to the model as well. As he predicted the R squared value did infact improve. This shows that not every seen in movies are just special effects.
In [128]:
from __future__ import division
%matplotlib inline
training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]
data_train_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X'])] for d in training_rb]
data_train_ys = [[get_hour(d["Time"])] for d in training_rb]
data_test_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X'])] for d in test_rb]
data_test_ys = [[get_hour(d["Time"])] for d in test_rb]
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)
# R squared coeficient
print 'R squared coefficient: %.4f' % regr.score(data_test_xs, data_test_ys)
Even though we also added the latitude to the model. The model did not improve more.
In [129]:
from __future__ import division
%matplotlib inline
training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]
data_train_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X']), float(d['Y'])] for d in training_rb]
data_train_ys = [[get_hour(d["Time"])] for d in training_rb]
data_test_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X']), float(d['Y'])] for d in test_rb]
data_test_ys = [[get_hour(d["Time"])] for d in test_rb]
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)
# R squared coeficient
print 'R squared coefficient: %.4f' % regr.score(data_test_xs, data_test_ys)