In [96]:
import urllib2
import numpy as np
# Downloads data given a url using urllib2
def getTsv(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# Transforms the input data from a tabular seperated
# file and casts the values to floats.
def transform(data):
return np.array([[float(x) for x in line.split('\t')] for line in data.split('\n')])
data = [
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data1.tsv'),
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data2.tsv'),
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data3.tsv'),
getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data4.tsv')
]
data_t = [transform(d) for d in data]
data_t now represents 4 lists containing a list of two elements (x,y) for each row of the tsv files.
Next, we calculate all the mean values. We use numpy as it has an easy way to calculate the mean of values in a numpy array. More specifically we can call a function on each of the four lists getting the mean across indexes.
See the documentation here for more details.
In [97]:
import numpy as np
# Calculates the mean values of a lists sublists/tuples
# on the axis 0, i.e. up/down (axis=1, right,left)
means = [dt.mean(axis=0) for dt in data_t]
In [98]:
for m in means:
print "Mean: {:10.2f},{:10.2f}".format(*m)
As we did with the means we can calculate the variance in x- and y-values the same way for each dataset.
In [99]:
variance = [dt.var(axis=0) for dt in data_t]
for v in variance:
print "Variance: {:10.3f},{:10.3f}".format(*v)
numpy to calculate the Pearson correlation between $x$- and $y$-values for all four data sets (also to three decimal places).As before we use a list comprehension on the transformed data data_t calling a simple function that calculates the pearson correlation on each of the 4 datasets.
In [100]:
# Calculates the pearson correlation for
# each dataset
def calc_pearson(data):
data_xs = data[:,0] # all xs across sublists/tuples
data_ys = data[:,1] # all xs across sublists/tuples
return np.corrcoef(data_xs, data_ys)[0][1]
pears = [calc_pearson(x) for x in data_t]
for pear in pears:
print "Pearson: {:10.3f}".format(pear)
Before we start plotting, lets first generate the straigt lines needed for each dataset.
Because we use a single list comprehension for all datasets we return a dictionary from lin_reg() for easy lookup later on.
In [101]:
from scipy import stats
# Calculates the linear regression for each dataset
# and returns a dictionary containing lin reg values
def lin_reg(data):
data_xs = data[:,0]
data_ys = data[:,1]
slope, intercept, r_value, p_value, std_err = stats.linregress(data_xs,data_ys)
return {'slope': slope, 'intercept': intercept, 'r_value': r_value, 'p_value': p_value, 'std_err': std_err}
lin_regs = [lin_reg(x) for x in data_t]
matplotlib.pyplot. Use a two-by-two subplot to put all of the plots nicely in a grid and use the same $x$ and $y$ range for all four plots. And include the linear fit in all four plots. (To get a sense of what I think the plot should look like, you can take a look at my version here.)Now that the regressions have been calculated, a plot for each dataset can be made.
Because the datasets are kept in a single list we can use a for loop to generate each subplot.
In [102]:
import matplotlib.pyplot as plt
%matplotlib inline
for i in range(0, len(data_t)):
d = data_t[i] # Get the individual dataset
xs = d[:,0]
ys = d[:,1]
predict_y = lin_regs[i]["intercept"] + lin_regs[i]["slope"] * xs # Straigt line function
plt.subplot(221+i) # 2 rows, 2 columns, current plot + i (starts at zero)
plt.scatter(xs,ys)
plt.plot(xs,predict_y, 'r')
plt.xlim(xmin=0, xmax=20) # Standadize x axis min
plt.ylim(ymin=2, ymax=14) # Standadize y axis min
plt.show()
Other than learning how to, extract and transform data, use numpy and matlibplot, we think it is fair to say that up until the plotting the data looks very similiar. And indeed they do on the mean, variance, pearson correlation which we have calculated.
However, looking at the data plots it is easy to see how different each dataset really is. Visualization is a really important factor to help understand your data better.
We begin by loading the cvs file. Containing all the crime records, and parse it into a dictonary, for easy access.
Note that we ignore any crime that do not have a district.
In [103]:
from csv import reader
def parseCrimeCsv(row):
return {
'IncidntNum': line[0],
'Category': line[1],
'Descript': line[2],
'DayOfWeek': line[3],
'Date': line[4],
'Time': line[5],
'PdDistrict': line[6],
'Resolution': line[7],
'Address': line[8],
'X': line[9],
'Y': line[10],
'Location': line[11],
'PdId': line[12]
}
with open('../SFPD_Incidents_-_from_1_January_2003.csv', 'r') as infile:
data = [parseCrimeCsv(line) for line in list(reader(infile))[1:] if not line[6] == '']
Now that the file have been loaded into memory we can extract all the districts.
We do this by taking value PdDistrict of all records, and insert them into a set.
In [104]:
PdDistrictSet = set([i['PdDistrict'] for i in data])
PdDistrictSet
Out[104]:
If we being finding the district with the most overall crimes.
We begin with creating a Counter object, were we input the districts of the dataset.
This object, helps us counting the occorences of the given values.
Lastly we call the most_common(1) on the Counter to extract the district with the most crimes overall.
In [105]:
from collections import Counter
DistrictCrimeCounter = Counter([d['PdDistrict'] for d in data])
DistrictCrimeCounter.most_common(1)
Out[105]:
By looking at hthe output above, we can se that Southern is the district with the most crimes overall with a total count of 335978 crimes from Jan 2003 to Jan 2016.
Which has the most focus crimes? To do this we first define the focus crimes, these are taken from the exercises from week 3:
In [106]:
focuscrimes = set([
'WEAPON LAWS',
'PROSTITUTION',
'DRIVING UNDER THE INFLUENCE',
'ROBBERY',
'BURGLARY',
'ASSAULT',
'DRUNKENNESS',
'DRUG/NARCOTIC',
'TRESPASS',
'LARCENY/THEFT',
'VANDALISM',
'VEHICLE THEFT',
'STOLEN PROPERTY',
'DISORDERLY CONDUCT'
])
Now that we have defined the focus crimes, we can filter our dataset. So that we get a new data set that only contains the records for these crimes. We do this by utilizing the build in filter function.
Then we do the same operatios a above, to get the district with the most focus crimes.
In [107]:
data_filtered_for_focus_crimes = filter(lambda c: c['Category'] in list(focuscrimes), data)
most_common_focus_crime_by_dist = Counter([d['PdDistrict'] for d in data_filtered_for_focus_crimes])
most_common_focus_crime_by_dist.most_common(1)
Out[107]:
We see that it is again the district south that have the most focus crimes, just as it also had with all crimes. However this time around we are down to 191521 crimes commited between Jan 2003 and Jan 2016.
Firstly we have to create a Counter for the categories. Then we can use list comprehention, over the output of most common, to create a nicely structred dataset.
In [108]:
from __future__ import division
# DistrictCrimeCounter
CatetoryCrimeCounter = Counter([d['Category'] for d in data])
p_crime = {}
for k,v in CatetoryCrimeCounter.most_common():
p_crime[k] = v/len(data)
p_crime
Out[108]:
Now we do the same, however we do it per district instead.
Our result is a dictonary, with the district as key, followed by a list, containing dictionaries again, with the crime category as key and properbility as value.
In [109]:
p_crime_district = {}
l = len(data)
for district in list(PdDistrictSet):
counter = Counter([d['Category'] for d in data if d['PdDistrict'] == district])
p_crime_district[district] = {}
for k,v in counter.most_common():
p_crime_district[district][k] = v / len(list(counter.elements()));
p_crime_district
Out[109]:
In [110]:
import pandas as pd
%matplotlib inline
def calcRatioForDistrict(districtCounter, overAllCounter, district):
ys = []
xs = []
for fc in list(focuscrimes):
ys.append(districtCounter[fc] / overAllCounter[fc])
xs.append(fc)
return pd.Series(ys, index=xs)
series = {}
for district in list(PdDistrictSet):
series[district] = calcRatioForDistrict(p_crime_district[district], p_crime, district)
df = pd.DataFrame(series)
df.plot(kind="bar", subplots=True, figsize=(14,14),layout=(7,2), legend=False,sharey=True)
Out[110]:
Tenderloin
The top crime in Tenderloin is by far DRUG/NARCOTICS as illustrated above. Looking through articles and wikipedia, it does seem to be representative current/past activity in the area.
From Wikipedia: "The first block of Turk Street, between Taylor and Mason, had one of the highest rates of violence and drug activity in San Francisco, according to a survey conducted by the Tenderloin Housing Clinic."
Mission
Prostitution. The top crime in Mission with some margin. Wikipedia does not give any indication about prostitution or crime in general for the district.
Doing a leading search on google, there does seem to be articles that mentions prostitution is on the uptick recently.
Richmond
Richmond has the highest ratio of crimes related to DRIVING UNDER THE INFLUENCE (DUI) in San Francisco. Oddly, both DRUNKENESS and DRUG/NARCOTIC are very low in the area.
We were not able to find any information that backs these statistics.
geoplotlib to plot all incidents of the three crime types on their own map using geoplotlib.kde(). This will give you an idea of how the varioius crimes are distributed across the city.Here we define the new focus crimes.
Then we create a dictionary that contains all the crimes from the tree specific focus crimes.
So that on the key PROSTITUTION the value is a list containing all of these crimes.
In [111]:
focusCrimes = ['PROSTITUTION', 'DRUG/NARCOTIC', 'DRIVING UNDER THE INFLUENCE']
filteredCrimes = {}
for f in focusCrimes:
filteredCrimes[f] = filter(lambda c: c['Category'] == f and float(c['Y']) < 90.0, data) # We remove some crimes located at the north pole
Then we create a dictionary that cointains two lists, with all the latitutes and longtitues.
This is the structure that geoplotlib accepts.
We also construct a Bounding Box for San Francisco so that we have a nice view box.
In [112]:
from geoplotlib.utils import BoundingBox
def createLongAndLatList(data):
return {'lat': [float(d['Y']) for d in data], 'lon': [float(d['X']) for d in data]}
# San Francisco
bbox_SF = BoundingBox(north=37.83, west=-122.52, south=37.67, east=-122.35)
In [113]:
import geoplotlib as glp
from geoplotlib.utils import BoundingBox, DataAccessObject
for fc in focusCrimes:
print "Plotting " + fc
glp.kde(createLongAndLatList(filteredCrimes[fc]), bw=5)
glp.set_bbox(bbox_SF)
glp.inline(width=900)