Assignment 1

Assignment 1A: Anscombe's quartet

Using the numpy function mean, calculate the mean of both $x$-values and $y$-values for each dataset.

We start out by downloading the data data to memory. Afterwards we transform the data data_t to a more appropriate format we can work with in Python.


In [96]:
import urllib2
import numpy as np

# Downloads data given a url using urllib2
def getTsv(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    return response.read()

# Transforms the input data from a tabular seperated 
# file and casts the values to floats.
def transform(data):
    return np.array([[float(x) for x in line.split('\t')] for line in data.split('\n')])
    

data = [
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data1.tsv'),
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data2.tsv'),
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data3.tsv'),
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data4.tsv')
    ]

data_t = [transform(d) for d in data]

data_t now represents 4 lists containing a list of two elements (x,y) for each row of the tsv files.

Next, we calculate all the mean values. We use numpy as it has an easy way to calculate the mean of values in a numpy array. More specifically we can call a function on each of the four lists getting the mean across indexes.

See the documentation here for more details.


In [97]:
import numpy as np

# Calculates the mean values of a lists sublists/tuples 
# on the axis 0, i.e. up/down (axis=1, right,left)

means = [dt.mean(axis=0) for dt in data_t]

Use python string formatting to print precisely two decimal places of these results to the output cell. Check out this stackoverflow page for help with the string formatting.


In [98]:
for m in means:
    print "Mean: {:10.2f},{:10.2f}".format(*m)


Mean:       9.00,      7.50
Mean:       9.00,      7.50
Mean:       9.00,      7.50
Mean:       9.00,      7.50

Now calculate the variance for all of the various sets of x- and y-values (to three decimal places).

As we did with the means we can calculate the variance in x- and y-values the same way for each dataset.


In [99]:
variance = [dt.var(axis=0) for dt in data_t]

for v in variance:
    print "Variance: {:10.3f},{:10.3f}".format(*v)


Variance:     10.000,     3.752
Variance:     10.000,     3.752
Variance:     10.000,     3.748
Variance:     10.000,     3.748

Use numpy to calculate the Pearson correlation between $x$- and $y$-values for all four data sets (also to three decimal places).

As before we use a list comprehension on the transformed data data_t calling a simple function that calculates the pearson correlation on each of the 4 datasets.


In [100]:
# Calculates the pearson correlation for 
# each dataset
def calc_pearson(data):
    data_xs = data[:,0] # all xs across sublists/tuples
    data_ys = data[:,1] # all xs across sublists/tuples
    return np.corrcoef(data_xs, data_ys)[0][1]
    
pears = [calc_pearson(x) for x in data_t]

for pear in pears:
    print "Pearson: {:10.3f}".format(pear)


Pearson:      0.816
Pearson:      0.816
Pearson:      0.816
Pearson:      0.817

The next step is use linear regression to fit a straight line $f(x) = a x + b$ through each dataset and report $a$ and $b$ (to two decimal places). An easy way to fit a straight line in Python is using scipy's linregress.

Before we start plotting, lets first generate the straigt lines needed for each dataset.

Because we use a single list comprehension for all datasets we return a dictionary from lin_reg() for easy lookup later on.


In [101]:
from scipy import stats

# Calculates the linear regression for each dataset
# and returns a dictionary containing lin reg values
def lin_reg(data):
    data_xs = data[:,0]
    data_ys = data[:,1]
    slope, intercept, r_value, p_value, std_err = stats.linregress(data_xs,data_ys)
    
    return {'slope': slope, 'intercept': intercept, 'r_value': r_value, 'p_value': p_value, 'std_err': std_err}

lin_regs = [lin_reg(x) for x in data_t]

Finally, it's time to plot the four datasets using matplotlib.pyplot. Use a two-by-two subplot to put all of the plots nicely in a grid and use the same $x$ and $y$ range for all four plots. And include the linear fit in all four plots. (To get a sense of what I think the plot should look like, you can take a look at my version here.)

Now that the regressions have been calculated, a plot for each dataset can be made.

Because the datasets are kept in a single list we can use a for loop to generate each subplot.


In [102]:
import matplotlib.pyplot as plt
%matplotlib inline

for i in range(0, len(data_t)):
    d = data_t[i] # Get the individual dataset
    xs = d[:,0]
    ys = d[:,1]
    predict_y = lin_regs[i]["intercept"] + lin_regs[i]["slope"] * xs # Straigt line function

    plt.subplot(221+i) # 2 rows, 2 columns, current plot + i (starts at zero)
    plt.scatter(xs,ys)
    plt.plot(xs,predict_y, 'r')
    plt.xlim(xmin=0, xmax=20) # Standadize x axis min
    plt.ylim(ymin=2, ymax=14) # Standadize y axis min
    
plt.show()


Explain - in your own words - what you think my point with this exercise is.

Other than learning how to, extract and transform data, use numpy and matlibplot, we think it is fair to say that up until the plotting the data looks very similiar. And indeed they do on the mean, variance, pearson correlation which we have calculated.

However, looking at the data plots it is easy to see how different each dataset really is. Visualization is a really important factor to help understand your data better.

Assignment 1B: Slicing data

We'll be combining information about PdDistrict and Category to explore differences between SF's neighborhoods. First, simply list the names of SF's 10 police districts.

We begin by loading the cvs file. Containing all the crime records, and parse it into a dictonary, for easy access.

Note that we ignore any crime that do not have a district.


In [103]:
from csv import reader

def parseCrimeCsv(row):
    return {
        'IncidntNum': line[0],
        'Category': line[1],
        'Descript': line[2],
        'DayOfWeek': line[3],
        'Date': line[4],
        'Time': line[5],
        'PdDistrict': line[6],
        'Resolution': line[7],
        'Address': line[8],
        'X': line[9],
        'Y': line[10],
        'Location': line[11],
        'PdId': line[12]
    }

with open('../SFPD_Incidents_-_from_1_January_2003.csv', 'r') as infile:    
    data = [parseCrimeCsv(line) for line in list(reader(infile))[1:] if not line[6] == '']

Now that the file have been loaded into memory we can extract all the districts. We do this by taking value PdDistrict of all records, and insert them into a set.


In [104]:
PdDistrictSet = set([i['PdDistrict'] for i in data])
PdDistrictSet


Out[104]:
{'BAYVIEW',
 'CENTRAL',
 'INGLESIDE',
 'MISSION',
 'NORTHERN',
 'PARK',
 'RICHMOND',
 'SOUTHERN',
 'TARAVAL',
 'TENDERLOIN'}

Which has the most crimes? Which has the most focus crimes?

If we being finding the district with the most overall crimes. We begin with creating a Counter object, were we input the districts of the dataset. This object, helps us counting the occorences of the given values.

Lastly we call the most_common(1) on the Counter to extract the district with the most crimes overall.


In [105]:
from collections import Counter

DistrictCrimeCounter = Counter([d['PdDistrict'] for d in data])

DistrictCrimeCounter.most_common(1)


Out[105]:
[('SOUTHERN', 335978)]

By looking at hthe output above, we can se that Southern is the district with the most crimes overall with a total count of 335978 crimes from Jan 2003 to Jan 2016.

Which has the most focus crimes? To do this we first define the focus crimes, these are taken from the exercises from week 3:


In [106]:
focuscrimes = set([
        'WEAPON LAWS',
        'PROSTITUTION',
        'DRIVING UNDER THE INFLUENCE',
        'ROBBERY',
        'BURGLARY',
        'ASSAULT',
        'DRUNKENNESS',
        'DRUG/NARCOTIC',
        'TRESPASS',
        'LARCENY/THEFT',
        'VANDALISM',
        'VEHICLE THEFT',
        'STOLEN PROPERTY',
        'DISORDERLY CONDUCT'
    ])

Now that we have defined the focus crimes, we can filter our dataset. So that we get a new data set that only contains the records for these crimes. We do this by utilizing the build in filter function.

Then we do the same operatios a above, to get the district with the most focus crimes.


In [107]:
data_filtered_for_focus_crimes = filter(lambda c: c['Category'] in list(focuscrimes), data)

       
most_common_focus_crime_by_dist = Counter([d['PdDistrict'] for d in data_filtered_for_focus_crimes])
most_common_focus_crime_by_dist.most_common(1)


Out[107]:
[('SOUTHERN', 191521)]

We see that it is again the district south that have the most focus crimes, just as it also had with all crimes. However this time around we are down to 191521 crimes commited between Jan 2003 and Jan 2016.

Next, we want to generate a slightly more complicated graphic. I'm interested to know if there are certain crimes that happen much more in certain neighborhoods than what's typical. Below I describe how to get that plot going

First, we need to calculate the relative probabilities of seeing each type of crime in the dataset as a whole. That's simply a normalized version of this plot. Let's call it P(crime).

Firstly we have to create a Counter for the categories. Then we can use list comprehention, over the output of most common, to create a nicely structred dataset.


In [108]:
from __future__ import division

# DistrictCrimeCounter

CatetoryCrimeCounter = Counter([d['Category'] for d in data])

p_crime = {}
for k,v in CatetoryCrimeCounter.most_common():
    p_crime[k] = v/len(data)

p_crime


Out[108]:
{'ARSON': 0.0017349745013750567,
 'ASSAULT': 0.0875342429177903,
 'BAD CHECKS': 0.0004571062398205751,
 'BRIBERY': 0.0003551117400475262,
 'BURGLARY': 0.04178837476303634,
 'DISORDERLY CONDUCT': 0.004770779376818946,
 'DRIVING UNDER THE INFLUENCE': 0.002626224868501856,
 'DRUG/NARCOTIC': 0.05924919232104237,
 'DRUNKENNESS': 0.004795343497183136,
 'EMBEZZLEMENT': 0.0013745227351614022,
 'EXTORTION': 0.00032574159613382107,
 'FAMILY OFFENSES': 0.0005751208180920087,
 'FORGERY/COUNTERFEITING': 0.011604944864229834,
 'FRAUD': 0.019067631431394012,
 'GAMBLING': 0.00016340480068352336,
 'KIDNAPPING': 0.0025781646330067017,
 'LARCENY/THEFT': 0.20380476864336636,
 'LIQUOR LAWS': 0.0020521720556430727,
 'LOITERING': 0.0012511681307238406,
 'MISSING PERSON': 0.029413932128267428,
 'NON-CRIMINAL': 0.10625637465623582,
 'OTHER OFFENSES': 0.1426449149600833,
 'PORNOGRAPHY/OBSCENE MAT': 2.4564120364189783e-05,
 'PROSTITUTION': 0.00831068272234534,
 'RECOVERED VEHICLE': 0.0033887806050249647,
 'ROBBERY': 0.02584999866499346,
 'RUNAWAY': 0.0021253304141190294,
 'SECONDARY CODES': 0.011447948094945665,
 'SEX OFFENSES, FORCIBLE': 0.005046324726991162,
 'SEX OFFENSES, NON FORCIBLE': 0.0001767548660988439,
 'STOLEN PROPERTY': 0.005342162176594665,
 'SUICIDE': 0.0005996849384561985,
 'SUSPICIOUS OCC': 0.03587482978666595,
 'TREA': 5.340026166128214e-06,
 'TRESPASS': 0.008350732918591302,
 'VANDALISM': 0.05101914399380557,
 'VEHICLE THEFT': 0.060621045043120714,
 'WARRANTS': 0.047586575174218354,
 'WEAPON LAWS': 0.009805890048861239}

Next, we calculate that same probability distribution but for each PD district, let's call that P(crime|district).

Now we do the same, however we do it per district instead.

Our result is a dictonary, with the district as key, followed by a list, containing dictionaries again, with the crime category as key and properbility as value.


In [109]:
p_crime_district = {}
l = len(data)
for district in list(PdDistrictSet):
    counter = Counter([d['Category'] for d in data if d['PdDistrict'] == district])
    p_crime_district[district] = {}
    for k,v in counter.most_common():
            p_crime_district[district][k] = v / len(list(counter.elements()));
    
    
p_crime_district


Out[109]:
{'BAYVIEW': {'ARSON': 0.004446580305134691,
  'ASSAULT': 0.10975448750785545,
  'BAD CHECKS': 0.0003538252736864897,
  'BRIBERY': 0.0006337169080952054,
  'BURGLARY': 0.044397150386303263,
  'DISORDERLY CONDUCT': 0.002476776915805428,
  'DRIVING UNDER THE INFLUENCE': 0.002091265796714178,
  'DRUG/NARCOTIC': 0.04878035900062844,
  'DRUNKENNESS': 0.00274610660174589,
  'EMBEZZLEMENT': 0.0012938386873610444,
  'EXTORTION': 0.00020595799513094175,
  'FAMILY OFFENSES': 0.0008343939289920205,
  'FORGERY/COUNTERFEITING': 0.008502368516944006,
  'FRAUD': 0.010113065658352653,
  'GAMBLING': 0.0002693296859404623,
  'KIDNAPPING': 0.0037072439123569515,
  'LARCENY/THEFT': 0.11580120300593054,
  'LIQUOR LAWS': 0.0013255245327658046,
  'LOITERING': 0.0005016925522420376,
  'MISSING PERSON': 0.05471089306555273,
  'NON-CRIMINAL': 0.07021055244271464,
  'OTHER OFFENSES': 0.18928595947380372,
  'PORNOGRAPHY/OBSCENE MAT': 2.640487117063356e-05,
  'PROSTITUTION': 0.0006918076246705992,
  'RECOVERED VEHICLE': 0.007509545360928184,
  'ROBBERY': 0.029795256628942907,
  'RUNAWAY': 0.002724982704809383,
  'SECONDARY CODES': 0.01886892093853474,
  'SEX OFFENSES, FORCIBLE': 0.004678943171436267,
  'SEX OFFENSES, NON FORCIBLE': 0.000274610660174589,
  'STOLEN PROPERTY': 0.0044201754339640575,
  'SUICIDE': 0.00037494917062299654,
  'SUSPICIOUS OCC': 0.04386905296289059,
  'TREA': 1.5842922702380134e-05,
  'TRESPASS': 0.007182124958412328,
  'VANDALISM': 0.061502225930639684,
  'VEHICLE THEFT': 0.07777290754598408,
  'WARRANTS': 0.048743392180989546,
  'WEAPON LAWS': 0.019106564779070443},
 'CENTRAL': {'ARSON': 0.001202040219615997,
  'ASSAULT': 0.08095903316981255,
  'BAD CHECKS': 0.0006714098523981244,
  'BRIBERY': 0.0001949254410188103,
  'BURGLARY': 0.05192055705359367,
  'DISORDERLY CONDUCT': 0.00520884095166932,
  'DRIVING UNDER THE INFLUENCE': 0.0019005230499334004,
  'DRUG/NARCOTIC': 0.019719957116402977,
  'DRUNKENNESS': 0.006042688671583119,
  'EMBEZZLEMENT': 0.0021766674247100486,
  'EXTORTION': 0.0006822390435658361,
  'FAMILY OFFENSES': 0.0002869735659443596,
  'FORGERY/COUNTERFEITING': 0.012805518555819066,
  'FRAUD': 0.026753516779831715,
  'GAMBLING': 0.00033029033061520634,
  'KIDNAPPING': 0.0021441798512069135,
  'LARCENY/THEFT': 0.2985012399423887,
  'LIQUOR LAWS': 0.0013915510650509514,
  'LOITERING': 0.0007851163596590971,
  'MISSING PERSON': 0.015778131531355922,
  'NON-CRIMINAL': 0.127291727580867,
  'OTHER OFFENSES': 0.10426886715831195,
  'PORNOGRAPHY/OBSCENE MAT': 2.707297791927921e-05,
  'PROSTITUTION': 0.009242714661641923,
  'RECOVERED VEHICLE': 0.00218749661587776,
  'ROBBERY': 0.02342895509134423,
  'RUNAWAY': 0.00041692385995689983,
  'SECONDARY CODES': 0.008560475618076087,
  'SEX OFFENSES, FORCIBLE': 0.004521187312519628,
  'SEX OFFENSES, NON FORCIBLE': 5.414595583855842e-05,
  'STOLEN PROPERTY': 0.006075176245086254,
  'SUICIDE': 0.0006714098523981244,
  'SUSPICIOUS OCC': 0.03344054232589368,
  'TREA': 1.0829191167711683e-05,
  'TRESPASS': 0.010856264145630963,
  'VANDALISM': 0.05254323554573709,
  'VEHICLE THEFT': 0.04840106992408737,
  'WARRANTS': 0.032964057914514365,
  'WEAPON LAWS': 0.005582448046955373},
 'INGLESIDE': {'ARSON': 0.0023914763499646038,
  'ASSAULT': 0.11107277345818188,
  'BAD CHECKS': 0.000416426230093339,
  'BRIBERY': 0.0007733629987447724,
  'BURGLARY': 0.04158313354789199,
  'DISORDERLY CONDUCT': 0.0020285906351689797,
  'DRIVING UNDER THE INFLUENCE': 0.0026889236571741317,
  'DRUG/NARCOTIC': 0.02856089043825886,
  'DRUNKENNESS': 0.002546148949713558,
  'EMBEZZLEMENT': 0.0009518313830704891,
  'EXTORTION': 0.0003688346609398145,
  'FAMILY OFFENSES': 0.0009637292753588702,
  'FORGERY/COUNTERFEITING': 0.012040666995841687,
  'FRAUD': 0.015752809389816595,
  'GAMBLING': 0.0002082131150466695,
  'KIDNAPPING': 0.004021487593472817,
  'LARCENY/THEFT': 0.12866380720655335,
  'LIQUOR LAWS': 0.0012314318518474453,
  'LOITERING': 0.0002796004687769562,
  'MISSING PERSON': 0.040179182257863016,
  'NON-CRIMINAL': 0.08831210551050882,
  'OTHER OFFENSES': 0.16874780632610933,
  'PORNOGRAPHY/OBSCENE MAT': 2.9744730720952783e-05,
  'PROSTITUTION': 0.0003985793916607673,
  'RECOVERED VEHICLE': 0.007715783149015152,
  'ROBBERY': 0.03425403189824922,
  'RUNAWAY': 0.0019155606584293592,
  'SECONDARY CODES': 0.0170734754338269,
  'SEX OFFENSES, FORCIBLE': 0.006644972843060852,
  'SEX OFFENSES, NON FORCIBLE': 0.0003747836070840051,
  'STOLEN PROPERTY': 0.004669922723189587,
  'SUICIDE': 0.0008031077294657252,
  'SUSPICIOUS OCC': 0.03993527546595121,
  'TRESPASS': 0.0048900337305246375,
  'VANDALISM': 0.06823441227386569,
  'VEHICLE THEFT': 0.11246482685592248,
  'WARRANTS': 0.03230277756295472,
  'WEAPON LAWS': 0.014509479645680768},
 'MISSION': {'ARSON': 0.001294508599799509,
  'ASSAULT': 0.09371768661841201,
  'BAD CHECKS': 0.0003236271499498773,
  'BRIBERY': 0.0005170141054077307,
  'BURGLARY': 0.03131684676649117,
  'DISORDERLY CONDUCT': 0.009187853720528223,
  'DRIVING UNDER THE INFLUENCE': 0.002995524473316547,
  'DRUG/NARCOTIC': 0.0713084798206632,
  'DRUNKENNESS': 0.006409396238031715,
  'EMBEZZLEMENT': 0.0010142948480136398,
  'EXTORTION': 0.0002091736456993109,
  'FAMILY OFFENSES': 0.001160321732747121,
  'FORGERY/COUNTERFEITING': 0.010360015470956437,
  'FRAUD': 0.016524718010245562,
  'GAMBLING': 0.00016181357497493864,
  'KIDNAPPING': 0.002502190403271002,
  'LARCENY/THEFT': 0.15173771992832844,
  'LIQUOR LAWS': 0.0039545659054850855,
  'LOITERING': 0.001483948882696998,
  'MISSING PERSON': 0.027713534718878514,
  'NON-CRIMINAL': 0.10562479773303128,
  'OTHER OFFENSES': 0.16031778607456054,
  'PORNOGRAPHY/OBSCENE MAT': 2.368003536218614e-05,
  'PROSTITUTION': 0.028005588488345477,
  'RECOVERED VEHICLE': 0.0026363772703233905,
  'ROBBERY': 0.02990393798988073,
  'RUNAWAY': 0.0022259233240454973,
  'SECONDARY CODES': 0.011650577398195581,
  'SEX OFFENSES, FORCIBLE': 0.00601472898199528,
  'SEX OFFENSES, NON FORCIBLE': 0.000221013663380404,
  'STOLEN PROPERTY': 0.005359581336974797,
  'SUICIDE': 0.000619627591977204,
  'SUSPICIOUS OCC': 0.032603462021169954,
  'TREA': 3.9466725603643566e-06,
  'TRESPASS': 0.008564279455990654,
  'VANDALISM': 0.04462502664003978,
  'VEHICLE THEFT': 0.06077481075705073,
  'WARRANTS': 0.05559282968529233,
  'WEAPON LAWS': 0.011338790265926798},
 'NORTHERN': {'ARSON': 0.0014210061782877316,
  'ASSAULT': 0.07824801412180053,
  'BAD CHECKS': 0.0005692850838481906,
  'BRIBERY': 0.00017210944395410415,
  'BURGLARY': 0.05461165048543689,
  'DISORDERLY CONDUCT': 0.004329214474845543,
  'DRIVING UNDER THE INFLUENCE': 0.0023654015887025594,
  'DRUG/NARCOTIC': 0.04202559576345984,
  'DRUNKENNESS': 0.003212709620476611,
  'EMBEZZLEMENT': 0.0013018534863195057,
  'EXTORTION': 0.00024271844660194174,
  'FAMILY OFFENSES': 0.00022506619593998234,
  'FORGERY/COUNTERFEITING': 0.011712268314210062,
  'FRAUD': 0.019849955869373347,
  'GAMBLING': 9.70873786407767e-05,
  'KIDNAPPING': 0.0021624007060900264,
  'LARCENY/THEFT': 0.2780847308031774,
  'LIQUOR LAWS': 0.0011076787290379524,
  'LOITERING': 0.0017431597528684906,
  'MISSING PERSON': 0.018172992056487203,
  'NON-CRIMINAL': 0.09720211827007944,
  'OTHER OFFENSES': 0.11582082965578111,
  'PORNOGRAPHY/OBSCENE MAT': 2.64783759929391e-05,
  'PROSTITUTION': 0.01709179170344219,
  'RECOVERED VEHICLE': 0.0024271844660194173,
  'ROBBERY': 0.024761694616063548,
  'RUNAWAY': 0.0008473080317740512,
  'SECONDARY CODES': 0.00939982347749338,
  'SEX OFFENSES, FORCIBLE': 0.004342453662842013,
  'SEX OFFENSES, NON FORCIBLE': 0.00010150044130626655,
  'STOLEN PROPERTY': 0.006676963812886143,
  'SUICIDE': 0.000706090026478376,
  'SUSPICIOUS OCC': 0.031248896734333627,
  'TREA': 4.41306266548985e-06,
  'TRESPASS': 0.007546337157987643,
  'VANDALISM': 0.051403353927625774,
  'VEHICLE THEFT': 0.058667255075022066,
  'WARRANTS': 0.04258605472197705,
  'WEAPON LAWS': 0.007484554280670785},
 'PARK': {'ARSON': 0.0012030979772915257,
  'ASSAULT': 0.06954470260921874,
  'BAD CHECKS': 0.00028197608842770133,
  'BRIBERY': 0.00022558087074216106,
  'BURGLARY': 0.05825625986916309,
  'DISORDERLY CONDUCT': 0.005385743288969095,
  'DRIVING UNDER THE INFLUENCE': 0.0034025114670275963,
  'DRUG/NARCOTIC': 0.051075268817204304,
  'DRUNKENNESS': 0.007143394240168434,
  'EMBEZZLEMENT': 0.0009305210918114144,
  'EXTORTION': 0.0001503872471614407,
  'FAMILY OFFENSES': 0.00041356492969396195,
  'FORGERY/COUNTERFEITING': 0.009389803744642454,
  'FRAUD': 0.0190709827806602,
  'GAMBLING': 1.879840589518009e-05,
  'KIDNAPPING': 0.0016166629069854876,
  'LARCENY/THEFT': 0.1918283329573652,
  'LIQUOR LAWS': 0.002528385592901722,
  'LOITERING': 0.0003571697120084217,
  'MISSING PERSON': 0.05462816753139334,
  'NON-CRIMINAL': 0.12124971802391157,
  'OTHER OFFENSES': 0.1282145274080758,
  'PORNOGRAPHY/OBSCENE MAT': 1.879840589518009e-05,
  'PROSTITUTION': 0.00013158884126626062,
  'RECOVERED VEHICLE': 0.002293405519211971,
  'ROBBERY': 0.01857282502443793,
  'RUNAWAY': 0.009850364689074367,
  'SECONDARY CODES': 0.008957440409053313,
  'SEX OFFENSES, FORCIBLE': 0.004107451688096849,
  'SEX OFFENSES, NON FORCIBLE': 0.00011279043537108053,
  'STOLEN PROPERTY': 0.0039006692232498685,
  'SUICIDE': 0.0005169561621174524,
  'SUSPICIOUS OCC': 0.03292540792540793,
  'TRESPASS': 0.006259869163094969,
  'VANDALISM': 0.0515546281675314,
  'VEHICLE THEFT': 0.08038198360779006,
  'WARRANTS': 0.04617828408150989,
  'WEAPON LAWS': 0.007321979096172645},
 'RICHMOND': {'ARSON': 0.0021137267904509285,
  'ASSAULT': 0.07038502984084881,
  'BAD CHECKS': 0.000621684350132626,
  'BRIBERY': 0.00020722811671087534,
  'BURGLARY': 0.058365799071618034,
  'DISORDERLY CONDUCT': 0.002175895225464191,
  'DRIVING UNDER THE INFLUENCE': 0.00726334549071618,
  'DRUG/NARCOTIC': 0.020940401193633953,
  'DRUNKENNESS': 0.003284565649867374,
  'EMBEZZLEMENT': 0.0011190318302387269,
  'EXTORTION': 0.0005698773209549072,
  'FAMILY OFFENSES': 0.0005284316976127321,
  'FORGERY/COUNTERFEITING': 0.013283322281167109,
  'FRAUD': 0.02486737400530504,
  'GAMBLING': 0.0001243368700265252,
  'KIDNAPPING': 0.001999751326259947,
  'LARCENY/THEFT': 0.22721526856763927,
  'LIQUOR LAWS': 0.001823607427055703,
  'LOITERING': 0.0002486737400530504,
  'MISSING PERSON': 0.02619363395225464,
  'NON-CRIMINAL': 0.12841926392572944,
  'OTHER OFFENSES': 0.12326964522546419,
  'PORNOGRAPHY/OBSCENE MAT': 4.1445623342175064e-05,
  'PROSTITUTION': 0.0005387931034482759,
  'RECOVERED VEHICLE': 0.0027768567639257294,
  'ROBBERY': 0.017179210875331564,
  'RUNAWAY': 0.002082642572944297,
  'SECONDARY CODES': 0.011428630636604774,
  'SEX OFFENSES, FORCIBLE': 0.004662632625994695,
  'SEX OFFENSES, NON FORCIBLE': 0.00016578249336870026,
  'STOLEN PROPERTY': 0.004569379973474801,
  'SUICIDE': 0.0008185510610079575,
  'SUSPICIOUS OCC': 0.047330901856763925,
  'TRESPASS': 0.00533612400530504,
  'VANDALISM': 0.06925563660477453,
  'VEHICLE THEFT': 0.08979194297082228,
  'WARRANTS': 0.02216304708222812,
  'WEAPON LAWS': 0.006838527851458886},
 'SOUTHERN': {'ARSON': 0.0011310264362547548,
  'ASSAULT': 0.07692765597747471,
  'BAD CHECKS': 0.00048812719880468364,
  'BRIBERY': 0.0002738274529879933,
  'BURGLARY': 0.031320503128181014,
  'DISORDERLY CONDUCT': 0.003342480757668657,
  'DRIVING UNDER THE INFLUENCE': 0.0019465560245016042,
  'DRUG/NARCOTIC': 0.05639654977409235,
  'DRUNKENNESS': 0.005911101322110376,
  'EMBEZZLEMENT': 0.001672728571513611,
  'EXTORTION': 0.00024108721404377667,
  'FAMILY OFFENSES': 0.000330378774800731,
  'FORGERY/COUNTERFEITING': 0.014340224657566864,
  'FRAUD': 0.021921078165832285,
  'GAMBLING': 0.00010417348754978004,
  'KIDNAPPING': 0.002086446136354166,
  'LARCENY/THEFT': 0.27396436671448726,
  'LIQUOR LAWS': 0.0023215805796808126,
  'LOITERING': 0.0023840846722106806,
  'MISSING PERSON': 0.019906065278083685,
  'NON-CRIMINAL': 0.12540702069778378,
  'OTHER OFFENSES': 0.13278845638702533,
  'PORNOGRAPHY/OBSCENE MAT': 1.488192679282572e-05,
  'PROSTITUTION': 0.0011935305287846228,
  'RECOVERED VEHICLE': 0.001982272648804386,
  'ROBBERY': 0.0244986278863497,
  'RUNAWAY': 0.0006786158617528528,
  'SECONDARY CODES': 0.0075064438743012935,
  'SEX OFFENSES, FORCIBLE': 0.005306895094321652,
  'SEX OFFENSES, NON FORCIBLE': 0.0001607248093625178,
  'STOLEN PROPERTY': 0.006690914286054444,
  'SUICIDE': 0.00041669395019912017,
  'SUSPICIOUS OCC': 0.031641952746906045,
  'TREA': 2.9763853585651442e-06,
  'TRESPASS': 0.009342873640535987,
  'VANDALISM': 0.0416366547809678,
  'VEHICLE THEFT': 0.029853145146408397,
  'WARRANTS': 0.05685193673395282,
  'WEAPON LAWS': 0.007015340290138045},
 'TARAVAL': {'ARSON': 0.0020673927406807476,
  'ASSAULT': 0.08227654752516748,
  'BAD CHECKS': 0.0006536087015210612,
  'BRIBERY': 0.00027707325390566724,
  'BURGLARY': 0.05270075378133947,
  'DISORDERLY CONDUCT': 0.0023515704369942523,
  'DRIVING UNDER THE INFLUENCE': 0.0034172367981698958,
  'DRUG/NARCOTIC': 0.022258218063755265,
  'DRUNKENNESS': 0.004127681038953657,
  'EMBEZZLEMENT': 0.0014919329056459004,
  'EXTORTION': 0.0006891309135602492,
  'FAMILY OFFENSES': 0.000618086489481873,
  'FORGERY/COUNTERFEITING': 0.015551624430756552,
  'FRAUD': 0.023764359854216842,
  'GAMBLING': 0.00012787996334107717,
  'KIDNAPPING': 0.0029341347144369373,
  'LARCENY/THEFT': 0.18472971148859382,
  'LIQUOR LAWS': 0.0017192750626967043,
  'LOITERING': 0.0004902065261407958,
  'MISSING PERSON': 0.04983766349098091,
  'NON-CRIMINAL': 0.10807277790802589,
  'OTHER OFFENSES': 0.131737675568533,
  'PORNOGRAPHY/OBSCENE MAT': 3.5522212039188104e-05,
  'PROSTITUTION': 0.0017547972747358925,
  'RECOVERED VEHICLE': 0.003893234439495016,
  'ROBBERY': 0.021284909453881513,
  'RUNAWAY': 0.005797225004795499,
  'SECONDARY CODES': 0.015622668854834928,
  'SEX OFFENSES, FORCIBLE': 0.005548569520521182,
  'SEX OFFENSES, NON FORCIBLE': 0.00025575992668215434,
  'STOLEN PROPERTY': 0.0038221900154166402,
  'SUICIDE': 0.0009164730706110531,
  'SUSPICIOUS OCC': 0.04610072678445832,
  'TREA': 1.4208884815675242e-05,
  'TRESPASS': 0.0060600893738854906,
  'VANDALISM': 0.07208877711232833,
  'VEHICLE THEFT': 0.09173256036999936,
  'WARRANTS': 0.024794504003353295,
  'WEAPON LAWS': 0.008383242041248393},
 'TENDERLOIN': {'ARSON': 0.0007607188207687941,
  'ASSAULT': 0.0960787870630987,
  'BAD CHECKS': 0.00022236396299395522,
  'BRIBERY': 0.0001989572300472231,
  'BURGLARY': 0.018649314475308824,
  'DISORDERLY CONDUCT': 0.009456320110479779,
  'DRIVING UNDER THE INFLUENCE': 0.0013224804114903652,
  'DRUG/NARCOTIC': 0.21151494227314488,
  'DRUNKENNESS': 0.005383548577748389,
  'EMBEZZLEMENT': 0.0013809972438571954,
  'EXTORTION': 0.00015214376415375883,
  'FAMILY OFFENSES': 0.0003862110936210801,
  'FORGERY/COUNTERFEITING': 0.006823062653972415,
  'FRAUD': 0.013921154420068933,
  'GAMBLING': 0.00016969881386380792,
  'KIDNAPPING': 0.0026391091397440472,
  'LARCENY/THEFT': 0.12360510500845569,
  'LIQUOR LAWS': 0.002387486760566677,
  'LOITERING': 0.00206564418254911,
  'MISSING PERSON': 0.010948499335833953,
  'NON-CRIMINAL': 0.09305931851297025,
  'OTHER OFFENSES': 0.16441474390108315,
  'PORNOGRAPHY/OBSCENE MAT': 1.7555049710049096e-05,
  'PROSTITUTION': 0.011557074392448989,
  'RECOVERED VEHICLE': 0.0016326196230345659,
  'ROBBERY': 0.027104996752315803,
  'RUNAWAY': 9.362693178692851e-05,
  'SECONDARY CODES': 0.009099367433042115,
  'SEX OFFENSES, FORCIBLE': 0.00382114915355402,
  'SEX OFFENSES, NON FORCIBLE': 5.266514913014729e-05,
  'STOLEN PROPERTY': 0.004371207377802225,
  'SUICIDE': 0.00042717287627786133,
  'SUSPICIOUS OCC': 0.03190337700639589,
  'TRESPASS': 0.01403233640156591,
  'VANDALISM': 0.020252675682159972,
  'VEHICLE THEFT': 0.012276831430561,
  'WARRANTS': 0.08769332498493192,
  'WEAPON LAWS': 0.010123411999461646}}

Now we look at the ratio P(crime|district)/P(crime). That ratio is equal to 1 if the crime occurs at the same level within a district as in the city as a whole. If it's greater than one, it means that the crime occurs more frequently within that district. If it's smaller than one, it means that the crime is rarer within the district in question than in the city as a whole.

For each district plot these ratios for the 14 focus crimes. My plot looks like this


In [110]:
import pandas as pd
%matplotlib inline

def calcRatioForDistrict(districtCounter, overAllCounter, district):
    ys = []
    xs = []
    for fc in list(focuscrimes):
        ys.append(districtCounter[fc] / overAllCounter[fc])
        xs.append(fc)
    return pd.Series(ys, index=xs)
    
series = {}
for district in list(PdDistrictSet):
    series[district] = calcRatioForDistrict(p_crime_district[district], p_crime, district)

df = pd.DataFrame(series)
df.plot(kind="bar", subplots=True, figsize=(14,14),layout=(7,2), legend=False,sharey=True)


Out[110]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000060E1D080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000015D6DDA0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000572B53C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000025E5F940>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A5C90F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010934B70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000083D726D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000004A88CDA0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000007BCE2550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000BCF9E10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000008243EA58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000010A92E860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000C6C98D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000F79CDC50>]], dtype=object)

Comment on the top crimes in Tenderloin, Mission, and Richmond. Does this fit with the impression you get of these neighborhoods on Wikipedia?

Tenderloin

The top crime in Tenderloin is by far DRUG/NARCOTICS as illustrated above. Looking through articles and wikipedia, it does seem to be representative current/past activity in the area.

From Wikipedia: "The first block of Turk Street, between Taylor and Mason, had one of the highest rates of violence and drug activity in San Francisco, according to a survey conducted by the Tenderloin Housing Clinic."

Mission

Prostitution. The top crime in Mission with some margin. Wikipedia does not give any indication about prostitution or crime in general for the district.

Doing a leading search on google, there does seem to be articles that mentions prostitution is on the uptick recently.

Richmond

Richmond has the highest ratio of crimes related to DRIVING UNDER THE INFLUENCE (DUI) in San Francisco. Oddly, both DRUNKENESS and DRUG/NARCOTIC are very low in the area.

We were not able to find any information that backs these statistics.

Even though we only plotted the ratios for our 14 focus crimes, I asked you to calculate the ratios based on all crime categories. Why do you think I wanted to include all crime types in the calculation?

Not including the other crimes the focus crime ratios will inheritly become larger.

Assignment 1C: KNN

Begin by using geoplotlib to plot all incidents of the three crime types on their own map using geoplotlib.kde(). This will give you an idea of how the varioius crimes are distributed across the city.

Here we define the new focus crimes. Then we create a dictionary that contains all the crimes from the tree specific focus crimes. So that on the key PROSTITUTION the value is a list containing all of these crimes.


In [111]:
focusCrimes = ['PROSTITUTION', 'DRUG/NARCOTIC', 'DRIVING UNDER THE INFLUENCE']

filteredCrimes = {}
for f in focusCrimes:
    filteredCrimes[f] = filter(lambda c: c['Category'] == f and float(c['Y']) < 90.0, data) # We remove some crimes located at the north pole

Then we create a dictionary that cointains two lists, with all the latitutes and longtitues. This is the structure that geoplotlib accepts. We also construct a Bounding Box for San Francisco so that we have a nice view box.


In [112]:
from geoplotlib.utils import BoundingBox

def createLongAndLatList(data):
    return {'lat': [float(d['Y']) for d in data], 'lon': [float(d['X']) for d in data]}

# San Francisco
bbox_SF = BoundingBox(north=37.83, west=-122.52, south=37.67, east=-122.35)

In [113]:
import geoplotlib as glp
from geoplotlib.utils import BoundingBox, DataAccessObject

for fc in focusCrimes:
    print "Plotting " + fc
    glp.kde(createLongAndLatList(filteredCrimes[fc]), bw=5)
    glp.set_bbox(bbox_SF)
    glp.inline(width=900)


Plotting PROSTITUTION
('smallest non-zero count', 1.4329573088768091e-09)
('max count:', 17.065636267483391)
Plotting DRUG/NARCOTIC
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 101.94473147756155)
Plotting DRIVING UNDER THE INFLUENCE
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 0.71292404868788073)

PROSTITUTION

DRUG/NARCOTIC

DRIVING UNDER THE INFLUENCE

Next, it's time to set up your model based on the actual data. You can use the code supplied in the book or try out scikit-learn's KNeighborsClassifier. If you end up using the latter (recommended), you may want to check out this example to get a sense of the usage.


In [114]:
import numpy as np
import pylab as pl
from sklearn import neighbors, datasets



def times(n, v):
    return [v for _ in range(0,n)]

In [115]:
%matplotlib inline

def createKNNPlot(X,Y,h):
    knn=neighbors.KNeighborsClassifier()
    # # we create an instance of Neighbours Classifier and fit the data.
    knn.fit(X, Y)

    # # Plot the decision boundary. For that, we will asign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    x_min, x_max = X[:,0].min() - 0.02, X[:,0].max() + 0.02
    y_min, y_max = X[:,1].min() - 0.02, X[:,1].max() + 0.02
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

    # # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    
    pl.figure(1, figsize=(15, 15))
    pl.set_cmap(pl.cm.Paired)
    pl.pcolormesh(xx, yy, Z, cmap='PuBuGn')

    # # Plot also the training points
    pl.scatter(X[:,0], X[:,1], c=Y, cmap='PuBuGn')
    pl.xlabel('Longtitue')
    pl.ylabel('Latitude')

    pl.xlim(xx.min(), xx.max())
    pl.ylim(yy.min(), yy.max())
    pl.xticks(())
    pl.yticks(())
    
    pl.show()

In [116]:
allLats = []
allLons = []
Y = []

for fc in focusCrimes:
    print "We have %d number of %s crimes" % (len(filteredCrimes[fc]), fc)
    ll = createLongAndLatList(filteredCrimes[fc])
    allLats.extend(ll['lat'])
    allLons.extend(ll['lon'])
    Y.extend(times(len(ll['lon']), focusCrimes.index(fc)))

    
X = np.array([[allLats[i],allLons[i]] for i in range(0, len(allLons))])

Y = np.array(Y)

h = .001 # step size in the mesh

# Call the magic method
createKNNPlot(X,Y,h)


We have 15561 number of PROSTITUTION crimes
We have 110946 number of DRUG/NARCOTIC crimes
We have 4918 number of DRIVING UNDER THE INFLUENCE crimes

In the plot

`DRUG/NARCOTIC` is shown as cyan

`PROSTITUTION` is shown as white

`DRIVING UNDER THE INFLUENCE` is shown as Green



As the above KNN plot shows the focus crime represented in cyan color DRUG/NARCOTIC is the dominant focus crime in SF. With +100,000 compared to the others. The KNN plot illustrates this.

The distribution of the different crimes should not affect the KNN plot in any negative way. In principal this ditribution just shows the real-world situation.

You can make the dataset 'balanced' by grabbing an equal number of examples from each crime category. How do you expect that will change the KNN result? In which situations is the balanced map useful - and when is the map that data in proportion to occurrences useful? Choose which map you will work on in the following.

As the DRUG/NARCOTICS where over repesented in the last map, it resulted in an overall cyan map. However when you blanace the dataset, we expect that this will change. This will result in a map with we think will be devided into more zones. Where we might be able to see the spread of all the tree crimes, as the DRUG/NARCOTICS will not domininate as it did in the previous map.


In [117]:
import sys

allLats = []
allLons = []
Y = []
balance = sys.maxint;

for fc in focusCrimes:
    if (balance > len(filteredCrimes[fc])):
        balance = len(filteredCrimes[fc])
        
for fc in focusCrimes:
    print "We have %d number of %s crimes" % (len(filteredCrimes[fc][:balance]), fc)
    ll = createLongAndLatList(filteredCrimes[fc])
    allLats.extend(ll['lat'][:balance])
    allLons.extend(ll['lon'][:balance])
    Y.extend(times(len(ll['lon'][:balance]), focusCrimes.index(fc)))

    
X = np.array([[allLats[i],allLons[i]] for i in range(0, len(allLons))])

Y = np.array(Y)

h = .001 # step size in the mesh

# Call the magic method
createKNNPlot(X,Y,h)


We have 4918 number of PROSTITUTION crimes
We have 4918 number of DRUG/NARCOTIC crimes
We have 4918 number of DRIVING UNDER THE INFLUENCE crimes

Above is our result of plotting the balanced dataset. Note that we balanced by the age of the data. Meaning that we only too data from Jan 2003, and then 4918 records forward, in the respective crime insidents. This of course means that if we balanced diffrently the map could with most certainty look a lot different.

What we can see from the map, is that the DRUG/NARCOTICS are accually really centered around Tenderlion. Were as DRIVING UNDER THE INFLUENCE is way more distributed thoughout the city.

As predicted, this map gave us a better understand of how the different crime incidents were distributed, across the city.

Now create an approximately square grid of point that runs over SF. You get to decide the grid-size, but I recommend somewhere between 50×50 and 100×100 points. I recommend plotting using geoplotlib.dot().


In [118]:
import geoplotlib as gpl

# We define a set of colors for the different focus crimes. were r is red, b is blue and g is green
colorDic = {'PROSTITUTION': 'r', 'DRUG/NARCOTIC': 'b', 'DRIVING UNDER THE INFLUENCE': 'g'}

First we define a function that help us create a list with 100 steps from the min og max values provided. We then use a set of latitues and longtitues that bound San Francisco. We use the samme values that we used in the boundbong box, in an ealier assignment.

Lastly we plot the grid as a dot plot with geolibplot. And we can see below how the grid fits nicly over San Franciso


In [119]:
import geoplotlib as gpl
from __future__ import division

def createDots(mi, ma):
    step = abs((ma - mi) / 100)
    return [(step*v)+mi for v in range(0,100)]

gridDots = {'lat': [], 'lon': []}

for lon in createDots(-122.52, -122.35):
    for lat in createDots(37.67, 37.82):
        gridDots['lat'].append(lat)
        gridDots['lon'].append(lon)



gpl.dot(gridDots, color='w')
gpl.inline(width=900)


Grid over SF

Visualize your model by coloring the grid, coloring each grid point according to it's category. Create a plot of this kind for models where each point is colored according to the majority of its 5, 10, and 30 nearest neighbors. Describe what happens to the map as you increase the number of neighbors, K.

We begin by pulling some helper functions in from the book. These will help us construcing the map.

BELOW IS A SET OF FUNCTIONS FOUND IN THE BOOK (+RELATED GITHUB REPO)


In [120]:
from __future__ import division
from collections import Counter
import re, math, random


def vector_add(v, w):
    """adds two vectors componentwise"""
    return [v_i + w_i for v_i, w_i in zip(v,w)]

def vector_subtract(v, w):
    """subtracts two vectors componentwise"""
    return [v_i - w_i for v_i, w_i in zip(v,w)]

def vector_sum(vectors):
    return reduce(vector_add, vectors)

def scalar_multiply(c, v):
    return [c * v_i for v_i in v]

# this isn't right if you don't from __future__ import division
def vector_mean(vectors):
    """compute the vector whose i-th element is the mean of the
    i-th elements of the input vectors"""
    n = len(vectors)
    return scalar_multiply(1/n, vector_sum(vectors))

def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def sum_of_squares(v):
    """v_1 * v_1 + ... + v_n * v_n"""
    return dot(v, v)

def squared_distance(v, w):
    return sum_of_squares(vector_subtract(v, w))

def distance(v, w):
    return math.sqrt(squared_distance(v, w))


## From the book

def majority_vote(labels):
    """assumes that labels are ordered from nearest to farthest"""
    vote_counts = Counter(labels)
    winner, winner_count = vote_counts.most_common(1)[0]
    num_winners = len([count 
                       for count in vote_counts.values()
                       if count == winner_count])

    if num_winners == 1:
        return winner                     # unique winner, so return it
    else:
        return majority_vote(labels[:-1]) # try again without the farthest

## From the book
def knn_classify(k, labeled_points, new_point):
    """each labeled point should be a pair (point, label)"""
    
    # order the labeled points from nearest to farthest
    by_distance = sorted(labeled_points, key=lambda (point, _): distance(point, new_point))

    # find the labels for the k closest
    k_nearest_labels = [label for _, label in by_distance[:k]]

    # and let them vote
    return majority_vote(k_nearest_labels)

Below is our code

Now we can create a function, that from a given K predict the crime for a given point based on the K nearest neighbours.


In [121]:
import geoplotlib as gpl


def plotKnnForGrid(K):
    plots = {
        "PROSTITUTION" : {'lat': [], 'lon':[]},
        "DRUG/NARCOTIC" : {'lat': [], 'lon':[]},
        "DRIVING UNDER THE INFLUENCE" : {'lat': [], 'lon':[]} }


    crimes = []

    for fc in focusCrimes:
        for c in filteredCrimes[fc][:20]:
            crimes.append(([float(c['X']), float(c['Y'])], c['Category']))


    for lon in createDots(-122.52, -122.35):
        for lat in createDots(37.67, 37.82):
            predicted_crime = knn_classify(k, crimes, [lon, lat])
            plots[predicted_crime]['lon'].append(lon)
            plots[predicted_crime]['lat'].append(lat)

    for fc in focusCrimes:
        #print plots[fc]
        gpl.dot(plots[fc], color=colorDic[fc])

    gpl.inline(width=900)

In [122]:
K = [5,10,30]

for k in K:
    print "For majority of its k=%d nearest neighbours" % k
    plotKnnForGrid(k)


For majority of its k=5 nearest neighbours
For majority of its k=10 nearest neighbours
For majority of its k=30 nearest neighbours

K=5

K=10

K=30

As you increase K, the number of nearest neighbors we look at, you get a much less granularity, for a lack of better a word.

As an example, the green - DRIVING UNDER THE INFLUENCE (DUI) - seem to make the biggest change going from K=5 to K=30. DUI crimes are much more spread out in SF compared to DRUG/NARCOTIC and PROSTITUTION which are generally very concentrated in certain areas.

Because of this, when we only look at K=5, it favours DUI because of the its geographical distribution. But when looking at K=30, we can conclude that looking at any point for its nearest neighbors the chance of it being DRUG/NARCOTIC is higher because of the density and overall amount compared to the beforementioned crimes.

Assignment 1D: Multiple regression and the Red Baron


In [123]:
from csv import reader

def parseCsv(row):
    return {
        'IncidntNum': line[0],
        'Category': line[1],
        'Descript': line[2],
        'DayOfWeek': line[3],
        'Date': line[4],
        'Time': line[5],
        'PdDistrict': line[6],
        'Resolution': line[7],
        'Address': line[8],
        'X': line[9],
        'Y': line[10],
        'Location': line[11],
        'PdId': line[12]
    }

with open('../x-files.csv', 'r') as infile:    
    data_all = [parseCsv(line) for line in list(reader(infile))[1:]]
    
data_rb = [line for line in data_all if line['Resolution'] == 'RED BARON']

print "Total count of incidents within the category of " + data_all[0]['Category'] + ": " + str(len(data_all))
print "Total count of incidents from Red Baron: " + str(len(data_rb))


Total count of incidents within the category of INDECENT EXPOSURE: 10000
Total count of incidents from Red Baron: 4985

In [124]:
import time

def get_hour(time):
    return int(time.split(':')[0])
def convert_day_to_num(day):
    return time.strptime(day, "%A").tm_wday + 1

Start from all cases having Red Baron in the resolution field and use the day of the week to predict the hour of the day when he is attacking, e.g. use linear regression to infer the hour of the day based on the weekday! Again, take 4/5 of the data for training and then calculate goodness of fit using $R^2$ on the rest 1/5. Don't forget to rescale your input variables! (Note 1: My goodness of fit after using the weekdays is only around 0.618). (Note 2: For multivariate regression, as always you can simply re-use the code in the DSFS book (Chapters 14-15) or scikit-learn).

We begin by dividing our filtered dataset into two sets. A training set, and a test set. WE then contruct the x values and y values. Then we fit from to the linear regression model. And the test with the test set. This results in a R squared value, and plot as seen below, in the output.


In [125]:
from __future__ import division
%matplotlib inline

training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]

data_train_xs = [[convert_day_to_num(d["DayOfWeek"])] for d in training_rb]

data_train_ys = [[get_hour(d["Time"])] for d in training_rb]

data_test_xs = [[convert_day_to_num(d["DayOfWeek"])] for d in test_rb]

data_test_ys = [[get_hour(d["Time"])] for d in test_rb]



regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)

# R squared coeficient
print 'R squared coefficient: %.3f' % regr.score(data_test_xs, data_test_ys)


# Plot outputs
plt.scatter(data_test_xs, data_test_ys,  color='black')
plt.plot(data_test_xs, regr.predict(data_test_xs), color='blue', linewidth=3)


plt.show()


R squared coefficient: 0.618

Now, add the crime year as well to the input variables! Did the goodness of fit improve? (Note: Mine did to 0.809)


In [126]:
def getYear(d):
    return int(d['Date'].split('/')[-1])

As we need to add another feature to the model, we add the year to each of the x values. This greatly improves the R squared value, and we can now see that that model fit way better.


In [127]:
from __future__ import division
%matplotlib inline

training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]

data_train_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d)] for d in training_rb]

data_train_ys = [[get_hour(d["Time"])] for d in training_rb]

data_test_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d)] for d in test_rb]

data_test_ys = [[get_hour(d["Time"])] for d in test_rb]



regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)

# R squared coeficient
print 'R squared coefficient: %.4f' % regr.score(data_test_xs, data_test_ys)


R squared coefficient: 0.8004

It is still low. Inspired by a movie he once watched, Chief Suneman yells: "Let's add the longitude of the crimes as well!" Is your prediction getting better? (It should, to around 0.993)

As Chief Suneman yells we add the longtitude to the model as well. As he predicted the R squared value did infact improve. This shows that not every seen in movies are just special effects.


In [128]:
from __future__ import division
%matplotlib inline

training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]

data_train_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X'])] for d in training_rb]

data_train_ys = [[get_hour(d["Time"])] for d in training_rb]

data_test_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X'])] for d in test_rb]

data_test_ys = [[get_hour(d["Time"])] for d in test_rb]



regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)

# R squared coeficient
print 'R squared coefficient: %.4f' % regr.score(data_test_xs, data_test_ys)


R squared coefficient: 0.9918

Very nice! Why not add latitude as well? What do you find now?

Even though we also added the latitude to the model. The model did not improve more.


In [129]:
from __future__ import division
%matplotlib inline

training_rb = data_rb[:int(len(data_rb)*(4/5))]
test_rb = data_rb[int(len(data_rb)*(4/5)):]

data_train_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X']), float(d['Y'])] for d in training_rb]

data_train_ys = [[get_hour(d["Time"])] for d in training_rb]

data_test_xs = [[convert_day_to_num(d["DayOfWeek"]), getYear(d), float(d['X']), float(d['Y'])] for d in test_rb]

data_test_ys = [[get_hour(d["Time"])] for d in test_rb]



regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(data_train_xs, data_train_ys)

# R squared coeficient
print 'R squared coefficient: %.4f' % regr.score(data_test_xs, data_test_ys)


R squared coefficient: 0.9918