Assignment 1

Assignment 1A: Anscombe's quartet

Using the numpy function mean, calculate the mean of both $x$-values and $y$-values for each dataset.

We start out by downloading the data data to memory. Afterwards we transform the data data_t to a more appropriate format we can work with in Python.


In [96]:
import urllib2
import numpy as np

# Downloads data given a url using urllib2
def getTsv(url):
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    return response.read()

# Transforms the input data from a tabular seperated 
# file and casts the values to floats.
def transform(data):
    return np.array([[float(x) for x in line.split('\t')] for line in data.split('\n')])
    

data = [
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data1.tsv'),
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data2.tsv'),
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data3.tsv'),
    getTsv('https://dl.dropboxusercontent.com/u/153071/teaching/02806_2016/data4.tsv')
    ]

data_t = [transform(d) for d in data]

data_t now represents 4 lists containing a list of two elements (x,y) for each row of the tsv files.

Next, we calculate all the mean values. We use numpy as it has an easy way to calculate the mean of values in a numpy array. More specifically we can call a function on each of the four lists getting the mean across indexes.

See the documentation here for more details.


In [97]:
import numpy as np

# Calculates the mean values of a lists sublists/tuples 
# on the axis 0, i.e. up/down (axis=1, right,left)

means = [dt.mean(axis=0) for dt in data_t]

Use python string formatting to print precisely two decimal places of these results to the output cell. Check out this stackoverflow page for help with the string formatting.


In [98]:
for m in means:
    print "Mean: {:10.2f},{:10.2f}".format(*m)


Mean:       9.00,      7.50
Mean:       9.00,      7.50
Mean:       9.00,      7.50
Mean:       9.00,      7.50

Now calculate the variance for all of the various sets of x- and y-values (to three decimal places).

As we did with the means we can calculate the variance in x- and y-values the same way for each dataset.


In [99]:
variance = [dt.var(axis=0) for dt in data_t]

for v in variance:
    print "Variance: {:10.3f},{:10.3f}".format(*v)


Variance:     10.000,     3.752
Variance:     10.000,     3.752
Variance:     10.000,     3.748
Variance:     10.000,     3.748

Use numpy to calculate the Pearson correlation between $x$- and $y$-values for all four data sets (also to three decimal places).

As before we use a list comprehension on the transformed data data_t calling a simple function that calculates the pearson correlation on each of the 4 datasets.


In [100]:
# Calculates the pearson correlation for 
# each dataset
def calc_pearson(data):
    data_xs = data[:,0] # all xs across sublists/tuples
    data_ys = data[:,1] # all xs across sublists/tuples
    return np.corrcoef(data_xs, data_ys)[0][1]
    
pears = [calc_pearson(x) for x in data_t]

for pear in pears:
    print "Pearson: {:10.3f}".format(pear)


Pearson:      0.816
Pearson:      0.816
Pearson:      0.816
Pearson:      0.817

The next step is use linear regression to fit a straight line $f(x) = a x + b$ through each dataset and report $a$ and $b$ (to two decimal places). An easy way to fit a straight line in Python is using scipy's linregress.

Before we start plotting, lets first generate the straigt lines needed for each dataset.

Because we use a single list comprehension for all datasets we return a dictionary from lin_reg() for easy lookup later on.


In [101]:
from scipy import stats

# Calculates the linear regression for each dataset
# and returns a dictionary containing lin reg values
def lin_reg(data):
    data_xs = data[:,0]
    data_ys = data[:,1]
    slope, intercept, r_value, p_value, std_err = stats.linregress(data_xs,data_ys)
    
    return {'slope': slope, 'intercept': intercept, 'r_value': r_value, 'p_value': p_value, 'std_err': std_err}

lin_regs = [lin_reg(x) for x in data_t]

Finally, it's time to plot the four datasets using matplotlib.pyplot. Use a two-by-two subplot to put all of the plots nicely in a grid and use the same $x$ and $y$ range for all four plots. And include the linear fit in all four plots. (To get a sense of what I think the plot should look like, you can take a look at my version here.)

Now that the regressions have been calculated, a plot for each dataset can be made.

Because the datasets are kept in a single list we can use a for loop to generate each subplot.


In [102]:
import matplotlib.pyplot as plt
%matplotlib inline

for i in range(0, len(data_t)):
    d = data_t[i] # Get the individual dataset
    xs = d[:,0]
    ys = d[:,1]
    predict_y = lin_regs[i]["intercept"] + lin_regs[i]["slope"] * xs # Straigt line function

    plt.subplot(221+i) # 2 rows, 2 columns, current plot + i (starts at zero)
    plt.scatter(xs,ys)
    plt.plot(xs,predict_y, 'r')
    plt.xlim(xmin=0, xmax=20) # Standadize x axis min
    plt.ylim(ymin=2, ymax=14) # Standadize y axis min
    
plt.show()


Explain - in your own words - what you think my point with this exercise is.

Other than learning how to, extract and transform data, use numpy and matlibplot, we think it is fair to say that up until the plotting the data looks very similiar. And indeed they do on the mean, variance, pearson correlation which we have calculated.

However, looking at the data plots it is easy to see how different each dataset really is. Visualization is a really important factor to help understand your data better.

Assignment 1B: Slicing data

We'll be combining information about PdDistrict and Category to explore differences between SF's neighborhoods. First, simply list the names of SF's 10 police districts.

We begin by loading the cvs file. Containing all the crime records, and parse it into a dictonary, for easy access.

Note that we ignore any crime that do not have a district.


In [103]:
from csv import reader

def parseCrimeCsv(row):
    return {
        'IncidntNum': line[0],
        'Category': line[1],
        'Descript': line[2],
        'DayOfWeek': line[3],
        'Date': line[4],
        'Time': line[5],
        'PdDistrict': line[6],
        'Resolution': line[7],
        'Address': line[8],
        'X': line[9],
        'Y': line[10],
        'Location': line[11],
        'PdId': line[12]
    }

with open('../SFPD_Incidents_-_from_1_January_2003.csv', 'r') as infile:    
    data = [parseCrimeCsv(line) for line in list(reader(infile))[1:] if not line[6] == '']

Now that the file have been loaded into memory we can extract all the districts. We do this by taking value PdDistrict of all records, and insert them into a set.


In [104]:
PdDistrictSet = set([i['PdDistrict'] for i in data])
PdDistrictSet


Out[104]:
{'BAYVIEW',
 'CENTRAL',
 'INGLESIDE',
 'MISSION',
 'NORTHERN',
 'PARK',
 'RICHMOND',
 'SOUTHERN',
 'TARAVAL',
 'TENDERLOIN'}

Which has the most crimes? Which has the most focus crimes?

If we being finding the district with the most overall crimes. We begin with creating a Counter object, were we input the districts of the dataset. This object, helps us counting the occorences of the given values.

Lastly we call the most_common(1) on the Counter to extract the district with the most crimes overall.


In [105]:
from collections import Counter

DistrictCrimeCounter = Counter([d['PdDistrict'] for d in data])

DistrictCrimeCounter.most_common(1)


Out[105]:
[('SOUTHERN', 335978)]

By looking at hthe output above, we can se that Southern is the district with the most crimes overall with a total count of 335978 crimes from Jan 2003 to Jan 2016.

Which has the most focus crimes? To do this we first define the focus crimes, these are taken from the exercises from week 3:


In [106]:
focuscrimes = set([
        'WEAPON LAWS',
        'PROSTITUTION',
        'DRIVING UNDER THE INFLUENCE',
        'ROBBERY',
        'BURGLARY',
        'ASSAULT',
        'DRUNKENNESS',
        'DRUG/NARCOTIC',
        'TRESPASS',
        'LARCENY/THEFT',
        'VANDALISM',
        'VEHICLE THEFT',
        'STOLEN PROPERTY',
        'DISORDERLY CONDUCT'
    ])

Now that we have defined the focus crimes, we can filter our dataset. So that we get a new data set that only contains the records for these crimes. We do this by utilizing the build in filter function.

Then we do the same operatios a above, to get the district with the most focus crimes.


In [107]:
data_filtered_for_focus_crimes = filter(lambda c: c['Category'] in list(focuscrimes), data)

       
most_common_focus_crime_by_dist = Counter([d['PdDistrict'] for d in data_filtered_for_focus_crimes])
most_common_focus_crime_by_dist.most_common(1)


Out[107]:
[('SOUTHERN', 191521)]

We see that it is again the district south that have the most focus crimes, just as it also had with all crimes. However this time around we are down to 191521 crimes commited between Jan 2003 and Jan 2016.

Next, we want to generate a slightly more complicated graphic. I'm interested to know if there are certain crimes that happen much more in certain neighborhoods than what's typical. Below I describe how to get that plot going

First, we need to calculate the relative probabilities of seeing each type of crime in the dataset as a whole. That's simply a normalized version of this plot. Let's call it P(crime).

Firstly we have to create a Counter for the categories. Then we can use list comprehention, over the output of most common, to create a nicely structred dataset.


In [108]:
from __future__ import division

# DistrictCrimeCounter

CatetoryCrimeCounter = Counter([d['Category'] for d in data])

p_crime = {}
for k,v in CatetoryCrimeCounter.most_common():
    p_crime[k] = v/len(data)

p_crime


Out[108]:
{'ARSON': 0.0017349745013750567,
 'ASSAULT': 0.0875342429177903,
 'BAD CHECKS': 0.0004571062398205751,
 'BRIBERY': 0.0003551117400475262,
 'BURGLARY': 0.04178837476303634,
 'DISORDERLY CONDUCT': 0.004770779376818946,
 'DRIVING UNDER THE INFLUENCE': 0.002626224868501856,
 'DRUG/NARCOTIC': 0.05924919232104237,
 'DRUNKENNESS': 0.004795343497183136,
 'EMBEZZLEMENT': 0.0013745227351614022,
 'EXTORTION': 0.00032574159613382107,
 'FAMILY OFFENSES': 0.0005751208180920087,
 'FORGERY/COUNTERFEITING': 0.011604944864229834,
 'FRAUD': 0.019067631431394012,
 'GAMBLING': 0.00016340480068352336,
 'KIDNAPPING': 0.0025781646330067017,
 'LARCENY/THEFT': 0.20380476864336636,
 'LIQUOR LAWS': 0.0020521720556430727,
 'LOITERING': 0.0012511681307238406,
 'MISSING PERSON': 0.029413932128267428,
 'NON-CRIMINAL': 0.10625637465623582,
 'OTHER OFFENSES': 0.1426449149600833,
 'PORNOGRAPHY/OBSCENE MAT': 2.4564120364189783e-05,
 'PROSTITUTION': 0.00831068272234534,
 'RECOVERED VEHICLE': 0.0033887806050249647,
 'ROBBERY': 0.02584999866499346,
 'RUNAWAY': 0.0021253304141190294,
 'SECONDARY CODES': 0.011447948094945665,
 'SEX OFFENSES, FORCIBLE': 0.005046324726991162,
 'SEX OFFENSES, NON FORCIBLE': 0.0001767548660988439,
 'STOLEN PROPERTY': 0.005342162176594665,
 'SUICIDE': 0.0005996849384561985,
 'SUSPICIOUS OCC': 0.03587482978666595,
 'TREA': 5.340026166128214e-06,
 'TRESPASS': 0.008350732918591302,
 'VANDALISM': 0.05101914399380557,
 'VEHICLE THEFT': 0.060621045043120714,
 'WARRANTS': 0.047586575174218354,
 'WEAPON LAWS': 0.009805890048861239}

Next, we calculate that same probability distribution but for each PD district, let's call that P(crime|district).

Now we do the same, however we do it per district instead.

Our result is a dictonary, with the district as key, followed by a list, containing dictionaries again, with the crime category as key and properbility as value.


In [109]:
p_crime_district = {}
l = len(data)
for district in list(PdDistrictSet):
    counter = Counter([d['Category'] for d in data if d['PdDistrict'] == district])
    p_crime_district[district] = {}
    for k,v in counter.most_common():
            p_crime_district[district][k] = v / len(list(counter.elements()));
    
    
p_crime_district


Out[109]:
{'BAYVIEW': {'ARSON': 0.004446580305134691,
  'ASSAULT': 0.10975448750785545,
  'BAD CHECKS': 0.0003538252736864897,
  'BRIBERY': 0.0006337169080952054,
  'BURGLARY': 0.044397150386303263,
  'DISORDERLY CONDUCT': 0.002476776915805428,
  'DRIVING UNDER THE INFLUENCE': 0.002091265796714178,
  'DRUG/NARCOTIC': 0.04878035900062844,
  'DRUNKENNESS': 0.00274610660174589,
  'EMBEZZLEMENT': 0.0012938386873610444,
  'EXTORTION': 0.00020595799513094175,
  'FAMILY OFFENSES': 0.0008343939289920205,
  'FORGERY/COUNTERFEITING': 0.008502368516944006,
  'FRAUD': 0.010113065658352653,
  'GAMBLING': 0.0002693296859404623,
  'KIDNAPPING': 0.0037072439123569515,
  'LARCENY/THEFT': 0.11580120300593054,
  'LIQUOR LAWS': 0.0013255245327658046,
  'LOITERING': 0.0005016925522420376,
  'MISSING PERSON': 0.05471089306555273,
  'NON-CRIMINAL': 0.07021055244271464,
  'OTHER OFFENSES': 0.18928595947380372,
  'PORNOGRAPHY/OBSCENE MAT': 2.640487117063356e-05,
  'PROSTITUTION': 0.0006918076246705992,
  'RECOVERED VEHICLE': 0.007509545360928184,
  'ROBBERY': 0.029795256628942907,
  'RUNAWAY': 0.002724982704809383,
  'SECONDARY CODES': 0.01886892093853474,
  'SEX OFFENSES, FORCIBLE': 0.004678943171436267,
  'SEX OFFENSES, NON FORCIBLE': 0.000274610660174589,
  'STOLEN PROPERTY': 0.0044201754339640575,
  'SUICIDE': 0.00037494917062299654,
  'SUSPICIOUS OCC': 0.04386905296289059,
  'TREA': 1.5842922702380134e-05,
  'TRESPASS': 0.007182124958412328,
  'VANDALISM': 0.061502225930639684,
  'VEHICLE THEFT': 0.07777290754598408,
  'WARRANTS': 0.048743392180989546,
  'WEAPON LAWS': 0.019106564779070443},
 'CENTRAL': {'ARSON': 0.001202040219615997,
  'ASSAULT': 0.08095903316981255,
  'BAD CHECKS': 0.0006714098523981244,
  'BRIBERY': 0.0001949254410188103,
  'BURGLARY': 0.05192055705359367,
  'DISORDERLY CONDUCT': 0.00520884095166932,
  'DRIVING UNDER THE INFLUENCE': 0.0019005230499334004,
  'DRUG/NARCOTIC': 0.019719957116402977,
  'DRUNKENNESS': 0.006042688671583119,
  'EMBEZZLEMENT': 0.0021766674247100486,
  'EXTORTION': 0.0006822390435658361,
  'FAMILY OFFENSES': 0.0002869735659443596,
  'FORGERY/COUNTERFEITING': 0.012805518555819066,
  'FRAUD': 0.026753516779831715,
  'GAMBLING': 0.00033029033061520634,
  'KIDNAPPING': 0.0021441798512069135,
  'LARCENY/THEFT': 0.2985012399423887,
  'LIQUOR LAWS': 0.0013915510650509514,
  'LOITERING': 0.0007851163596590971,
  'MISSING PERSON': 0.015778131531355922,
  'NON-CRIMINAL': 0.127291727580867,
  'OTHER OFFENSES': 0.10426886715831195,
  'PORNOGRAPHY/OBSCENE MAT': 2.707297791927921e-05,
  'PROSTITUTION': 0.009242714661641923,
  'RECOVERED VEHICLE': 0.00218749661587776,
  'ROBBERY': 0.02342895509134423,
  'RUNAWAY': 0.00041692385995689983,
  'SECONDARY CODES': 0.008560475618076087,
  'SEX OFFENSES, FORCIBLE': 0.004521187312519628,
  'SEX OFFENSES, NON FORCIBLE': 5.414595583855842e-05,
  'STOLEN PROPERTY': 0.006075176245086254,
  'SUICIDE': 0.0006714098523981244,
  'SUSPICIOUS OCC': 0.03344054232589368,
  'TREA': 1.0829191167711683e-05,
  'TRESPASS': 0.010856264145630963,
  'VANDALISM': 0.05254323554573709,
  'VEHICLE THEFT': 0.04840106992408737,
  'WARRANTS': 0.032964057914514365,
  'WEAPON LAWS': 0.005582448046955373},
 'INGLESIDE': {'ARSON': 0.0023914763499646038,
  'ASSAULT': 0.11107277345818188,
  'BAD CHECKS': 0.000416426230093339,
  'BRIBERY': 0.0007733629987447724,
  'BURGLARY': 0.04158313354789199,
  'DISORDERLY CONDUCT': 0.0020285906351689797,
  'DRIVING UNDER THE INFLUENCE': 0.0026889236571741317,
  'DRUG/NARCOTIC': 0.02856089043825886,
  'DRUNKENNESS': 0.002546148949713558,
  'EMBEZZLEMENT': 0.0009518313830704891,
  'EXTORTION': 0.0003688346609398145,
  'FAMILY OFFENSES': 0.0009637292753588702,
  'FORGERY/COUNTERFEITING': 0.012040666995841687,
  'FRAUD': 0.015752809389816595,
  'GAMBLING': 0.0002082131150466695,
  'KIDNAPPING': 0.004021487593472817,
  'LARCENY/THEFT': 0.12866380720655335,
  'LIQUOR LAWS': 0.0012314318518474453,
  'LOITERING': 0.0002796004687769562,
  'MISSING PERSON': 0.040179182257863016,
  'NON-CRIMINAL': 0.08831210551050882,
  'OTHER OFFENSES': 0.16874780632610933,
  'PORNOGRAPHY/OBSCENE MAT': 2.9744730720952783e-05,
  'PROSTITUTION': 0.0003985793916607673,
  'RECOVERED VEHICLE': 0.007715783149015152,
  'ROBBERY': 0.03425403189824922,
  'RUNAWAY': 0.0019155606584293592,
  'SECONDARY CODES': 0.0170734754338269,
  'SEX OFFENSES, FORCIBLE': 0.006644972843060852,
  'SEX OFFENSES, NON FORCIBLE': 0.0003747836070840051,
  'STOLEN PROPERTY': 0.004669922723189587,
  'SUICIDE': 0.0008031077294657252,
  'SUSPICIOUS OCC': 0.03993527546595121,
  'TRESPASS': 0.0048900337305246375,
  'VANDALISM': 0.06823441227386569,
  'VEHICLE THEFT': 0.11246482685592248,
  'WARRANTS': 0.03230277756295472,
  'WEAPON LAWS': 0.014509479645680768},
 'MISSION': {'ARSON': 0.001294508599799509,
  'ASSAULT': 0.09371768661841201,
  'BAD CHECKS': 0.0003236271499498773,
  'BRIBERY': 0.0005170141054077307,
  'BURGLARY': 0.03131684676649117,
  'DISORDERLY CONDUCT': 0.009187853720528223,
  'DRIVING UNDER THE INFLUENCE': 0.002995524473316547,
  'DRUG/NARCOTIC': 0.0713084798206632,
  'DRUNKENNESS': 0.006409396238031715,
  'EMBEZZLEMENT': 0.0010142948480136398,
  'EXTORTION': 0.0002091736456993109,
  'FAMILY OFFENSES': 0.001160321732747121,
  'FORGERY/COUNTERFEITING': 0.010360015470956437,
  'FRAUD': 0.016524718010245562,
  'GAMBLING': 0.00016181357497493864,
  'KIDNAPPING': 0.002502190403271002,
  'LARCENY/THEFT': 0.15173771992832844,
  'LIQUOR LAWS': 0.0039545659054850855,
  'LOITERING': 0.001483948882696998,
  'MISSING PERSON': 0.027713534718878514,
  'NON-CRIMINAL': 0.10562479773303128,
  'OTHER OFFENSES': 0.16031778607456054,
  'PORNOGRAPHY/OBSCENE MAT': 2.368003536218614e-05,
  'PROSTITUTION': 0.028005588488345477,
  'RECOVERED VEHICLE': 0.0026363772703233905,
  'ROBBERY': 0.02990393798988073,
  'RUNAWAY': 0.0022259233240454973,
  'SECONDARY CODES': 0.011650577398195581,
  'SEX OFFENSES, FORCIBLE': 0.00601472898199528,
  'SEX OFFENSES, NON FORCIBLE': 0.000221013663380404,
  'STOLEN PROPERTY': 0.005359581336974797,
  'SUICIDE': 0.000619627591977204,
  'SUSPICIOUS OCC': 0.032603462021169954,
  'TREA': 3.9466725603643566e-06,
  'TRESPASS': 0.008564279455990654,
  'VANDALISM': 0.04462502664003978,
  'VEHICLE THEFT': 0.06077481075705073,
  'WARRANTS': 0.05559282968529233,
  'WEAPON LAWS': 0.011338790265926798},
 'NORTHERN': {'ARSON': 0.0014210061782877316,
  'ASSAULT': 0.07824801412180053,
  'BAD CHECKS': 0.0005692850838481906,
  'BRIBERY': 0.00017210944395410415,
  'BURGLARY': 0.05461165048543689,
  'DISORDERLY CONDUCT': 0.004329214474845543,
  'DRIVING UNDER THE INFLUENCE': 0.0023654015887025594,
  'DRUG/NARCOTIC': 0.04202559576345984,
  'DRUNKENNESS': 0.003212709620476611,
  'EMBEZZLEMENT': 0.0013018534863195057,
  'EXTORTION': 0.00024271844660194174,
  'FAMILY OFFENSES': 0.00022506619593998234,
  'FORGERY/COUNTERFEITING': 0.011712268314210062,
  'FRAUD': 0.019849955869373347,
  'GAMBLING': 9.70873786407767e-05,
  'KIDNAPPING': 0.0021624007060900264,
  'LARCENY/THEFT': 0.2780847308031774,
  'LIQUOR LAWS': 0.0011076787290379524,
  'LOITERING': 0.0017431597528684906,
  'MISSING PERSON': 0.018172992056487203,
  'NON-CRIMINAL': 0.09720211827007944,
  'OTHER OFFENSES': 0.11582082965578111,
  'PORNOGRAPHY/OBSCENE MAT': 2.64783759929391e-05,
  'PROSTITUTION': 0.01709179170344219,
  'RECOVERED VEHICLE': 0.0024271844660194173,
  'ROBBERY': 0.024761694616063548,
  'RUNAWAY': 0.0008473080317740512,
  'SECONDARY CODES': 0.00939982347749338,
  'SEX OFFENSES, FORCIBLE': 0.004342453662842013,
  'SEX OFFENSES, NON FORCIBLE': 0.00010150044130626655,
  'STOLEN PROPERTY': 0.006676963812886143,
  'SUICIDE': 0.000706090026478376,
  'SUSPICIOUS OCC': 0.031248896734333627,
  'TREA': 4.41306266548985e-06,
  'TRESPASS': 0.007546337157987643,
  'VANDALISM': 0.051403353927625774,
  'VEHICLE THEFT': 0.058667255075022066,
  'WARRANTS': 0.04258605472197705,
  'WEAPON LAWS': 0.007484554280670785},
 'PARK': {'ARSON': 0.0012030979772915257,
  'ASSAULT': 0.06954470260921874,
  'BAD CHECKS': 0.00028197608842770133,
  'BRIBERY': 0.00022558087074216106,
  'BURGLARY': 0.05825625986916309,
  'DISORDERLY CONDUCT': 0.005385743288969095,
  'DRIVING UNDER THE INFLUENCE': 0.0034025114670275963,
  'DRUG/NARCOTIC': 0.051075268817204304,
  'DRUNKENNESS': 0.007143394240168434,
  'EMBEZZLEMENT': 0.0009305210918114144,
  'EXTORTION': 0.0001503872471614407,
  'FAMILY OFFENSES': 0.00041356492969396195,
  'FORGERY/COUNTERFEITING': 0.009389803744642454,
  'FRAUD': 0.0190709827806602,
  'GAMBLING': 1.879840589518009e-05,
  'KIDNAPPING': 0.0016166629069854876,
  'LARCENY/THEFT': 0.1918283329573652,
  'LIQUOR LAWS': 0.002528385592901722,
  'LOITERING': 0.0003571697120084217,
  'MISSING PERSON': 0.05462816753139334,
  'NON-CRIMINAL': 0.12124971802391157,
  'OTHER OFFENSES': 0.1282145274080758,
  'PORNOGRAPHY/OBSCENE MAT': 1.879840589518009e-05,
  'PROSTITUTION': 0.00013158884126626062,
  'RECOVERED VEHICLE': 0.002293405519211971,
  'ROBBERY': 0.01857282502443793,
  'RUNAWAY': 0.009850364689074367,
  'SECONDARY CODES': 0.008957440409053313,
  'SEX OFFENSES, FORCIBLE': 0.004107451688096849,
  'SEX OFFENSES, NON FORCIBLE': 0.00011279043537108053,
  'STOLEN PROPERTY': 0.0039006692232498685,
  'SUICIDE': 0.0005169561621174524,
  'SUSPICIOUS OCC': 0.03292540792540793,
  'TRESPASS': 0.006259869163094969,
  'VANDALISM': 0.0515546281675314,
  'VEHICLE THEFT': 0.08038198360779006,
  'WARRANTS': 0.04617828408150989,
  'WEAPON LAWS': 0.007321979096172645},
 'RICHMOND': {'ARSON': 0.0021137267904509285,
  'ASSAULT': 0.07038502984084881,
  'BAD CHECKS': 0.000621684350132626,
  'BRIBERY': 0.00020722811671087534,
  'BURGLARY': 0.058365799071618034,
  'DISORDERLY CONDUCT': 0.002175895225464191,
  'DRIVING UNDER THE INFLUENCE': 0.00726334549071618,
  'DRUG/NARCOTIC': 0.020940401193633953,
  'DRUNKENNESS': 0.003284565649867374,
  'EMBEZZLEMENT': 0.0011190318302387269,
  'EXTORTION': 0.0005698773209549072,
  'FAMILY OFFENSES': 0.0005284316976127321,
  'FORGERY/COUNTERFEITING': 0.013283322281167109,
  'FRAUD': 0.02486737400530504,
  'GAMBLING': 0.0001243368700265252,
  'KIDNAPPING': 0.001999751326259947,
  'LARCENY/THEFT': 0.22721526856763927,
  'LIQUOR LAWS': 0.001823607427055703,
  'LOITERING': 0.0002486737400530504,
  'MISSING PERSON': 0.02619363395225464,
  'NON-CRIMINAL': 0.12841926392572944,
  'OTHER OFFENSES': 0.12326964522546419,
  'PORNOGRAPHY/OBSCENE MAT': 4.1445623342175064e-05,
  'PROSTITUTION': 0.0005387931034482759,
  'RECOVERED VEHICLE': 0.0027768567639257294,
  'ROBBERY': 0.017179210875331564,
  'RUNAWAY': 0.002082642572944297,
  'SECONDARY CODES': 0.011428630636604774,
  'SEX OFFENSES, FORCIBLE': 0.004662632625994695,
  'SEX OFFENSES, NON FORCIBLE': 0.00016578249336870026,
  'STOLEN PROPERTY': 0.004569379973474801,
  'SUICIDE': 0.0008185510610079575,
  'SUSPICIOUS OCC': 0.047330901856763925,
  'TRESPASS': 0.00533612400530504,
  'VANDALISM': 0.06925563660477453,
  'VEHICLE THEFT': 0.08979194297082228,
  'WARRANTS': 0.02216304708222812,
  'WEAPON LAWS': 0.006838527851458886},
 'SOUTHERN': {'ARSON': 0.0011310264362547548,
  'ASSAULT': 0.07692765597747471,
  'BAD CHECKS': 0.00048812719880468364,
  'BRIBERY': 0.0002738274529879933,
  'BURGLARY': 0.031320503128181014,
  'DISORDERLY CONDUCT': 0.003342480757668657,
  'DRIVING UNDER THE INFLUENCE': 0.0019465560245016042,
  'DRUG/NARCOTIC': 0.05639654977409235,
  'DRUNKENNESS': 0.005911101322110376,
  'EMBEZZLEMENT': 0.001672728571513611,
  'EXTORTION': 0.00024108721404377667,
  'FAMILY OFFENSES': 0.000330378774800731,
  'FORGERY/COUNTERFEITING': 0.014340224657566864,
  'FRAUD': 0.021921078165832285,
  'GAMBLING': 0.00010417348754978004,
  'KIDNAPPING': 0.002086446136354166,
  'LARCENY/THEFT': 0.27396436671448726,
  'LIQUOR LAWS': 0.0023215805796808126,
  'LOITERING': 0.0023840846722106806,
  'MISSING PERSON': 0.019906065278083685,
  'NON-CRIMINAL': 0.12540702069778378,
  'OTHER OFFENSES': 0.13278845638702533,
  'PORNOGRAPHY/OBSCENE MAT': 1.488192679282572e-05,
  'PROSTITUTION': 0.0011935305287846228,
  'RECOVERED VEHICLE': 0.001982272648804386,
  'ROBBERY': 0.0244986278863497,
  'RUNAWAY': 0.0006786158617528528,
  'SECONDARY CODES': 0.0075064438743012935,
  'SEX OFFENSES, FORCIBLE': 0.005306895094321652,
  'SEX OFFENSES, NON FORCIBLE': 0.0001607248093625178,
  'STOLEN PROPERTY': 0.006690914286054444,
  'SUICIDE': 0.00041669395019912017,
  'SUSPICIOUS OCC': 0.031641952746906045,
  'TREA': 2.9763853585651442e-06,
  'TRESPASS': 0.009342873640535987,
  'VANDALISM': 0.0416366547809678,
  'VEHICLE THEFT': 0.029853145146408397,
  'WARRANTS': 0.05685193673395282,
  'WEAPON LAWS': 0.007015340290138045},
 'TARAVAL': {'ARSON': 0.0020673927406807476,
  'ASSAULT': 0.08227654752516748,
  'BAD CHECKS': 0.0006536087015210612,
  'BRIBERY': 0.00027707325390566724,
  'BURGLARY': 0.05270075378133947,
  'DISORDERLY CONDUCT': 0.0023515704369942523,
  'DRIVING UNDER THE INFLUENCE': 0.0034172367981698958,
  'DRUG/NARCOTIC': 0.022258218063755265,
  'DRUNKENNESS': 0.004127681038953657,
  'EMBEZZLEMENT': 0.0014919329056459004,
  'EXTORTION': 0.0006891309135602492,
  'FAMILY OFFENSES': 0.000618086489481873,
  'FORGERY/COUNTERFEITING': 0.015551624430756552,
  'FRAUD': 0.023764359854216842,
  'GAMBLING': 0.00012787996334107717,
  'KIDNAPPING': 0.0029341347144369373,
  'LARCENY/THEFT': 0.18472971148859382,
  'LIQUOR LAWS': 0.0017192750626967043,
  'LOITERING': 0.0004902065261407958,
  'MISSING PERSON': 0.04983766349098091,
  'NON-CRIMINAL': 0.10807277790802589,
  'OTHER OFFENSES': 0.131737675568533,
  'PORNOGRAPHY/OBSCENE MAT': 3.5522212039188104e-05,
  'PROSTITUTION': 0.0017547972747358925,
  'RECOVERED VEHICLE': 0.003893234439495016,
  'ROBBERY': 0.021284909453881513,
  'RUNAWAY': 0.005797225004795499,
  'SECONDARY CODES': 0.015622668854834928,
  'SEX OFFENSES, FORCIBLE': 0.005548569520521182,
  'SEX OFFENSES, NON FORCIBLE': 0.00025575992668215434,
  'STOLEN PROPERTY': 0.0038221900154166402,
  'SUICIDE': 0.0009164730706110531,
  'SUSPICIOUS OCC': 0.04610072678445832,
  'TREA': 1.4208884815675242e-05,
  'TRESPASS': 0.0060600893738854906,
  'VANDALISM': 0.07208877711232833,
  'VEHICLE THEFT': 0.09173256036999936,
  'WARRANTS': 0.024794504003353295,
  'WEAPON LAWS': 0.008383242041248393},
 'TENDERLOIN': {'ARSON': 0.0007607188207687941,
  'ASSAULT': 0.0960787870630987,
  'BAD CHECKS': 0.00022236396299395522,
  'BRIBERY': 0.0001989572300472231,
  'BURGLARY': 0.018649314475308824,
  'DISORDERLY CONDUCT': 0.009456320110479779,
  'DRIVING UNDER THE INFLUENCE': 0.0013224804114903652,
  'DRUG/NARCOTIC': 0.21151494227314488,
  'DRUNKENNESS': 0.005383548577748389,
  'EMBEZZLEMENT': 0.0013809972438571954,
  'EXTORTION': 0.00015214376415375883,
  'FAMILY OFFENSES': 0.0003862110936210801,
  'FORGERY/COUNTERFEITING': 0.006823062653972415,
  'FRAUD': 0.013921154420068933,
  'GAMBLING': 0.00016969881386380792,
  'KIDNAPPING': 0.0026391091397440472,
  'LARCENY/THEFT': 0.12360510500845569,
  'LIQUOR LAWS': 0.002387486760566677,
  'LOITERING': 0.00206564418254911,
  'MISSING PERSON': 0.010948499335833953,
  'NON-CRIMINAL': 0.09305931851297025,
  'OTHER OFFENSES': 0.16441474390108315,
  'PORNOGRAPHY/OBSCENE MAT': 1.7555049710049096e-05,
  'PROSTITUTION': 0.011557074392448989,
  'RECOVERED VEHICLE': 0.0016326196230345659,
  'ROBBERY': 0.027104996752315803,
  'RUNAWAY': 9.362693178692851e-05,
  'SECONDARY CODES': 0.009099367433042115,
  'SEX OFFENSES, FORCIBLE': 0.00382114915355402,
  'SEX OFFENSES, NON FORCIBLE': 5.266514913014729e-05,
  'STOLEN PROPERTY': 0.004371207377802225,
  'SUICIDE': 0.00042717287627786133,
  'SUSPICIOUS OCC': 0.03190337700639589,
  'TRESPASS': 0.01403233640156591,
  'VANDALISM': 0.020252675682159972,
  'VEHICLE THEFT': 0.012276831430561,
  'WARRANTS': 0.08769332498493192,
  'WEAPON LAWS': 0.010123411999461646}}

Now we look at the ratio P(crime|district)/P(crime). That ratio is equal to 1 if the crime occurs at the same level within a district as in the city as a whole. If it's greater than one, it means that the crime occurs more frequently within that district. If it's smaller than one, it means that the crime is rarer within the district in question than in the city as a whole.

For each district plot these ratios for the 14 focus crimes. My plot looks like this


In [110]:
import pandas as pd
%matplotlib inline

def calcRatioForDistrict(districtCounter, overAllCounter, district):
    ys = []
    xs = []
    for fc in list(focuscrimes):
        ys.append(districtCounter[fc] / overAllCounter[fc])
        xs.append(fc)
    return pd.Series(ys, index=xs)
    
series = {}
for district in list(PdDistrictSet):
    series[district] = calcRatioForDistrict(p_crime_district[district], p_crime, district)

df = pd.DataFrame(series)
df.plot(kind="bar", subplots=True, figsize=(14,14),layout=(7,2), legend=False,sharey=True)


Out[110]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000060E1D080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000015D6DDA0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000572B53C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000025E5F940>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A5C90F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010934B70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000083D726D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000004A88CDA0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000007BCE2550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000BCF9E10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000008243EA58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000010A92E860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000C6C98D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000F79CDC50>]], dtype=object)

Comment on the top crimes in Tenderloin, Mission, and Richmond. Does this fit with the impression you get of these neighborhoods on Wikipedia?

Tenderloin

The top crime in Tenderloin is by far DRUG/NARCOTICS as illustrated above. Looking through articles and wikipedia, it does seem to be representative current/past activity in the area.

From Wikipedia: "The first block of Turk Street, between Taylor and Mason, had one of the highest rates of violence and drug activity in San Francisco, according to a survey conducted by the Tenderloin Housing Clinic."

Mission

Prostitution. The top crime in Mission with some margin. Wikipedia does not give any indication about prostitution or crime in general for the district.

Doing a leading search on google, there does seem to be articles that mentions prostitution is on the uptick recently.

Richmond

Richmond has the highest ratio of crimes related to DRIVING UNDER THE INFLUENCE (DUI) in San Francisco. Oddly, both DRUNKENESS and DRUG/NARCOTIC are very low in the area.

We were not able to find any information that backs these statistics.

Even though we only plotted the ratios for our 14 focus crimes, I asked you to calculate the ratios based on all crime categories. Why do you think I wanted to include all crime types in the calculation?

Not including the other crimes the focus crime ratios will inheritly become larger.

Assignment 1C: KNN

Begin by using geoplotlib to plot all incidents of the three crime types on their own map using geoplotlib.kde(). This will give you an idea of how the varioius crimes are distributed across the city.

Here we define the new focus crimes. Then we create a dictionary that contains all the crimes from the tree specific focus crimes. So that on the key PROSTITUTION the value is a list containing all of these crimes.


In [111]:
focusCrimes = ['PROSTITUTION', 'DRUG/NARCOTIC', 'DRIVING UNDER THE INFLUENCE']

filteredCrimes = {}
for f in focusCrimes:
    filteredCrimes[f] = filter(lambda c: c['Category'] == f and float(c['Y']) < 90.0, data) # We remove some crimes located at the north pole

Then we create a dictionary that cointains two lists, with all the latitutes and longtitues. This is the structure that geoplotlib accepts. We also construct a Bounding Box for San Francisco so that we have a nice view box.


In [112]:
from geoplotlib.utils import BoundingBox

def createLongAndLatList(data):
    return {'lat': [float(d['Y']) for d in data], 'lon': [float(d['X']) for d in data]}

# San Francisco
bbox_SF = BoundingBox(north=37.83, west=-122.52, south=37.67, east=-122.35)

In [113]:
import geoplotlib as glp
from geoplotlib.utils import BoundingBox, DataAccessObject

for fc in focusCrimes:
    print "Plotting " + fc
    glp.kde(createLongAndLatList(filteredCrimes[fc]), bw=5)
    glp.set_bbox(bbox_SF)
    glp.inline(width=900)


Plotting PROSTITUTION
('smallest non-zero count', 1.4329573088768091e-09)
('max count:', 17.065636267483391)
Plotting DRUG/NARCOTIC
('smallest non-zero count', 7.1647865443840454e-10)
('max count:', 101.94473147756155)