How we Munged

Tools

pandas is great for wrangling tabular data.


In [9]:
import pandas as pd

Data Source

Chicago publishes its crime data in a massive 1.4GB csv.

Here's a small sample.


In [3]:
sample = pd.read_csv('clearn/data/fixtures/tinyCrimeSample.csv')

Data Format

Lots of features. And lots of possible discrete values.


In [4]:
sample


Out[4]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic ... Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
0 9977403 HY166566 02/27/2015 11:58:00 PM 075XX S SOUTH CHICAGO AVE 4625 OTHER OFFENSE PAROLE VIOLATION STREET True False ... 5 43 26 1185557 1855546 2015 03/06/2015 12:43:50 PM 41.758759 -87.595507 (41.758759126, -87.59550678)
1 9977399 HY166552 02/27/2015 11:55:00 PM 111XX S INDIANA AVE 460 BATTERY SIMPLE RESIDENCE True False ... 9 49 08B 1179512 1831148 2015 03/06/2015 12:43:50 PM 41.691948 -87.618404 (41.691948085, -87.618403732)
2 9977419 HY166555 02/27/2015 11:53:00 PM 034XX W NORTH AVE 486 BATTERY DOMESTIC BATTERY SIMPLE APARTMENT True True ... 26 23 08B 1153350 1910385 2015 03/06/2015 12:43:50 PM 41.909942 -87.712090 (41.909941658, -87.712090397)
3 9977392 HY166560 02/27/2015 11:49:00 PM 007XX N HAMLIN AVE 2024 NARCOTICS POSS: HEROIN(WHITE) STREET True False ... 27 23 18 1150955 1904931 2015 03/06/2015 12:43:50 PM 41.895023 -87.721032 (41.89502261, -87.721031713)
4 9977845 HY167195 02/27/2015 11:45:00 PM 064XX S JUSTINE ST 1150 DECEPTIVE PRACTICE CREDIT CARD FRAUD OTHER False False ... 17 67 11 1167069 1862147 2015 03/06/2015 12:43:50 PM 41.777288 -87.663075 (41.777288287, -87.663075426)
5 9977548 HY166741 02/27/2015 11:40:00 PM 008XX E 49TH ST 820 THEFT $500 AND UNDER STREET False False ... 4 39 06 1182440 1872737 2015 03/06/2015 12:43:50 PM 41.806006 -87.606398 (41.806005596, -87.606397619)
6 9978595 HY168304 02/27/2015 11:30:00 PM 021XX S STATE ST 460 BATTERY SIMPLE BAR OR TAVERN False False ... 3 33 08B 1176690 1890182 2015 03/06/2015 12:43:50 PM 41.854008 -87.626960 (41.854007608, -87.626960226)
7 9977447 HY166548 02/27/2015 11:30:00 PM 055XX W WASHINGTON BLVD 820 THEFT $500 AND UNDER APARTMENT False True ... 29 25 06 1139102 1900154 2015 03/06/2015 12:43:50 PM 41.882138 -87.764681 (41.882137789, -87.764681486)
8 9977422 HY166526 02/27/2015 11:30:00 PM 034XX W PALMER ST 460 BATTERY SIMPLE APARTMENT False False ... 26 22 08B 1153107 1914370 2015 03/06/2015 12:43:50 PM 41.920882 -87.712877 (41.920881673, -87.712877177)
9 9977930 HY166969 02/27/2015 11:30:00 PM 034XX W 74TH ST 1310 CRIMINAL DAMAGE TO PROPERTY STREET False False ... 18 66 14 1154671 1855365 2015 03/06/2015 12:43:50 PM 41.758934 -87.708707 (41.758933601, -87.708707301)

10 rows × 22 columns

Cleaning up the Crimes

We wrote a munge module to tame the data.


In [ ]:
from clearn import munge

Bin, drop, and reindex

Bin crimes into 4 categories. Convert numbers to community area names. Turn timestamp string into pandas time series index.


In [7]:
munge.make_clean_timestamps(sample)


Out[7]:
Primary Type Community Area Arrest Domestic
Date
2015-02-27 23:58:00 Petty South Shore True False
2015-02-27 23:55:00 Violent Roseland True False
2015-02-27 23:53:00 Violent Humboldt Park True True
2015-02-27 23:49:00 Petty Humboldt Park True False
2015-02-27 23:45:00 Minor West Englewood False False
2015-02-27 23:40:00 Severe Kenwood False False
2015-02-27 23:30:00 Violent Near South Side False False
2015-02-27 23:30:00 Severe Austin False True
2015-02-27 23:30:00 Violent Logan Square False False
2015-02-27 23:30:00 Severe Chicago Lawn False False

Group by community area and resample by day

For each community area, create a series of summaries of each day's criminal activity from 2001 to present.


In [10]:
every_community_area = munge.get_master_dict()

In [11]:
where_wills_sister_lives = every_community_area['Edgewater']

In [14]:
where_wills_sister_lives[-5:]


Out[14]:
Arrest Domestic Violent Crimes Severe Crimes Minor Crimes Petty Crimes Violent Crime Committed? Month Weekday
2015-03-29 1 1 2 3 0 1 True 3 6
2015-03-30 1 0 0 3 0 1 False 3 0
2015-03-31 3 4 4 3 1 1 True 3 1
2015-04-01 0 1 3 1 1 0 True 4 2
2015-04-02 1 2 1 7 2 0 True 4 3

Extra preprocessing for each model

For nonsequential prediction, we added history to each day.


In [16]:
from clearn.predict import NonsequentialPredictor
with_history = NonsequentialPredictor.preprocess(every_community_area)
with_history['Edgewater'][-5:]


Out[16]:
Arrest Domestic Violent Crimes Severe Crimes Minor Crimes Petty Crimes Violent Crime Committed? Month Weekday Violent Crimes in Last Week ... Chicago Minor Crimes Chicago Petty Crimes Chicago Violent Crimes in Last Week Chicago Violent Crimes in Last Month Chicago Severe Crimes in Last Week Chicago Severe Crimes in Last Month Chicago Minor Crimes in Last Week Chicago Minor Crimes in Last Month Chicago Petty Crimes in Last Week Chicago Petty Crimes in Last Month
2015-03-29 1 1 2 3 0 1 True 3 6 6 ... 52 123 1176 5471 1870 8312 441 1980 932 4105
2015-03-30 1 0 0 3 0 1 False 3 0 6 ... 74 139 1187 5490 1924 8396 461 1994 960 4134
2015-03-31 3 4 4 3 1 1 True 3 1 10 ... 60 130 1232 5504 1938 8404 455 1999 954 4117
2015-04-01 0 1 3 1 1 0 True 4 2 13 ... 62 134 1273 5576 1984 8438 439 1995 937 4109
2015-04-02 1 2 1 7 2 0 True 4 3 14 ... 66 147 1297 5615 2010 8507 446 1999 949 4117

5 rows × 31 columns

Let's predict crime!


In [17]:
from datetime import date
log_reg_predictor = NonsequentialPredictor(with_history['Edgewater'])
log_reg_predictor.predict(date(2015, 4, 3))


Out[17]:
True

Which algorithm performs best?


In [19]:
from clearn.evaluate import evaluate
# Generate a sample of 2500 days to predict
evaluate(2500)

... and come back in 9 hours


In [ ]: