How we Munged

Tools

pandas is great for wrangling tabular data.



In [9]:

    
import pandas as pd

Data Source

Chicago publishes its crime data in a massive 1.4GB csv.

Here's a small sample.



In [3]:

    
sample = pd.read_csv('clearn/data/fixtures/tinyCrimeSample.csv')

Data Format

Lots of features. And lots of possible discrete values.



In [4]:

    
sample









    Out[4]:






  
    
      
      ID
      Case Number
      Date
      Block
      IUCR
      Primary Type
      Description
      Location Description
      Arrest
      Domestic
      ...
      Ward
      Community Area
      FBI Code
      X Coordinate
      Y Coordinate
      Year
      Updated On
      Latitude
      Longitude
      Location
    
  
  
    
      0
      9977403
      HY166566
      02/27/2015 11:58:00 PM
      075XX S SOUTH CHICAGO AVE
      4625
      OTHER OFFENSE
      PAROLE VIOLATION
      STREET
      True
      False
      ...
      5
      43
      26
      1185557
      1855546
      2015
      03/06/2015 12:43:50 PM
      41.758759
      -87.595507
      (41.758759126, -87.59550678)
    
    
      1
      9977399
      HY166552
      02/27/2015 11:55:00 PM
      111XX S INDIANA AVE
      460
      BATTERY
      SIMPLE
      RESIDENCE
      True
      False
      ...
      9
      49
      08B
      1179512
      1831148
      2015
      03/06/2015 12:43:50 PM
      41.691948
      -87.618404
      (41.691948085, -87.618403732)
    
    
      2
      9977419
      HY166555
      02/27/2015 11:53:00 PM
      034XX W NORTH AVE
      486
      BATTERY
      DOMESTIC BATTERY SIMPLE
      APARTMENT
      True
      True
      ...
      26
      23
      08B
      1153350
      1910385
      2015
      03/06/2015 12:43:50 PM
      41.909942
      -87.712090
      (41.909941658, -87.712090397)
    
    
      3
      9977392
      HY166560
      02/27/2015 11:49:00 PM
      007XX N HAMLIN AVE
      2024
      NARCOTICS
      POSS: HEROIN(WHITE)
      STREET
      True
      False
      ...
      27
      23
      18
      1150955
      1904931
      2015
      03/06/2015 12:43:50 PM
      41.895023
      -87.721032
      (41.89502261, -87.721031713)
    
    
      4
      9977845
      HY167195
      02/27/2015 11:45:00 PM
      064XX S JUSTINE ST
      1150
      DECEPTIVE PRACTICE
      CREDIT CARD FRAUD
      OTHER
      False
      False
      ...
      17
      67
      11
      1167069
      1862147
      2015
      03/06/2015 12:43:50 PM
      41.777288
      -87.663075
      (41.777288287, -87.663075426)
    
    
      5
      9977548
      HY166741
      02/27/2015 11:40:00 PM
      008XX E 49TH ST
      820
      THEFT
      $500 AND UNDER
      STREET
      False
      False
      ...
      4
      39
      06
      1182440
      1872737
      2015
      03/06/2015 12:43:50 PM
      41.806006
      -87.606398
      (41.806005596, -87.606397619)
    
    
      6
      9978595
      HY168304
      02/27/2015 11:30:00 PM
      021XX S STATE ST
      460
      BATTERY
      SIMPLE
      BAR OR TAVERN
      False
      False
      ...
      3
      33
      08B
      1176690
      1890182
      2015
      03/06/2015 12:43:50 PM
      41.854008
      -87.626960
      (41.854007608, -87.626960226)
    
    
      7
      9977447
      HY166548
      02/27/2015 11:30:00 PM
      055XX W WASHINGTON BLVD
      820
      THEFT
      $500 AND UNDER
      APARTMENT
      False
      True
      ...
      29
      25
      06
      1139102
      1900154
      2015
      03/06/2015 12:43:50 PM
      41.882138
      -87.764681
      (41.882137789, -87.764681486)
    
    
      8
      9977422
      HY166526
      02/27/2015 11:30:00 PM
      034XX W PALMER ST
      460
      BATTERY
      SIMPLE
      APARTMENT
      False
      False
      ...
      26
      22
      08B
      1153107
      1914370
      2015
      03/06/2015 12:43:50 PM
      41.920882
      -87.712877
      (41.920881673, -87.712877177)
    
    
      9
      9977930
      HY166969
      02/27/2015 11:30:00 PM
      034XX W 74TH ST
      1310
      CRIMINAL DAMAGE
      TO PROPERTY
      STREET
      False
      False
      ...
      18
      66
      14
      1154671
      1855365
      2015
      03/06/2015 12:43:50 PM
      41.758934
      -87.708707
      (41.758933601, -87.708707301)
    
  

10 rows × 22 columns

Cleaning up the Crimes

We wrote a munge module to tame the data.



In [ ]:

    
from clearn import munge

Bin, drop, and reindex

Bin crimes into 4 categories. Convert numbers to community area names. Turn timestamp string into pandas time series index.



In [7]:

    
munge.make_clean_timestamps(sample)









    Out[7]:






  
    
      
      Primary Type
      Community Area
      Arrest
      Domestic
    
    
      Date
      
      
      
      
    
  
  
    
      2015-02-27 23:58:00
      Petty
      South Shore
      True
      False
    
    
      2015-02-27 23:55:00
      Violent
      Roseland
      True
      False
    
    
      2015-02-27 23:53:00
      Violent
      Humboldt Park
      True
      True
    
    
      2015-02-27 23:49:00
      Petty
      Humboldt Park
      True
      False
    
    
      2015-02-27 23:45:00
      Minor
      West Englewood
      False
      False
    
    
      2015-02-27 23:40:00
      Severe
      Kenwood
      False
      False
    
    
      2015-02-27 23:30:00
      Violent
      Near South Side
      False
      False
    
    
      2015-02-27 23:30:00
      Severe
      Austin
      False
      True
    
    
      2015-02-27 23:30:00
      Violent
      Logan Square
      False
      False
    
    
      2015-02-27 23:30:00
      Severe
      Chicago Lawn
      False
      False

Group by community area and resample by day

For each community area, create a series of summaries of each day's criminal activity from 2001 to present.



In [10]:

    
every_community_area = munge.get_master_dict()



In [11]:

    
where_wills_sister_lives = every_community_area['Edgewater']



In [14]:

    
where_wills_sister_lives[-5:]









    Out[14]:






  
    
      
      Arrest
      Domestic
      Violent Crimes
      Severe Crimes
      Minor Crimes
      Petty Crimes
      Violent Crime Committed?
      Month
      Weekday
    
  
  
    
      2015-03-29
      1
      1
      2
      3
      0
      1
      True
      3
      6
    
    
      2015-03-30
      1
      0
      0
      3
      0
      1
      False
      3
      0
    
    
      2015-03-31
      3
      4
      4
      3
      1
      1
      True
      3
      1
    
    
      2015-04-01
      0
      1
      3
      1
      1
      0
      True
      4
      2
    
    
      2015-04-02
      1
      2
      1
      7
      2
      0
      True
      4
      3

Extra preprocessing for each model

For nonsequential prediction, we added history to each day.



In [16]:

    
from clearn.predict import NonsequentialPredictor
with_history = NonsequentialPredictor.preprocess(every_community_area)
with_history['Edgewater'][-5:]









    Out[16]:






  
    
      
      Arrest
      Domestic
      Violent Crimes
      Severe Crimes
      Minor Crimes
      Petty Crimes
      Violent Crime Committed?
      Month
      Weekday
      Violent Crimes in Last Week
      ...
      Chicago Minor Crimes
      Chicago Petty Crimes
      Chicago Violent Crimes in Last Week
      Chicago Violent Crimes in Last Month
      Chicago Severe Crimes in Last Week
      Chicago Severe Crimes in Last Month
      Chicago Minor Crimes in Last Week
      Chicago Minor Crimes in Last Month
      Chicago Petty Crimes in Last Week
      Chicago Petty Crimes in Last Month
    
  
  
    
      2015-03-29
      1
      1
      2
      3
      0
      1
      True
      3
      6
      6
      ...
      52
      123
      1176
      5471
      1870
      8312
      441
      1980
      932
      4105
    
    
      2015-03-30
      1
      0
      0
      3
      0
      1
      False
      3
      0
      6
      ...
      74
      139
      1187
      5490
      1924
      8396
      461
      1994
      960
      4134
    
    
      2015-03-31
      3
      4
      4
      3
      1
      1
      True
      3
      1
      10
      ...
      60
      130
      1232
      5504
      1938
      8404
      455
      1999
      954
      4117
    
    
      2015-04-01
      0
      1
      3
      1
      1
      0
      True
      4
      2
      13
      ...
      62
      134
      1273
      5576
      1984
      8438
      439
      1995
      937
      4109
    
    
      2015-04-02
      1
      2
      1
      7
      2
      0
      True
      4
      3
      14
      ...
      66
      147
      1297
      5615
      2010
      8507
      446
      1999
      949
      4117
    
  

5 rows × 31 columns

Let's predict crime!



In [17]:

    
from datetime import date
log_reg_predictor = NonsequentialPredictor(with_history['Edgewater'])
log_reg_predictor.predict(date(2015, 4, 3))









    Out[17]:





True

Which algorithm performs best?



In [19]:

    
from clearn.evaluate import evaluate
# Generate a sample of 2500 days to predict
evaluate(2500)

... and come back in 9 hours



In [ ]:

	ID	Case Number	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	...	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude	Longitude	Location
0	9977403	HY166566	02/27/2015 11:58:00 PM	075XX S SOUTH CHICAGO AVE	4625	OTHER OFFENSE	PAROLE VIOLATION	STREET	True	False	...	5	43	26	1185557	1855546	2015	03/06/2015 12:43:50 PM	41.758759	-87.595507	(41.758759126, -87.59550678)
1	9977399	HY166552	02/27/2015 11:55:00 PM	111XX S INDIANA AVE	460	BATTERY	SIMPLE	RESIDENCE	True	False	...	9	49	08B	1179512	1831148	2015	03/06/2015 12:43:50 PM	41.691948	-87.618404	(41.691948085, -87.618403732)
2	9977419	HY166555	02/27/2015 11:53:00 PM	034XX W NORTH AVE	486	BATTERY	DOMESTIC BATTERY SIMPLE	APARTMENT	True	True	...	26	23	08B	1153350	1910385	2015	03/06/2015 12:43:50 PM	41.909942	-87.712090	(41.909941658, -87.712090397)
3	9977392	HY166560	02/27/2015 11:49:00 PM	007XX N HAMLIN AVE	2024	NARCOTICS	POSS: HEROIN(WHITE)	STREET	True	False	...	27	23	18	1150955	1904931	2015	03/06/2015 12:43:50 PM	41.895023	-87.721032	(41.89502261, -87.721031713)
4	9977845	HY167195	02/27/2015 11:45:00 PM	064XX S JUSTINE ST	1150	DECEPTIVE PRACTICE	CREDIT CARD FRAUD	OTHER	False	False	...	17	67	11	1167069	1862147	2015	03/06/2015 12:43:50 PM	41.777288	-87.663075	(41.777288287, -87.663075426)
5	9977548	HY166741	02/27/2015 11:40:00 PM	008XX E 49TH ST	820	THEFT	$500 AND UNDER	STREET	False	False	...	4	39	06	1182440	1872737	2015	03/06/2015 12:43:50 PM	41.806006	-87.606398	(41.806005596, -87.606397619)
6	9978595	HY168304	02/27/2015 11:30:00 PM	021XX S STATE ST	460	BATTERY	SIMPLE	BAR OR TAVERN	False	False	...	3	33	08B	1176690	1890182	2015	03/06/2015 12:43:50 PM	41.854008	-87.626960	(41.854007608, -87.626960226)
7	9977447	HY166548	02/27/2015 11:30:00 PM	055XX W WASHINGTON BLVD	820	THEFT	$500 AND UNDER	APARTMENT	False	True	...	29	25	06	1139102	1900154	2015	03/06/2015 12:43:50 PM	41.882138	-87.764681	(41.882137789, -87.764681486)
8	9977422	HY166526	02/27/2015 11:30:00 PM	034XX W PALMER ST	460	BATTERY	SIMPLE	APARTMENT	False	False	...	26	22	08B	1153107	1914370	2015	03/06/2015 12:43:50 PM	41.920882	-87.712877	(41.920881673, -87.712877177)
9	9977930	HY166969	02/27/2015 11:30:00 PM	034XX W 74TH ST	1310	CRIMINAL DAMAGE	TO PROPERTY	STREET	False	False	...	18	66	14	1154671	1855365	2015	03/06/2015 12:43:50 PM	41.758934	-87.708707	(41.758933601, -87.708707301)

	Primary Type	Community Area	Arrest	Domestic
Date
2015-02-27 23:58:00	Petty	South Shore	True	False
2015-02-27 23:55:00	Violent	Roseland	True	False
2015-02-27 23:53:00	Violent	Humboldt Park	True	True
2015-02-27 23:49:00	Petty	Humboldt Park	True	False
2015-02-27 23:45:00	Minor	West Englewood	False	False
2015-02-27 23:40:00	Severe	Kenwood	False	False
2015-02-27 23:30:00	Violent	Near South Side	False	False
2015-02-27 23:30:00	Severe	Austin	False	True
2015-02-27 23:30:00	Violent	Logan Square	False	False
2015-02-27 23:30:00	Severe	Chicago Lawn	False	False

	Arrest	Domestic	Violent Crimes	Severe Crimes	Minor Crimes	Petty Crimes	Violent Crime Committed?	Month	Weekday
2015-03-29	1	1	2	3	0	1	True	3	6
2015-03-30	1	0	0	3	0	1	False	3	0
2015-03-31	3	4	4	3	1	1	True	3	1
2015-04-01	0	1	3	1	1	0	True	4	2
2015-04-02	1	2	1	7	2	0	True	4	3

	Arrest	Domestic	Violent Crimes	Severe Crimes	Minor Crimes	Petty Crimes	Violent Crime Committed?	Month	Weekday	Violent Crimes in Last Week	...	Chicago Minor Crimes	Chicago Petty Crimes	Chicago Violent Crimes in Last Week	Chicago Violent Crimes in Last Month	Chicago Severe Crimes in Last Week	Chicago Severe Crimes in Last Month	Chicago Minor Crimes in Last Week	Chicago Minor Crimes in Last Month	Chicago Petty Crimes in Last Week	Chicago Petty Crimes in Last Month
2015-03-29	1	1	2	3	0	1	True	3	6	6	...	52	123	1176	5471	1870	8312	441	1980	932	4105
2015-03-30	1	0	0	3	0	1	False	3	0	6	...	74	139	1187	5490	1924	8396	461	1994	960	4134
2015-03-31	3	4	4	3	1	1	True	3	1	10	...	60	130	1232	5504	1938	8404	455	1999	954	4117
2015-04-01	0	1	3	1	1	0	True	4	2	13	...	62	134	1273	5576	1984	8438	439	1995	937	4109
2015-04-02	1	2	1	7	2	0	True	4	3	14	...	66	147	1297	5615	2010	8507	446	1999	949	4117