In [1]:
import pandas as pd
pd.options.display.max_rows = 6
%matplotlib inline

In [2]:
crimes = pd.read_csv('Crimes_-_2001_to_present.csv', parse_dates=['Date'])


















Group By

Given a key column, one can partition a table into groups:


In [3]:
crimes50_by_community = crimes.head(50).groupby('Community Area')
crimes50_by_community.groups


Out[3]:
{1: [26],
 6: [3, 13],
 8: [28, 38, 45, 49],
 13: [14],
 14: [1, 43],
 19: [30],
 22: [0, 27],
 23: [32],
 24: [31],
 25: [5, 24, 42],
 28: [10, 11, 22, 36],
 29: [33, 40],
 30: [16, 25],
 32: [4],
 33: [37],
 37: [48],
 41: [2],
 43: [7, 8, 20],
 49: [29],
 56: [23],
 58: [21, 46],
 61: [47],
 63: [9],
 67: [15, 19],
 69: [17],
 70: [41, 44],
 71: [6, 34],
 75: [12],
 76: [18],
 77: [35, 39]}

In [4]:
crimes_by_community = crimes.groupby('Community Area')


















Aggregate

We can aggregate a column over the groups using an aggregate function.

  • Built-in functions include 'sum' 'mean' 'median' 'mode' 'count' 'std' 'nunique'.
  • Can also pass an arbitrary function whose argument is a vector and returns a scalar.

In [5]:
community_crime_count = crimes_by_community['ID'].agg('count')
community_crime_count


Out[5]:
Community Area
1     259
2     263
3     263
     ... 
75    118
76    112
77    166
Name: ID, dtype: int64

In [6]:
community_crime_count.plot(kind='bar', figsize=(12,5))


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199dd22e90>



































Daily Timeseries


In [7]:
def to_day(timestamp):
    return timestamp.replace(minute=0,hour=0, second=0)

crimes['Day'] = crimes['Date'].apply(to_day)

In [8]:
crimes_by_day = crimes.groupby('Day')

In [36]:
crimes_by_day['ID'].agg('nunique').plot()


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199cc1ffd0>


















What about a time series for a single community area?


In [10]:
crimes[crimes['Community Area'] == 41].groupby('Day')['ID'].agg('count').plot()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199d396e50>


















Multiple Group By

To get a series for each community area we could just loop over them. But there is a better way.

Group by can take multiple keys, in this case community area and day. The result has a (row) multi-index.


In [11]:
crimes_by_community_day = crimes.groupby(['Community Area', 'Day'])
crimes_by_community_day_count = crimes_by_community_day['ID'].agg('count')

In [12]:
crimes_by_community_day_count


Out[12]:
Community Area  Day       
1               2015-12-01    14
                2015-12-02    12
                2015-12-03     6
                              ..
77              2015-12-28     8
                2015-12-29     4
                2015-12-30     5
Name: ID, dtype: int64


















Similarly for arrests, then we can take their quotient:


In [13]:
crimes_by_community_day_arrests = crimes_by_community_day['Arrest'].agg('sum')

In [14]:
community_day_arrest_prop = crimes_by_community_day_arrests / crimes_by_community_day_count
community_day_arrest_prop


Out[14]:
Community Area  Day       
1               2015-12-01    0.214286
                2015-12-02    0.166667
                2015-12-03    0.166667
                                ...   
77              2015-12-28    0.250000
                2015-12-29    0.500000
                2015-12-30    0.000000
dtype: float64


















How can we plot multiple of these time series together? Unstacking


In [40]:
crimes.groupby(['Community Area', 'Primary Type'])['ID'].agg('count').unstack('Community Area')


Out[40]:
Community Area 1 2 3 4 5 6 7 8 9 10 ... 68 69 70 71 72 73 74 75 76 77
Primary Type
ARSON NaN 1 NaN NaN NaN 1 NaN NaN NaN NaN ... 1 1 NaN 2 NaN NaN NaN NaN NaN NaN
ASSAULT 19 11 12 10 4 11 3 37 2 4 ... 39 46 16 40 4 14 5 4 4 10
BATTERY 50 32 47 17 8 45 23 99 4 15 ... 122 139 23 129 6 40 4 21 9 32
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
STALKING NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
THEFT 60 62 75 42 40 172 152 392 NaN 29 ... 65 87 38 89 12 44 16 25 45 39
WEAPONS VIOLATION NaN 3 2 NaN NaN 1 NaN 2 NaN NaN ... 7 9 2 8 NaN NaN NaN 2 NaN 2

30 rows × 77 columns


In [15]:
community_day_arrest_prop.unstack('Community Area')


Out[15]:
Community Area 1 2 3 4 5 6 7 8 9 10 ... 68 69 70 71 72 73 74 75 76 77
Day
2015-12-01 0.214286 0.133333 0.100000 0.375 0.200000 0.076923 0.000000 0.000000 NaN 0.000000 ... 0.333333 0.318182 0.111111 0.285714 0.0 0.000000 0.000000 0.000000 0.400000 0.0
2015-12-02 0.166667 0.000000 0.100000 0.250 0.000000 0.000000 0.000000 0.173913 0 0.000000 ... 0.173913 0.307692 0.000000 0.347826 0.0 0.500000 0.500000 0.166667 0.166667 0.0
2015-12-03 0.166667 0.181818 0.333333 0.000 0.000000 0.166667 0.000000 0.031250 0 0.000000 ... 0.250000 0.210526 0.142857 0.125000 0.0 0.285714 0.000000 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-12-29 0.000000 0.285714 0.083333 0.000 0.333333 0.125000 0.000000 0.217391 NaN 0.000000 ... 0.277778 0.200000 0.142857 0.277778 NaN 0.125000 0.333333 0.000000 0.000000 0.5
2015-12-30 0.000000 0.250000 0.428571 0.000 0.000000 0.272727 0.133333 0.222222 NaN 0.333333 ... 0.181818 0.260870 0.000000 0.142857 0.5 0.125000 0.000000 0.000000 0.250000 0.0
2015-12-31 NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

31 rows × 77 columns


















There are missing values after reshaping.
That means there were no crimes in that area on that day so fillna(0)


In [42]:
community_arrest_timeseries = community_day_arrest_prop.unstack('Community Area')
#community_arrest_timeseries.fillna(0, inplace=True)
community_arrest_timeseries


Out[42]:
Community Area 1 2 3 4 5 6 7 8 9 10 ... 68 69 70 71 72 73 74 75 76 77
Day
2015-12-01 0.214286 0.133333 0.100000 0.375 0.200000 0.076923 0.000000 0.000000 NaN 0.000000 ... 0.333333 0.318182 0.111111 0.285714 0.0 0.000000 0.000000 0.000000 0.400000 0.0
2015-12-02 0.166667 0.000000 0.100000 0.250 0.000000 0.000000 0.000000 0.173913 0 0.000000 ... 0.173913 0.307692 0.000000 0.347826 0.0 0.500000 0.500000 0.166667 0.166667 0.0
2015-12-03 0.166667 0.181818 0.333333 0.000 0.000000 0.166667 0.000000 0.031250 0 0.000000 ... 0.250000 0.210526 0.142857 0.125000 0.0 0.285714 0.000000 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-12-29 0.000000 0.285714 0.083333 0.000 0.333333 0.125000 0.000000 0.217391 NaN 0.000000 ... 0.277778 0.200000 0.142857 0.277778 NaN 0.125000 0.333333 0.000000 0.000000 0.5
2015-12-30 0.000000 0.250000 0.428571 0.000 0.000000 0.272727 0.133333 0.222222 NaN 0.333333 ... 0.181818 0.260870 0.000000 0.142857 0.5 0.125000 0.000000 0.000000 0.250000 0.0
2015-12-31 NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

31 rows × 77 columns


















Now we can plot multiple community area timeseries:


In [43]:
community_arrest_timeseries[[40,41,42]]


Out[43]:
Community Area 40 41 42
Day
2015-12-01 0.500000 1.0 0.333333
2015-12-02 0.250000 0.0 0.187500
2015-12-03 0.000000 0.0 0.333333
... ... ... ...
2015-12-29 0.285714 0.2 0.125000
2015-12-30 0.285714 0.0 0.500000
2015-12-31 NaN 0.0 NaN

31 rows × 3 columns


In [55]:
community_arrest_timeseries[[1, 10]]


Out[55]:
Community Area 1 10
Day
2015-12-01 0.214286 0.000000
2015-12-02 0.166667 0.000000
2015-12-03 0.166667 0.000000
... ... ...
2015-12-29 0.000000 0.000000
2015-12-30 0.000000 0.333333
2015-12-31 NaN NaN

31 rows × 2 columns


In [44]:
community_arrest_timeseries[[40,41,42]].plot()


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199c9d67d0>


















Another dataset

Let's look at Chicago's affordable housing dataset


In [18]:
housing = pd.read_csv('Affordable_Rental_Housing_Developments.csv')

In [56]:
housing['Property Type'].value_counts()


Out[56]:
Multifamily                          135
Senior                                76
Supportive Housing                    22
                                    ... 
Supportive/Youth/Kinship Families      1
Supportive/HIV/AIDS                    1
Supportive/Teenage Moms                1
Name: Property Type, dtype: int64

In [19]:
housing


Out[19]:
Community Area Name Community Area Number Property Type Property Name Address Zip Code Phone Number Management Company Units X Coordinate Y Coordinate Latitude Longitude Location
0 Albany Park 14 Senior Mayfair Commons 4444 W. Lawrence Ave. 60630 773-205-7862 Metroplex, Inc. 97 1145674.753818 1931569.979045 41.968224 -87.739747 4444 W. Lawrence Ave.\n(41.9682242321, -87.739...
1 Albany Park 14 Senior Senior Suites of Ravenswood Manor 2800 W. Montrose Ave. 60618 773-509-9333 Senior Lifestyle Corp. 80 1156898.513945 1929148.460973 41.961359 -87.698544 2800 W. Montrose Ave.\n(41.9613586525, -87.698...
2 Albany Park 14 Senior Darul Amaan Senior Living 4814-58 N. Kedzie Ave. 60625 773-463-3004 East Lake Management & Development Corp. 56 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase I 6227-29 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 67 1186722.404223 1863775.146697 41.781408 -87.605807 6227-29 S. Cottage Grove Ave.\n(41.7814080037,...
262 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase II 6129 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 33 1186094.258818 1864209.572438 41.782505 -87.590962 6129 S. Cottage Grove Ave.\n(41.7825052194, -8...
263 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1864312.842731 41.782803 -87.593261 (41.7828034603, -87.5932613147)

264 rows × 14 columns


















Join

What if we want to know the crime rate in the community for each housing development?
We can find this using what's called a join or merge.
First let's turn the community_crime_count Series into a DataFrame:


In [20]:
community_area_crime = pd.DataFrame({'Crime Count': community_crime_count})
community_area_crime


Out[20]:
Crime Count
Community Area
1 259
2 263
3 263
... ...
75 118
76 112
77 166

77 rows × 1 columns


In [21]:
housing_crime = housing.merge(community_area_crime, 
                              left_on='Community Area Number', right_index=True)
housing_crime


Out[21]:
Community Area Name Community Area Number Property Type Property Name Address Zip Code Phone Number Management Company Units X Coordinate Y Coordinate Latitude Longitude Location Crime Count
0 Albany Park 14 Senior Mayfair Commons 4444 W. Lawrence Ave. 60630 773-205-7862 Metroplex, Inc. 97 1145674.753818 1931569.979045 41.968224 -87.739747 4444 W. Lawrence Ave.\n(41.9682242321, -87.739... 195
1 Albany Park 14 Senior Senior Suites of Ravenswood Manor 2800 W. Montrose Ave. 60618 773-509-9333 Senior Lifestyle Corp. 80 1156898.513945 1929148.460973 41.961359 -87.698544 2800 W. Montrose Ave.\n(41.9613586525, -87.698... 195
2 Albany Park 14 Senior Darul Amaan Senior Living 4814-58 N. Kedzie Ave. 60625 773-463-3004 East Lake Management & Development Corp. 56 NaN NaN NaN NaN NaN 195
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 Woodlawn 42 Senior Butler Apartments 6146 S. Kenwood Ave. 60637 773-955-9290 Woodlawn Comm. Dev. Corp. 17 1182676.846895 1862140.114893 41.776909 -87.603954 6146 S. Kenwood Ave.\n(41.7769092523, -87.6039... 265
261 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase I 6227-29 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 67 1186722.404223 1863775.146697 41.781408 -87.605807 6227-29 S. Cottage Grove Ave.\n(41.7814080037,... 265
262 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase II 6129 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 33 1186094.258818 1864209.572438 41.782505 -87.590962 6129 S. Cottage Grove Ave.\n(41.7825052194, -8... 265

260 rows × 15 columns


















Another aggregate

We can pass other functions to aggregate:


In [22]:
community_housing = pd.DataFrame({
    'Affordable Housing Units': housing.groupby('Community Area Number')['Units'].agg('sum')
})
community_housing


Out[22]:
Affordable Housing Units
Community Area Number
1 517
2 307
3 940
... ...
71 352
73 142
77 504

53 rows × 1 columns


In [57]:
community_housing.shape


Out[57]:
(53, 1)

In [58]:
housing_crime_aggregate = community_housing.merge(community_area_crime,
                                                  left_index=True, right_index=True)
housing_crime_aggregate.shape


Out[58]:
(53, 2)

In [59]:
housing_crime_aggregate


Out[59]:
Affordable Housing Units Crime Count
1 517 259
2 307 263
3 940 263
... ... ...
71 352 570
73 142 218
77 504 166

53 rows × 2 columns


















Join types

Why does the above join only have entries for 53 community areas?
Because merge defaults to a left join. There are four types of joins:

  • left
  • right
  • inner
  • outer

In this case we want an outer join.


In [24]:
housing_crime_aggregate = community_housing.merge(
        community_area_crime, left_index=True, right_index=True, how='outer')
housing_crime_aggregate.shape


Out[24]:
(77, 2)

In [25]:
housing_crime_aggregate


Out[25]:
Affordable Housing Units Crime Count
1 517 259
2 307 263
3 940 263
... ... ...
75 NaN 118
76 NaN 112
77 504 166

77 rows × 2 columns


In [63]:
df = pd.DataFrame(columns=['column1', 'asdf'], index=range(10))

In [70]:
df.iloc[0]['column1'] = 10

In [71]:
df


Out[71]:
column1 asdf
0 10 NaN
1 NaN NaN
2 NaN NaN
... ... ...
7 NaN NaN
8 NaN NaN
9 NaN NaN

10 rows × 2 columns


















Rows without a corresponding row in the merge are filled with missing values. In our case those should be zeros, so we can fillna(0).


In [26]:
housing_crime_aggregate.fillna(0, inplace=True)
housing_crime_aggregate.plot(kind='scatter', x='Affordable Housing Units', y='Crime Count')


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199d028ad0>