Data setup

Use our function to read the data file


In [3]:
def read_311_data(datafile):
    import pandas as pd
    import numpy as np
    
    #Add the fix_zip function
    def fix_zip(input_zip):
        try:
            input_zip = int(float(input_zip))
        except:
            try:
                input_zip = int(input_zip.split('-')[0])
            except:
                return np.NaN
        if input_zip < 10000 or input_zip > 19999:
            return np.NaN
        return str(input_zip)
    
    #Read the file
    df = pd.read_csv(datafile,index_col='Unique Key')
    
    #fix the zip
    df['Incident Zip'] = df['Incident Zip'].apply(fix_zip)
    
    #drop all rows that have any nans in them (note the easier syntax!)
    
    df = df.dropna(how='any')
    
    #get rid of unspecified boroughs
    df = df[df['Borough'] != 'Unspecified']
    
    #Convert times to datetime and create a processing time column
    
    import datetime
    df['Created Date'] = df['Created Date'].apply(lambda x:datetime.datetime.strptime(x,'%m/%d/%y %H:%M'))
    df['Closed Date'] = df['Closed Date'].apply(lambda x:datetime.datetime.strptime(x,'%m/%d/%y %H:%M'))
    df['processing_time'] =  df['Closed Date'] - df['Created Date']
    
    #Finally, get rid of negative processing times and return the final data frame
    
    df = df[df['processing_time']>=datetime.timedelta(0,0,0)]
    
    return df

In [4]:
datafile = "nyc_311_data_subset-2.csv"
data = read_311_data(datafile)


/Users/Beck/anaconda/envs/tflearn/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2821: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

Plotting data on google maps

gmplot library

https://github.com/vgm64/gmplot


In [138]:
!pip install gmplot --upgrade


Requirement already up-to-date: gmplot in /Users/hardeepjohar/anaconda/lib/python3.5/site-packages
Requirement already up-to-date: requests in /Users/hardeepjohar/anaconda/lib/python3.5/site-packages (from gmplot)
Requirement already up-to-date: chardet<3.1.0,>=3.0.2 in /Users/hardeepjohar/anaconda/lib/python3.5/site-packages (from requests->gmplot)
Requirement already up-to-date: idna<2.6,>=2.5 in /Users/hardeepjohar/anaconda/lib/python3.5/site-packages (from requests->gmplot)
Requirement already up-to-date: certifi>=2017.4.17 in /Users/hardeepjohar/anaconda/lib/python3.5/site-packages (from requests->gmplot)
Requirement already up-to-date: urllib3<1.22,>=1.21.1 in /Users/hardeepjohar/anaconda/lib/python3.5/site-packages (from requests->gmplot)

Our data dataframe contains latitudes and longitudes for each complaint.

We can draw a heatmap that will help us see the relative concentration of complaints using lats and lons

Set up the map

GoogleMapPlotter constructor

  • GoogleMapPlotter(center_lat, center_lng, zoom)
  • from_geocode(location_string,zoom)
  • 
    
    In [6]:
    import gmplot
    #gmap = gmplot.GoogleMapPlotter(40.7128, -74.0059, 8)
    
    
    gmap = gmplot.GoogleMapPlotter.from_geocode("New York",10)
    

    Then generate the heatmap passing the two data series (latitude and longitude) to the function

    
    
    In [7]:
    #Then generate a heatmap using the latitudes and longitudes
    gmap.heatmap(data['Latitude'], data['Longitude'])
    

    Save the heatmap to an html file

    The html file can be viewed, printed, or included in another html page

    
    
    In [8]:
    gmap.draw('incidents3.html')
    

    Let's do some grouping operations

    
    
    In [ ]:
    
    

    Incidents by Borough

    
    
    In [9]:
    %matplotlib inline
    

    Group data by borough and plot a bar chart of the incident count

    
    
    In [10]:
    borough_group = data.groupby('Borough')
    borough_group.size().plot(kind='bar')
    #kind can be 'hist', 'scatter'
    
    
    
    
    Out[10]:
    <matplotlib.axes._subplots.AxesSubplot at 0x1213bba58>

    Incidents by Agency

    
    
    In [11]:
    agency_group = data.groupby('Agency')
    agency_group.size().plot(kind='bar')
    
    
    
    
    Out[11]:
    <matplotlib.axes._subplots.AxesSubplot at 0x12137e6d8>

    Let's combine the two in a single graph

    
    
    In [12]:
    agency_borough = data.groupby(['Agency','Borough'])
    agency_borough.size().plot(kind='bar')
    
    
    
    
    Out[12]:
    <matplotlib.axes._subplots.AxesSubplot at 0x121082b70>

    This is quite unreadable and pointless!

    We can unstack the groups so that we get borough by agency

    
    
    In [13]:
    agency_borough.size().unstack().plot(kind='bar')
    
    
    
    
    Out[13]:
    <matplotlib.axes._subplots.AxesSubplot at 0x120976b38>

    Increase the size of the image and add a title

    
    
    In [14]:
    agency_borough = data.groupby(['Agency','Borough'])
    agency_borough.size().unstack().plot(kind='bar',title="Incidents in each Agency by Borough",figsize=(15,15))
    
    
    
    
    Out[14]:
    <matplotlib.axes._subplots.AxesSubplot at 0x1206e5c50>

    Digression: The pandas groupby function

    You can use functions to group data

    
    
    In [15]:
    import pandas as pd
    writers = pd.DataFrame({'Author':['George Orwell','John Steinbeck',
                                      'Pearl Buck','Agatha Christie'],
                            'Country':['UK','USA','USA','UK'],
                            'Gender':['M','M','F','F'],
                            'Age':[46,66,80,85]})
    
    
    
    In [16]:
    writers
    
    
    
    
    Out[16]:
    Age Author Country Gender
    0 46 George Orwell UK M
    1 66 John Steinbeck USA M
    2 80 Pearl Buck USA F
    3 85 Agatha Christie UK F

    Group by country

    
    
    In [23]:
    grouped = writers.groupby('Country')
    #grouped.first()
    #grouped.last()
    #grouped.sum()
    #grouped.mean()
    grouped.apply(sum)
    
    
    
    
    Out[23]:
    Age Author Country Gender
    Country
    UK 131 George OrwellAgatha Christie UKUK MF
    USA 146 John SteinbeckPearl Buck USAUSA MF
    
    
    In [24]:
    grouped.groups
    
    
    
    
    Out[24]:
    {'UK': Int64Index([0, 3], dtype='int64'),
     'USA': Int64Index([1, 2], dtype='int64')}

    Group by multiple columns

    
    
    In [25]:
    grouped = writers.groupby(['Country','Gender'])
    grouped.groups
    
    
    
    
    Out[25]:
    {('UK', 'F'): Int64Index([3], dtype='int64'),
     ('UK', 'M'): Int64Index([0], dtype='int64'),
     ('USA', 'F'): Int64Index([2], dtype='int64'),
     ('USA', 'M'): Int64Index([1], dtype='int64')}

    Group by age groups

    
    
    In [32]:
    def age_groups(df,index,col):
        print(index,col)
        if df[col].iloc[index] < 30:
            return 'Young'
        if df[col].iloc[index] < 60:
            return 'Middle'
        else:
            return 'Old'
    
    
    
    In [44]:
    writers['Age'].iloc[0]
    
    
    
    
    Out[44]:
    46
    
    
    In [34]:
    grouped = writers.groupby(lambda x: age_groups(writers,x,'Age'))
    grouped.groups
    
    
    
    
    0 Age
    1 Age
    2 Age
    3 Age
    
    Out[34]:
    {'Middle': Int64Index([0], dtype='int64'),
     'Old': Int64Index([1, 2, 3], dtype='int64')}

    Grouping by the values in a column

    For example, grouping the data by values in a column that are greater than or less than zero

    
    
    In [36]:
    import numpy as np
    people = pd.DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
    people
    
    
    
    
    Out[36]:
    a b c d e
    Joe -0.249511 0.829799 1.782724 1.175463 -0.494082
    Steve 0.359809 -0.235219 -3.068322 -0.183199 -0.598801
    Wes 0.932746 -1.146177 -0.672963 -0.348038 -0.760377
    Jim 0.775882 -1.659168 -0.301639 -0.000548 1.163943
    Travis -0.254044 -1.601734 -0.024121 -0.250024 -0.266271

    Write a function that takes three arguments - a dataframe, an index, and a column name and returns the grouping for that row

    
    
    In [41]:
    def GroupColFunc(df, ind, col):
        if df[col].loc[ind] > 0:
            return 'Group1'
        else:
            return 'Group2'
    
    
    
    In [42]:
    people.groupby(lambda x: GroupColFunc(people, x, 'a')).groups
    
    
    
    
    Out[42]:
    {'Group1': Index(['Steve', 'Wes', 'Jim'], dtype='object'),
     'Group2': Index(['Joe', 'Travis'], dtype='object')}

    Now we can compute stats on these groups

    
    
    In [45]:
    print(people.groupby(lambda x: GroupColFunc(people, x, 'a')).mean())
    print(people.groupby(lambda x: GroupColFunc(people, x, 'a')).std())
    
    
    
    
                   a         b         c         d         e
    Group1  0.689479 -1.013521 -1.347641 -0.177262 -0.065078
    Group2 -0.251777 -0.385968  0.879301  0.462719 -0.380177
                   a         b         c         d         e
    Group1  0.296080  0.721184  1.501675  0.173821  1.067425
    Group2  0.003205  1.719354  1.277633  1.007971  0.161087
    

    Incidents by time

    We know the creation date of each incident so we can build a bar graph of number of incidents by month

    Not particularly useful with a few months data but if we had all data from 2010, we could use this sort of analysis to eyeball trends and seasonality

    We're going to need to do some data manipulation for this

    We'll start by creating a new date field yyyymm

    
    
    In [46]:
    import datetime
    data['yyyymm'] = data['Created Date'].apply(lambda x:datetime.datetime.strftime(x,'%Y%m'))
    
    
    
    In [47]:
    data['yyyymm']
    
    
    
    
    Out[47]:
    Unique Key
    11        201609
    12        201609
    13        201609
    14        201609
    15        201609
    16        201609
    17        201609
    18        201609
    19        201609
    20        201609
    21        201609
    22        201609
    23        201609
    24        201609
    25        201609
    26        201609
    27        201609
    28        201609
    30        201609
    31        201609
    32        201609
    33        201609
    34        201609
    35        201609
    36        201609
    37        201609
    38        201609
    39        201609
    40        201609
    41        201609
               ...  
    971012    201609
    971014    201609
    971018    201609
    971021    201609
    971022    201609
    971023    201609
    971025    201609
    971031    201609
    971032    201609
    971034    201609
    971035    201609
    971036    201609
    971038    201609
    971040    201609
    971041    201609
    971043    201609
    971044    201609
    971046    201609
    971047    201609
    971049    201609
    971052    201609
    971053    201609
    971055    201609
    971056    201609
    971057    201609
    971059    201609
    971060    201609
    971061    201609
    971062    201609
    971063    201609
    Name: yyyymm, Length: 799325, dtype: object
    
    
    In [48]:
    date_agency = data.groupby(['yyyymm','Agency'])
    date_agency.size().unstack().plot(kind='bar',figsize=(15,15))
    
    
    
    
    Out[48]:
    <matplotlib.axes._subplots.AxesSubplot at 0x115bddc50>

    Examining agencies

    We'll look at the frequency by agency and report the top 5 values

    
    
    In [49]:
    data.groupby('Agency').size().sort_values(ascending=False)
    
    
    
    
    Out[49]:
    Agency
    NYPD      273683
    HPD       244815
    DOT        66180
    DSNY       63321
    DEP        60346
    DOB        34821
    DHS        15083
    DOHMH      14188
    DPR        10830
    TLC         7129
    DCA         5760
    FDNY        1676
    DOF          579
    DOE          454
    DOITT        134
    EDC          123
    3/1/01       102
    DFTA         101
    dtype: int64
    
    
    In [50]:
    data.groupby('Agency').size().sort_values(ascending=False).plot(kind='bar', figsize=(20,4))
    
    
    
    
    Out[50]:
    <matplotlib.axes._subplots.AxesSubplot at 0x115c3b198>

    We can drill down into complaints by Agency by borough

    
    
    In [51]:
    agency_borough = data.groupby(['Agency', 'Borough']).size().unstack()
    
    
    
    In [52]:
    agency_borough
    
    
    
    
    Out[52]:
    Borough BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
    Agency
    3/1/01 17.0 28.0 23.0 28.0 6.0
    DCA 958.0 1532.0 1529.0 1547.0 194.0
    DEP 5837.0 17917.0 17315.0 15216.0 4061.0
    DFTA 21.0 33.0 24.0 21.0 2.0
    DHS 397.0 1130.0 12767.0 734.0 55.0
    DOB 5160.0 10993.0 6507.0 10567.0 1594.0
    DOE 129.0 127.0 49.0 136.0 13.0
    DOF 143.0 161.0 153.0 112.0 10.0
    DOHMH 2406.0 4481.0 3759.0 2814.0 728.0
    DOITT 7.0 18.0 91.0 18.0 NaN
    DOT 8683.0 19177.0 9673.0 22096.0 6551.0
    DPR 1416.0 2929.0 1103.0 3897.0 1485.0
    DSNY 6406.0 22208.0 6079.0 18125.0 10503.0
    EDC 1.0 62.0 41.0 15.0 4.0
    FDNY 39.0 127.0 1344.0 158.0 8.0
    HPD 77774.0 81382.0 51017.0 31080.0 3562.0
    NYPD 48837.0 88973.0 55841.0 69931.0 10101.0
    TLC 318.0 1238.0 4393.0 1146.0 34.0

    We can create 'top 5 Agency' subplots subplots for each borough

    
    
    In [53]:
    #We'll arrange the subplots in two rows and three columns. 
    #Since we have only 5 boroughs, one plot will be blank
    COL_NUM = 2
    ROW_NUM = 3
    import matplotlib.pyplot as plt
    fig, axes = plt.subplots(ROW_NUM, COL_NUM, figsize=(12,12))
    
    for i, (label, col) in enumerate(agency_borough.iteritems()): 
        ax = axes[int(i/COL_NUM), i%COL_NUM]
        col = col.sort_values(ascending=False)[:5] 
        col.plot(kind='barh', ax=ax)
        ax.set_title(label)
    
    plt.tight_layout()
    
    
    
    
    
    
    In [54]:
    for i, (label, col) in enumerate(agency_borough.iteritems()): 
        print(i,label,col)
    
    
    
    
    0 BRONX Agency
    3/1/01       17.0
    DCA         958.0
    DEP        5837.0
    DFTA         21.0
    DHS         397.0
    DOB        5160.0
    DOE         129.0
    DOF         143.0
    DOHMH      2406.0
    DOITT         7.0
    DOT        8683.0
    DPR        1416.0
    DSNY       6406.0
    EDC           1.0
    FDNY         39.0
    HPD       77774.0
    NYPD      48837.0
    TLC         318.0
    Name: BRONX, dtype: float64
    1 BROOKLYN Agency
    3/1/01       28.0
    DCA        1532.0
    DEP       17917.0
    DFTA         33.0
    DHS        1130.0
    DOB       10993.0
    DOE         127.0
    DOF         161.0
    DOHMH      4481.0
    DOITT        18.0
    DOT       19177.0
    DPR        2929.0
    DSNY      22208.0
    EDC          62.0
    FDNY        127.0
    HPD       81382.0
    NYPD      88973.0
    TLC        1238.0
    Name: BROOKLYN, dtype: float64
    2 MANHATTAN Agency
    3/1/01       23.0
    DCA        1529.0
    DEP       17315.0
    DFTA         24.0
    DHS       12767.0
    DOB        6507.0
    DOE          49.0
    DOF         153.0
    DOHMH      3759.0
    DOITT        91.0
    DOT        9673.0
    DPR        1103.0
    DSNY       6079.0
    EDC          41.0
    FDNY       1344.0
    HPD       51017.0
    NYPD      55841.0
    TLC        4393.0
    Name: MANHATTAN, dtype: float64
    3 QUEENS Agency
    3/1/01       28.0
    DCA        1547.0
    DEP       15216.0
    DFTA         21.0
    DHS         734.0
    DOB       10567.0
    DOE         136.0
    DOF         112.0
    DOHMH      2814.0
    DOITT        18.0
    DOT       22096.0
    DPR        3897.0
    DSNY      18125.0
    EDC          15.0
    FDNY        158.0
    HPD       31080.0
    NYPD      69931.0
    TLC        1146.0
    Name: QUEENS, dtype: float64
    4 STATEN ISLAND Agency
    3/1/01        6.0
    DCA         194.0
    DEP        4061.0
    DFTA          2.0
    DHS          55.0
    DOB        1594.0
    DOE          13.0
    DOF          10.0
    DOHMH       728.0
    DOITT         NaN
    DOT        6551.0
    DPR        1485.0
    DSNY      10503.0
    EDC           4.0
    FDNY          8.0
    HPD        3562.0
    NYPD      10101.0
    TLC          34.0
    Name: STATEN ISLAND, dtype: float64
    

    Processing time

    We can compute simple statistics on processing time

    
    
    In [ ]:
    grouped = data[['processing_time','Borough']].groupby('Borough')
    
    
    
    In [ ]:
    grouped.describe()
    

    But it is easier to convert the timedelta processing_time into floats for calculation purposes

    
    
    In [ ]:
    import numpy as np
    #The time it takes to process. Cleaned up
    data['float_time'] =data['processing_time'].apply(lambda x:x/np.timedelta64(1, 'D'))
    
    
    
    In [ ]:
    data
    

    Now we can compute stats easily

    
    
    In [ ]:
    grouped = data[['float_time','Agency']].groupby('Agency')
    grouped.mean().sort_values('float_time',ascending=False)
    
    
    
    In [ ]:
    data['float_time'].hist(bins=50)
    

    Other useful visualization libraries

    seaborn: https://seaborn.pydata.org/

    bokeh: http://bokeh.pydata.org/en/latest/

    plotly: https://plot.ly/

    
    
    In [ ]: