In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser

First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use 311-2014.csv. You can rename it.

Importing and preparing your data

Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.


In [2]:
df = pd.read_csv("311-2015.csv", nrows=200000, low_memory=False)

#Why can't we use parse_dates=True? --> If the date is in a format that pandas recognizes as a date like YYYYMMDD
#Keep the data set small and make sure everything works before running everything through it

In [3]:
df.head()


Out[3]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 31015465 07/06/2015 10:58:27 AM 07/22/2015 01:07:20 AM DCA Department of Consumer Affairs Consumer Complaint Demand for Cash NaN 11360 27-16 203 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.773540 -73.788237 (40.773539552542, -73.78823697228408)
1 30997660 07/03/2015 01:26:29 PM 07/03/2015 02:08:20 PM NYPD New York City Police Department Vending In Prohibited Area Residential Building/House 10019 200 CENTRAL PARK SOUTH ... NaN NaN NaN NaN NaN NaN NaN 40.767021 -73.979448 (40.76702142171206, -73.97944780718524)
2 31950223 11/09/2015 03:55:09 AM 11/09/2015 08:08:57 AM NYPD New York City Police Department Blocked Driveway No Access Street/Sidewalk 10453 1993 GRAND AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.852671 -73.910608 (40.85267061877697, -73.91060771362552)
3 31000038 07/03/2015 02:18:32 AM 07/03/2015 07:54:48 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 11372 84-16 NORTHERN BOULEVARD ... NaN NaN NaN NaN NaN NaN NaN 40.755774 -73.883262 (40.755773786469966, -73.88326243225418)
4 30995614 07/04/2015 12:03:27 AM 07/04/2015 03:33:09 AM NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11216 1057 BERGEN STREET ... NaN NaN NaN NaN NaN NaN NaN 40.676175 -73.951269 (40.67617516102934, -73.9512690004692)

5 rows × 53 columns


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 53 columns):
Unique Key                        200000 non-null int64
Created Date                      200000 non-null object
Closed Date                       188913 non-null object
Agency                            200000 non-null object
Agency Name                       200000 non-null object
Complaint Type                    200000 non-null object
Descriptor                        198197 non-null object
Location Type                     179328 non-null object
Incident Zip                      181049 non-null object
Incident Address                  152173 non-null object
Street Name                       152152 non-null object
Cross Street 1                    108035 non-null object
Cross Street 2                    107583 non-null object
Intersection Street 1             24790 non-null object
Intersection Street 2             24530 non-null object
Address Type                      177091 non-null object
City                              181095 non-null object
Landmark                          127 non-null object
Facility Type                     80031 non-null object
Status                            199998 non-null object
Due Date                          152018 non-null object
Resolution Description            198936 non-null object
Resolution Action Updated Date    188529 non-null object
Community Board                   200000 non-null object
Borough                           200000 non-null object
X Coordinate (State Plane)        175825 non-null float64
Y Coordinate (State Plane)        175825 non-null float64
Park Facility Name                200000 non-null object
Park Borough                      200000 non-null object
School Name                       200000 non-null object
School Number                     199907 non-null object
School Region                     197128 non-null object
School Code                       197128 non-null object
School Phone Number               200000 non-null object
School Address                    200000 non-null object
School City                       200000 non-null object
School State                      200000 non-null object
School Zip                        199999 non-null object
School Not Found                  151897 non-null object
School or Citywide Complaint      0 non-null float64
Vehicle Type                      34 non-null object
Taxi Company Borough              434 non-null object
Taxi Pick Up Location             3680 non-null object
Bridge Highway Name               1960 non-null object
Bridge Highway Direction          1959 non-null object
Road Ramp                         1946 non-null object
Bridge Highway Segment            2134 non-null object
Garage Lot Name                   143 non-null object
Ferry Direction                   86 non-null object
Ferry Terminal Name               215 non-null object
Latitude                          175825 non-null float64
Longitude                         175825 non-null float64
Location                          175825 non-null object
dtypes: float64(5), int64(1), object(47)
memory usage: 80.9+ MB

In [8]:
def parse_date(str_date):
    return dateutil.parser.parse(str_date)

df['created_datetime'] = df['Created Date'].apply(parse_date)
df.head()


Out[8]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location created_datetime
0 31015465 07/06/2015 10:58:27 AM 07/22/2015 01:07:20 AM DCA Department of Consumer Affairs Consumer Complaint Demand for Cash NaN 11360 27-16 203 STREET ... NaN NaN NaN NaN NaN NaN 40.773540 -73.788237 (40.773539552542, -73.78823697228408) 2015-07-06 10:58:27
1 30997660 07/03/2015 01:26:29 PM 07/03/2015 02:08:20 PM NYPD New York City Police Department Vending In Prohibited Area Residential Building/House 10019 200 CENTRAL PARK SOUTH ... NaN NaN NaN NaN NaN NaN 40.767021 -73.979448 (40.76702142171206, -73.97944780718524) 2015-07-03 13:26:29
2 31950223 11/09/2015 03:55:09 AM 11/09/2015 08:08:57 AM NYPD New York City Police Department Blocked Driveway No Access Street/Sidewalk 10453 1993 GRAND AVENUE ... NaN NaN NaN NaN NaN NaN 40.852671 -73.910608 (40.85267061877697, -73.91060771362552) 2015-11-09 03:55:09
3 31000038 07/03/2015 02:18:32 AM 07/03/2015 07:54:48 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 11372 84-16 NORTHERN BOULEVARD ... NaN NaN NaN NaN NaN NaN 40.755774 -73.883262 (40.755773786469966, -73.88326243225418) 2015-07-03 02:18:32
4 30995614 07/04/2015 12:03:27 AM 07/04/2015 03:33:09 AM NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11216 1057 BERGEN STREET ... NaN NaN NaN NaN NaN NaN 40.676175 -73.951269 (40.67617516102934, -73.9512690004692) 2015-07-04 00:03:27

5 rows × 54 columns


In [9]:
df.index = df['created_datetime']

In [10]:
df.head()


Out[10]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location created_datetime
created_datetime
2015-07-06 10:58:27 31015465 07/06/2015 10:58:27 AM 07/22/2015 01:07:20 AM DCA Department of Consumer Affairs Consumer Complaint Demand for Cash NaN 11360 27-16 203 STREET ... NaN NaN NaN NaN NaN NaN 40.773540 -73.788237 (40.773539552542, -73.78823697228408) 2015-07-06 10:58:27
2015-07-03 13:26:29 30997660 07/03/2015 01:26:29 PM 07/03/2015 02:08:20 PM NYPD New York City Police Department Vending In Prohibited Area Residential Building/House 10019 200 CENTRAL PARK SOUTH ... NaN NaN NaN NaN NaN NaN 40.767021 -73.979448 (40.76702142171206, -73.97944780718524) 2015-07-03 13:26:29
2015-11-09 03:55:09 31950223 11/09/2015 03:55:09 AM 11/09/2015 08:08:57 AM NYPD New York City Police Department Blocked Driveway No Access Street/Sidewalk 10453 1993 GRAND AVENUE ... NaN NaN NaN NaN NaN NaN 40.852671 -73.910608 (40.85267061877697, -73.91060771362552) 2015-11-09 03:55:09
2015-07-03 02:18:32 31000038 07/03/2015 02:18:32 AM 07/03/2015 07:54:48 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 11372 84-16 NORTHERN BOULEVARD ... NaN NaN NaN NaN NaN NaN 40.755774 -73.883262 (40.755773786469966, -73.88326243225418) 2015-07-03 02:18:32
2015-07-04 00:03:27 30995614 07/04/2015 12:03:27 AM 07/04/2015 03:33:09 AM NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11216 1057 BERGEN STREET ... NaN NaN NaN NaN NaN NaN 40.676175 -73.951269 (40.67617516102934, -73.9512690004692) 2015-07-04 00:03:27

5 rows × 54 columns


In [ ]:

What was the most popular type of complaint, and how many times was it filed?


In [3]:
df['Complaint Type'].value_counts().head(1)


Out[3]:
Blocked Driveway    21779
Name: Complaint Type, dtype: int64

Make a horizontal bar graph of the top 5 most frequent complaint types.


In [6]:
df['Complaint Type'].value_counts().head().sort_values().plot(kind='barh')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x108f9a5f8>

Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.


In [11]:
df['Borough'].value_counts()


Out[11]:
BROOKLYN         57129
QUEENS           46824
MANHATTAN        42050
BRONX            29610
Unspecified      17000
STATEN ISLAND     7387
Name: Borough, dtype: int64

In [12]:
brooklyn_population = 2636735
manhattan_population = 1644518
queens_population = 2339150
bronx_population = 1455444
staten_island_population = 474558

#http://www1.nyc.gov/site/planning/data-maps/nyc-population/current-future-populations.page

In [13]:
brooklyn_complaints_pc = 57129 / brooklyn_population
manhattan_complaints_pc = 42050 / manhattan_population
bronx_complaints_pc = 29610 / bronx_population
queens_complaints_pc = 46824 / queens_population
staten_island_complaints_pc = 7387 / staten_island_population

print(brooklyn_complaints_pc, manhattan_complaints_pc, bronx_complaints_pc, queens_complaints_pc, staten_island_complaints_pc)
print("Manhattan whines the most")


0.02166656869196184 0.0255698022156036 0.020344307304162854 0.02001752773443345 0.015566063579162083
Manhattan whines the most

In [7]:
#Here's how to convert the data into a dataframe

borough_counts = pd.DataFrame(df['Borough'].value_counts())
borough_counts['name'] = borough_counts.index
borough_counts


Out[7]:
Borough name
BROOKLYN 57129 BROOKLYN
QUEENS 46824 QUEENS
MANHATTAN 42050 MANHATTAN
BRONX 29610 BRONX
Unspecified 17000 Unspecified
STATEN ISLAND 7387 STATEN ISLAND

According to your selection of data, how many cases were filed in March? How about May?


In [11]:
df['2015-03']['Unique Key'].count()


Out[11]:
15025

In [12]:
df['2015-05']['Unique Key'].count()


Out[12]:
49715

I'd like to see all of the 311 complaints called in on April 1st.

Surprise! We couldn't do this in class, but it was just a limitation of our data set


In [13]:
df['2015-04-01']['Unique Key'].count()


Out[13]:
573

What was the most popular type of complaint on April 1st?

What were the most popular three types of complaint on April 1st


In [14]:
df['2015-04-01']['Complaint Type'].value_counts().head(3)


Out[14]:
Illegal Parking     67
Street Condition    64
Blocked Driveway    58
Name: Complaint Type, dtype: int64

In [ ]:

What month has the most reports filed? How many? Graph it.


In [18]:
ax = df.resample('M')['Unique Key'].count().plot(kind='barh')
ax.set_yticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.set_ylabel("Month")


Out[18]:
<matplotlib.text.Text at 0x1083ceda0>

What week of the year has the most reports filed? How many? Graph the weekly complaints.


In [ ]:
df.index.week

In [17]:
ax = df.groupby(by=df.index.week)['Unique Key'].count().plot()
ax.set_xlabel("Week")


Out[17]:
<matplotlib.text.Text at 0x1098d0208>

In [19]:
df.resample('W')['Unique Key'].count().plot()


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a1cc160>

In [ ]:

Noise complaints are a big deal. Use .str.contains to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).


In [20]:
is_noise_complaints = df['Complaint Type'].str.contains("Noise")
ax = df[is_noise_complaints].resample('M').count().plot(y='Unique Key')
ax = df[is_noise_complaints].resample('D').count().plot(y='Unique Key')



In [21]:
#Using a noise df group by the hour 

#noise_df.groupby(by=noise_df.index.hour)['Unique Key'].count.plot()

Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.


In [22]:
ax = df['created_datetime'].value_counts().head().plot(kind='barh')
ax.set_ylabel("Top 5 Days For Complaints")


Out[22]:
<matplotlib.text.Text at 0x10baefc88>

What hour of the day are the most complaints? Graph a day of complaints.


In [ ]:
#df['created_datetime'].resample('H').count()

df.groupby(by=df.index.hour).count().plot(y='Unique Key')


Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b845a58>

One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?


In [25]:
#df.sort_values(by=df.index.hour).head()

midnight_df = df[df.index.hour == 0]
midnight_df.head()


Out[25]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location created_datetime
created_datetime
2015-07-04 00:03:27 30995614 07/04/2015 12:03:27 AM 07/04/2015 03:33:09 AM NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11216 1057 BERGEN STREET ... NaN NaN NaN NaN NaN NaN 40.676175 -73.951269 (40.67617516102934, -73.9512690004692) 2015-07-04 00:03:27
2015-07-09 00:00:00 31042454 07/09/2015 12:00:00 AM 07/20/2015 12:00:00 AM DOHMH Department of Health and Mental Hygiene Standing Water Other - Explain Below Other NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-07-09 00:00:00
2015-07-09 00:00:00 31037751 07/09/2015 12:00:00 AM NaN DOHMH Department of Health and Mental Hygiene Standing Water Puddle in Ground 3+ Family Apartment Building 10016 379 THIRD AVENUE ... NaN NaN NaN NaN NaN NaN 40.741537 -73.981163 (40.741536747969185, -73.98116258383294) 2015-07-09 00:00:00
2015-06-29 00:26:39 30956584 06/29/2015 12:26:39 AM 06/29/2015 04:27:24 AM NYPD New York City Police Department Noise - Park Loud Talking Park/Playground 11377 NaN ... NaN NaN NaN NaN NaN NaN 40.741280 -73.902565 (40.741280237793646, -73.90256544457489) 2015-06-29 00:26:39
2015-07-02 00:09:59 30986795 07/02/2015 12:09:59 AM 07/02/2015 12:37:47 AM NYPD New York City Police Department Noise - Street/Sidewalk Loud Music/Party Street/Sidewalk 10035 20 PALADINO AVENUE ... NaN NaN NaN NaN NaN NaN 40.800365 -73.931212 (40.80036497064086, -73.9312115560449) 2015-07-02 00:09:59

5 rows × 54 columns


In [28]:
midnight_df['Complaint Type'].value_counts().head(3)


Out[28]:
HEAT/HOT WATER    4534
Rodent            2112
PAINT/PLASTER     1946
Name: Complaint Type, dtype: int64

In [29]:
df[df.index.hour == 23]['Complaint Type'].value_counts().head(3)


Out[29]:
Noise - Street/Sidewalk    1599
Noise - Commercial         1503
Blocked Driveway            973
Name: Complaint Type, dtype: int64

In [30]:
df[df.index.hour == 1]['Complaint Type'].value_counts().head(3)


Out[30]:
Noise - Commercial         1025
Noise - Street/Sidewalk     897
Blocked Driveway            479
Name: Complaint Type, dtype: int64

So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.


In [31]:
#Wrong becuase we don't want every minute of every day
#midnight_df.resample('T')['Unique Key'].count().plot()

midnight_df.groupby(by=midnight_df.index.minute)['Unique Key'].count().plot()


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bb4bf28>

Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).


In [36]:
real_midnight = df[(df.index.hour == 0) & (df.index.minute == 0) & (df.index.second == 0)]
len(real_midnight)


Out[36]:
17015

In [39]:
twelve_and_one_sec = df[(df.index.hour == 0) & (df.index.minute == 0) & (df.index.second == 1)]
len(twelve_and_one_sec)


Out[39]:
3

In [40]:
twelve_o_one = df[(df.index.hour == 0) & (df.index.minute == 2) & (df.index.second == 0)]
len(twelve_o_one)


Out[40]:
5

In [41]:
real_midnight['Agency'].value_counts()


Out[41]:
HPD      14492
DOHMH     2505
DOT         18
Name: Agency, dtype: int64

In [43]:
df['Agency'].value_counts().head()


Out[43]:
NYPD     80000
HPD      39388
DOT      22308
DPR      15505
DOHMH     8250
Name: Agency, dtype: int64

In [ ]:

Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?


In [ ]:

Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.


In [45]:
df[df['Agency'] == 'NYPD']['2015-7':'2015-8']['Complaint Type'].value_counts().head()


Out[45]:
Illegal Parking            3444
Blocked Driveway           3258
Noise - Street/Sidewalk    3165
Noise - Commercial         1201
Noise - Vehicle             942
Name: Complaint Type, dtype: int64

In [46]:
df[df['Agency'] == 'NYPD']['2015-5']['Complaint Type'].value_counts().head()


Out[46]:
Blocked Driveway           4114
Illegal Parking            3975
Noise - Street/Sidewalk    3385
Noise - Commercial         2263
Noise - Vehicle            1232
Name: Complaint Type, dtype: int64

In [47]:
df[df['Agency'] == 'HPD']['2015-01':'2015-02']['Complaint Type'].value_counts().head()


Out[47]:
Blocked Driveway           318
Illegal Parking            219
Noise - Commercial          80
Derelict Vehicle            51
Noise - Street/Sidewalk     43
Name: Complaint Type, dtype: int64

In [48]:
df[df['Agency'] == 'HPD']['2015-05':'2015-08']['Complaint Type'].value_counts().head()


Out[48]:
HEAT/HOT WATER            4200
HPD Literature Request    2424
UNSANITARY CONDITION      2063
PAINT/PLASTER             2015
PLUMBING                  1555
Name: Complaint Type, dtype: int64

In [ ]: