In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser
First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use
311-2014.csv
. You can rename it.
Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.
In [2]:
df = pd.read_csv("311-2014.csv", nrows=200000, low_memory = False)
In [3]:
df.head(3)
df.columns
Out[3]:
In [4]:
type(df['Created Date'][0])
print(df['Created Date'][0])
dateutil.parser.parse(df['Created Date'][0])
Out[4]:
In [5]:
def str_to_time(str_date):
datetype_date = dateutil.parser.parse(str_date)
return datetype_date
In [6]:
df['created_date'] = df['Created Date'].apply(str_to_time)
In [7]:
df.index=df['created_date']
df.head(3)
Out[7]:
What was the most popular type of complaint, and how many times was it filed?
In [8]:
df.groupby("Complaint Type").count()['Unique Key'].sort_values(ascending=False)
Out[8]:
Make a horizontal bar graph of the top 5 most frequent complaint types.
In [9]:
plt.style.use('ggplot')
ax = df.groupby("Complaint Type").count()['Unique Key'].sort_values(ascending=True).tail(5).plot(kind='barh', figsize=(10,7))
ax.set_title("5 most frequent complaint types")
ax.set_xlabel("How many times was it filed?")
Out[9]:
Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.
In [10]:
df.groupby('Borough').count()['Unique Key'].sort_values(ascending=False)
Out[10]:
In [11]:
#https://en.wikipedia.org/wiki/Brooklyn
inhabitants=[{'state': "BROOKLYN", 'inhabitants':2621793},
{'state': "QUEENS", 'inhabitants': 2321580},
{'state': "MANHATTAN", 'inhabitants': 1636268},
{'state': "BRONX", 'inhabitants': 1438159},
{'state': "STATEN ISLAND", 'inhabitants': 473279},]
inhabitantsdf=pd.DataFrame(inhabitants)
inhabitantsdf
Out[11]:
In [12]:
print("Complaints per capita: Brooklyn", 57129/2621793)
print("Complaints per capita: Queens", 46824/2321580)
print("Complaints per capita: Manhattan", 42050/1636268)
print("Complaints per capita: Bronx", 29610/1438159)
print("Complaints per capita: Staten Island", 7387/473279)
Manhattan has the most complaints per capita.
According to your selection of data, how many cases were filed in March? How about May?
In [13]:
march_cases = df["2015-03"]['Unique Key'].count()
may_cases = df["2015-05"]['Unique Key'].count()
print("Cases filed in March:", march_cases)
print("Cases filed in May:", may_cases)
I'd like to see all of the 311 complaints called in on April 1st.
Surprise! We couldn't do this in class, but it was just a limitation of our data set
In [14]:
df["2015-04-01"]
Out[14]:
What was the most popular type of complaint on April 1st?
In [15]:
df["2015-04-01"].groupby('Complaint Type').count()["Unique Key"].sort_values(ascending=False).head(1)
Out[15]:
What were the most popular three types of complaint on April 1st
In [16]:
df["2015-04-01"].groupby('Complaint Type').count()["Unique Key"].sort_values(ascending=False).head(3)
Out[16]:
What month has the most reports filed? How many? Graph it.
In [17]:
ax = df.resample('m')['Unique Key'].count().plot(figsize=(10,5))
ax.set_title("Reports filed per month")
ax.set_ylabel("Number of complaints")
ax.set_xlabel("Month")
Out[17]:
In [18]:
df.resample('m')['Unique Key'].count().max()
print("The month with most cases is May with",df.resample('m')['Unique Key'].count().max(), "cases.")
What week of the year has the most reports filed? How many? Graph the weekly complaints.
In [19]:
ax = df.resample('W')['Unique Key'].count().plot(figsize=(10,5))
ax.set_title("Reports filed per week")
ax.set_ylabel("Number of complaints")
ax.set_xlabel("Week")
# weeknumbers= list(range(1,52))
# ax.set_xticks(weeknumbers)
Out[19]:
In [20]:
print("The week with most cases is the second in May with",df.resample('W')['Unique Key'].count().max(), "cases.")
Noise complaints are a big deal. Use .str.contains
to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).
In [21]:
noise_df = df[df["Complaint Type"].str.contains("Noise ")]
In [22]:
noiseax = noise_df.resample('W')['Unique Key'].count().plot(figsize=(10,5))
noiseax.set_title("Noise complaints filed over the year")
noiseax.set_ylabel("Number of noise-related complaints")
noiseax.set_xlabel("Time of year")
Out[22]:
In [23]:
noisedayax = noise_df.groupby(by=noise_df.index.hour)['Unique Key'].count().plot(figsize=(10,5))
noisedayax.set_title("Hour when noise complaints are filed")
noisedayax.set_ylabel("Number of noise complaints")
noisedayax.set_xticks([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23])
noisedayax.set_xlabel("Hour of the day")
Out[23]:
Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.
In [24]:
df.resample('D')['Unique Key'].count().sort_values(ascending=False).head(5)
Out[24]:
In [25]:
df.resample('D')['Unique Key'].count().sort_values().tail(5).plot(kind='barh')
Out[25]:
What hour of the day are the most complaints? Graph a day of complaints.
In [26]:
df.groupby(by=df.index.hour)['Unique Key'].count()
Out[26]:
In [27]:
ax = df.groupby(by=df.index.hour)['Unique Key'].count().plot(figsize=(10,5))
ax.set_title("Hour when complaints are filed")
ax.set_ylabel("Number of complaints")
ax.set_xticks([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23])
ax.set_xlabel("Hour of the day")
Out[27]:
One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?
In [28]:
# Midnight is is an outlier, complaint-number-wise
hourly_df = pd.DataFrame(df.groupby(df.index.hour)['Complaint Type'].value_counts())
hourly_df
Out[28]:
In [29]:
#Most common complaint types at midnight
hourly_df['Complaint Type'][0].head(5)
Out[29]:
In [30]:
#Most common complaint types the hour before midnight
hourly_df['Complaint Type'][23].head(5)
Out[30]:
In [31]:
#Most common complaint types the hour after midnight
hourly_df['Complaint Type'][1].head(5)
Out[31]:
So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.
In [32]:
midnight_df = df[df.index.hour==0]
minutely_df = midnight_df.groupby(by=midnight_df.index.minute)
minax = minutely_df['Unique Key'].count().plot(figsize=(15,5))
minax.set_title("Complaints filed per minute during midnight hour")
minax.set_xlabel("Minutes of the hour")
minax.set_ylabel("Numbers of complaints filed")
minax.set_xticks([0,5,10,15,20,25,30,35,40,45,50,55,60])
Out[32]:
Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).
In [33]:
midnight_df['Agency'].value_counts().sort_values(ascending=False).head(5)
Out[33]:
In [53]:
#Write test code for first agency
hpd_df = df[df['Agency'] == 'HPD']
ax = hpd_df.groupby(by=hpd_df.index.hour).count()['Unique Key'].plot(figsize=(12,7), label= 'HPD', legend=True, linewidth=2)
ax.set_title("Complaints per agency and time of the day")
ax.set_xlabel("Hour of the day")
ax.set_xticks([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23])
ax.set_ylabel("Number of complaints filed")
#Turn working code into a function:
def complaints_by_agency(agency_name):
complaints_agency = df[df['Agency'] == agency_name]
return complaints_agency.groupby(by=complaints_agency.index.hour).count()['Unique Key']
#run code for remaining agencies
for agency in ['NYPD', 'DOHMH', 'TLC', 'DOT']:
complaints_by_agency(agency).plot(ax=ax, label = agency, legend=True, linewidth=2)
Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?
In [70]:
#Write test code for first agency
#Copied code from above replacing groupby by resample('W')
hpd_df = df[df['Agency'] == 'HPD']
ax = hpd_df.resample('W')['Unique Key'].count().plot(figsize=(12,7), label= 'HPD', legend=True, linewidth=2)
ax.set_title("Complaints per agency and time of the year")
ax.set_xlabel("Week of the year")
ax.set_ylabel("Number of complaints filed")
#Turn working code into a function:
def agency_complaints_weekly(agency_name):
complaints_weekly = df[df['Agency'] == agency_name]
return complaints_weekly.resample('W')['Unique Key'].count()
#run code for remaining agencies
for agency in ['NYPD', 'DOHMH', 'TLC', 'DOT']:
agency_complaints_weekly(agency).plot(ax=ax, label = agency, legend=True, linewidth=2)
NYPD has an odd number after the first week in May
Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.
In [105]:
def ag_complaints(agency_name, str_date_s, str_date_e):
newdf = df[df['Agency'] == agency_name]
newdf.resample('M')
return newdf[str_date_s : str_date_e].groupby("Complaint Type")['Unique Key'].count().sort_values(ascending=False).head(10)
In [108]:
print("NYPD most popular complaints in July and August:")
ag_complaints('NYPD', '2015-07', '2015-08')
Out[108]:
In [107]:
print("NYPD most popular complaints in May:")
ag_complaints('NYPD', '2015-05', '2015-05')
Out[107]:
In [106]:
print("Most common complaints filed at the Housing Preservation Bureau (HPD) in winter")
ag_complaints('HPD', '2015-11', '2016-02')
Out[106]:
In [104]:
print("Most common complaints filed at the Housing Preservation Bureau (HPD) in summer")
ag_complaints('HPD', '2015-05', '2016-09')
Out[104]: