In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser
First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use
311-2014.csv
. You can rename it.
Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.
In [216]:
df = pd.read_csv('311-2015.csv', dtype = str)
In [217]:
df.head()
Out[217]:
In [133]:
import datetime
In [134]:
def created_date_to_datetime(date_str):
return datetime.datetime.strptime(date_str, '%m/%d/%Y %I:%M:%S %p')
In [218]:
df['created_datetime'] = df['Created Date'].apply(created_date_to_datetime)
In [219]:
df = df.set_index('created_datetime')
What was the most popular type of complaint, and how many times was it filed?
In [220]:
freq_complaints = df[['Unique Key', 'Complaint Type']].groupby('Complaint Type').count().sort_values('Unique Key', ascending=False).head()
In [221]:
freq_complaints
Out[221]:
Make a horizontal bar graph of the top 5 most frequent complaint types.
In [222]:
ax = freq_complaints.plot(kind = 'barh', legend = False)
ax.set_title('5 Most Frequent 311 Complaints')
ax.set_xlabel('Number of Complaints in 2015')
Out[222]:
Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.
In [223]:
df[['Unique Key', 'Borough']].groupby('Borough').count().sort_values('Unique Key', ascending = False)
Out[223]:
According to your selection of data, how many cases were filed in March? How about May?
In [224]:
cases_in_mar = df[df.index.month == 3]['Unique Key'].count()
print('There were', cases_in_mar, 'cases filed in March.')
In [225]:
cases_in_may = df[df.index.month == 5]['Unique Key'].count()
print('There were', cases_in_may, 'cases filed in May.')
I'd like to see all of the 311 complaints called in on April 1st.
Surprise! We couldn't do this in class, but it was just a limitation of our data set
In [226]:
april_1_complaints = df[(df.index.month == 4) & (df.index.day == 1)][['Unique Key', 'Created Date', 'Complaint Type', 'Descriptor']]
april_1_complaints
Out[226]:
What was the most popular type of complaint on April 1st?
In [227]:
april_1_complaints[['Unique Key', 'Complaint Type']].groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head(1)
Out[227]:
What were the most popular three types of complaint on April 1st
In [228]:
april_1_complaints[['Unique Key', 'Complaint Type']].groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head(3)
Out[228]:
What month has the most reports filed? How many? Graph it.
In [229]:
complaints_by_month = df['Unique Key'].groupby(df.index.month).count()
complaints_by_month
Out[229]:
In [230]:
ax = complaints_by_month.plot()
ax.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
ax.set_ylabel('Number of Complaints')
ax.set_title('311 Complaints by Month in 2015')
# x_values = df.groupby(df.index.month).median().index
min_values = 0
max_values = complaints_by_month
ax.fill_between(x_values, min_values, max_values, alpha = 0.4)
Out[230]:
What week of the year has the most reports filed? How many? Graph the weekly complaints.
In [231]:
complaints_by_week = df['Unique Key'].groupby(df.index.week).count()
complaints_by_week
Out[231]:
In [232]:
ax = complaints_by_week.plot()
ax.set_xticks(range(1,53))
ax.set_xticklabels(['', '', '', '', '5',
'', '', '', '', '10',
'', '', '', '', '15',
'', '', '', '', '20',
'', '', '', '', '25',
'', '', '', '', '30',
'', '', '', '', '35',
'', '', '', '', '40',
'', '', '', '', '45',
'', '', '', '', '50',
'', '',])
ax.set_ylabel('Number of Complaints')
ax.set_xlabel('Week of the Year')
ax.set_title('311 Complaints by Week in 2015')
Out[232]:
Noise complaints are a big deal. Use .str.contains
to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).
In [233]:
noise_complaints = df[df['Complaint Type'].str.contains('Noise') == True]
In [234]:
noise_complaints_hour = noise_complaints['Unique Key'].groupby(noise_complaints.index.hour).count()
In [235]:
ax = noise_complaints_hour.plot()
ax.set_xticks(range(0,24))
ax.set_xticklabels(['Midnight', '', '', '', '', '',
'6 am', '', '', '', '', '',
'Noon', '', '', '', '', '',
'6 pm', '', '', '', '', ''])
ax.set_ylabel('Number of Complaints')
ax.set_xlabel('Time of Day')
ax.set_title('Noise Complaints by Time of Day in 2015')
Out[235]:
Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.
In [236]:
top_complaining_days = df['Unique Key'].resample('D').count().sort_values(ascending = False).head()
top_complaining_days
Out[236]:
In [237]:
ax = top_complaining_days.plot(kind = 'barh')
ax.set_ylabel('Date')
ax.set_xlabel('Number of Complaints')
ax.set_title('The Top 5 Days for 311 Complaints in 2015')
Out[237]:
In [238]:
complaining_days = df['Unique Key'].resample('D').count()
ax = complaining_days.plot()
ax.set_ylabel('Number of Complaints')
ax.set_xlabel('Day of Year')
ax.set_title('Noise Complaints by Day in 2015')
Out[238]:
Interesting—it looks cyclical. Let's see what day of the week is most popular:
In [243]:
df.index[1]
Out[243]:
In [254]:
def get_day_of_wk(timestamp):
return datetime.datetime.strftime(timestamp, '%a')
In [255]:
df['datetime'] = df.index
In [256]:
df['day_of_wk'] = df['datetime'].apply(get_day_of_wk)
In [258]:
complaining_day_of_wk = df[['Unique Key', 'day_of_wk']].groupby('day_of_wk').count()
In [272]:
complaining_day_of_wk['number_of_day'] = [6, 2, 7, 1, 5, 3, 4]
In [275]:
complaining_day_of_wk_sorted = complaining_day_of_wk.sort_values('number_of_day')
In [277]:
complaining_day_of_wk_sorted
Out[277]:
In [282]:
ax = complaining_day_of_wk_sorted.plot(y = 'Unique Key', legend = False)
What hour of the day are the most complaints? Graph a day of complaints.
In [283]:
hourly_complaints = df['Unique Key'].groupby(df.index.hour).count()
In [284]:
ax = hourly_complaints.plot()
ax.set_xticks(range(0,24))
ax.set_xticklabels(['Midnight', '', '', '', '', '',
'6 am', '', '', '', '', '',
'Noon', '', '', '', '', '',
'6 pm', '', '', '', '', ''])
ax.set_ylabel('Number of Complaints')
ax.set_xlabel('Time of Day')
ax.set_title('311 Complaints by Time of Day in 2015')
Out[284]:
One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?
In [285]:
# 11 pm
df[df.index.hour == 23][['Unique Key', 'Complaint Type', 'Descriptor']].groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[285]:
In [286]:
# 12 am
df[df.index.hour == 0][['Unique Key', 'Complaint Type', 'Descriptor']].groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[286]:
In [287]:
# 1 am
df[df.index.hour == 1][['Unique Key', 'Complaint Type', 'Descriptor']].groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[287]:
So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.
In [291]:
midnight_complaints = df[df.index.hour == 0][['Unique Key', 'Complaint Type']]
for minute in range(0, 60):
top_complaint = midnight_complaints[midnight_complaints.index.minute == minute].groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head(1)
if minute < 10:
minute = '0' + str(minute)
else:
minute = str(minute)
print('12:' + minute + '\'s top complaint was:', top_complaint)
print('')
# hourly_complaints = df['Unique Key'].groupby(df.index.hour).count()
Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).
5 Top Agencies:
In [292]:
df[['Unique Key', 'Agency']].groupby('Agency').count().sort_values('Unique Key', ascending = False).head()
Out[292]:
In [293]:
def agency_hourly_complaints(agency_name_str):
agency_complaints = df[df['Agency'] == agency_name_str]
return agency_complaints['Unique Key'].groupby(agency_complaints.index.hour).count()
In [294]:
ax = agency_hourly_complaints('HPD').plot(label = 'HPD', legend = True)
for x in ['NYPD', 'DOT', 'DEP', 'DSNY']:
agency_hourly_complaints(x).plot(ax = ax, label = x, legend = True)
Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?
In [295]:
def agency_weekly_complaints(agency_name_str):
agency_complaints = df[df['Agency'] == agency_name_str]
return agency_complaints['Unique Key'].groupby(agency_complaints.index.week).count()
In [296]:
ax = agency_weekly_complaints('NYPD').plot(label = 'NYPD', legend = True)
for x in ['DOT', 'HPD', 'DPR', 'DSNY']:
agency_weekly_complaints(x).plot(ax = ax, label = x, legend = True)
In [299]:
NYPD_complaints = df[df['Agency'] == 'NYPD']
NYPD_weekly_complaints = NYPD_complaints['Unique Key'].groupby(NYPD_complaints.index.week).count()
NYPD_weekly_complaints[NYPD_weekly_complaints > 13000]
Out[299]:
In [300]:
# # a way to use the function agency_weekly_complaints that's actually longer than not using it.
# week_number = 0
# for week in agency_weekly_complaints('NYPD'):
# week_number += 1
# if week > 1500:
# print('In week', week_number)
# print('there were', week, 'complaints.')
# print('')
It looks like complaints are most popular in May, June, September—generally in the summer.
In [301]:
NYPD_weekly_complaints[NYPD_weekly_complaints < 6000]
Out[301]:
It looks like complaints are least popular in around Christmas and New Year's.
Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.
Most popular NYPD complaints in July and August:
In [302]:
NYPD_complaints = df[df['Agency'] == 'NYPD']
NYPD_jul_aug_complaints = NYPD_complaints[(NYPD_complaints.index.month == 7) | (NYPD_complaints.index.month == 8)][['Unique Key', 'Complaint Type']]
NYPD_jul_aug_complaints.groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[302]:
Most popular NYPD complaints in May:
In [303]:
NYPD_complaints = df[df['Agency'] == 'NYPD']
NYPD_may_complaints = NYPD_complaints[(NYPD_complaints.index.month == 5)][['Unique Key', 'Complaint Type']]
NYPD_may_complaints.groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[303]:
Most popular HPD complaints in June, July, and August:
In [304]:
HPD_complaints = df[df['Agency'] == 'HPD']
HPD_jun_jul_aug_complaints = HPD_complaints[(HPD_complaints.index.month == 6) |
(HPD_complaints.index.month == 7) |
(HPD_complaints.index.month == 8)][['Unique Key', 'Complaint Type']]
HPD_jun_jul_aug_complaints.groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[304]:
Most popular HPD complaints in December, January, and February:
In [305]:
HPD_complaints = df[df['Agency'] == 'HPD']
HPD_dec_jan_feb_complaints = HPD_complaints[(HPD_complaints.index.month == 12) |
(HPD_complaints.index.month == 1) |
(HPD_complaints.index.month == 2)][['Unique Key', 'Complaint Type']]
HPD_dec_jan_feb_complaints.groupby('Complaint Type').count().sort_values('Unique Key', ascending = False).head()
Out[305]: