This notebook is going to play around with the 311 Data from the Western Pennsylvania Regional Data Center
I have taken the liberty of downloading the 311 data
In [ ]:
# use the %ls magic to list the files in the current directory.
%ls
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sms
%matplotlib inline
In [3]:
three11s = pd.read_csv("data/pgh-311.csv", parse_dates=['CREATED_ON'])
In [4]:
three11s.dtypes
Out[4]:
In [5]:
three11s.head()
Out[5]:
In [6]:
three11s.loc[0]
Out[6]:
In [7]:
# Plot the number of 311 requests per month
month_counts = three11s.groupby(three11s.CREATED_ON.dt.month)
y = month_counts.size()
x = month_counts.CREATED_ON.first()
axes = pd.Series(y.values, index=x).plot(figsize=(15,5))
plt.ylim(0)
plt.xlabel('Month')
plt.ylabel('Complaint')
Out[7]:
In [8]:
grouped_by_type = three11s.groupby(three11s.REQUEST_TYPE)
size = grouped_by_type.size()
size
#len(size)
#size[size > 200]
Out[8]:
There are too many request types (268). We need some higher level categories to make this more comprehensible. Fortunately, there is an Issue and Category codebook that we can use to map between low and higher level categories.
In [9]:
codebook = pd.read_csv('data/codebook.csv')
codebook.head()
Out[9]:
In [10]:
merged_data = pd.merge(three11s,
codebook[['Category', 'Issue']],
how='left',
left_on="REQUEST_TYPE",
right_on="Issue")
In [11]:
merged_data.head()
Out[11]:
In [12]:
grouped_by_type = merged_data.groupby(merged_data.Category)
size = grouped_by_type.size()
size
Out[12]:
That is a more manageable list of categories for data visualization. Let's take a look at the distribution of requests per category in the dataset.
In [13]:
size.plot(kind='barh', figsize=(8,6))
Out[13]:
In [15]:
merged_data.groupby(merged_data.NEIGHBORHOOD).size().sort_values(inplace=False,
ascending=False)
Out[15]:
In GRAPH form
In [16]:
merged_data.groupby(merged_data.NEIGHBORHOOD).size().sort_values(inplace=False,
ascending=True).plot(kind="barh", figsize=(5,20))
Out[16]:
So we can see from the graph above that Brookline, followed by the South Side Slopes, Carrick, and South Side Flats, make the most 311 requests. It would be interesting to get some neighborhood population data and compute the number of requests per capita.
I bet those data are available, maybe YOU could create that graph!
Jupyter Notebooks have a very powerful widget framework that allows you to easily add interactive components to live notebooks.
In [17]:
# create a function that generates a chart of requests per neighborhood
def issues_by_neighborhood(neighborhood):
"""Generates a plot of issue categories by neighborhood"""
grouped_by_type = merged_data[merged_data['NEIGHBORHOOD'] == neighborhood].groupby(merged_data.Category)
size = grouped_by_type.size()
size.plot(kind='barh', figsize=(8,6))
In [18]:
issues_by_neighborhood('Greenfield')
In [19]:
issues_by_neighborhood('Brookline')
In [20]:
issues_by_neighborhood('Garfield')
In [ ]:
from ipywidgets import interact
@interact(hood=sorted(list(pd.Series(three11s.NEIGHBORHOOD.unique()).dropna())))
def issues_by_neighborhood(hood):
"""Generates a plot of issue categories by neighborhood"""
grouped_by_type = merged_data[merged_data['NEIGHBORHOOD'] == hood].groupby(merged_data.Category)
size = grouped_by_type.size()
size.plot(kind='barh',figsize=(8,6))
In [ ]: