The goals for this hack are:
First, we will import all the modules we will need.
I especially like pandas
and seaborn
.
In [1]:
%matplotlib inline
import pandas as pd
import seaborn as sns
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt # side-stepping mpl backend
import warnings
warnings.filterwarnings("ignore") #YOLO
import seaborn as sns
sns.set_context('notebook', font_scale=1.5)
Next, we need to import the data. I accessed the 311 open data at Miami-Dade County’s Open Data Portal.
I saved it as a .csv
file.
In [2]:
!ls data/
In [3]:
dat = pd.read_csv('data/311_Service_Requests_-_Miami-Dade_County.csv')
It has almost 400,000 rows!
In [4]:
len(dat.City)
Out[4]:
In [5]:
dat.head()
Out[5]:
In [6]:
dat.columns
Out[6]:
In [7]:
plt.plot(dat['Goal Days'], dat['Actual Completed Days'], '.', alpha=0.1)
plt.xlim(0, 30)
plt.ylim(0, 30)
plt.title('Does 311 meet its goals?')
plt.xlabel('Goal (Days)')
plt.ylabel('Actual (Days)')
Out[7]:
Yuck, that's a scatter plot. Let's make a 2D (Hess) diagram.
In [8]:
H, xbins, ybins = np.histogram2d(dat['Goal Days'], dat['Actual Completed Days'],
bins=(np.linspace(-60, 60, 120),
np.linspace(-60, 60, 120)))
In [9]:
sns.set_style('dark')
In [10]:
# Create a black and white color map where bad data (NaNs) are white
cmap = plt.cm.bone
cmap.set_bad('w', 1.)
# Use the image display function imshow() to plot the result
fig, ax = plt.subplots(figsize=(7, 7))
H[H == 0] = 1 # prevent warnings in log10
ax.imshow(np.log10(H).T, origin='lower',
extent=[xbins[0], xbins[-1], ybins[0], ybins[-1]],
cmap=cmap, interpolation='nearest',
aspect='auto')
ax.plot([0,60],[0, 60], 'r--')
plt.xlim(0,60)
plt.ylim(0,60)
plt.xlabel('Goal (days)')
plt.ylabel('Actual (days)')
Out[10]:
That looks better. The red-dashed line is the 1:1 line. Above this line, 311 did not meet its goal to complete the task, below this line 311 completed the task faster than its goal. It sort of looks like most 311 tasks are completed below the 1:1 line. But let's just compute it.
In [11]:
dat.columns
Out[11]:
In [12]:
diff = dat['Actual Completed Days'] - dat['Goal Days']
In [13]:
good_diff = diff[diff == diff]
In [18]:
plt.hist(good_diff, range=(-100, 100), bins=100)
plt.yscale('linear')
plt.xlabel('Actual - Goal')
plt.ylabel('$N$')
Out[18]:
In [15]:
n_good = np.sum(good_diff <= 0)
n_bad = np.sum(good_diff > 0)
n_tot = len(good_diff)
In [16]:
n_good, n_bad, n_tot
Out[16]:
In [17]:
print "Miami 311 meets their goal {:.1f}% of the time.".format(n_good*100.0/n_tot)
Miami 311 gets a B.
The end!