In this section, our main goal will be to outline how to go from the kind of trial-and-error exploratory data analysis we explored this morning, into a nice, linear, reproducible analysis.
In [1]:
import this
In [2]:
URL = "https://s3.amazonaws.com/pronto-data/open_data_year_one.zip"
In [3]:
import urllib.request
urllib.request.urlretrieve?
In [4]:
import os
os.path.exists('open_data_year_one.zip')
Out[4]:
In [5]:
# Python 2:
# from urllib import urlretrieve
# Python 3:
from urllib.request import urlretrieve
import os
def download_if_needed(url, filename, force_download=False):
if force_download or not os.path.exists(filename):
urlretrieve(url, filename)
else:
pass
download_if_needed(URL, 'open_data_year_one.zip')
In [6]:
!ls
(Use a text editor to edit pronto_utils.py)
In [7]:
from pronto_utils import download_if_needed
download_if_needed(URL, 'open_data_year_one.zip')
Use Python to unzip and load the data:
In [8]:
import zipfile
import pandas as pd
def load_trip_data(filename='open_data_year_one.zip'):
"""Load trip data from the zipfile; return as DataFrame"""
download_if_needed(URL, filename)
zf = zipfile.ZipFile(filename)
return pd.read_csv(zf.open('2015_trip_data.csv'))
data = load_trip_data()
data.head()
Out[8]:
(paste the above function in pronto_utils.py)
In [9]:
from pronto_utils import load_trip_data
data = load_trip_data()
data.head()
Out[9]:
Let's write a unit test to make sure our download script works properly. We will use pytest here.
In [10]:
import pandas as pd
from pronto_utils import load_trip_data
def test_trip_data():
df = load_trip_data()
assert isinstance(df, pd.DataFrame)
assert df.shape == (142846, 12)
test_trip_data()
(paste the above function in pronto_utils.py)
In [11]:
!py.test pronto_utils.py
Working in pairs, do the following:
For step 3, you'll have to tell matplotlib not to invoke the graphical backend, which you can do by putting the following at the top of the test file:
import matplotlib as mpl
mpl.use('Agg') # Don't invoke graphical backend for plots
If you want to go farther with testing the output of your plot, matplotlib has some useful plot testing tools that you can use.
In [12]:
%matplotlib inline
In [13]:
def plot_totals_by_birthyear():
df = load_trip_data()
totals_by_birthyear = df.birthyear.value_counts().sort_index()
return totals_by_birthyear.plot(linestyle='steps')
plot_totals_by_birthyear()
Out[13]:
In [14]:
def test_plot_totals():
ax = plot_totals_by_birthyear()
assert len(ax.lines) == 1
In [15]:
import numpy as np
import matplotlib as mpl
def test_plot_totals_by_birthyear():
ax = plot_totals_by_birthyear()
# Some tests of the output that dig into the
# matplotlib internals
assert len(ax.lines) == 1
line = ax.lines[0]
x, y = line.get_data()
assert np.all((x > 1935) & (x < 2000))
assert y.mean() == 1456
In [16]:
test_plot_totals_by_birthyear()
In [17]:
!py.test pronto_utils.py
In [ ]: