Procedural Python and Unit Tests

In this section, our main goal will be to outline how to go from the kind of trial-and-error exploratory data analysis we explored this morning, into a nice, linear, reproducible analysis.



In [1]:

    
import this









    



The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Step 1: Downloading the Data

We want a function that will download the data automatically if it does not already exist.



In [2]:

    
URL = "https://s3.amazonaws.com/pronto-data/open_data_year_one.zip"



In [3]:

    
import urllib.request
urllib.request.urlretrieve?



In [4]:

    
import os
os.path.exists('open_data_year_one.zip')









    Out[4]:





True



In [5]:

    
# Python 2:
# from urllib import urlretrieve
# Python 3:
from urllib.request import urlretrieve
import os


def download_if_needed(url, filename, force_download=False):
    if force_download or not os.path.exists(filename):
        urlretrieve(url, filename)
    else:
        pass

    
download_if_needed(URL, 'open_data_year_one.zip')



In [6]:

    
!ls









    



ProceduralPython.ipynb google-home.html       pronto_utils.py
__pycache__            open_data_year_one.zip

Step 2: Make a Package

Now that this function works, let's create a Python package that we can import it from

(Use a text editor to edit pronto_utils.py)



In [7]:

    
from pronto_utils import download_if_needed
download_if_needed(URL, 'open_data_year_one.zip')









    



File open_data_year_one.zip already exists; not downloading

Use Python to unzip and load the data:



In [8]:

    
import zipfile
import pandas as pd

def load_trip_data(filename='open_data_year_one.zip'):
    """Load trip data from the zipfile; return as DataFrame"""
    download_if_needed(URL, filename)
    zf = zipfile.ZipFile(filename)
    return pd.read_csv(zf.open('2015_trip_data.csv'))

data = load_trip_data()
data.head()









    



File open_data_year_one.zip already exists; not downloading






    Out[8]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31
      10/13/2014 10:48
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32
      10/13/2014 10:48
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33
      10/13/2014 10:48
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34
      10/13/2014 10:48
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34
      10/13/2014 10:49
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1971.0

(paste the above function in pronto_utils.py)



In [9]:

    
from pronto_utils import load_trip_data
data = load_trip_data()
data.head()









    



File open_data_year_one.zip already exists; not downloading






    Out[9]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31
      10/13/2014 10:48
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32
      10/13/2014 10:48
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33
      10/13/2014 10:48
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34
      10/13/2014 10:48
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34
      10/13/2014 10:49
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1971.0

Step 3: Write a Unit Test

Let's write a unit test to make sure our download script works properly. We will use pytest here.



In [10]:

    
import pandas as pd
from pronto_utils import load_trip_data

def test_trip_data():
    df = load_trip_data()
    assert isinstance(df, pd.DataFrame)
    assert df.shape == (142846, 12)
    
test_trip_data()









    



File open_data_year_one.zip already exists; not downloading

(paste the above function in pronto_utils.py)



In [11]:

    
!py.test pronto_utils.py









    



============================= test session starts ==============================
platform darwin -- Python 3.5.3, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /Users/jakevdp/short-course/short-course/LectureNotes/ProceduralPython, inifile: 
collected 2 items 

pronto_utils.py ..

=========================== 2 passed in 1.64 seconds ===========================

Breakout: Add functionality

Working in pairs, do the following:

create a function that will plot an interesting aspect of this data
once you are happy with the function, copy it into the Python package you have created
Write a "smoke-test" – this is a test that calls the function, but doesn't necessarily validate the output. This can be useful for "testing" plotting functions, because it's generally difficult to programatically evaluate the plot output itself

For step 3, you'll have to tell matplotlib not to invoke the graphical backend, which you can do by putting the following at the top of the test file:

import matplotlib as mpl
mpl.use('Agg')  # Don't invoke graphical backend for plots

If you want to go farther with testing the output of your plot, matplotlib has some useful plot testing tools that you can use.



In [12]:

    
%matplotlib inline



In [13]:

    
def plot_totals_by_birthyear():
    df = load_trip_data()
    totals_by_birthyear = df.birthyear.value_counts().sort_index()
    return totals_by_birthyear.plot(linestyle='steps')

plot_totals_by_birthyear()









    



File open_data_year_one.zip already exists; not downloading






    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x1192aaa90>



In [14]:

    
def test_plot_totals():
    ax = plot_totals_by_birthyear()
    assert len(ax.lines) == 1



In [15]:

    
import numpy as np
import matplotlib as mpl

def test_plot_totals_by_birthyear():
    ax = plot_totals_by_birthyear()
    
    # Some tests of the output that dig into the
    # matplotlib internals
    assert len(ax.lines) == 1
    
    line = ax.lines[0]
    x, y = line.get_data()
    assert np.all((x > 1935) & (x < 2000))
    assert y.mean() == 1456



In [16]:

    
test_plot_totals_by_birthyear()









    



File open_data_year_one.zip already exists; not downloading



In [17]:

    
!py.test pronto_utils.py









    



============================= test session starts ==============================
platform darwin -- Python 3.5.3, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /Users/jakevdp/short-course/short-course/LectureNotes/ProceduralPython, inifile: 
collected 2 items 

pronto_utils.py ..

=========================== 2 passed in 1.78 seconds ===========================



In [ ]:

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
0	431	10/13/2014 10:31	10/13/2014 10:48	SEA00298	985.935	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1960.0
1	432	10/13/2014 10:32	10/13/2014 10:48	SEA00195	926.375	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1970.0
2	433	10/13/2014 10:33	10/13/2014 10:48	SEA00486	883.831	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Female	1988.0
3	434	10/13/2014 10:34	10/13/2014 10:48	SEA00333	865.937	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Female	1977.0
4	435	10/13/2014 10:34	10/13/2014 10:49	SEA00202	923.923	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1971.0