Peter Bull
Data Scientist
DrivenData

Data Science is Software: Developer #lifehacks for the Jupyter Data Scientist

21 May 2016

1. This is my house

Environment reproducibility for Python

1.1 The watermark extension

Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.



In [ ]:

    
# install the watermark extension
!pip install watermark

# once it is installed, you'll just need this in future notebooks:
%load_ext watermark



In [ ]:

    
%watermark -a "Peter Bull" -d -v -p numpy,pandas -g

1.2 Laying the foundation

virtualenv and virtualenvwrapper give you a new foundation.

Start from "scratch" on each project
Choose Python 2 or 3 as appropriate
Packages are cached locally, so no need to wait for download/compile on every new env

Installation is as easy as:

pip install virtualenv
pip install virtualenvwrapper
Add the following lines to ~/.bashrc:

export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
source /usr/local/bin/virtualenvwrapper.sh

To create a virtual environment:

mkvirtualenv <name>

To work in a particular virtual environment:

workon <name>

To leave a virtual environment:

deactivate

#lifehack: create a new virtual environment for every project you work on

#lifehack: if you use anaconda to manage packages using mkvirtualenv --system-site-packages <name> means you don't have to recompile large packages

1.1 The `pip` requirements.txt file

Track your MRE, "Minimum reproducible environment" in a requirements.txt file

#lifehack: never again run pip install <package>. Instead, update requirements.txt and run pip install -r requirements.txt

#lifehack: for data science projects, favor package>=0.0.0 rather than package==0.0.0. This works well with the --system-site-packages flag so you don't have many versions of large packages with complex dependencies sitting around (e.g., numpy, scipy, pandas)



In [ ]:

    
!head -n 20 ../requirements.txt

2. The Life-Changing Magic of Tidying Up

2.1 Consistent project structure means

relative paths work
other collaborators know what to expect
order of scripts is self-documenting



In [ ]:

    
! tree ..

3. Edit-run-repeat: how to stop the cycle of pain

The goal: don't edit, execute and verify any more. How close can we get to code succeeding the first or second time you run it? It's a fine way to start a project, but it doesn't scale as code runs longer and gets more complex.

3.1 No more docs-guessing

Don't edit-run-repeat to try to remember the name of a function or argument.



In [ ]:

    
import pandas as pd



In [ ]:

    
df = pd.read_csv("../data/water-pumps.csv")
df.head(1)

## Try adding parameter index=0



In [ ]:

    
pd.read_csv?



In [ ]:

    
df = pd.read_csv("../data/water-pumps.csv",
                 index_col=0,
                 parse_dates=["date_recorded"])
df.head(1)

#lifehack: in addition to the ? operator, the Jupyter notebooks has great intelligent code completion; try tab when typing the name of a function, try shift+tab when inside a method call

3.2 No more copy pasta

Don't repeat yourself.



In [ ]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns



In [ ]:

    
plot_data = df['construction_year']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

plot_data = df['longitude']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()

## Paste for 'amount_tsh' and plot
## Paste for 'latitude' and plot



In [ ]:

    
def kde_plot(dataframe, variable, upper=0.0, lower=0.0, bw=0.1):
    plot_data = dataframe[variable]
    plot_data = plot_data[(plot_data > lower) & (plot_data < upper)]
    sns.kdeplot(plot_data, bw=bw)
    plt.show()



In [ ]:

    
kde_plot(df, 'construction_year', upper=2016)
kde_plot(df, 'longitude', upper=42)



In [ ]:

    
kde_plot(df, 'amount_tsh', upper=400000)

3.3 No more guess-and-check

Interrupt execution with:

%debug magic: drops you out into the most recent error stacktrace in pdb
import q;q.d(): drops you into pdb, even outside of IPython

Interrupt execution on an Exception with %pdb magic. Use pdb the Python debugger to debug inside a notebook. Key commands for pdb are:

p: Evaluate and print Python code

w: Where in the stack trace am I?
u: Go up a frame in the stack trace.
d: Go down a frame in the stack trace.

c: Continue execution
q: Stop execution



In [ ]:

    
kde_plot(df, 'date_recorded')



In [ ]:

    
%debug



In [ ]:

    
# "1" turns pdb on, "0" turns pdb off
%pdb 1

kde_plot(df, 'date_recorded')



In [ ]:

    
# turn off debugger
%pdb 0

#lifehack: %debug and %pdb are great, but pdb can be clunky. Try the 'q' module. Adding the line import q;q.d() anywhere in a project gives you a normal python console at that point. This is great if you're running outside of IPython.

3.4 No more "Restart & Run All"

assert is the poor man's unit test: stops execution if condition is False, continues silently if True



In [ ]:

    
import numpy as np



In [ ]:

    
def gimme_the_mean(series):
    return np.mean(series)

assert gimme_the_mean([0.0]*10) == 0.0
assert gimme_the_mean(range(10)) == 5

3.5 No more copy-pasta between notebooks

Have a method that gets used in multiple notebooks? Refactor it into a separate .py file so it can live a happy life!

Note: In order to import your local modules, you must do three things:

put the .py file in a separate folder
add an empty __init__.py file to the folder
add that folder to the Python path with sys.path.append



In [ ]:

    
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
from preprocess.build_features import remove_invalid_data

df = remove_invalid_data("../data/water-pumps.csv")
df.shape



In [ ]:

    
# TRY ADDING print "lalalala" to the method
df = remove_invalid_data("../data/water-pumps.csv")

Restart the kernel, let's try this again....



In [ ]:

    
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport preprocess.build_features
from preprocess.build_features import remove_invalid_data



In [ ]:

    
df = remove_invalid_data("../data/water-pumps.csv")
df.head()

#lifehack: reloading modules in a running kernel is tricky business. If you use %autoreload when developing, restart the kernel and run all cells when you're done.

3.6 I'm too good! Now this code is useful to other projects!

Importing local code is great if you want to use it in multiple notebooks, but once you want to use the code in multiple projects or repositories, it gets complicated. This is when we get serious about isolation!

We can build a python package to solve that! In fact, there is a cookiecutter to create Python packages.

Once we create this package, we can install it in "editable" mode, which means that as we change the code the changes will get picked up if the package is used. The process looks like

cookiecutter https://github.com/kragniz/cookiecutter-pypackage-minimal
cd package_name
pip install -e .

Now we can have a separate repository for this code and it can be used across projects without having to maintain code in multiple places.

3.7 No more letting other people (including future you) break your toys

unittest is a unit testing framework that is built in to Python. See src/preprocess/tests.py for an example.



In [ ]:

    
%run ../src/preprocess/tests.py

#lifehack: test your code.

3.8 Special treats for datascience testing

`numpy.testing`

Provides useful assertion methods for values that are numerically close and for numpy arrays.



In [ ]:

    
data = np.random.normal(0.0, 1.0, 1000000)
assert gimme_the_mean(data) == 0.0



In [ ]:

    
np.testing.assert_almost_equal(gimme_the_mean(data),
                               0.0,
                               decimal=1)



In [ ]:

    
a = np.random.normal(0, 0.0001, 10000)
b = np.random.normal(0, 0.0001, 10000)

np.testing.assert_array_equal(a, b)



In [ ]:

    
np.testing.assert_array_almost_equal(a, b, decimal=3)

engarde decorators

A new library that lets you practice defensive program--specifically with pandas DataFrame objects. It provides a set of decorators that check the return value of any function that returns a DataFrame and confirms that it conforms to the rules.



In [ ]:

    
import engarde.decorators as ed



In [ ]:

    
test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
                          'b': np.random.normal(0, 1, 100)})

@ed.none_missing()
def process(dataframe):
    dataframe.loc[10, 'a'] = np.nan
    return dataframe

process(test_data).head()

engarde has an awesome set of decorators:

none_missing - no NaNs (great for machine learning--sklearn does not care for NaNs)
has_dtypes - make sure the dtypes are what you expect
verify - runs an arbitrary function on the dataframe
verify_all - makes sure every element returns true for a given function

More can be found in the docs.

#lifehack: test your data science code.

3.9 Keep your secrets to yourself

We've all seen secrets: passwords, database URLs, API keys checked in to GitHub. Don't do it! Even on a private repo. What's the easiest way to manage these secrets outside of source control? Store them as a .env file that lives in your repository, but is not in source control (e.g., add .env to your .gitignore file).

A package called python-dotenv manages this for you easily.



In [ ]:

    
!cat ../.env



In [ ]:

    
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv(usecwd=True)

# load up the entries as environment variables
load_dotenv(dotenv_path)

api_key = os.environ.get("API_KEY")
api_key

4. Next-level code inspection

4.1 Code coverage

coverage.py is an amazing tool for seeing what code gets executed when you run your test suite. You can run these commands to generate a code coverage report:

coverage run --source ../src/ ../src/preprocess/tests.py
coverage html
coverage report



In [ ]:

    
from IPython.display import IFrame
IFrame("htmlcov/index.html", 800, 300)

4.2 Code profiling

Sometimes your code is slow. See which functions are called, how many times, and how long they take!

The %prun magic reports these to you right in the Jupyter notebook!



In [ ]:

    
import numpy as np
from mcmc.hamiltonian import hamiltonian, run_diagnostics

f = lambda X: np.exp(-100*(np.sqrt(X[:,1]**2 + X[:,0]**2)- 1)**2 + (X[:,0]-1)**3 - X[:,1] - 5)

# potential and kinetic energies
U = lambda q: -np.log(f(q))
K = lambda p: p.dot(p.T) / 2

# gradient of the potential energy
def grad_U(X):
    x, y = X[0,:]

    xy_sqrt = np.sqrt(y**2 + x**2)
        
    mid_term = 100*2*(xy_sqrt - 1) 
    grad_x = 3*((x-1)**2) - mid_term * ((x) / (xy_sqrt))
    grad_y = -1 - mid_term * ((y) / (xy_sqrt))
    
    return -1*np.array([grad_x, grad_y]).reshape(-1, 2)

ham_samples, H = hamiltonian(5000, U, K, grad_U)
run_diagnostics(ham_samples)



In [ ]:

    
%prun ham_samples, H = hamiltonian(5000, U, K, grad_U)
run_diagnostics(ham_samples)

4.3 The world beyond Jupyter

Linting and Graphical Debugging (IDEs)

PyCharm is a fully-featured Python IDE. It has tons of integrations with the normal development flow. The features I use most are:

git integration
interactive graphical debugger
flake8 linting
smart refactoring/go to