Data Science is Software

Developer #lifehacks for the Jupyter Data Scientist

Section 3: Writing code for reusability



In [ ]:

    
%matplotlib inline
from __future__ import print_function

import os

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

PROJ_ROOT = os.path.join(os.pardir, os.pardir)

3.1 No more docs-guessing

Don't edit-run-repeat to try to remember the name of a function or argument. Jupyter provides great docs integration and easy ways to remember the arguments to a function.



In [ ]:

    
## Try adding parameter index=0
pump_data_path = os.path.join(PROJ_ROOT,
                              "data",
                              "raw",
                              "pumps_train_values.csv")

df = pd.read_csv(pump_data_path, index=0)
df.head(1)



In [ ]:

    
pd.read_csv?



In [ ]:

    
# Tab completion for parsing dates in the date_recoreded column
# Shift tab for documentation
df = pd.read_csv("../data/water-pumps.csv", index_col=0)

df.head(1)

3.2 No more copy-pasta

Don't repeat yourself.



In [ ]:

    
df.describe()



In [ ]:

    
## Paste for 'construction_year' and plot
## Paste for 'gps_height' and plot
plot_data = df['amount_tsh']
sns.kdeplot(plot_data, bw=1000)
plt.show()



In [ ]:

    
def kde_plot(dataframe, variable, upper=None, lower=None, bw=0.1):
    """ Plots a density plot for a variable with optional upper and
        lower bounds on the data (inclusive).
    """
    plot_data = dataframe[variable]
    
    if upper is not None:
        plot_data = plot_data[plot_data <= upper]
    if lower is not None:
        plot_data = plot_data[plot_data >= lower]

    sns.kdeplot(plot_data, bw=bw)
    plt.show()



In [ ]:

    
kde_plot(df, 'amount_tsh', bw=1000, lower=0)
kde_plot(df, 'construction_year', bw=1, lower=1000, upper=2016)
kde_plot(df, 'gps_height', bw=100)

3.3 No more copy-pasta between notebooks

Have a method that gets used in multiple notebooks? Refactor it into a separate .py file so it can live a happy life!

Note: In order to import your local modules, you must do three things:

put the .py file in a separate folder
add an empty __init__.py file to the folder
add that folder to the Python path with sys.path.append



In [ ]:

    
# add local python functions
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(PROJ_ROOT, "src")
sys.path.append(src_dir)

# import my method from the source code
from features.build_features import remove_invalid_data

df = remove_invalid_data(pump_data_path)
df.shape



In [ ]:

    
# TRY ADDING print("lalalala") to the method
df = remove_invalid_data(pump_data_path)

Restart the kernel, let's try this again....



In [ ]:

    
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport features.build_features
from features.build_features import remove_invalid_data



In [ ]:

    
df = remove_invalid_data(pump_data_path)
df.head()

3.4 I'm too good! Now this code is useful to other projects!

Importing local code is great if you want to use it in multiple notebooks, but once you want to use the code in multiple projects or repositories, it gets complicated. This is when we get serious about isolation!

We can build a python package to solve that! In fact, there is a cookiecutter to create Python packages.

Once we create this package, we can install it in "editable" mode, which means that as we change the code the changes will get picked up if the package is used. The process looks like

cookiecutter https://github.com/wdm0006/cookiecutter-pipproject
cd package_name
pip install -e .

Now we can have a separate repository for this code and it can be used across projects without having to maintain code in multiple places.

3.5 Sometimes things go wrong

Interrupt execution with:

%debug magic: drops you out into the most recent error stacktrace in pdb
import q;q.d(): drops you into pdb, even outside of IPython

Interrupt execution on an Exception with %pdb magic. Use pdb the Python debugger to debug inside a notebook. Key commands for pdb are:

p: Evaluate and print Python code

w: Where in the stack trace am I?
u: Go up a frame in the stack trace.
d: Go down a frame in the stack trace.

c: Continue execution
q: Stop execution



In [ ]:

    
kde_plot(df,
         'date_recorded',
         upper=pd.to_datetime('2017-01-01'),
         lower=pd.to_datetime('1900-01-01'))



In [ ]:

    
%debug



In [ ]:

    
# "1" turns pdb on, "0" turns pdb off
%pdb 1

kde_plot(df, 'date_recorded')



In [ ]:

    
# turn off debugger
%pdb 0

#lifehack: %debug and %pdb are great, but pdb can be clunky. Try the 'q' module. Adding the line import q;q.d() anywhere in a project gives you a normal python console at that point. This is great if you're running outside of IPython.

3.6 Code profiling

Sometimes your code is slow. See which functions are called, how many times, and how long they take!

The %prun magic reports these to you right in the Jupyter notebook!



In [ ]:

    
import numpy as np
from mcmc.hamiltonian import hamiltonian, run_diagnostics

f = lambda X: np.exp(-100*(np.sqrt(X[:,1]**2 + X[:,0]**2)- 1)**2 + (X[:,0]-1)**3 - X[:,1] - 5)

# potential and kinetic energies
U = lambda q: -np.log(f(q))
K = lambda p: p.dot(p.T) / 2

# gradient of the potential energy
def grad_U(X):
    x, y = X[0,:]

    xy_sqrt = np.sqrt(y**2 + x**2)
        
    mid_term = 100*2*(xy_sqrt - 1) 
    grad_x = 3*((x-1)**2) - mid_term * ((x) / (xy_sqrt))
    grad_y = -1 - mid_term * ((y) / (xy_sqrt))
    
    return -1*np.array([grad_x, grad_y]).reshape(-1, 2)

ham_samples, H = hamiltonian(2500, U, K, grad_U)
run_diagnostics(ham_samples)



In [ ]:

    
%prun ham_samples, H = hamiltonian(2500, U, K, grad_U)
run_diagnostics(ham_samples)

3.7 The world beyond Jupyter

Graphical Debugging (IDEs)

PyCharm is a fully-featured Python IDE. It has tons of integrations with the normal development flow. The features I use most are:

git integration
interactive graphical debugger
flake8 linting
smart refactoring/go to



In [ ]: