Data Science is Software

Developer #lifehacks for the Jupyter Data Scientist

Section 4: Don't let other people break your toys

Motivation

"Many machine learning algorithms have a curious property: they are robust against bugs. Since they’re designed to deal with noisy data, they can often deal pretty well with noise caused by math mistakes as well. If you make a math mistake in your implementation, the algorithm might still make sensible-looking predictions. This is bad news, not good news. It means bugs are subtle and hard to detect. Your algorithm might work well in some situations, such as small toy datasets you use for validation, and completely fail in other situations — high dimensions, large numbers of training examples, noisy observations, etc." — Roger Gross, "Testing MCMC code, part 1: unit tests", Harvard Intelligent Probabilistic Systems group



In [ ]:

    
from __future__ import print_function

import os

import numpy as np
import pandas as pd

PROJ_ROOT = os.path.abspath(os.path.join(os.pardir, os.pardir))

`numpy.testing`

Provides useful assertion methods for values that are numerically close and for numpy arrays.



In [ ]:

    
data = np.random.normal(0.0, 1.0, 1000000)
assert np.mean(data) == 0.0



In [ ]:

    
np.testing.assert_almost_equal(np.mean(data), 0.0, decimal=2)



In [ ]:

    
a = np.random.normal(0, 0.0001, 10000)
b = np.random.normal(0, 0.0001, 10000)

np.testing.assert_array_equal(a, b)



In [ ]:

    
np.testing.assert_array_almost_equal(a, b, decimal=3)

engarde decorators

A new library that lets you practice defensive program--specifically with pandas DataFrame objects. It provides a set of decorators that check the return value of any function that returns a DataFrame and confirms that it conforms to the rules.



In [ ]:

    
import engarde.decorators as ed



In [ ]:

    
test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
                          'b': np.random.normal(0, 1, 100)})

@ed.none_missing()
def process(dataframe):
    dataframe.loc[10, 'a'] = 1.0
    return dataframe

process(test_data).head()

engarde has an awesome set of decorators:

none_missing - no NaNs (great for machine learning--sklearn does not care for NaNs)
has_dtypes - make sure the dtypes are what you expect
verify - runs an arbitrary function on the dataframe
verify_all - makes sure every element returns true for a given function

More can be found in the docs.

#lifehack: test your data science code.

Code coverage

What are those tests getting up to? Sometimes you think you wrote test cases that cover anything that might be interesting. But, sometimes you're wrong.

coverage.py is an amazing tool for seeing what code gets executed when you run your test suite. You can run these commands to generate a code coverage report:

coverage run --source src -m pytest
coverage html
coverage report



In [ ]: