"Many machine learning algorithms have a curious property: they are robust against bugs. Since they’re designed to deal with noisy data, they can often deal pretty well with noise caused by math mistakes as well. If you make a math mistake in your implementation, the algorithm might still make sensible-looking predictions. This is bad news, not good news. It means bugs are subtle and hard to detect. Your algorithm might work well in some situations, such as small toy datasets you use for validation, and completely fail in other situations — high dimensions, large numbers of training examples, noisy observations, etc." — Roger Gross, "Testing MCMC code, part 1: unit tests", Harvard Intelligent Probabilistic Systems group
In [ ]:
from __future__ import print_function
import os
import numpy as np
import pandas as pd
PROJ_ROOT = os.path.abspath(os.path.join(os.pardir, os.pardir))
In [ ]:
data = np.random.normal(0.0, 1.0, 1000000)
assert np.mean(data) == 0.0
In [ ]:
np.testing.assert_almost_equal(np.mean(data), 0.0, decimal=2)
In [ ]:
a = np.random.normal(0, 0.0001, 10000)
b = np.random.normal(0, 0.0001, 10000)
np.testing.assert_array_equal(a, b)
In [ ]:
np.testing.assert_array_almost_equal(a, b, decimal=3)
A new library that lets you practice defensive program--specifically with pandas DataFrame objects. It provides a set of decorators that check the return value of any function that returns a DataFrame and confirms that it conforms to the rules.
In [ ]:
import engarde.decorators as ed
In [ ]:
test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
'b': np.random.normal(0, 1, 100)})
@ed.none_missing()
def process(dataframe):
dataframe.loc[10, 'a'] = 1.0
return dataframe
process(test_data).head()
engarde has an awesome set of decorators:
none_missing - no NaNs (great for machine learning--sklearn does not care for NaNs)has_dtypes - make sure the dtypes are what you expectverify - runs an arbitrary function on the dataframeverify_all - makes sure every element returns true for a given functionMore can be found in the docs.
#lifehack: test your data science code.
What are those tests getting up to? Sometimes you think you wrote test cases that cover anything that might be interesting. But, sometimes you're wrong.
coverage.py is an amazing tool for seeing what code gets executed when you run your test suite. You can run these commands to generate a code coverage report:
coverage run --source src -m pytest
coverage html
coverage report
In [ ]: