Data Science is Software

Developer #lifehacks for the Jupyter Data Scientist

Section 2: This is my house

Environment reproducibility for Python


In [ ]:
from __future__ import print_function

import os
import sys

PROJ_ROOT = os.path.join(os.pardir, os.pardir)

# add local python functions
sys.path.append(os.path.join(PROJ_ROOT, "src"))

2.1 The watermark extension

Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.


In [ ]:
# install the watermark extension
!pip install watermark

# once it is installed, you'll just need this in future notebooks:
%load_ext watermark

In [ ]:
%watermark -a "Peter Bull" -d -t -v -p numpy,pandas -g

2.2 Laying the foundation

Continuum's conda tool provides a way to create isolated environments. In fact, you've already seen this at work if you followed the pydata setup instructions to setup your machine for this tutorial. The conda env functionality let's you created an isolated environment on your machine for

  • Start from "scratch" on each project
  • Choose Python 2 or 3 as appropriate

To create an empty environment:

  • conda create -n <name> python=3

Note: python=2 will create a Python 2 environment; python=3 will create a Python 3 environment.

To work in a particular virtual environment:

  • source activate <name>

To leave a virtual environment:

  • source deactivate

Note: on Windows, the commands are just activate and deactivate, no need to type source.

There are other Python tools for environment isolation, but none of them are perfect. If you're interested in the other options, virtualenv and pyenv both provide environment isolation. There are sometimes compatibility issues between the Anaconda Python distribution and these packages, so if you've got Anaconda on your machine you can use conda env to create and manage environments.


#lifehack: create a new environment for every project you work on

#lifehack: if you use anaconda to manage packages using mkvirtualenv --system-site-packages <name> means you don't have to recompile large packages


2.3 The pip requirements.txt file

It's a convention in the Python ecosystem to track a project's dependencies in a file called requirements.txt. We recommend using this file to keep track of your MRE, "Minimum reproducible environment".

Conda


#lifehack: never again run pip install <package>. Instead, update requirements.txt and run pip install -r requirements.txt

#lifehack: for data science projects, favor package>=0.0.0 rather than package==0.0.0. This works well with the --system-site-packages flag so you don't have many versions of large packages with complex dependencies sitting around (e.g., numpy, scipy, pandas)



In [ ]:
# what does requirements.txt look like?
print(open(os.path.join(PROJ_ROOT, 'requirements.txt')).read())

The format for a line in the requirements file is:

Syntax Result
package_name for whatever the latest version on PyPI is
package_name==X.X.X for an exact match of version X.X.X
package_name>=X.X.X for at least version X.X.X

Now, contributors can create a new virtual environment (using conda or any other tool) and install your dependencies just by running:

  • pip install -r requirements.txt

2.4 Separation of configuration from codebase

There are some things you don't want to be openly reproducible: your private database url, your AWS credentials for downloading the data, your SSN, which you decided to use as a hash. These shouldn't live in source control, but may be essential for collaborators or others reproducing your work.

This is a situation where we can learn from some software engineering best practices. The 12-factor app principles give a set of best-practices for building web applications. Many of these principles are relevant for best practices in the data-science codebases as well.

Using a dependency manifest like requirements.txt satisfies II. Explicitly declare and isolate dependencies. The important principle here is III. Store config in the environment:

An app’s config is everything that is likely to vary between deploys (staging, production, developer environments, etc). Apps sometimes store config as constants in the code. This is a violation of twelve-factor, which requires strict separation of config from code. Config varies substantially across deploys, code does not. A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials.

The dotenv pacakge allows you to easily store these variables in a file that is not in source control (as long as you keep the line .env in your .gitignore file!). You can then reference these variables as environment variables in your application with os.environ.get('VARIABLE_NAME').


In [ ]:
print(open(os.path.join(PROJ_ROOT, '.env')).read())