Data Science is Software

Developer #lifehacks for the Jupyter Data Scientist

Section 2: This is my house

Environment reproducibility for Python

Question 1

Set up a new virtual environment using conda. We strongly recommending naming the environment the same thing as your project root folder so that it is easy to remember.

Once the environment is created, activate the environment.

When you're finished your command line prompt should look something like:

(water-pumps)machine:~ user$

Use the conda env list command to list the environments that are available and confirm that the water-pumps environment is marked with a * to indicate it is active.


In [ ]:
#SOLUTION
# AT THE COMMANDLINE:
> conda create -n water-pumps python=3
> source activate water-pumps

> conda env list

Question 2

Now that you have a virtual environment, let's populate it with the packages we need. For convenience, we've listed the some of the contents of your requirements.txt file. If you see an import package error later in the tutorial, you'll be expected to update requirements.txt with the additional dependency!

At this step you need to:

  • Ensure you have a requirements.txt file. It should have been created by the data-science-cookiecutter process.
  • Update that requirements.txt with the packages you need for analysis. Remember, the point is not to pre-load the environment with every possible package you could ever need. Just your minimum reproducible environment.
  • Run the command that will install the packages listed in that file.

Here's a start for what your requirements.txt should contain.

click Sphinx coverage awscli flake8 watermark python-dotenv>=0.5.1 pandas>=0.18.1 matplotlib>=1.5.1

In [ ]:
#SOLUTION
# AT THE COMMANDLINE:
> cd water-pumps
> edit requirements.txt

> pip install -r requirements.txt

Question 3

Awesome! Finally, let's add the watermark extension so we can keep track of the package versions every time we execute the notebook. This helps people who find the notebook in other contexts (nbviewer, as a blog post, as a rendered notebook) also know what versions of packages the code executes against. The watermark documenation may help here.

  • Load the watermark extension
  • Print out the author name, date, time, python version, package versions for numpy, pandas, matplotlib, and the current git hash.

In [ ]:
#SOLUTION
%load_ext watermark
%watermark -a "Peter Bull" -d -t -v -p numpy,pandas,matplotlib -g