In [ ]:
from __future__ import print_function
import os
import sys
PROJ_ROOT = os.path.join(os.pardir, os.pardir)
# add local python functions
sys.path.append(os.path.join(PROJ_ROOT, "src"))
Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.
In [ ]:
# install the watermark extension
!pip install watermark
# once it is installed, you'll just need this in future notebooks:
%load_ext watermark
In [ ]:
%watermark -a "Peter Bull" -d -t -v -p numpy,pandas -g
Continuum's conda
tool provides a way to create isolated environments. In fact, you've already seen this at work if you followed the pydata setup instructions to setup your machine for this tutorial. The conda env
functionality let's you created an isolated environment on your machine for
To create an empty environment:
conda create -n <name> python=3
Note: python=2
will create a Python 2 environment; python=3
will create a Python 3 environment.
To work in a particular virtual environment:
source activate <name>
To leave a virtual environment:
source deactivate
Note: on Windows, the commands are just activate
and deactivate
, no need to type source
.
There are other Python tools for environment isolation, but none of them are perfect. If you're interested in the other options, virtualenv
and pyenv
both provide environment isolation. There are sometimes compatibility issues between the Anaconda Python distribution and these packages, so if you've got Anaconda on your machine you can use conda env
to create and manage environments.
#lifehack
: create a new environment for every project you work on
#lifehack
: if you use anaconda to manage packages using mkvirtualenv --system-site-packages <name>
means you don't have to recompile large packages
pip
requirements.txt fileIt's a convention in the Python ecosystem to track a project's dependencies in a file called requirements.txt
. We recommend using this file to keep track of your MRE, "Minimum reproducible environment".
Conda
#lifehack
: never again run pip install <package>
. Instead, update requirements.txt
and run pip install -r requirements.txt
#lifehack
: for data science projects, favor package>=0.0.0
rather than package==0.0.0
. This works well with the --system-site-packages
flag so you don't have many versions of large packages with complex dependencies sitting around (e.g., numpy, scipy, pandas)
In [ ]:
# what does requirements.txt look like?
print(open(os.path.join(PROJ_ROOT, 'requirements.txt')).read())
The format for a line in the requirements file is:
Syntax | Result |
---|---|
package_name |
for whatever the latest version on PyPI is |
package_name==X.X.X |
for an exact match of version X.X.X |
package_name>=X.X.X |
for at least version X.X.X |
Now, contributors can create a new virtual environment (using conda or any other tool) and install your dependencies just by running:
pip install -r requirements.txt
There are some things you don't want to be openly reproducible: your private database url, your AWS credentials for downloading the data, your SSN, which you decided to use as a hash. These shouldn't live in source control, but may be essential for collaborators or others reproducing your work.
This is a situation where we can learn from some software engineering best practices. The 12-factor app principles give a set of best-practices for building web applications. Many of these principles are relevant for best practices in the data-science codebases as well.
Using a dependency manifest like requirements.txt
satisfies II. Explicitly declare and isolate dependencies. The important principle here is III. Store config in the environment:
An app’s config is everything that is likely to vary between deploys (staging, production, developer environments, etc). Apps sometimes store config as constants in the code. This is a violation of twelve-factor, which requires strict separation of config from code. Config varies substantially across deploys, code does not. A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials.
The dotenv
pacakge allows you to easily store these variables in a file that is not in source control (as long as you keep the line .env
in your .gitignore
file!). You can then reference these variables as environment variables in your application with os.environ.get('VARIABLE_NAME')
.
In [ ]:
print(open(os.path.join(PROJ_ROOT, '.env')).read())