|
Peter Bull Data Scientist DrivenData |
Tell everyone when your notebook was run, and with which packages. This is especially useful for nbview, blog posts, and other media where you are not sharing the notebook as executable code.
In [ ]:
# install the watermark extension
!pip install watermark
# once it is installed, you'll just need this in future notebooks:
%load_ext watermark
In [ ]:
%watermark -a "Peter Bull" -d -v -p numpy,pandas -g
virtualenv
and virtualenvwrapper
give you a new foundation.
Installation is as easy as:
pip install virtualenv
pip install virtualenvwrapper
~/.bashrc
:export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
source /usr/local/bin/virtualenvwrapper.sh
To create a virtual environment:
mkvirtualenv <name>
To work in a particular virtual environment:
workon <name>
To leave a virtual environment:
deactivate
#lifehack
: create a new virtual environment for every project you work on
#lifehack
: if you use anaconda to manage packages using mkvirtualenv --system-site-packages <name>
means you don't have to recompile large packages
pip
requirements.txt fileTrack your MRE, "Minimum reproducible environment" in a requirements.txt
file
#lifehack
: never again run pip install <package>
. Instead, update requirements.txt
and run pip install -r requirements.txt
#lifehack
: for data science projects, favor package>=0.0.0
rather than package==0.0.0
. This works well with the --system-site-packages
flag so you don't have many versions of large packages with complex dependencies sitting around (e.g., numpy, scipy, pandas)
In [ ]:
!head -n 20 ../requirements.txt
In [ ]:
! tree ..
In [ ]:
import pandas as pd
In [ ]:
df = pd.read_csv("../data/water-pumps.csv")
df.head(1)
## Try adding parameter index=0
In [ ]:
pd.read_csv?
In [ ]:
df = pd.read_csv("../data/water-pumps.csv",
index_col=0,
parse_dates=["date_recorded"])
df.head(1)
#lifehack
: in addition to the ?
operator, the Jupyter notebooks has great intelligent code completion; try tab
when typing the name of a function, try shift+tab
when inside a method call
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
plot_data = df['construction_year']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()
plot_data = df['longitude']
plot_data = plot_data[plot_data != 0]
sns.kdeplot(plot_data, bw=0.1)
plt.show()
## Paste for 'amount_tsh' and plot
## Paste for 'latitude' and plot
In [ ]:
def kde_plot(dataframe, variable, upper=0.0, lower=0.0, bw=0.1):
plot_data = dataframe[variable]
plot_data = plot_data[(plot_data > lower) & (plot_data < upper)]
sns.kdeplot(plot_data, bw=bw)
plt.show()
In [ ]:
kde_plot(df, 'construction_year', upper=2016)
kde_plot(df, 'longitude', upper=42)
In [ ]:
kde_plot(df, 'amount_tsh', upper=400000)
Interrupt execution with:
%debug
magic: drops you out into the most recent error stacktrace in pdbimport q;q.d()
: drops you into pdb, even outside of IPythonInterrupt execution on an Exception with %pdb
magic. Use pdb the Python debugger to debug inside a notebook. Key commands for pdb
are:
p
: Evaluate and print Python codew
: Where in the stack trace am I?u
: Go up a frame in the stack trace.d
: Go down a frame in the stack trace.c
: Continue executionq
: Stop execution
In [ ]:
kde_plot(df, 'date_recorded')
In [ ]:
%debug
In [ ]:
# "1" turns pdb on, "0" turns pdb off
%pdb 1
kde_plot(df, 'date_recorded')
In [ ]:
# turn off debugger
%pdb 0
#lifehack
: %debug and %pdb are great, but pdb can be clunky. Try the 'q' module. Adding the line import q;q.d()
anywhere in a project gives you a normal python console at that point. This is great if you're running outside of IPython.
In [ ]:
import numpy as np
In [ ]:
def gimme_the_mean(series):
return np.mean(series)
assert gimme_the_mean([0.0]*10) == 0.0
assert gimme_the_mean(range(10)) == 5
Have a method that gets used in multiple notebooks? Refactor it into a separate .py
file so it can live a happy life!
Note: In order to import your local modules, you must do three things:
__init__.py
file to the foldersys.path.append
In [ ]:
import os
import sys
# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
# import my method from the source code
from preprocess.build_features import remove_invalid_data
df = remove_invalid_data("../data/water-pumps.csv")
df.shape
In [ ]:
# TRY ADDING print "lalalala" to the method
df = remove_invalid_data("../data/water-pumps.csv")
Restart the kernel, let's try this again....
In [ ]:
# Load the "autoreload" extension
%load_ext autoreload
# always reload modules marked with "%aimport"
%autoreload 1
import os
import sys
# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
# import my method from the source code
%aimport preprocess.build_features
from preprocess.build_features import remove_invalid_data
In [ ]:
df = remove_invalid_data("../data/water-pumps.csv")
df.head()
#lifehack
: reloading modules in a running kernel is tricky business. If you use %autoreload
when developing, restart the kernel and run all cells when you're done.
Importing local code is great if you want to use it in multiple notebooks, but once you want to use the code in multiple projects or repositories, it gets complicated. This is when we get serious about isolation!
We can build a python package to solve that! In fact, there is a cookiecutter to create Python packages.
Once we create this package, we can install it in "editable" mode, which means that as we change the code the changes will get picked up if the package is used. The process looks like
cookiecutter https://github.com/kragniz/cookiecutter-pypackage-minimal
cd package_name
pip install -e .
Now we can have a separate repository for this code and it can be used across projects without having to maintain code in multiple places.
In [ ]:
%run ../src/preprocess/tests.py
#lifehack
: test your code.
In [ ]:
data = np.random.normal(0.0, 1.0, 1000000)
assert gimme_the_mean(data) == 0.0
In [ ]:
np.testing.assert_almost_equal(gimme_the_mean(data),
0.0,
decimal=1)
In [ ]:
a = np.random.normal(0, 0.0001, 10000)
b = np.random.normal(0, 0.0001, 10000)
np.testing.assert_array_equal(a, b)
In [ ]:
np.testing.assert_array_almost_equal(a, b, decimal=3)
A new library that lets you practice defensive program--specifically with pandas DataFrame
objects. It provides a set of decorators that check the return value of any function that returns a DataFrame
and confirms that it conforms to the rules.
In [ ]:
import engarde.decorators as ed
In [ ]:
test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
'b': np.random.normal(0, 1, 100)})
@ed.none_missing()
def process(dataframe):
dataframe.loc[10, 'a'] = np.nan
return dataframe
process(test_data).head()
engarde
has an awesome set of decorators:
none_missing
- no NaNs (great for machine learning--sklearn does not care for NaNs)has_dtypes
- make sure the dtypes are what you expectverify
- runs an arbitrary function on the dataframeverify_all
- makes sure every element returns true for a given functionMore can be found in the docs.
#lifehack
: test your data science code.
We've all seen secrets: passwords, database URLs, API keys checked in to GitHub. Don't do it! Even on a private repo. What's the easiest way to manage these secrets outside of source control? Store them as a .env
file that lives in your repository, but is not in source control (e.g., add .env
to your .gitignore
file).
A package called python-dotenv
manages this for you easily.
In [ ]:
!cat ../.env
In [ ]:
import os
from dotenv import load_dotenv, find_dotenv
# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv(usecwd=True)
# load up the entries as environment variables
load_dotenv(dotenv_path)
api_key = os.environ.get("API_KEY")
api_key
In [ ]:
from IPython.display import IFrame
IFrame("htmlcov/index.html", 800, 300)
In [ ]:
import numpy as np
from mcmc.hamiltonian import hamiltonian, run_diagnostics
f = lambda X: np.exp(-100*(np.sqrt(X[:,1]**2 + X[:,0]**2)- 1)**2 + (X[:,0]-1)**3 - X[:,1] - 5)
# potential and kinetic energies
U = lambda q: -np.log(f(q))
K = lambda p: p.dot(p.T) / 2
# gradient of the potential energy
def grad_U(X):
x, y = X[0,:]
xy_sqrt = np.sqrt(y**2 + x**2)
mid_term = 100*2*(xy_sqrt - 1)
grad_x = 3*((x-1)**2) - mid_term * ((x) / (xy_sqrt))
grad_y = -1 - mid_term * ((y) / (xy_sqrt))
return -1*np.array([grad_x, grad_y]).reshape(-1, 2)
ham_samples, H = hamiltonian(5000, U, K, grad_U)
run_diagnostics(ham_samples)
In [ ]:
%prun ham_samples, H = hamiltonian(5000, U, K, grad_U)
run_diagnostics(ham_samples)
PyCharm is a fully-featured Python IDE. It has tons of integrations with the normal development flow. The features I use most are:
git
integration