In [ ]:
%matplotlib inline
from __future__ import print_function
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
PROJ_ROOT = os.path.join(os.pardir, os.pardir)
In [ ]:
## Try adding parameter index=0
pump_data_path = os.path.join(PROJ_ROOT,
"data",
"raw",
"pumps_train_values.csv")
df = pd.read_csv(pump_data_path, index=0)
df.head(1)
In [ ]:
pd.read_csv?
In [ ]:
# Tab completion for parsing dates in the date_recoreded column
# Shift tab for documentation
df = pd.read_csv("../data/water-pumps.csv", index_col=0)
df.head(1)
In [ ]:
df.describe()
In [ ]:
## Paste for 'construction_year' and plot
## Paste for 'gps_height' and plot
plot_data = df['amount_tsh']
sns.kdeplot(plot_data, bw=1000)
plt.show()
In [ ]:
def kde_plot(dataframe, variable, upper=None, lower=None, bw=0.1):
""" Plots a density plot for a variable with optional upper and
lower bounds on the data (inclusive).
"""
plot_data = dataframe[variable]
if upper is not None:
plot_data = plot_data[plot_data <= upper]
if lower is not None:
plot_data = plot_data[plot_data >= lower]
sns.kdeplot(plot_data, bw=bw)
plt.show()
In [ ]:
kde_plot(df, 'amount_tsh', bw=1000, lower=0)
kde_plot(df, 'construction_year', bw=1, lower=1000, upper=2016)
kde_plot(df, 'gps_height', bw=100)
Have a method that gets used in multiple notebooks? Refactor it into a separate .py
file so it can live a happy life!
Note: In order to import your local modules, you must do three things:
__init__.py
file to the foldersys.path.append
In [ ]:
# add local python functions
import sys
# add the 'src' directory as one where we can import modules
src_dir = os.path.join(PROJ_ROOT, "src")
sys.path.append(src_dir)
# import my method from the source code
from features.build_features import remove_invalid_data
df = remove_invalid_data(pump_data_path)
df.shape
In [ ]:
# TRY ADDING print("lalalala") to the method
df = remove_invalid_data(pump_data_path)
Restart the kernel, let's try this again....
In [ ]:
# Load the "autoreload" extension
%load_ext autoreload
# always reload modules marked with "%aimport"
%autoreload 1
import os
import sys
# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
# import my method from the source code
%aimport features.build_features
from features.build_features import remove_invalid_data
In [ ]:
df = remove_invalid_data(pump_data_path)
df.head()
Importing local code is great if you want to use it in multiple notebooks, but once you want to use the code in multiple projects or repositories, it gets complicated. This is when we get serious about isolation!
We can build a python package to solve that! In fact, there is a cookiecutter to create Python packages.
Once we create this package, we can install it in "editable" mode, which means that as we change the code the changes will get picked up if the package is used. The process looks like
cookiecutter https://github.com/wdm0006/cookiecutter-pipproject
cd package_name
pip install -e .
Now we can have a separate repository for this code and it can be used across projects without having to maintain code in multiple places.
Interrupt execution with:
%debug
magic: drops you out into the most recent error stacktrace in pdbimport q;q.d()
: drops you into pdb, even outside of IPythonInterrupt execution on an Exception with %pdb
magic. Use pdb the Python debugger to debug inside a notebook. Key commands for pdb
are:
p
: Evaluate and print Python codew
: Where in the stack trace am I?u
: Go up a frame in the stack trace.d
: Go down a frame in the stack trace.c
: Continue executionq
: Stop execution
In [ ]:
kde_plot(df,
'date_recorded',
upper=pd.to_datetime('2017-01-01'),
lower=pd.to_datetime('1900-01-01'))
In [ ]:
%debug
In [ ]:
# "1" turns pdb on, "0" turns pdb off
%pdb 1
kde_plot(df, 'date_recorded')
In [ ]:
# turn off debugger
%pdb 0
#lifehack
: %debug and %pdb are great, but pdb can be clunky. Try the 'q' module. Adding the line import q;q.d()
anywhere in a project gives you a normal python console at that point. This is great if you're running outside of IPython.
In [ ]:
import numpy as np
from mcmc.hamiltonian import hamiltonian, run_diagnostics
f = lambda X: np.exp(-100*(np.sqrt(X[:,1]**2 + X[:,0]**2)- 1)**2 + (X[:,0]-1)**3 - X[:,1] - 5)
# potential and kinetic energies
U = lambda q: -np.log(f(q))
K = lambda p: p.dot(p.T) / 2
# gradient of the potential energy
def grad_U(X):
x, y = X[0,:]
xy_sqrt = np.sqrt(y**2 + x**2)
mid_term = 100*2*(xy_sqrt - 1)
grad_x = 3*((x-1)**2) - mid_term * ((x) / (xy_sqrt))
grad_y = -1 - mid_term * ((y) / (xy_sqrt))
return -1*np.array([grad_x, grad_y]).reshape(-1, 2)
ham_samples, H = hamiltonian(2500, U, K, grad_U)
run_diagnostics(ham_samples)
In [ ]:
%prun ham_samples, H = hamiltonian(2500, U, K, grad_U)
run_diagnostics(ham_samples)
PyCharm is a fully-featured Python IDE. It has tons of integrations with the normal development flow. The features I use most are:
git
integration
In [ ]: