... background material available at https://github.com/softecon/talks
I restarted a research project after taking a longer break, and now the codes are not working anymore or yield different results?
I just added new components to my code, but for some reason now the old ones stopped working?
I wanted to reproduce and then extend published research, whyis the first step already so hard?
...
$\Rightarrow$ There is a better way!
Verification How accurately does the computational model solve the underlying equations of the model for the quantities of interest?
Validation How accurately does the model represent the reality for the quantities of interest?
Uncertainty Quantification How do the various sources of error and uncertainty feed into uncertainty in the model-based prediction of the quantities of interest?
Python is used by computer programmers and scientists alike. Thus, tools from software engineering are readily available.
Python is an open-source project. This ensures that all implementation details can be critically examined. There are no licence costs and low barriers to recomputability.
Python can be easily linked to high-performance languages such as C and Fortran. Python is ideal for prototyping with a focus on readability, design patterns, and ease of testing.
Python has numerous high-quality libraries for scientific computing under active development.
For more details, you can check out my lecture on Python for Scientific Computing in Economics from last year online.
ensure correctness of implementation
manage numerous numerical components
address performance constraints
guarantee recomputability and extensibility
...
$\Rightarrow$ building, documenting, and communicating expertise.
The online documentation of respy is available at http://respy.readthedocs.io.
Throughout this lecture we will work with a simple example. We are intersted in building a code to calculate an agent's expected utility $EU$ from a lottery. The generic problem is
$$E U= \int^{-\infty}_{\infty} u(x) f(x)dx,$$where $x$ is a possible outome, $f(x)$ denotes the probability density function, and $u(x)$ the utility functions.
We start with a simple utility function
$$ u(x) = \sqrt{x},$$where $x > 0$ and $x$ is drawn from a standard lognormal distribution with scale parameter $s$. To solve the integral, we opt for a naive Monte Carlo integration. See Skrainka & Judd (2013) for the importance of choosing more advanced integration strategies in serious research.
In [ ]:
from scipy.stats import lognorm
deviates = lognorm.rvs(1, size=1000)
Rslt = 0.0
for deviate in deviates:
Rslt += deviate ** 0.5
print('Expected Utility {}'.format(Rslt/10000))
Version Control records changes to a file or set of files over time so that you can recall (and compare) specific versions later. I draw heavily on the free e-book Pro Git and Blischak & al. (2016).
Tracking changes to your code over time has a variety of benefits:
tractability
clarity
flexibility
reduced duplication and error
$\Rightarrow$ transparent, reproducible, and collaborative research.
In our quick tour, we will do the following:
set up local repository
commit files
track changes to files
compare versions of files
We will use the eupy package as our baseline.
$ git clone https://github.com/softEcon/eupy.git; rm -rf eupy/.git
Let us initialize a repository ...
$ git init
Initialized empty Git repository in eupy/.git/
... and provide some information about ourselves.
$ git config user.name "Philipp Eisenhauer"
$ git config user.email "eisenhauer@policy-lab.org"
First of all, let us take stock and check the status of our local repository:
$ git status
On branch master
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
.coveralls.yml
.gitignore
.travis.yml
LICENSE
MANIFEST
README.md
_travis/
dist/
eupy/
requirements.txt
run.py
setup.cfg
setup.py
nothing added to commit but untracked files present (use "git add" to track
We now set up tracking of all our files and revisit the status:
$ git add *
$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: eupy/__init__.py
new file: eupy/eu_calculations.py
new file: eupy/integration_rules.py
new file: eupy/testing/
new file: eupy/utility_functions.py
new file: requirements.txt
new file: run.py
Untracked files:
(use "git add <file>..." to include in what will be committed)
.coveralls.yml
.gitignore
.travis.yml
We are now ready for our initial commit.
$ git commit -a -m'Initial commit of directory content.'
19 files changed, 662 insertions(+)
create mode 100644 LICENSE
create mode 100644 MANIFEST
create mode 100644 README.md
create mode 100644 _travis/requirements.py
create mode 100644 eupy/__init__.py
create mode 100644 eupy/eu_calculations.py
create mode 100644 eupy/integration_rules.py
create mode 100644 eupy/testing/
create mode 100644 eupy/utility_functions.py
create mode 100644 requirements.txt
create mode 100755 run.py
create mode 100644 setup.cfg
create mode 100644 setup.py
By adding a message to our commit, it is much easier to return to it later.
$ git log
commit 49593b8c447a9f023a35ff724a3467db2fe10037
Author: Philipp Eisenhauer <eisenhauer@policy-lab.org>
Date: Mon Feb 1 21:11:58 2016 +0100
Initial commit of directory content.
This brings us back to the beginning of our repository lifecycle.
$ git status
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
.coveralls.yml
.gitignore
.travis.yml
nothing added to commit but untracked files present (use "git add" to track)
Let us modify the setup of the random seed in integration_rules.py using our favourite editor and check for the current status of our files.
$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: eupy/integration_rules.py
What exactly is the difference?
$ git diff
diff --git a/eupy/integration_rules.py b/eupy/integration_rules.py
index 66aab9c..d6d7dbd 100644
--- a/eupy/integration_rules.py
+++ b/eupy/integration_rules.py
@@ -36,7 +36,7 @@ def naive_monte_carlo(func, bounds, num_draws, implementation, seed):
lower, upper = bounds
# Draw requested number of deviates.
- np.random.seed(seed)
+ np.random.seed(123)
deviates = np.random.uniform(lower, upper, size=num_draws)
# Implement native Monte Carlo approach.
In case we want to discard our changes now:
$ git checkout eupy/integration_rules.py
In this case, we discard our changes immediately. However, we can always return to any version of our file ever commited.
$ git checkout IDENTIFIER eupy/integration_rules.py
Git also provides a feature called Branching. It allows to maintain parallel versions of your code and switch easily between them. I usually create a new branch for each feature that I add to my research software. Once I am done, I merge my edits back to the master branch.
But first, let us check where we are now.
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
Now we create and immediately switch to a new branch, that we will use to extend the implemented integration strategies.
$ git checkout -b feature_romberg
Switched to a new branch 'feature_romberg'
However, we can also switch back and forth between branches.
$ git checkout master
Once we are done with our edits, we can simple merge our two branches.
$ git merge master testing
Once you are familiar with the basic workflow for versioning your own code, the natural next step is to share your code online. GitHub is a popular online hosting service. Among other benefits, you then have a backup copy of your files and can establish a tranparent workflow with your co-authors.
Let us check out the repository for my current research on ambiguity here.
We will need to generate a host of random requests for our testing battery.
In [ ]:
def generate_random_request():
""" Generate a random admissible request.
"""
# Draw random deviates that honor the constraints for the utility
# function and distribution of returns.
alpha, shape = np.random.uniform(low=0.0001, size=2)
# Draw a random integration technique.
technique = np.random.choice(['naive_mc', 'quad'])
# Add options.
int_options = dict()
int_options['naive_mc'] = dict()
int_options['naive_mc']['implementation'] = \
np.random.choice(['slow', 'fast'])
int_options['naive_mc']['num_draws'] = np.random.random_integers(10, 1000)
int_options['naive_mc']['seed'] = np.random.random_integers(10, 1000)
# Finishing
return alpha, shape, technique, int_options
In [ ]:
def test_random_requests():
""" Draw a whole host of random requests to ensure that the function
works for all admissible values.
"""
for _ in range(100):
# Generate random request.
alpha, shape, technique, int_options = generate_random_request()
# Perform calculation.
eupy.get_baseline_lognormal(alpha, shape, technique, int_options)
In [ ]:
def test_regression():
""" This regression test ensures that the code does not change during
refactoring without noticing.
"""
# Set seed to avoid dependence of seed.
np.random.seed(123)
# Generate random request.
alpha, shape, technique, int_options = generate_random_request()
# Perform calculation.
rslt = eupy.get_baseline_lognormal(alpha, shape, technique,
int_options)
# Ensure equivalence with expected results up to numerical precision.
np.testing.assert_almost_equal(rslt, 0.21990743996551923)
In [ ]:
def test_invalid_request():
""" Test that assertions are raised in case of a flawed request.
"""
with pytest.raises(NotImplementedError):
# Generate random request.
alpha, shape, technique, int_options = generate_random_request()
# Invalidate request.
technique = 'gaussian_quadrature'
# Parameters outside defined ranges.
eupy.get_baseline_lognormal(alpha, shape, technique, int_options)
In [ ]:
def test_closed_form_naive():
""" Test whether the results line up with a closed form solution for the
special case, where alpha is set to zero. This function test the naive
monte carlo implementation.
"""
for _ in range(10):
# Generate random request.
alpha, shape, _, int_options = generate_random_request()
# Set options favourable.
int_options['naive_mc']['num_draws'] = 1000
int_options['naive_mc']['implementation'] = 'fast'
# Restrict to special case.
alpha, shape = 0.0, 0.001
# Calculate closed form solution and simulate special case.
closed_form = lognorm.mean(shape)
simulated = eupy.get_baseline_lognormal(alpha, shape, 'naive_mc',
int_options)
# Test equality.
np.testing.assert_almost_equal(closed_form, simulated, decimal=3)
In [ ]:
def test_feature_speed():
""" Test whether the results from the fast and slow implementation of the naive monte carlo
integration are identical.
"""
technique = 'naive_mc'
for _ in range(10):
# Generate random request.
alpha, shape, _, int_options = generate_random_request()
# Loop over alternative implementations.
baseline = None
for implementation in ['fast', 'slow']:
int_options['naive_mc']['implementation'] = implementation
args = (alpha, shape, technique, int_options)
rslt = eupy.get_baseline_lognormal(*args)
if baseline is None:
baseline = rslt
# Test equality.
np.testing.assert_almost_equal(baseline, rslt)
In [ ]:
def test_feature_romberg():
""" This tests compares the output from the two alternative SciPy
integrations.
"""
# Set seed to avoid dependence of seed.
np.random.seed(123)
for _ in range(5):
# Generate random request.
alpha, shape, _, _ = generate_random_request()
# Integration according to alternative rules
romb = eupy.get_baseline_lognormal(alpha, shape, 'romberg')
quad = eupy.get_baseline_lognormal(alpha, shape, 'quad')
# Testing the equality of results
np.testing.assert_almost_equal(romb, quad)
In [ ]:
''' Execute module as a script.
'''
if __name__ == "__main__":
test_random_requests()
test_invalid_request()
test_closed_form_quad()
test_naive_implementations()
test_closed_form_naive()
test_regression()
In principle, we can collect all our tests in a module and execute it as a script:
$ python eupy/testing/test_eupy.py
How much of our code base is covered by tests at this point?
$ py.test --cov=eupy
Name Stmts Miss Cover
-----------------------------------------------
eupy/__init__.py 2 0 100%
eupy/eu_calculations.py 29 2 93%
eupy/integration_rules.py 31 6 81%
eupy/testing/__init__.py 0 0 100%
eupy/testing/checks.py 87 15 83%
eupy/testing/test_eupy.py 67 6 91%
eupy/utility_functions.py 6 0 100%
-----------------------------------------------
TOTAL 222 29 87%
Coverage of integration rules is comparatively low. Why?
As usual, this basic output works just fine for projects with only a low level of complexity. However, at some point some more details will be required to guide your development efforts. In my research, I use a tool called codecov.
A description of the test suite is available as part of the online documentation.
The actual codes are available in the Github respository here.
There are several tools out there that I found useful in the past to improve the quality of my code along all the dimensions of quality we discussed. Let us visit Codacy online and take a look around.
... premature optimization is the root of all evil. (Knuth.1974)
In [ ]:
# Add the eupy package to the PYTHONPATH
import sys
sys.path.insert(0, '../../../eupy')
In [ ]:
# standard library
import cProfile
import time
# project library
from eupy import get_baseline_lognormal
# Specify request
alpha, shape, technique = 0.3, 0.1, 'naive_mc'
int_options = dict()
int_options['naive_mc'] = dict()
int_options['naive_mc']['implementation'] = 'slow'
int_options['naive_mc']['num_draws'] = 10000
int_options['naive_mc']['seed'] = 123
Let us first just measure the total execution time of our request.
In [ ]:
# Iterate over both types of implementations
for implementation in ['fast', 'slow']:
# Switch between the fast and slow implementation.
int_options['naive_mc']['implementation'] = implementation
# Get the computer time at the start
start_time = time.time()
# Execute the request.
get_baseline_lognormal(alpha, shape, technique, int_options)
# Print the time difference.
print(implementation, time.time() - start_time, "seconds")
Let us now investigate where most of time is spent in a slow evaluation.
In [ ]:
# Profiler is part of the Python standard library.
int_options['naive_mc']['implementation'] = 'slow'
cProfile.run("get_baseline_lognormal(alpha, shape, technique, int_options)")
1260190 function calls in 1.068 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
40004 0.079 0.000 0.209 0.000 basic_checks
1 0.000 0.000 0.000 0.000 check_integration_options
1 0.000 0.000 1.068 1.068 get_baseline_lognormal
10000 0.026 0.000 1.056 0.000 _wrapper_baseline
1 0.000 0.000 1.067 1.067 naive_monte_carlo
1 0.011 0.011 1.067 1.067 slow_implementation
10000 0.012 0.000 0.130 0.000 baseline_utility
...
Luckily a graphical viewer for this output is available. We will use snakeviz.
Let us inspect our package using this visualization tool.
In [ ]:
# Profiler is part of the Python standard library.
int_options['naive_mc']['implementation'] = 'slow'
cProfile.run("get_baseline_lognormal(alpha, shape, technique, int_options)", 'eupy.prof')
Then we look at the report.
$ snakeviz eupy.prof
Version Control
Testing and Coverage
Code Review
Build
...
using Github and its integrations as the center.
Check out more detailed lectures at the Software Engineering for Economists Initiative online.
Explore the material on Software Carpentry.
Sign up for a personal GitHub account and create a remote repository.
If you are using a Windows machine, download and install Ubuntu Desktop.
Set aside a day or two to establish an continuous integration workflow for your current research project.
Philipp Eisenhauer
Software Engineering for Economists Initiative
In [ ]:
import urllib; from IPython.core.display import HTML
HTML(urllib.urlopen('http://bit.ly/1K5apRH').read())