Software Engineering for Economists

... background material available at https://github.com/softecon/talks

Computer-Assisted Research in Practice

  • I restarted a research project after taking a longer break, and now the codes are not working anymore or yield different results?

  • I just added new components to my code, but for some reason now the old ones stopped working?

  • I wanted to reproduce and then extend published research, whyis the first step already so hard?

  • ...

$\Rightarrow$ There is a better way!

Software Engineering

... the systematic application of scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software

Assessing the Reliability of Structural Econometric Models

  • Verification How accurately does the computational model solve the underlying equations of the model for the quantities of interest?

  • Validation How accurately does the model represent the reality for the quantities of interest?

  • Uncertainty Quantification How do the various sources of error and uncertainty feed into uncertainty in the model-based prediction of the quantities of interest?

Software Engineering

  • verifying correctness of computations

  • increase transparency implementations

  • ensure recomputability of results

  • expand the set of possible economic questions we can address

Why Python for Scientific Computing?

  • Python is used by computer programmers and scientists alike. Thus, tools from software engineering are readily available.

  • Python is an open-source project. This ensures that all implementation details can be critically examined. There are no licence costs and low barriers to recomputability.

  • Python can be easily linked to high-performance languages such as C and Fortran. Python is ideal for prototyping with a focus on readability, design patterns, and ease of testing.

  • Python has numerous high-quality libraries for scientific computing under active development.

For more details, you can check out my lecture on Python for Scientific Computing in Economics from last year online.

Research Example

Robust Human Capital Investment under Risk and Ambiguity

  • Plausible

    • better description of agent decision problem
  • Meaningful

    • reinterpretation of economic phenomenon
    • reevaluation of policy interventions
  • Tractable

Computational Challenges and Software Engineering

  • ensure correctness of implementation

  • manage numerous numerical components

  • address performance constraints

  • guarantee recomputability and extensibility

  • ...

$\Rightarrow$ building, documenting, and communicating expertise.

The online documentation of respy is available at http://respy.readthedocs.io.

Example

Throughout this lecture we will work with a simple example. We are intersted in building a code to calculate an agent's expected utility $EU$ from a lottery. The generic problem is

$$E U= \int^{-\infty}_{\infty} u(x) f(x)dx,$$

where $x$ is a possible outome, $f(x)$ denotes the probability density function, and $u(x)$ the utility functions.

We start with a simple utility function

$$ u(x) = \sqrt{x},$$

where $x > 0$ and $x$ is drawn from a standard lognormal distribution with scale parameter $s$. To solve the integral, we opt for a naive Monte Carlo integration. See Skrainka & Judd (2013) for the importance of choosing more advanced integration strategies in serious research.

Prototyping


In [ ]:
from scipy.stats import lognorm

deviates = lognorm.rvs(1, size=1000)

Rslt = 0.0
for deviate in deviates:
    Rslt += deviate ** 0.5

print('Expected Utility {}'.format(Rslt/10000))

Problems

  • Hard-coded problem parameters are spread out across the problem.

  • The naming of variables appears not consistent.

  • There is no obvious way the code scales to solve a more generic problem.

  • The implementation is not tested.

What would a more mature implementation look like?

$\quad$

run.py

eupy/

... eu_calculation.py
... integration_rules.py
... utility_functions.py
... input_checks.py
... __init__.py

... testing/
       ... __init__.py
       ... test_eupy.py

Main Improvements

  • clear and extensible structure

  • testing infrastucture available

  • ...

Goal of the Lectures

Provide some guidance on how to get from a protoype to a more mature implementation of your research project in a systematic way.

Dimensions of Code Quality

  • correctness

  • maintainability

  • readability

  • scalabilty

Basic Tools

  • Version Control
  • Code Review
  • Testing
  • Profiling
  • Continous Integration
  • Version Control

    Version Control records changes to a file or set of files over time so that you can recall (and compare) specific versions later. I draw heavily on the free e-book Pro Git and Blischak & al. (2016).

    Tracking changes to your code over time has a variety of benefits:

    • tractability

    • clarity

    • flexibility

    • reduced duplication and error

    $\Rightarrow$ transparent, reproducible, and collaborative research.

    In our quick tour, we will do the following:

    • set up local repository

    • commit files

    • track changes to files

    • compare versions of files

    We will use the eupy package as our baseline.

    $ git clone https://github.com/softEcon/eupy.git; rm -rf eupy/.git
    

    Let us initialize a repository ...

    $ git init
    Initialized empty Git repository in eupy/.git/
    

    ... and provide some information about ourselves.

    $ git config user.name "Philipp Eisenhauer"
    $ git config user.email "eisenhauer@policy-lab.org"
    

    First of all, let us take stock and check the status of our local repository:

    $ git status
    On branch master
    
    Initial commit
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
            .coveralls.yml
            .gitignore
            .travis.yml
            LICENSE
            MANIFEST
            README.md
            _travis/
            dist/
            eupy/
            requirements.txt
            run.py
            setup.cfg
            setup.py
    
    nothing added to commit but untracked files present (use "git add" to track
    

    We now set up tracking of all our files and revisit the status:

    $ git add *
    $ git status
    On branch master
    
    Initial commit
    
    Changes to be committed:
      (use "git rm --cached <file>..." to unstage)
    
    
            new file:   eupy/__init__.py
            new file:   eupy/eu_calculations.py
            new file:   eupy/integration_rules.py
            new file:   eupy/testing/
            new file:   eupy/utility_functions.py
            new file:   requirements.txt
            new file:   run.py
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
            .coveralls.yml
            .gitignore
            .travis.yml
    

    We are now ready for our initial commit.

    $ git commit -a -m'Initial commit of directory content.'
     19 files changed, 662 insertions(+)
     create mode 100644 LICENSE
     create mode 100644 MANIFEST
     create mode 100644 README.md
     create mode 100644 _travis/requirements.py
     create mode 100644 eupy/__init__.py
     create mode 100644 eupy/eu_calculations.py
     create mode 100644 eupy/integration_rules.py
     create mode 100644 eupy/testing/
     create mode 100644 eupy/utility_functions.py
     create mode 100644 requirements.txt
     create mode 100755 run.py
     create mode 100644 setup.cfg
     create mode 100644 setup.py
    

    By adding a message to our commit, it is much easier to return to it later.

    $ git log
    commit 49593b8c447a9f023a35ff724a3467db2fe10037
    Author: Philipp Eisenhauer <eisenhauer@policy-lab.org>
    Date:   Mon Feb 1 21:11:58 2016 +0100
    
        Initial commit of directory content.
    

    This brings us back to the beginning of our repository lifecycle.

    $ git status
    On branch master
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
            .coveralls.yml
            .gitignore
            .travis.yml
    
    nothing added to commit but untracked files present (use "git add" to track)
    

    Let us modify the setup of the random seed in integration_rules.py using our favourite editor and check for the current status of our files.

    $ git status
    On branch master
    Changes not staged for commit:
      (use "git add <file>..." to update what will be committed)
      (use "git checkout -- <file>..." to discard changes in working directory)
    
            modified:   eupy/integration_rules.py
    

    What exactly is the difference?

    $ git diff
    diff --git a/eupy/integration_rules.py b/eupy/integration_rules.py
    index 66aab9c..d6d7dbd 100644
    --- a/eupy/integration_rules.py
    +++ b/eupy/integration_rules.py
    @@ -36,7 +36,7 @@ def naive_monte_carlo(func, bounds, num_draws, implementation, seed):
         lower, upper = bounds
    
         # Draw requested number of deviates.
    -    np.random.seed(seed)
    +    np.random.seed(123)
         deviates = np.random.uniform(lower, upper, size=num_draws)
    
         # Implement native Monte Carlo approach.
    

    In case we want to discard our changes now:

    $ git checkout eupy/integration_rules.py
    

    In this case, we discard our changes immediately. However, we can always return to any version of our file ever commited.

    $ git checkout IDENTIFIER eupy/integration_rules.py
    

    Git also provides a feature called Branching. It allows to maintain parallel versions of your code and switch easily between them. I usually create a new branch for each feature that I add to my research software. Once I am done, I merge my edits back to the master branch.

    But first, let us check where we are now.

    $ git status
    On branch master
    Your branch is up-to-date with 'origin/master'.
    nothing to commit, working directory clean
    

    Now we create and immediately switch to a new branch, that we will use to extend the implemented integration strategies.

    $ git checkout -b feature_romberg
    Switched to a new branch 'feature_romberg'
    

    However, we can also switch back and forth between branches.

    $ git checkout master
    

    Once we are done with our edits, we can simple merge our two branches.

    $ git merge master testing
    

    Once you are familiar with the basic workflow for versioning your own code, the natural next step is to share your code online. GitHub is a popular online hosting service. Among other benefits, you then have a backup copy of your files and can establish a tranparent workflow with your co-authors.

    Let us check out the repository for my current research on ambiguity here.

    Testing

    Key Challenges

    • How do I come up with good tests?

    • How can I make the process of testing the most convenient?

    Types of Tests

    • Unit Tests

    • Integration Tests

    • Regression Tests

    • ...

    We will need to generate a host of random requests for our testing battery.

    
    
    In [ ]:
    def generate_random_request():
        """ Generate a random admissible request.
        """
        # Draw random deviates that honor the constraints for the utility
        # function and distribution of returns.
        alpha, shape = np.random.uniform(low=0.0001, size=2)
        # Draw a random integration technique.
        technique = np.random.choice(['naive_mc', 'quad'])
        # Add options.
        int_options = dict()
        int_options['naive_mc'] = dict()
        int_options['naive_mc']['implementation'] = \
                        np.random.choice(['slow', 'fast'])
        int_options['naive_mc']['num_draws'] = np.random.random_integers(10, 1000)
        int_options['naive_mc']['seed'] = np.random.random_integers(10, 1000)
    
        # Finishing
        return alpha, shape, technique, int_options
    

    Example Tests

    
    
    In [ ]:
    def test_random_requests():
        """ Draw a whole host of random requests to ensure that the function
        works for all admissible values.
        """
        for _ in range(100):
            # Generate random request.
            alpha, shape, technique, int_options = generate_random_request()
            # Perform calculation.
            eupy.get_baseline_lognormal(alpha, shape, technique, int_options)
    
    
    
    In [ ]:
    def test_regression():
        """ This regression test ensures that the code does not change during
        refactoring without noticing.
        """
        # Set seed to avoid dependence of seed.
        np.random.seed(123)
        # Generate random request.
        alpha, shape, technique, int_options = generate_random_request()
        # Perform calculation.
        rslt = eupy.get_baseline_lognormal(alpha, shape, technique, 
                                           int_options)
        # Ensure equivalence with expected results up to numerical precision.
        np.testing.assert_almost_equal(rslt, 0.21990743996551923)
    
    
    
    In [ ]:
    def test_invalid_request():
        """ Test that assertions are raised in case of a flawed request.
        """
        with pytest.raises(NotImplementedError):
            # Generate random request.
            alpha, shape, technique, int_options = generate_random_request()
            # Invalidate request.
            technique = 'gaussian_quadrature'
            # Parameters outside defined ranges.
            eupy.get_baseline_lognormal(alpha, shape, technique, int_options)
    
    
    
    In [ ]:
    def test_closed_form_naive():
        """ Test whether the results line up with a closed form solution for the
        special case, where alpha is set to zero. This function test the naive
        monte carlo implementation.
        """
        for _ in range(10):
            # Generate random request.
            alpha, shape, _, int_options = generate_random_request()
            # Set options favourable.
            int_options['naive_mc']['num_draws'] = 1000
            int_options['naive_mc']['implementation'] = 'fast'
            # Restrict to special case.
            alpha, shape = 0.0, 0.001
            # Calculate closed form solution and simulate special case.
            closed_form = lognorm.mean(shape)
            simulated = eupy.get_baseline_lognormal(alpha, shape, 'naive_mc',
                                                    int_options)
            # Test equality.
            np.testing.assert_almost_equal(closed_form, simulated, decimal=3)
    
    
    
    In [ ]:
    def test_feature_speed():
        """ Test whether the results from the fast and slow implementation of the naive monte carlo
        integration are identical.
        """
        technique = 'naive_mc'
        for _ in range(10):
            # Generate random request.
            alpha, shape, _, int_options = generate_random_request()
            # Loop over alternative implementations.
            baseline = None
            for implementation in ['fast', 'slow']:
                int_options['naive_mc']['implementation'] = implementation
                args = (alpha, shape, technique, int_options)
                rslt = eupy.get_baseline_lognormal(*args)
                if baseline is None:
                    baseline = rslt
            # Test equality.
            np.testing.assert_almost_equal(baseline, rslt)
    
    
    
    In [ ]:
    def test_feature_romberg():
        """ This tests compares the output from the two alternative SciPy 
        integrations.
        """
        # Set seed to avoid dependence of seed.
        np.random.seed(123)
        for _ in range(5):
            # Generate random request.
            alpha, shape, _, _ = generate_random_request()
            # Integration according to alternative rules
            romb = eupy.get_baseline_lognormal(alpha, shape, 'romberg')
            quad = eupy.get_baseline_lognormal(alpha, shape, 'quad')
            # Testing the equality of results
            np.testing.assert_almost_equal(romb, quad)
    
    
    
    In [ ]:
    ''' Execute module as a script.
    '''
    
    if __name__ == "__main__":
    
        test_random_requests()
    
        test_invalid_request()
    
        test_closed_form_quad()
    
        test_naive_implementations()
    
        test_closed_form_naive()
    
        test_regression()
    

    In principle, we can collect all our tests in a module and execute it as a script:

    $ python eupy/testing/test_eupy.py
    

    Problems

    • execution immediately stops, when test fails

    • uninformative error messages

    • ...

    Let us use an automated testing tool such as py.test instead.

    $ py.test -v -s
    

    As usual, several alternatives exist: (1) nose, (2) unittest.

    How much of our code base is covered by tests at this point?

    $ py.test --cov=eupy
    
    Name                        Stmts   Miss  Cover
    -----------------------------------------------
    eupy/__init__.py                2      0   100%
    eupy/eu_calculations.py        29      2    93%
    eupy/integration_rules.py      31      6    81%
    eupy/testing/__init__.py        0      0   100%
    eupy/testing/checks.py         87     15    83%
    eupy/testing/test_eupy.py      67      6    91%
    eupy/utility_functions.py       6      0   100%
    -----------------------------------------------
    TOTAL                         222     29    87%
    

    Coverage of integration rules is comparatively low. Why?

    As usual, this basic output works just fine for projects with only a low level of complexity. However, at some point some more details will be required to guide your development efforts. In my research, I use a tool called codecov.

    respy Test Suite

    • A description of the test suite is available as part of the online documentation.

    • The actual codes are available in the Github respository here.

    Automated Code Review

    There are several tools out there that I found useful in the past to improve the quality of my code along all the dimensions of quality we discussed. Let us visit Codacy online and take a look around.

    Using these tools helps us to improve your own programming skills over time as each of the issues is well explained and best practices provided. More lightweight solutions are also available: (1) Pylint, (2) Pyflakes, and (3) pep8.

    Profiling

    ... premature optimization is the root of all evil. (Knuth.1974)

    Motivation

    • increase the complexity of models we can analyse

    • save computational resources

    • allow for more experimentation

    • speed up feedback loop

    • ...

    
    
    In [ ]:
    # Add the eupy package to the PYTHONPATH
    import sys
    sys.path.insert(0, '../../../eupy')
    
    
    
    In [ ]:
    # standard library
    import cProfile
    import time
    
    # project library
    from eupy import get_baseline_lognormal
    
    # Specify request
    alpha, shape, technique = 0.3, 0.1, 'naive_mc'
    
    int_options = dict()
    int_options['naive_mc'] = dict()
    int_options['naive_mc']['implementation'] = 'slow'
    int_options['naive_mc']['num_draws'] = 10000
    int_options['naive_mc']['seed'] = 123
    

    Let us first just measure the total execution time of our request.

    
    
    In [ ]:
    # Iterate over both types of implementations
    for implementation in ['fast', 'slow']:
        # Switch between the fast and slow implementation.
        int_options['naive_mc']['implementation'] = implementation
        # Get the computer time at the start
        start_time = time.time()
        # Execute the request.
        get_baseline_lognormal(alpha, shape, technique, int_options)
        # Print the time difference.
        print(implementation, time.time() - start_time, "seconds")
    

    Let us now investigate where most of time is spent in a slow evaluation.

    
    
    In [ ]:
    # Profiler is part of the Python standard library.
    int_options['naive_mc']['implementation'] = 'slow'
    cProfile.run("get_baseline_lognormal(alpha, shape, technique, int_options)")
    
    1260190 function calls in 1.068 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        40004    0.079    0.000    0.209    0.000 basic_checks
            1    0.000    0.000    0.000    0.000 check_integration_options
            1    0.000    0.000    1.068    1.068 get_baseline_lognormal
        10000    0.026    0.000    1.056    0.000 _wrapper_baseline
            1    0.000    0.000    1.067    1.067 naive_monte_carlo
            1    0.011    0.011    1.067    1.067 slow_implementation
        10000    0.012    0.000    0.130    0.000 baseline_utility
    
    ...
    

    Luckily a graphical viewer for this output is available. We will use snakeviz.

    Let us inspect our package using this visualization tool.

    
    
    In [ ]:
    # Profiler is part of the Python standard library.
    int_options['naive_mc']['implementation'] = 'slow'
    cProfile.run("get_baseline_lognormal(alpha, shape, technique, int_options)", 'eupy.prof')
    

    Then we look at the report.

    $ snakeviz eupy.prof
    

    Continuous Integration Workflow

    Automation Steps

    • Version Control

    • Testing and Coverage

    • Code Review

    • Build

    • ...

    using Github and its integrations as the center.

    Best Practices

    • Set up a scalable workflow right from the beginning.

    • Develop a testing harness as you impelement and refine your code.

    • Build your model in a hierarchical way.

    • Initially focus on the reability, testabilty, and scalability and then tackle performance constraints.

    Next Steps

    • Check out more detailed lectures at the Software Engineering for Economists Initiative online.

      • Contribute a lecture on a special topic of your choice. See here for an example.
    • Explore the material on Software Carpentry.

    • Sign up for a personal GitHub account and create a remote repository.

    • If you are using a Windows machine, download and install Ubuntu Desktop.

    • Set aside a day or two to establish an continuous integration workflow for your current research project.

    Contact



    Philipp Eisenhauer



    Software Engineering for Economists Initiative

    
    
    In [ ]:
    import urllib; from IPython.core.display import HTML
    HTML(urllib.urlopen('http://bit.ly/1K5apRH').read())