PyShop Session 5

Exercises

These exercises are loosely based on a few slides in a presentation by Michael Creel, available here.

The questions are in increasing difficulty, where the first question should take you less than a minute and the last one you might not be able to figure out. Good luck!

Note: I appologize for the solutions and questions not being next to each other, but there is a numbering issue in the Markdown that generates this text. Sorry, but it is a known bug that has yet to be fixed!

For this example, we will use most of the things we have learned in the course, so it may take more time than other problem sets. That being said, I encourage you to follow it through to the end.

Our goal in this problem set will be to write a program to estimate a maximum likelihood estimator using different programming techniques and packages, studying the benefits of each. What follows is a description of the problem, which you could skip if you do not care, as the exercises are self contained.

Maximum Likelihood Estimation

Consider a string of binary outcome data (zeros and ones) that we know follows a bernoulli distribution. For example, we have a coin that we know to be unfair and the results from flipping that coin $n$ times. More precisely, we have a series of observations $\{x_i\}_{i=1}^n$ ($x_i = 1$ if heads, $x_i = 0$ if tails) drawn from the distribution

$$ f(x_i; p) = p^{x_i}(1 - p)^{1 - x_i} $$

and we wish to estimate the value of $p$. This implies the likelihood function

$$ \begin{align*} L\left(\{x_i\}_{i=1}^n; p \right) &= \prod_i f(x_i; p) \\ &= \prod_i p^{x_i}(1 - p)^{1 - x_i} \\ &= p^{\sum_i x_i}(1 - p)^{n - \sum_i x_i} \end{align*} $$

A maximum likelihood estimator for the value of $p$ seeks to maximize this value. For some initial guess, $p_k$, we can take a second order taylor expansion of $L\left(\{x_i\}_{i=1}^n; p \right)$ around the guess to get

$$ L\left(\{x_i\}_{i=1}^n; p \right) \approx L\left(\{x_i\}_{i=1}^n; p_k \right) + L'\left(\{x_i\}_{i=1}^n; p_k \right)(p - p_k) \\ + \frac{1}{2}L''\left(\{x_i\}_{i=1}^n; p_k \right)(p - p_k)^2 $$

Minimizing only the part of the right hand side that depends on $p$ is a simpler problem since it is a quadratic function in $p$. So, we seek to minimise

$$ \hat{L}\left(\{x_i\}_{i=1}^n; p, p_k \right) \approx L'\left(\{x_i\}_{i=1}^n; p_k \right)p + \frac{1}{2}L''\left(\{x_i\}_{i=1}^n; p_k \right)(p - p_k)^2 $$

which has a linear first order condition with respect to $p$

$$ L'\left(\{x_i\}_{i=1}^n; p_k \right) + L''\left(\{x_i\}_{i=1}^n; p_k \right)(p - p_k) = 0 $$

Thus the optimal next step solves this equation and is

$$ p_{k+1} = p_k - L''\left(\{x_i\}_{i=1}^n; p_k \right)^{-1}L'\left(\{x_i\}_{i=1}^n; p_k \right) $$

Substituting the actual formulas we have the following iterative method to estimate $p$:

$$ p_{k+1} = p_k + \frac{p_k^{s - 1}(1 - p_k)^{n - s - 1}(s(1 - p_k) + (n - s)p_k)}{p_k^{s - 1}(1 - p_k)^{n - s - 1}\left(s(s-1)\frac{1 - p_k}{p_k} + 2s(n-s) \frac{(n - s)(n - s - 1)}{p_k(1 - p_k)}\right)} $$

where $s = \sum_i x_i$.

We'll seek to write a program to iterate this formula until we find an estimate for $p$.

Write a function that takes a value of $n$ and a value of $p$ (the 'true' value), and returns a random sample $\{x_i\}_{i=1}^n$, where $x_i$ corresponds to the flipping of an unfair coin.
Write a function or series of functions that calculate the above equation, $p_{k+1}$, for a value of $p_k$ and a sample $\{x_i\}_{i=1}^n$.
Use a loop and your functions to iterate over the above equation. You can do this for a specific number of iterations or you can set a minimum error level after which to stop.
Rewrite your functions to use NumPy if you have thus far used native Python.
Create a function that plots the evolution of the estimates for $p_k$. Do this using Matplotlib, Pandas, and Seaborn (just for practice).
Use Numba to jit your functions (follow the link for installation instructions). NOTE: If you used NumPy functions and would like to speed them up, find them in the source code on the NumPy website, paste them in your own code, then jit them!
Combine your functions into a class object. Include in the class an option to output the sample and the series of estimators to a .csv file.
Can you think of any way to parallelize this problem? (NOTE: If you don't want to cheat, don't look at number 9 until you've fully answered this question).
ANSWER QUESTION 8 FIRST! Since the calculation of $p_{k+1}$ depends on $p_k$ and since the sample only needs to be calculated once, the estimation process cannot be parallelized outside of the summation. Since a basic summation is a pain to parallelize and since NumPy is so fast, this is rather pointless. However, we could parallelize drawing the sample! Create a function within your class object that uses multiprocessing to parallelize the sample generation and add the option as a keyword argument.
Use %timeit to test for what values of $n$ your multprocessor implimentation is faster.
Write a short program to run your estimation for different values of $n$ and plot the results for single process and multiprocess. Try to plot time to execute, the estimated value of $p$, and the variance of the estimators in each extimation procedure (ie $\mathbb{V}(p_k)$ as a function of $n$).
Post your work to your GitHub account and marvel at how awesome you are. Congrats!