Overview

Objective

This homework assignment will get you comfortable using NumPy and Matplotlib. Again I fully expect that you will need some things that we haven't discussed in class for elegant solutions. Read the docs, google, and check Stack Overflow.

Grading

This assignment is worth ten points. You should complete it in this notebook. When complete, you should email the assignment to Dr. Johnson. The due date is Wednesday, June 8, by 9:00AM. Late assignments will be penalized 1 point per day (that's one full letter grade per day late). Submission anytime after 9:00AM on June 8 counts as 1 day late.

Working together

You may discuss and work on this assignment with your peers. However, you must submit your own work, copying is not permitted. You will be graded individually, and if your work appears to be a copy of someone else's, I may ask for you to demonstrate in person and on the spot your ability to write the code to solve a similar problem.

Format

Do all of your work in this notebook. I should be able to change the necessary filepaths in the first cell block, after which I should be able to restart the kernel and run all cells to see the correct output. I should not need to make any other changes to the file. You should verify that you've done the work correctly by doing this yourself (restart the kernel and run all using the 'kernel' dropdown menu above).

Instructions

Part 1 (4 points)

The function below is often used to measure the quality of probabilistic predictions in a classification setting. In the formula below, $n$ is the number of observations, $y_i$ is either a 0 or 1, $p_i$ is the predicted probability that $y_i = 1$, and the logarithm function is the natural logarithm (log base $e$: see https://en.wikipedia.org/wiki/Natural_logarithm to review). Note that as a probability, $0 \leq p_i \leq 1$.

Write a vectorized implementation of this function using NumPy. Note that for wildly wrong predictions, this function assigns a potentially infinite penalty. Try to write your function in such a way that infinite error is avoided.

$$Error = -\frac{1}{n}\sum_{i=1}^n\left[y_i\cdot\log(p_i) + (1-y_i)\cdot\log(1-p_i)\right].$$

Part 2 (4 points)

The beta distribution is a probability distribution for probabilities. It is parameterized by two values, $\alpha$ and $\beta$. The distribution is implemented in the scipy.stats library. Using Matplotlib, plot the probability density function for the beta distribution for three different pairs of values of $\alpha$ and $\beta$. Put your plots side by side in a single figure, give each image a legend and title it. See the sample code below:



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import norm

x = np.linspace(-5, 5, 100)
normal = norm(loc=0, scale=1)  # instatiate the distribution with parameters
y = normal.pdf(x)  # get the pdf for values x
plt.fill_between(x, 0, y, color='#6699FF', alpha=0.3);

Part 3 (2 points)

Generate a list of 1000 0's and 1's chosen randomly using NumPy (look up the 'random' library in NumPy). Generate a second list of 1000 values from a beta distribution using the .rvs method (see sample code below). Use the error function you wrote in part 1 to calculate the error as if the first list were actual classifications and the second list were predicted probabilities. What is the error?



In [3]:

    
# example of .rvs method - normal is norm object created above
normal.rvs(5)









    Out[3]:





array([ 0.10181854, -0.30193621,  0.67311963, -1.17047468, -1.96923842])