First things first

Click File -> Save a copy in Drive and click Open in new tab in the pop-up window to save your progress in Google Drive.

Using PyMC3

In this assignment, we will learn how to use a library for probabilistic programming and inference called PyMC3.

Setup

Loading auxiliary files and importing the necessary libraries.


In [1]:
pip install -U pymc3 arviz


Collecting pymc3
  Downloading https://files.pythonhosted.org/packages/32/19/6c94cbadb287745ac38ff1197b9fadd66500b6b9c468e79099b110c6a2e9/pymc3-3.8-py3-none-any.whl (908kB)
     |████████████████████████████████| 911kB 2.8MB/s 
Requirement already up-to-date: arviz in /usr/local/lib/python3.6/dist-packages (0.8.3)
Requirement already satisfied, skipping upgrade: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from pymc3) (1.4.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.0 in /usr/local/lib/python3.6/dist-packages (from pymc3) (1.18.4)
Requirement already satisfied, skipping upgrade: tqdm>=4.8.4 in /usr/local/lib/python3.6/dist-packages (from pymc3) (4.41.1)
Requirement already satisfied, skipping upgrade: theano>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from pymc3) (1.0.4)
Requirement already satisfied, skipping upgrade: pandas>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from pymc3) (1.0.4)
Requirement already satisfied, skipping upgrade: patsy>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from pymc3) (0.5.1)
Requirement already satisfied, skipping upgrade: h5py>=2.7.0 in /usr/local/lib/python3.6/dist-packages (from pymc3) (2.10.0)
Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.6/dist-packages (from arviz) (20.4)
Requirement already satisfied, skipping upgrade: xarray>=0.11 in /usr/local/lib/python3.6/dist-packages (from arviz) (0.15.1)
Requirement already satisfied, skipping upgrade: netcdf4 in /usr/local/lib/python3.6/dist-packages (from arviz) (1.5.3)
Requirement already satisfied, skipping upgrade: matplotlib>=3.0 in /usr/local/lib/python3.6/dist-packages (from arviz) (3.2.1)
Requirement already satisfied, skipping upgrade: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from theano>=1.0.4->pymc3) (1.12.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.18.0->pymc3) (2.8.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.18.0->pymc3) (2018.9)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->arviz) (2.4.7)
Requirement already satisfied, skipping upgrade: setuptools>=41.2 in /usr/local/lib/python3.6/dist-packages (from xarray>=0.11->arviz) (47.1.1)
Requirement already satisfied, skipping upgrade: cftime in /usr/local/lib/python3.6/dist-packages (from netcdf4->arviz) (1.1.3)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->arviz) (1.2.0)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->arviz) (0.10.0)
Installing collected packages: pymc3
  Found existing installation: pymc3 3.7
    Uninstalling pymc3-3.7:
      Successfully uninstalled pymc3-3.7
Successfully installed pymc3-3.8

In [2]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False
if IN_COLAB:
    print("Downloading Colab files")
    ! shred -u setup_google_colab.py
    ! wget https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/setup_google_colab.py -O setup_google_colab.py
    import setup_google_colab
    setup_google_colab.load_data_week4()


Downloading Colab files
--2020-06-07 01:23:18--  https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/setup_google_colab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1232 (1.2K) [text/plain]
Saving to: ‘setup_google_colab.py’

setup_google_colab. 100%[===================>]   1.20K  --.-KB/s    in 0s      

2020-06-07 01:23:18 (75.5 MB/s) - ‘setup_google_colab.py’ saved [1232/1232]

https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/week4/w4_grader.py w4_grader.py
https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/week4/adult_us_postprocessed.csv adult_us_postprocessed.csv

In [3]:
import numpy as np
import pandas as pd
import numpy.random as rnd
import seaborn as sns
from matplotlib import animation
import pymc3 as pm
from w4_grader import MCMCGrader
%pylab inline


/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
Populating the interactive namespace from numpy and matplotlib

Grading

We will create a grader instance below and use it to collect your answers. Note that these outputs will be stored locally inside grader and will be uploaded to the platform only after running submitting function in the last part of this assignment. If you want to make a partial submission, you can run that cell anytime you want.


In [0]:
grader = MCMCGrader()

Task 1. Alice and Bob

Alice and Bob are trading on the market. Both of them are selling the Thing and want to get as high profit as possible. Every hour they check out with each other's prices and adjust their prices to compete on the market. Although they have different strategies for price setting.

Alice: takes Bob's price during the previous hour, multiply by 0.6, add \$90, add Gaussian noise from $N(0, 20^2)$.

Bob: takes Alice's price during the current hour, multiply by 1.2 and subtract \$20, add Gaussian noise from $N(0, 10^2)$.

The problem is to find the joint distribution of Alice and Bob's prices after many hours of such an experiment.

Task 1.1

Implement the run_simulation function according to the description above.


In [0]:
def run_simulation(alice_start_price=300.0, bob_start_price=300.0, seed=42, num_hours=10000, burnin=1000):
    """Simulates an evolution of prices set by Bob and Alice.
    
    The function should simulate Alice and Bob behavior for `burnin' hours, then ignore the obtained
    simulation results, and then simulate it for `num_hours' more.
    The initial burnin (also sometimes called warmup) is done to make sure that the distribution stabilized.
    
    Please don't change the signature of the function.
    
    Returns:
        two lists, with Alice and with Bob prices. Both lists should be of length num_hours.
    """
    np.random.seed(seed)

    alice_prices = [alice_start_price]
    bob_prices = [bob_start_price]
    
    #### YOUR CODE HERE ####
    for hour in range(burnin + num_hours - 1):
      #Alice: takes Bob's price during the previous hour, multiply by 0.6, add $90, add Gaussian noise from  N(0,202) .
      #Bob: takes Alice's price during the current hour, multiply by 1.2 and subtract $20, add Gaussian noise from  N(0,102) .
      alice_current = bob_prices[-1]*0.6 + 90 + np.random.normal(loc=0, scale=20)
      bob_current = alice_current*1.2 - 20 + np.random.normal(loc=0, scale=10)

      alice_prices.append(alice_current)
      bob_prices.append(bob_current)
    
    
    ### END OF YOUR CODE ###
    #print(len(alice_prices[burnin:]), len(bob_prices[burnin:]))
    return alice_prices[burnin:], bob_prices[burnin:]

In [31]:
alice_prices, bob_prices = run_simulation(alice_start_price=300, bob_start_price=300, seed=42, num_hours=3, burnin=1)
if len(alice_prices) != 3:
    raise RuntimeError('Make sure that the function returns `num_hours` data points.')
grader.submit_simulation_trajectory(alice_prices, bob_prices)


Current answer for task 1.1 (Alice trajectory) is: 279.93428306022463  291.67686875834846
Current answer for task 1.1 (Bob trajectory) is: 314.5384966605577  345.2425410740984

Task 1.2

What is the average price for Alice and Bob after the burn-in period? Whose prices are higher?


In [32]:
#### YOUR CODE HERE ####
alice_prices, bob_prices = run_simulation(alice_start_price=300, bob_start_price=300)

average_alice_price = np.mean(alice_prices)
average_bob_price = np.mean(bob_prices)
### END OF YOUR CODE ###
grader.submit_simulation_mean(average_alice_price, average_bob_price)


Current answer for task 1.2 (Alice mean) is: 278.85416992423353
Current answer for task 1.2 (Bob mean) is: 314.6064116545574

Task 1.3

Let's look at the 2-d histogram of prices, computed using kernel density estimation.


In [33]:
data = np.array(run_simulation())
sns.jointplot(data[0, :], data[1, :], stat_func=None, kind='kde')


Out[33]:
<seaborn.axisgrid.JointGrid at 0x7faf6e4ff198>

Clearly, the prices of Bob and Alce are highly correlated. What is the Pearson correlation coefficient of Alice and Bob prices?


In [34]:
#### YOUR CODE HERE ####
correlation = np.corrcoef(alice_prices, bob_prices)[0,1]
### END OF YOUR CODE ###
grader.submit_simulation_correlation(correlation)


Current answer for task 1.3 (Bob and Alice prices correlation) is: 0.9636099866766943

Task 1.4

We observe an interesting effect here: seems like the bivariate distribution of Alice and Bob prices converges to a correlated bivariate Gaussian distribution.

Let's check, whether the results change if we use different random seed and starting points.


In [35]:
# Pick different starting prices, e.g 10, 1000, 10000 for Bob and Alice. 
# Does the joint distribution of the two prices depend on these parameters?
POSSIBLE_ANSWERS = {
    0: 'Depends on random seed and starting prices', 
    1: 'Depends only on random seed',
    2: 'Depends only on starting prices',
    3: 'Does not depend on random seed and starting prices'
}

idx = 3### TYPE THE INDEX OF THE CORRECT ANSWER HERE ###
answer = POSSIBLE_ANSWERS[idx]
grader.submit_simulation_depends(answer)


Current answer for task 1.4 (depends on the random data or not) is: Does not depend on random seed and starting prices

Task 2. Logistic regression with PyMC3

Logistic regression is a powerful model that allows you to analyze how a set of features affects some binary target label. Posterior distribution over the weights gives us an estimation of the influence of each particular feature on the probability of the target being equal to one. But most importantly, posterior distribution gives us the interval estimates for each weight of the model. This is very important for data analysis when you want to not only provide a good model but also estimate the uncertainty of your conclusions.

In this task, we will learn how to use PyMC3 library to perform approximate Bayesian inference for logistic regression.

This part of the assignment is based on the logistic regression tutorial by Peadar Coyle and J. Benjamin Cook.

Logistic regression.

The problem here is to model how the probability that a person has salary $\geq$ \$50K is affected by his/her age, education, sex and other features.

Let $y_i = 1$ if i-th person's salary is $\geq$ \$50K and $yi = 0$ otherwise. Let $x{ij}$ be $j$-th feature of $i$-th person.

Logistic regression models this probabilty in the following way:

$$p(y_i = 1 \mid \beta) = \sigma (\beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_k x_{ik} ), $$

where $\sigma(t) = \frac1{1 + e^{-t}}$

Odds ratio.

Let's try to answer the following question: does the gender of a person affects his or her salary? To do it we will use the concept of odds.

If we have a binary random variable $y$ (which may indicate whether a person makes \$50K) and if the probabilty of the positive outcome $p(y = 1)$ is for example 0.8, we will say that the *odds* are 4 to 1 (or just 4 for short), because succeding is 4 time more likely than failing $\frac{p(y = 1)}{p(y = 0)} = \frac{0.8}{0.2} = 4$.

Now, let's return to the effect of gender on the salary. Let's compute the ratio between the odds of a male having salary $\geq $ \$50K and the odds of a female (with the same level of education, experience and everything else) having salary $\geq$ \\$50K. The first feature of each person in the dataset is gender. Specifically, $x_{i1} = 0$ if the person is female and $x_{i1} = 1$ otherwise. Consider two people $i$ and $j$ having all but one features the same with the only difference in $x_{i1} \neq x_{j1}$.

If the logistic regression model above estimates the probabilities exactly, the odds for a male will be (check it!): $$ \frac{p(y_i = 1 \mid x_{i1}=1, x_{i2}, \ldots, x_{ik})}{p(y_i = 0 \mid x_{i1}=1, x_{i2}, \ldots, x_{ik})} = \frac{\sigma(\beta_1 + \beta_2 x_{i2} + \ldots)}{1 - \sigma(\beta_1 + \beta_2 x_{i2} + \ldots)} = \exp(\beta_1 + \beta_2 x_{i2} + \ldots) $$

Now the ratio of the male and female odds will be: $$ \frac{\exp(\beta_1 \cdot 1 + \beta_2 x_{i2} + \ldots)}{\exp(\beta_1 \cdot 0 + \beta_2 x_{i2} + \ldots)} = \exp(\beta_1) $$

So given the correct logistic regression model, we can estimate odds ratio for some feature (gender in this example) by just looking at the corresponding coefficient. But of course, even if all the logistic regression assumptions are met we cannot estimate the coefficient exactly from real-world data, it's just too noisy. So it would be really nice to build an interval estimate, which would tell us something along the lines "with probability 0.95 the odds ratio is greater than 0.8 and less than 1.2, so we cannot conclude that there is any gender discrimination in the salaries" (or vice versa, that "with probability 0.95 the odds ratio is greater than 1.5 and less than 1.9 and the discrimination takes place because a male has at least 1.5 higher probability to get >$50k than a female with the same level of education, age, etc."). In Bayesian statistics, this interval estimate is called credible interval.

Unfortunately, it's impossible to compute this credible interval analytically. So let's use MCMC for that!

Credible interval

A credible interval for the value of $\exp(\beta_1)$ is an interval $[a, b]$ such that $p(a \leq \exp(\beta_1) \leq b \mid X_{\text{train}}, y_{\text{train}})$ is $0.95$ (or some other predefined value). To compute the interval, we need access to the posterior distribution $p(\exp(\beta_1) \mid X_{\text{train}}, y_{\text{train}})$.

Lets for simplicity focus on the posterior on the parameters $p(\beta_1 \mid X_{\text{train}}, y_{\text{train}})$ since if we compute it, we can always find $[a, b]$ such that $p(\log a \leq \beta_1 \leq \log b \mid X_{\text{train}}, y_{\text{train}}) = p(a \leq \exp(\beta_1) \leq b \mid X_{\text{train}}, y_{\text{train}}) = 0.95$

Task 2.1 MAP inference

Let's read the dataset. This is a post-processed version of the UCI Adult dataset.


In [11]:
data = pd.read_csv("adult_us_postprocessed.csv")
data.head()


Out[11]:
sex age educ hours income_more_50K
0 Male 39 13 40 0
1 Male 50 13 13 0
2 Male 38 9 40 0
3 Male 53 7 40 0
4 Female 28 13 40 0

Each row of the dataset is a person with his (her) features. The last column is the target variable $y$. One indicates that this person's annual salary is more than $50K.

First of all let's set up a Bayesian logistic regression model (i.e. define priors on the parameters $\alpha$ and $\beta$ of the model) that predicts the value of "income_more_50K" based on person's age and education:

$$ p(y = 1 \mid \alpha, \beta_1, \beta_2) = \sigma(\alpha + \beta_1 x_1 + \beta_2 x_2) \\ \alpha \sim N(0, 100^2) \\ \beta_1 \sim N(0, 100^2) \\ \beta_2 \sim N(0, 100^2), \\ $$

where $x_1$ is a person's age, $x_2$ is his/her level of education, y indicates his/her level of income, $\alpha$, $\beta_1$ and $\beta_2$ are paramters of the model.


In [12]:
with pm.Model() as manual_logistic_model:
    # Declare pymc random variables for logistic regression coefficients with uninformative 
    # prior distributions N(0, 100^2) on each weight using pm.Normal. 
    # Don't forget to give each variable a unique name.
    
    #### YOUR CODE HERE ####

    alpha = pm.Normal('alpha', mu=0, sigma=100)
    beta1 = pm.Normal('beta1', mu=0, sigma=100)
    beta2 = pm.Normal('beta2', mu=0, sigma=100)
    
    ### END OF YOUR CODE ###
    
    # Transform these random variables into vector of probabilities p(y_i=1) using logistic regression model specified 
    # above. PyMC random variables are theano shared variables and support simple mathematical operations.
    # For example:
    # z = pm.Normal('x', 0, 1) * np.array([1, 2, 3]) + pm.Normal('y', 0, 1) * np.array([4, 5, 6])`
    # is a correct PyMC expression.
    # Use pm.invlogit for the sigmoid function.
    
    #### YOUR CODE HERE ####
    
    prob = pm.invlogit(alpha + beta1*data['age'] + beta2*data['educ'] )
    ### END OF YOUR CODE ###
    
    # Declare PyMC Bernoulli random vector with probability of success equal to the corresponding value
    # given by the sigmoid function.
    # Supply target vector using "observed" argument in the constructor.

    #### YOUR CODE HERE ####
    likelihood = pm.Bernoulli('likelihood', p=prob, observed=data['income_more_50K'])
    ### END OF YOUR CODE ###
    
    # Use pm.find_MAP() to find the maximum a-posteriori estimate for the vector of logistic regression weights.
    map_estimate = pm.find_MAP()
    print(map_estimate)


logp = -18,844, ||grad|| = 57,293: 100%|██████████| 30/30 [00:00<00:00, 110.95it/s]  
{'alpha': array(-6.74811904), 'beta1': array(0.04348316), 'beta2': array(0.36210803)}

Sumbit MAP estimations of corresponding coefficients:


In [13]:
with pm.Model() as logistic_model:
    # There's a simpler interface for generalized linear models in pymc3. 
    # Try to train the same model using pm.glm.GLM.from_formula.
    # Do not forget to specify that the target variable is binary (and hence follows Binomial distribution).
    
    #### YOUR CODE HERE ####
    formula = 'income_more_50K ~ age + educ'
    likelihood = pm.glm.GLM.from_formula(formula, data, family=pm.glm.families.Binomial())
    ### END OF YOUR CODE ###
    map_estimate = pm.find_MAP()
    print(map_estimate)


WARNING (theano.tensor.blas): We did not find a dynamic library in the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.
logp = -15,131, ||grad|| = 0.024014: 100%|██████████| 32/32 [00:00<00:00, 123.80it/s]  
{'Intercept': array(-6.7480998), 'age': array(0.04348259), 'educ': array(0.36210894)}


In [14]:
beta_age_coefficient =  0.04348259### TYPE MAP ESTIMATE OF THE AGE COEFFICIENT HERE ###
beta_education_coefficient = 0.36210894### TYPE MAP ESTIMATE OF THE EDUCATION COEFFICIENT HERE ###
grader.submit_pymc_map_estimates(beta_age_coefficient, beta_education_coefficient)


Current answer for task 2.1 (MAP for age coef) is: 0.04348259
Current answer for task 2.1 (MAP for aducation coef) is: 0.36210894

Task 2.2 MCMC

To find credible regions let's perform MCMC inference.


In [0]:
# You will need the following function to visualize the sampling process.
# You don't need to change it.
def plot_traces(traces, burnin=200):
    ''' 
    Convenience function:
    Plot traces with overlaid means and values
    '''
    
    ax = pm.traceplot(traces[burnin:], figsize=(12,len(traces.varnames)*1.5),
        lines={k: v['mean'] for k, v in pm.summary(traces[burnin:]).iterrows()})

    for i, mn in enumerate(pm.summary(traces[burnin:])['mean']):
        ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
                    ,xytext=(5,10), textcoords='offset points', rotation=90
                    ,va='bottom', fontsize='large', color='#AA0022')

Metropolis-Hastings

Let's use the Metropolis-Hastings algorithm for finding the samples from the posterior distribution.

Once you wrote the code, explore the hyperparameters of Metropolis-Hastings such as the proposal distribution variance to speed up the convergence. You can use plot_traces function in the next cell to visually inspect the convergence.

You may also use MAP-estimate to initialize the sampling scheme to speed things up. This will make the warmup (burn-in) period shorter since you will start from a probable point.


In [16]:
with pm.Model() as logistic_model:
    # Since it is unlikely that the dependency between the age and salary is linear, we will include age squared
    # into features so that we can model dependency that favors certain ages.
    # Train Bayesian logistic regression model on the following features: sex, age, age^2, educ, hours
    # Use pm.sample to run MCMC to train this model.
    # To specify the particular sampler method (Metropolis-Hastings) to pm.sample,
    # use `pm.Metropolis`.
    # Train your model for 400 samples.
    # Save the output of pm.sample to a variable: this is the trace of the sampling procedure and will be used
    # to estimate the statistics of the posterior distribution.
    
    #### YOUR CODE HERE ####
    data['age_squared'] = data['age'] ** 2
    formula = 'income_more_50K ~ sex + age + age_squared + educ + hours'
    likelihood = pm.glm.GLM.from_formula(formula, data, family=pm.glm.families.Binomial())
    trace = pm.sample(400, step=pm.Metropolis(), chains=1)
    ### END OF YOUR CODE ###


Only 400 samples in chain.
Sequential sampling (1 chains in 1 job)
CompoundStep
>Metropolis: [hours]
>Metropolis: [educ]
>Metropolis: [age_squared]
>Metropolis: [age]
>Metropolis: [sex[T. Male]]
>Metropolis: [Intercept]
Sampling chain 0, 0 divergences: 100%|██████████| 900/900 [01:02<00:00, 14.36it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks

In [17]:
pm.__version__


Out[17]:
'3.8'

In [18]:
plot_traces(trace)


/usr/local/lib/python3.6/dist-packages/arviz/data/base.py:146: UserWarning: More chains (200) than draws (6). Passed array should have shape (chains, draws, *shape)
  UserWarning,
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
/usr/local/lib/python3.6/dist-packages/arviz/data/base.py:146: UserWarning: More chains (200) than draws (6). Passed array should have shape (chains, draws, *shape)
  UserWarning,
/usr/local/lib/python3.6/dist-packages/arviz/plots/backends/matplotlib/traceplot.py:152: UserWarning: A valid var_name should be provided, found {'a', 'h', 's', 'e', 'I'} expected from {'hours', 'Intercept', 'age_squared', 'sex[T. Male]', 'educ', 'age'}
  invalid_var_names, all_var_names
/usr/local/lib/python3.6/dist-packages/arviz/data/base.py:146: UserWarning: More chains (200) than draws (6). Passed array should have shape (chains, draws, *shape)
  UserWarning,
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)

NUTS sampler

Use pm.sample without specifying a particular sampling method (pymc3 will choose it automatically). The sampling algorithm that will be used in this case is NUTS, which is a form of Hamiltonian Monte Carlo, in which parameters are tuned automatically. This is an advanced method that we hadn't cover in the lectures, but it usually converges faster and gives less correlated samples compared to vanilla Metropolis-Hastings.


In [19]:
with pm.Model() as logistic_model:
    # Train Bayesian logistic regression model on the following features: sex, age, age_squared, educ, hours
    # Use pm.sample to run MCMC to train this model.
    # Train your model for 400 samples.
    # Training can take a while, so relax and wait :)
    
    #### YOUR CODE HERE ####
    formula = 'income_more_50K ~ sex + age + age_squared + educ + hours'
    likelihood = pm.glm.GLM.from_formula(formula, data, family=pm.glm.families.Binomial())
    trace = pm.sample(400, step=pm.NUTS(), chains=1)
    ### END OF YOUR CODE ###


Only 400 samples in chain.
Sequential sampling (1 chains in 1 job)
NUTS: [hours, educ, age_squared, age, sex[T. Male], Intercept]
Sampling chain 0, 0 divergences: 100%|██████████| 900/900 [14:39<00:00,  1.02it/s]
The acceptance probability does not match the target. It is 0.9840827118135124, but should be close to 0.8. Try to increase the number of tuning steps.
Only one chain was sampled, this makes it impossible to run some convergence checks

In [20]:
plot_traces(trace)


arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
/usr/local/lib/python3.6/dist-packages/arviz/plots/backends/matplotlib/traceplot.py:152: UserWarning: A valid var_name should be provided, found {'a', 'h', 's', 'e', 'I'} expected from {'hours', 'Intercept', 'age_squared', 'sex[T. Male]', 'educ', 'age'}
  invalid_var_names, all_var_names
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)
arviz.stats.stats_utils - WARNING - Shape validation failed: input_shape: (1, 200), minimum_shape: (chains=2, draws=4)

Estimating the odds ratio

Now, let's build the posterior distribution on the odds ratio given the dataset (approximated by MCMC).


In [22]:
# We don't need to use a large burn-in here, since we initialize sampling
# from a good point (from our approximation of the most probable
# point (MAP) to be more precise).
burnin = 100
b = trace['sex[T. Male]'][burnin:]
plt.hist(np.exp(b), bins=20, density=True)
plt.xlabel("Odds Ratio")
plt.show()


Finally, we can find a credible interval (recall that credible intervals are Bayesian and confidence intervals are frequentist) for this quantity. This may be the best part about Bayesian statistics: we get to interpret credibility intervals the way we've always wanted to interpret them. We are 95% confident that the odds ratio lies within our interval!


In [23]:
lb, ub = np.percentile(b, 2.5), np.percentile(b, 97.5)
print("P(%.3f < Odds Ratio < %.3f) = 0.95" % (np.exp(lb), np.exp(ub)))


P(3.023 < Odds Ratio < 3.460) = 0.95

In [24]:
# Submit the obtained credible interval.
grader.submit_pymc_odds_ratio_interval(np.exp(lb), np.exp(ub))


Current answer for task 2.2 (credible interval lower bound) is: 3.022648987035948
Current answer for task 2.2 (credible interval upper bound) is: 3.4597593844022723

Task 2.3 interpreting the results


In [25]:
# Does the gender affects salary in the provided dataset?
# (Note that the data is from 1996 and maybe not representative
# of the current situation in the world.)
POSSIBLE_ANSWERS = {
    0: 'No, there is certainly no discrimination',
    1: 'We cannot say for sure',
    2: 'Yes, we are 95% sure that a female is *less* likely to get >$50K than a male with the same age, level of education, etc.', 
    3: 'Yes, we are 95% sure that a female is *more* likely to get >$50K than a male with the same age, level of education, etc.', 
}

idx = 2### TYPE THE INDEX OF THE CORRECT ANSWER HERE ###
answer = POSSIBLE_ANSWERS[idx]
grader.submit_is_there_discrimination(answer)


Current answer for task 2.3 (does the data suggest gender discrimination?) is: Yes, we are 95% sure that a female is *less* likely to get >$50K than a male with the same age, level of education, etc.

Authorization & Submission

To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate a token on this programming assignment's page. Note: The token expires 30 minutes after generation.


In [36]:
STUDENT_EMAIL = 'saketkc@gmail.com'
STUDENT_TOKEN = '6r463miiML4NWB9M'
grader.status()


You want to submit these numbers:
Task 1.1 (Alice trajectory): 279.93428306022463  291.67686875834846
Task 1.1 (Bob trajectory): 314.5384966605577  345.2425410740984
Task 1.2 (Alice mean): 278.85416992423353
Task 1.2 (Bob mean): 314.6064116545574
Task 1.3 (Bob and Alice prices correlation): 0.9636099866766943
Task 1.4 (depends on the random data or not): Does not depend on random seed and starting prices
Task 2.1 (MAP for age coef): 0.04348259
Task 2.1 (MAP for aducation coef): 0.36210894
Task 2.2 (credible interval lower bound): 3.022648987035948
Task 2.2 (credible interval upper bound): 3.4597593844022723
Task 2.3 (does the data suggest gender discrimination?): Yes, we are 95% sure that a female is *less* likely to get >$50K than a male with the same age, level of education, etc.

If you want to submit these answers, run cell below


In [37]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)


Submitted to Coursera platform. See results on assignment page!

(Optional) generating videos of sampling process

In this part you will generate videos showing the sampling process.

Setting things up

You don't need to modify the code below, it sets up the plotting functions. The code is based on MCMC visualization tutorial.


In [28]:
from IPython.display import HTML

# Number of MCMC iteration to animate.
samples = 400

figsize(6, 6)
fig = plt.figure()
s_width = (0.81, 1.29)
a_width = (0.11, 0.39)
samples_width = (0, samples)
ax1 = fig.add_subplot(221, xlim=s_width, ylim=samples_width)
ax2 = fig.add_subplot(224, xlim=samples_width, ylim=a_width)
ax3 = fig.add_subplot(223, xlim=s_width, ylim=a_width,
                      xlabel='male coef',
                      ylabel='educ coef')
fig.subplots_adjust(wspace=0.0, hspace=0.0)
line1, = ax1.plot([], [], lw=1)
line2, = ax2.plot([], [], lw=1)
line3, = ax3.plot([], [], 'o', lw=2, alpha=.1)
line4, = ax3.plot([], [], lw=1, alpha=.3)
line5, = ax3.plot([], [], 'k', lw=1)
line6, = ax3.plot([], [], 'k', lw=1)
ax1.set_xticklabels([])
ax2.set_yticklabels([])
lines = [line1, line2, line3, line4, line5, line6]

def init():
    for line in lines:
        line.set_data([], [])
    return lines

def animate(i):
    with logistic_model:
        if i == 0:
            # Burnin
            for j in range(samples): iter_sample.__next__() 
        trace = iter_sample.__next__()
    line1.set_data(trace['sex[T. Male]'][::-1], range(len(trace['sex[T. Male]'])))
    line2.set_data(range(len(trace['educ'])), trace['educ'][::-1])
    line3.set_data(trace['sex[T. Male]'], trace['educ'])
    line4.set_data(trace['sex[T. Male]'], trace['educ'])
    male = trace['sex[T. Male]'][-1]
    educ = trace['educ'][-1]
    line5.set_data([male, male], [educ, a_width[1]])
    line6.set_data([male, s_width[1]], [educ, educ])
    return lines


Animating Metropolis-Hastings


In [0]:
with pm.Model() as logistic_model:
    # Again define Bayesian logistic regression model on the following features: sex, age, age_squared, educ, hours
    
    #### YOUR CODE HERE ####
    
    ### END OF YOUR CODE ###
    step = pm.Metropolis()
    iter_sample = pm.iter_sample(2 * samples, step, start=map_estimate)
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=samples, interval=5, blit=True)
HTML(anim.to_html5_video())
# Note that generating the video may take a while.

Animating NUTS

Now rerun the animation providing the NUTS sampling method as the step argument.


In [0]: