Home Assignment No. 3: Part 1

In this part of the homework you are to solve several problems related to machine learning algorithms.

For every separate problem you can get only 0 points or maximal points for this problem. There are NO INTERMEDIATE scores.
Your solution must me COMPLETE, i.e. contain all required formulas/proofs/detailed explanations.
You must write your solution for any problem just right after the words BEGIN SOLUTION. Attaching pictures of your handwriting is allowed, but highly discouraged.
If you want an easy life, you have to use BUILT-IN METHODS of sklearn library instead of writing tons of our yown code. There exists a class/method for almost everything you can imagine (related to this homework).
To do some tasks in this part of homework, you have to write CODE directly inside specified places inside notebook CELLS.
In some problems you may be asked to provide short discussion of the results. In this cases you have to create MARKDOWN cell with your comments right after the your code cell.
Your SOLUTION notebook MUST BE REPRODUCIBLE, i.e. if the reviewer decides to execute Kernel -> Restart Kernel and Run All Cells, after all the computation he will obtain exactly the same solution (with all the corresponding plots) as in your uploaded notebook. For this purpose, we suggest to fix random seed or (better) define random_state= inside every algorithm that uses some pseudorandomness.
Your code must be clear to the reviewer. For this purpose, try to include neccessary comments inside the code. But remember: GOOD CODE MUST BE SELF-EXPLANATORY without any additional comments.
The are problems with * mark - they are not obligatory. You can get EXTRA POINTS for solving them. ## $\LaTeX$ in Jupyter Jupyter has constantly improving $\LaTeX$ support. Below are the basic methods to write neat, tidy, and well typeset equations in your notebooks:
to write an inline equation use
```
$ you latex equation here $
```
to write an equation, that is displayed on a separate line use
```
$$ you latex equation here $$
```

to write a block of equations use

\begin{align}
  left-hand-side
      &= right-hand-side on line 1
      \\
      &= right-hand-side on line 2
      \\
      &= right-hand-side on the last line
\end{align}

The ampersand (&) aligns the equations horizontally and the double backslash (\\) creates a new line.

Write your theoretical derivations within such blocks:

**BEGIN Solution**

<!-- >>> your derivation here <<< -->

**END Solution**

Please, write your implementation within the designated blocks:

...
### BEGIN Solution

# >>> your solution here <<<

### END Solution
...



In [2]:

    
import numpy as np
import pandas as pd

import torch

%matplotlib inline
import matplotlib.pyplot as plt

Bayesian Models. GLM

Task 1 (1 pt.)

Consider a univariate Gaussian distribution $\mathcal{N}(x; \mu, \tau^{-1})$. Let's define Gaussian-Gamma prior for parameters $(\mu, \tau)$:

\begin{equation} p(\mu, \tau) = \mathcal{N}(\mu; \mu_0, (\beta \tau)^{-1}) \otimes \text{Gamma}(\tau; a, b) \,. \end{equation}

Find the posterior distribution of $(\mu, \tau)$ after observing $X = (x_1, \dots, x_n)$.

BEGIN Solution

$$ \mathbb{P}(\mu, \tau | X) \propto p(\mu, \tau) \mathbb{P}(X | \mu, \tau) $$

By condition: $$ \mathbb{P}(X | \mu, \tau) = \prod_{i=1}^n \mathbb{P}(x_i | \mu, \tau) $$ As we know distribution of each datasample: $$ \mathbb{P}(X | \mu, \tau) = \prod_{i=1}^n \frac{\tau^{\frac{1}{2}}}{\sqrt{2 \pi}} \exp{ \Big[-\frac{\tau (x_i - \mu)^2 }{2}\Big]} = \frac{\tau^{\frac{n}{2}}}{(2 \pi)^{\frac{n}{2}}} \exp{ \Big[-\frac{\tau}{2} \sum_{i=1}^n(x_i - \mu)^2 \Big]} $$

$$ p(\mu, \tau) = \mathcal{N}(\mu; \mu_0, (\beta \tau)^{-1}) \otimes \text{Gamma}(\tau; a, b) $$$$ p(\mu, \tau) = \frac{b^a \beta ^{\frac{1}{2}}}{(2\pi)^{\frac{1}{2}}\Gamma(a)} \tau^{a-\frac{1}{2}} e^{-b\tau} \exp{\Big( - \frac{\beta \tau}{2} (\mu - \mu_0)^2 \Big)} $$$$ p(\mu, \tau | X) \propto \tau^{\frac{n}{2} + a - \frac{1}{2}} e^{-b\tau} \exp{\Big( - \frac{\tau}{2} \Big[ \beta(\mu - \mu_0)^2 + \sum_{i=1}^{n} (x_i - \mu)^2 \Big] \Big)} $$$$ \sum_{i=1}^n (x_i - \mu)^2 = ns + n(\overline{x} - \mu)^2, \, \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i, \, s=\frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2 $$$$ \exp{\Big[ -\frac{\tau}{2} \Big[ \beta (\mu - \mu_0)^2 +ns + n(\overline{x} - \mu)^2 \Big] \Big]} \exp{(-b\tau)} = \exp{\Big[ -\tau \Big( \frac{1}{2} ns + b \Big) \Big]} \exp{\Big[ - \frac{\tau}{2} \Big( \beta(\mu - \mu_0)^2 + n (\overline{x} - \mu)^2 \Big) \Big]} $$

After simple regrouping: $$ \beta (\mu - \mu_0)^2 + n(\overline{x} - \mu)^2 = (\beta + n) \Big( \mu - \frac{\beta \mu_0 + n \overline{x}}{\beta + n} \Big)^2 + \frac{\beta n ( \overline{x} - \mu_0)^2}{\beta + n} $$

$$ p(\mu, \tau | X) \propto \tau^{\frac{n}{2} + a - \frac{1}{2}} \exp{\Big[ -\tau \Big( \frac{1}{2} ns + b \Big) \Big]} \exp{\Big[ - \frac{\tau}{2} \Big[ (\beta + n) \Big( \mu - \frac{\beta \mu_0 + n \overline{x}}{\beta + n} \Big)^2 + \frac{\beta n ( \overline{x} - \mu_0)^2}{\beta + n} \Big] \Big]} $$

Again regrouping: $$ p(\mu, \tau | X) \propto \tau^{\frac{n}{2} + a - \frac{1}{2}} \exp{\Big[ -\tau \Big[ \frac{1}{2} ns + b + \frac{\beta n ( \overline{x} - \mu_0)^2}{2(\beta + n)} \Big] \Big]} \exp{\Big[ - \frac{\tau}{2} (\beta + n) \Big( \mu - \frac{\beta \mu_0 + n \overline{x}}{\beta + n} \Big)^2 \Big]} $$ Finally: $$ \boxed{ p(\mu, \tau | X) \propto \Gamma\Big(\tau, \frac{n}{2} + a, \frac{1}{2} ns +b + \frac{\beta n ( \overline{x} - \mu_0)^2}{2(\beta + n)} \Big) \mathcal{N}\Big( \mu, \frac{\beta \mu_0 + n \overline{x}}{\beta + n}, (\tau(\beta + n))^{-1}\Big) } $$ END Solution

Task 2 (1 + 1 + 1 = 3 pt.)

Evaluate the following integral using the Laplace approximation: \begin{equation} x \mapsto \int \sigma(w^T x) \mathcal{N}(w; 0, \Sigma) dw \,, \end{equation} for $x = \bigl(\tfrac23, \tfrac16, \tfrac16\bigr)\in \mathbb{R}^3$ and \begin{equation} \Sigma = \begin{pmatrix} 1 & -0.25 & 0.75 \\ -0.25 & 1 & 0.5 \\ 0.75 & 0.5 & 2 \end{pmatrix} \,. \end{equation}

Task 2.1 (1 pt.)

Use the Hessian matrix computed numericaly via finite differences. (Check out Numdifftools)



In [17]:

    
import numdifftools as nd
from scipy.optimize import minimize
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal



In [82]:

    
### BEGIN Solution

def p_star_w(w):
    x = np.array([2/3, 1/6, 1/6], dtype=np.float64)
    E = np.array([[1, -0.25, 0.75], [-0.25, 1, 0.5], [0.75, 0.5, 2]], dtype=np.float64)
    left_part = 1/(1+np.exp(- w.T @ x))
    right_part = multivariate_normal(mean=[0,0,0], cov=E).pdf(w)
    return left_part * right_part

def log_star_w(w):
    return -np.log(p_star_w(w))

w_0 = minimize(log_star_w, np.array([1,2,1], dtype=np.float64)).x
Hessian = nd.Hessian(log_star_w)
A = Hessian(w_0)
Z_p = p_star_w(w_0) * np.sqrt((2*np.pi)**3/np.linalg.det(A))

print("The value of intergral:", Z_p)

### END Solution









    



The value of intergral: 0.49791972000790274

Task 2.2 (1 pt.)

Use the diagonal approximation of the Hessian computed by autodifferentiation in pytorch.



In [31]:

    
import torch
from torch.autograd import Variable, grad



In [108]:

    
### BEGIN Solution

def pt_p_star_w(w):
    x = np.array([2/3, 1/6, 1/6], dtype=np.float64)
    E = np.array([[1, -0.25, 0.75], [-0.25, 1, 0.5], [0.75, 0.5, 2]], dtype=np.float64)
    left_part = torch.sigmoid(torch.dot(w, Variable(torch.from_numpy(x).type(torch.FloatTensor))))
    right_part = 1 / (( 2 * np.pi )**(3/2) * np.linalg.det(E)**(1/2)) *\
                 torch.exp(-0.5 * w @ Variable(torch.from_numpy(np.linalg.inv(E)).type(torch.FloatTensor))@w)
    return left_part * right_part

def pt_log_star_w(w):
    return -torch.log(pt_p_star_w(w))

def hessian_diag(func, w):
    w = Variable(torch.FloatTensor(w), requires_grad=True)
    grad_params = torch.autograd.grad(func(w), w, create_graph=True)
    hessian = [torch.autograd.grad(grad_params[0][i], w, create_graph=True)[0].data.numpy() \
               for i in range(3)]
    return np.diagonal(hessian)*np.eye(3)

A = hessian_diag(pt_log_star_w, w_0)

pt_Z_p = (np.sqrt((2*np.pi)**3 / np.linalg.det(A)) *\
       pt_p_star_w(Variable(torch.from_numpy(w_0).type(torch.FloatTensor)))).data.numpy()

print('Integral value is', pt_Z_p)

### END Solution









    



Integral value is 0.29643095

Task 2.3 (1 pt.)

Compare the results comparing the absolute errors of the results (this is possible with Monte-Carlo estimate of the integral). Write 1-2 sentences in the results discussion.



In [112]:

    
from scipy.integrate import tplquad

### BEGIN Solution

def p_star_w_adapter(x, y, z):
    return p_star_w(np.array([x,y,z]))

acc_Z_p = tplquad(p_star_w_adapter, -10, 10, -10, 10, -10, 10)

print("Laplace method: %.05f" % abs(acc_Z_p[0] - Z_p))
print("Diag. Hessian Approx: %.05f" % abs(acc_Z_p[0] - pt_Z_p))

### END Solution









    



Laplace method: 0.00208
Diag. Hessian Approx: 0.20357

BEGIN Solution

So, we have got big absolute error in the second line due to the fact that we used Hessian diagonal approximation which neglects a lot of values off the diagonal

END Solution

Gaussian Processes

Task 3 (1 + 2 = 3 pt.)

Task 3.1 (1 pt.)

Assuimng the matrices $A \in \mathbb{R}^{n \times n}$ and $D \in \mathbb{R}^{d \times d}$ are invertible, using gaussian elimination find the inverse matrix for the following block matrix: \begin{equation} \begin{pmatrix} A & B \\ C & D \end{pmatrix} \,, \end{equation} where $C \in \mathbb{R}^{d \times n}$ and $B \in \mathbb{R}^{n \times d}$.

BEGIN Solution

$$ \Bigg( \begin{array}{cc|cc} A & B & I_n & 0\\ C & D & 0 & I_m\\ \end{array} \Bigg) \sim \Bigg( \begin{array}{cc|cc} I_n & B A^{-1} & A^{-1} & 0\\ C & D & 0 & I_m\\ \end{array} \Bigg) \sim \Bigg( \begin{array}{cc|cc} I_n & B A^{-1} & A^{-1} & 0\\ 0 & D - B A^{-1} C & - A^{-1} C & I_m\\ \end{array} \Bigg) \sim $$$$ \sim \Bigg( \begin{array}{cc|cc} I_n & B A^{-1} & A^{-1} & 0\\ 0 & I_m & - A^{-1} C (D - B A^{-1}C)^{-1} & (D - B A^{-1} C)^{-1}\\ \end{array} \Bigg) \sim \Bigg( \begin{array}{cc|cc} I_n & 0 & A^{-1} + A^{-1}C (D - B A^{-1}C)^{-1} B A^{-1} & -(D - B A^{-1}C)^{-1} BA^{-1} \\ 0 & I_m & - A^{-1} C (D - B A^{-1}C)^{-1} & (D - B A^{-1}C)^{-1}\\ \end{array} \Bigg) $$

Finally, $$

\boxed {\begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1}

\Bigg( \begin{array}{cc} A^{-1} + A^{-1}C (D - B A^{-1}C)^{-1} B A^{-1} & - (D - B A^{-1}C)^{-1} BA^{-1} \

A^{-1} C (D - B A^{-1}C)^{-1} & (D - B A^{-1}C)^{-1}\ \end{array} \Bigg) } $$

END Solution

Task 3.2 (2 pt.)

Assume that the function $y(x)$, $x \in \mathbb{R}^d$, is a realization of the Gaussian Process $GP\bigl(0; K(\cdot, \cdot)\bigr)$ with $K(a, b) = \exp({- \gamma \|a - b\|_2^2}))$.

Suppose two datasets were observed: noiseless ${D_0}$ and noisy ${D_1}$ \begin{aligned} & D_0 = \bigl(x_i, y(x_i) \bigr)_{i=1}^{n} \,, \\ & D_1 = \bigl(x^\prime_i, y(x^\prime_i) + \varepsilon_i \bigr)_{i=1}^{m} \,, \end{aligned} where $\varepsilon_i \sim \text{ iid } \mathcal{N}(0, \sigma^2)$, independent of process $y$.

Derive the conditional distribution of $y(x) \big\vert_{D_0, D_1}$ at a new $x$.

BEGIN Solution

END Solution

Task 4 (1 + 2 = 3 pt.)

Task 4.1 (1 pt.)

In the late 1950’s Charles Keeling invented an accurate way to measure atmospheric $CO_2$ concentration and began taking regular measurements at the Mauna Loa observatory.

Take monthly_co2_mlo.csv file, load it and prepare the data.

Load the CO2 [ppm] time series
Replace $-99.99$ with NaN and drop the missing observations
Split the time series into train and test
Normalize the target value by fitting a transformation on the train
Plot the resulting target against the time index



In [196]:

    
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### BEGIN Solution

df = pd.read_csv('data/monthly_co2_mlo.csv')
df = df.replace(-99.99, np.nan).dropna()
df.head(10)
y = df['CO2 [ppm]']
X = df.drop(['CO2 [ppm]'], axis=1)
X['year'] -= 1958
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, shuffle=False, test_size=0.25)
X.head(10)

### END Solution









    Out[196]:







  
    
      
      year
      seasonally adjusted [ppm]
      fit [ppm]
      seasonally adjusted fit [ppm]
      CO2 filled [ppm]
      seasonally adjusted filled [ppm]
    
  
  
    
      2
      0.166667
      314.43
      316.19
      314.90
      315.70
      314.43
    
    
      3
      0.250000
      315.15
      317.30
      314.98
      317.45
      315.15
    
    
      4
      0.333333
      314.72
      317.85
      315.06
      317.51
      314.72
    
    
      6
      0.500000
      315.18
      315.87
      315.22
      315.86
      315.18
    
    
      7
      0.583333
      316.18
      314.00
      315.29
      314.93
      316.18
    
    
      8
      0.666667
      316.08
      312.46
      315.35
      313.21
      316.08
    
    
      10
      0.833333
      315.20
      313.62
      315.46
      313.33
      315.20
    
    
      11
      0.916667
      315.43
      314.77
      315.52
      314.67
      315.43
    
    
      12
      1.000000
      315.55
      315.61
      315.57
      315.58
      315.55
    
    
      13
      1.083333
      315.87
      316.26
      315.64
      316.48
      315.87



In [197]:

    
scaler = StandardScaler()
y_test_min = np.min(y_train.values)
y_test_abs = np.max(y_train.values) - np.min(y_train.values)
y_train_scaled = scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = scaler.transform(y_test.values.reshape(-1, 1))

plt.figure(figsize=(14, 5))
plt.plot(X_train['year'],  y_train_scaled)
plt.plot(X_test['year'],  y_test_scaled)
plt.axvline(x=0.75 * np.max([np.max(X_train['year'].values), np.max(X_test['year'].values)]), c='black', ls='-')
plt.grid()
plt.ylabel(r'${CO}_2$', size=18)
plt.xlabel('Train and test split', size=18)
plt.show()

Task 4.2 (2 pt.)

Use GPy library for training and prediction. Fit a GP and run the predict on the test. Useful kernels to combine: GPy.kern.RBF, GPy.kern.Poly, GPy.kern.StdPeriodic, GPy.kern.White, GPy.kern.Linear.

Plot mean and confidence interval of the prediction.
Inspect them on normality by scatter plot: plot predicted points/time series against true values.
Estimate the prediction error with r2_score. R2-score accepted > 0.83 on test sample.



In [198]:

    
from GPy.models import GPRegression
from GPy.kern import RBF, Poly, StdPeriodic, White, Linear
from sklearn.metrics import r2_score



In [199]:

    
### BEGIN Solution

kernels = RBF(input_dim=1, variance=1., lengthscale=10.) + \
          Poly(input_dim=1) + \
          StdPeriodic(input_dim=1) + \
          White(input_dim=1) + \
          Linear(input_dim=1)

gpr = GPRegression(X_train['year'].values.reshape(-1, 1), y_train_scaled, kernels)
gpr.plot(figsize=(13,4))
plt.show()

### END Solution









    



 /home/maxwell/anaconda3/envs/python3/lib/python3.6/site-packages/matplotlib/figure.py:2366: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.



In [201]:

    
predicted = gpr.predict(X_test['year'].values.reshape(-1, 1))
plt.figure(figsize=(13,4))
plt.plot(scaler.inverse_transform(y_test_scaled), scaler.inverse_transform(y_test_scaled), label='x = y', c='r')
plt.scatter(scaler.inverse_transform(predicted[0]), scaler.inverse_transform(y_test_scaled), label="")
plt.title("QQ - plot", size=16)
plt.xlabel("True value", size=16)
plt.ylabel("Predicted values", size=16)
plt.legend()
plt.show()



In [195]:

    
r2_score(predicted[0], y_test_scaled)









    Out[195]:





0.9936073526456586

	year	seasonally adjusted [ppm]	fit [ppm]	seasonally adjusted fit [ppm]	CO2 filled [ppm]	seasonally adjusted filled [ppm]
2	0.166667	314.43	316.19	314.90	315.70	314.43
3	0.250000	315.15	317.30	314.98	317.45	315.15
4	0.333333	314.72	317.85	315.06	317.51	314.72
6	0.500000	315.18	315.87	315.22	315.86	315.18
7	0.583333	316.18	314.00	315.29	314.93	316.18
8	0.666667	316.08	312.46	315.35	313.21	316.08
10	0.833333	315.20	313.62	315.46	313.33	315.20
11	0.916667	315.43	314.77	315.52	314.67	315.43
12	1.000000	315.55	315.61	315.57	315.58	315.55
13	1.083333	315.87	316.26	315.64	316.48	315.87