In this part of the homework you are to solve several problems related to machine learning algorithms.
sklearn library instead of writing tons of our yown code. There exists a class/method for almost everything you can imagine (related to this homework).Your SOLUTION notebook MUST BE REPRODUCIBLE, i.e. if the reviewer decides to execute Kernel -> Restart Kernel and Run All Cells, after all the computation he will obtain exactly the same solution (with all the corresponding plots) as in your uploaded notebook. For this purpose, we suggest to fix random seed or (better) define random_state= inside every algorithm that uses some pseudorandomness.
Your code must be clear to the reviewer. For this purpose, try to include neccessary comments inside the code. But remember: GOOD CODE MUST BE SELF-EXPLANATORY without any additional comments.
$ you latex equation here $
$$ you latex equation here $$
\begin{align}
left-hand-side
&= right-hand-side on line 1
\\
&= right-hand-side on line 2
\\
&= right-hand-side on the last line
\end{align}
&) aligns the equations horizontally and the double backslash
(\\) creates a new line.Write your theoretical derivations within such blocks:
**BEGIN Solution**
<!-- >>> your derivation here <<< -->
**END Solution**
Please, write your implementation within the designated blocks:
...
### BEGIN Solution
# >>> your solution here <<<
### END Solution
...
In [2]:
import numpy as np
import pandas as pd
import torch
%matplotlib inline
import matplotlib.pyplot as plt
Consider a univariate Gaussian distribution $\mathcal{N}(x; \mu, \tau^{-1})$. Let's define Gaussian-Gamma prior for parameters $(\mu, \tau)$:
\begin{equation} p(\mu, \tau) = \mathcal{N}(\mu; \mu_0, (\beta \tau)^{-1}) \otimes \text{Gamma}(\tau; a, b) \,. \end{equation}Find the posterior distribution of $(\mu, \tau)$ after observing $X = (x_1, \dots, x_n)$.
BEGIN Solution
$$ \mathbb{P}(\mu, \tau | X) \propto p(\mu, \tau) \mathbb{P}(X | \mu, \tau) $$By condition: $$ \mathbb{P}(X | \mu, \tau) = \prod_{i=1}^n \mathbb{P}(x_i | \mu, \tau) $$ As we know distribution of each datasample: $$ \mathbb{P}(X | \mu, \tau) = \prod_{i=1}^n \frac{\tau^{\frac{1}{2}}}{\sqrt{2 \pi}} \exp{ \Big[-\frac{\tau (x_i - \mu)^2 }{2}\Big]} = \frac{\tau^{\frac{n}{2}}}{(2 \pi)^{\frac{n}{2}}} \exp{ \Big[-\frac{\tau}{2} \sum_{i=1}^n(x_i - \mu)^2 \Big]} $$
$$ p(\mu, \tau) = \mathcal{N}(\mu; \mu_0, (\beta \tau)^{-1}) \otimes \text{Gamma}(\tau; a, b) $$$$ p(\mu, \tau) = \frac{b^a \beta ^{\frac{1}{2}}}{(2\pi)^{\frac{1}{2}}\Gamma(a)} \tau^{a-\frac{1}{2}} e^{-b\tau} \exp{\Big( - \frac{\beta \tau}{2} (\mu - \mu_0)^2 \Big)} $$$$ p(\mu, \tau | X) \propto \tau^{\frac{n}{2} + a - \frac{1}{2}} e^{-b\tau} \exp{\Big( - \frac{\tau}{2} \Big[ \beta(\mu - \mu_0)^2 + \sum_{i=1}^{n} (x_i - \mu)^2 \Big] \Big)} $$$$ \sum_{i=1}^n (x_i - \mu)^2 = ns + n(\overline{x} - \mu)^2, \, \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i, \, s=\frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2 $$$$ \exp{\Big[ -\frac{\tau}{2} \Big[ \beta (\mu - \mu_0)^2 +ns + n(\overline{x} - \mu)^2 \Big] \Big]} \exp{(-b\tau)} = \exp{\Big[ -\tau \Big( \frac{1}{2} ns + b \Big) \Big]} \exp{\Big[ - \frac{\tau}{2} \Big( \beta(\mu - \mu_0)^2 + n (\overline{x} - \mu)^2 \Big) \Big]} $$After simple regrouping: $$ \beta (\mu - \mu_0)^2 + n(\overline{x} - \mu)^2 = (\beta + n) \Big( \mu - \frac{\beta \mu_0 + n \overline{x}}{\beta + n} \Big)^2 + \frac{\beta n ( \overline{x} - \mu_0)^2}{\beta + n} $$
$$ p(\mu, \tau | X) \propto \tau^{\frac{n}{2} + a - \frac{1}{2}} \exp{\Big[ -\tau \Big( \frac{1}{2} ns + b \Big) \Big]} \exp{\Big[ - \frac{\tau}{2} \Big[ (\beta + n) \Big( \mu - \frac{\beta \mu_0 + n \overline{x}}{\beta + n} \Big)^2 + \frac{\beta n ( \overline{x} - \mu_0)^2}{\beta + n} \Big] \Big]} $$Again regrouping: $$ p(\mu, \tau | X) \propto \tau^{\frac{n}{2} + a - \frac{1}{2}} \exp{\Big[ -\tau \Big[ \frac{1}{2} ns + b + \frac{\beta n ( \overline{x} - \mu_0)^2}{2(\beta + n)} \Big] \Big]} \exp{\Big[ - \frac{\tau}{2} (\beta + n) \Big( \mu - \frac{\beta \mu_0 + n \overline{x}}{\beta + n} \Big)^2 \Big]} $$ Finally: $$ \boxed{ p(\mu, \tau | X) \propto \Gamma\Big(\tau, \frac{n}{2} + a, \frac{1}{2} ns +b + \frac{\beta n ( \overline{x} - \mu_0)^2}{2(\beta + n)} \Big) \mathcal{N}\Big( \mu, \frac{\beta \mu_0 + n \overline{x}}{\beta + n}, (\tau(\beta + n))^{-1}\Big) } $$ END Solution
Evaluate the following integral using the Laplace approximation: \begin{equation} x \mapsto \int \sigma(w^T x) \mathcal{N}(w; 0, \Sigma) dw \,, \end{equation} for $x = \bigl(\tfrac23, \tfrac16, \tfrac16\bigr)\in \mathbb{R}^3$ and \begin{equation} \Sigma = \begin{pmatrix} 1 & -0.25 & 0.75 \\ -0.25 & 1 & 0.5 \\ 0.75 & 0.5 & 2 \end{pmatrix} \,. \end{equation}
Use the Hessian matrix computed numericaly via finite differences. (Check out Numdifftools)
In [17]:
import numdifftools as nd
from scipy.optimize import minimize
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
In [82]:
### BEGIN Solution
def p_star_w(w):
x = np.array([2/3, 1/6, 1/6], dtype=np.float64)
E = np.array([[1, -0.25, 0.75], [-0.25, 1, 0.5], [0.75, 0.5, 2]], dtype=np.float64)
left_part = 1/(1+np.exp(- w.T @ x))
right_part = multivariate_normal(mean=[0,0,0], cov=E).pdf(w)
return left_part * right_part
def log_star_w(w):
return -np.log(p_star_w(w))
w_0 = minimize(log_star_w, np.array([1,2,1], dtype=np.float64)).x
Hessian = nd.Hessian(log_star_w)
A = Hessian(w_0)
Z_p = p_star_w(w_0) * np.sqrt((2*np.pi)**3/np.linalg.det(A))
print("The value of intergral:", Z_p)
### END Solution
In [31]:
import torch
from torch.autograd import Variable, grad
In [108]:
### BEGIN Solution
def pt_p_star_w(w):
x = np.array([2/3, 1/6, 1/6], dtype=np.float64)
E = np.array([[1, -0.25, 0.75], [-0.25, 1, 0.5], [0.75, 0.5, 2]], dtype=np.float64)
left_part = torch.sigmoid(torch.dot(w, Variable(torch.from_numpy(x).type(torch.FloatTensor))))
right_part = 1 / (( 2 * np.pi )**(3/2) * np.linalg.det(E)**(1/2)) *\
torch.exp(-0.5 * w @ Variable(torch.from_numpy(np.linalg.inv(E)).type(torch.FloatTensor))@w)
return left_part * right_part
def pt_log_star_w(w):
return -torch.log(pt_p_star_w(w))
def hessian_diag(func, w):
w = Variable(torch.FloatTensor(w), requires_grad=True)
grad_params = torch.autograd.grad(func(w), w, create_graph=True)
hessian = [torch.autograd.grad(grad_params[0][i], w, create_graph=True)[0].data.numpy() \
for i in range(3)]
return np.diagonal(hessian)*np.eye(3)
A = hessian_diag(pt_log_star_w, w_0)
pt_Z_p = (np.sqrt((2*np.pi)**3 / np.linalg.det(A)) *\
pt_p_star_w(Variable(torch.from_numpy(w_0).type(torch.FloatTensor)))).data.numpy()
print('Integral value is', pt_Z_p)
### END Solution
In [112]:
from scipy.integrate import tplquad
### BEGIN Solution
def p_star_w_adapter(x, y, z):
return p_star_w(np.array([x,y,z]))
acc_Z_p = tplquad(p_star_w_adapter, -10, 10, -10, 10, -10, 10)
print("Laplace method: %.05f" % abs(acc_Z_p[0] - Z_p))
print("Diag. Hessian Approx: %.05f" % abs(acc_Z_p[0] - pt_Z_p))
### END Solution
BEGIN Solution
So, we have got big absolute error in the second line due to the fact that we used Hessian diagonal approximation which neglects a lot of values off the diagonal
END Solution
Assuimng the matrices $A \in \mathbb{R}^{n \times n}$ and $D \in \mathbb{R}^{d \times d}$ are invertible, using gaussian elimination find the inverse matrix for the following block matrix: \begin{equation} \begin{pmatrix} A & B \\ C & D \end{pmatrix} \,, \end{equation} where $C \in \mathbb{R}^{d \times n}$ and $B \in \mathbb{R}^{n \times d}$.
BEGIN Solution
$$ \Bigg( \begin{array}{cc|cc} A & B & I_n & 0\\ C & D & 0 & I_m\\ \end{array} \Bigg) \sim \Bigg( \begin{array}{cc|cc} I_n & B A^{-1} & A^{-1} & 0\\ C & D & 0 & I_m\\ \end{array} \Bigg) \sim \Bigg( \begin{array}{cc|cc} I_n & B A^{-1} & A^{-1} & 0\\ 0 & D - B A^{-1} C & - A^{-1} C & I_m\\ \end{array} \Bigg) \sim $$$$ \sim \Bigg( \begin{array}{cc|cc} I_n & B A^{-1} & A^{-1} & 0\\ 0 & I_m & - A^{-1} C (D - B A^{-1}C)^{-1} & (D - B A^{-1} C)^{-1}\\ \end{array} \Bigg) \sim \Bigg( \begin{array}{cc|cc} I_n & 0 & A^{-1} + A^{-1}C (D - B A^{-1}C)^{-1} B A^{-1} & -(D - B A^{-1}C)^{-1} BA^{-1} \\ 0 & I_m & - A^{-1} C (D - B A^{-1}C)^{-1} & (D - B A^{-1}C)^{-1}\\ \end{array} \Bigg) $$Finally, $$
\Bigg( \begin{array}{cc} A^{-1} + A^{-1}C (D - B A^{-1}C)^{-1} B A^{-1} & - (D - B A^{-1}C)^{-1} BA^{-1} \
END Solution
Assume that the function $y(x)$, $x \in \mathbb{R}^d$, is a realization of the Gaussian Process $GP\bigl(0; K(\cdot, \cdot)\bigr)$ with $K(a, b) = \exp({- \gamma \|a - b\|_2^2}))$.
Suppose two datasets were observed: noiseless ${D_0}$ and noisy ${D_1}$ \begin{aligned} & D_0 = \bigl(x_i, y(x_i) \bigr)_{i=1}^{n} \,, \\ & D_1 = \bigl(x^\prime_i, y(x^\prime_i) + \varepsilon_i \bigr)_{i=1}^{m} \,, \end{aligned} where $\varepsilon_i \sim \text{ iid } \mathcal{N}(0, \sigma^2)$, independent of process $y$.
Derive the conditional distribution of $y(x) \big\vert_{D_0, D_1}$ at a new $x$.
BEGIN Solution
END Solution
In the late 1950’s Charles Keeling invented an accurate way to measure atmospheric $CO_2$ concentration and began taking regular measurements at the Mauna Loa observatory.
Take monthly_co2_mlo.csv file, load it and prepare the data.
CO2 [ppm] time series
In [196]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
### BEGIN Solution
df = pd.read_csv('data/monthly_co2_mlo.csv')
df = df.replace(-99.99, np.nan).dropna()
df.head(10)
y = df['CO2 [ppm]']
X = df.drop(['CO2 [ppm]'], axis=1)
X['year'] -= 1958
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, shuffle=False, test_size=0.25)
X.head(10)
### END Solution
Out[196]:
In [197]:
scaler = StandardScaler()
y_test_min = np.min(y_train.values)
y_test_abs = np.max(y_train.values) - np.min(y_train.values)
y_train_scaled = scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = scaler.transform(y_test.values.reshape(-1, 1))
plt.figure(figsize=(14, 5))
plt.plot(X_train['year'], y_train_scaled)
plt.plot(X_test['year'], y_test_scaled)
plt.axvline(x=0.75 * np.max([np.max(X_train['year'].values), np.max(X_test['year'].values)]), c='black', ls='-')
plt.grid()
plt.ylabel(r'${CO}_2$', size=18)
plt.xlabel('Train and test split', size=18)
plt.show()
Use GPy library for training and prediction. Fit a GP and run the predict on the test. Useful kernels to combine: GPy.kern.RBF, GPy.kern.Poly, GPy.kern.StdPeriodic, GPy.kern.White, GPy.kern.Linear.
r2_score. R2-score accepted > 0.83 on test sample.
In [198]:
from GPy.models import GPRegression
from GPy.kern import RBF, Poly, StdPeriodic, White, Linear
from sklearn.metrics import r2_score
In [199]:
### BEGIN Solution
kernels = RBF(input_dim=1, variance=1., lengthscale=10.) + \
Poly(input_dim=1) + \
StdPeriodic(input_dim=1) + \
White(input_dim=1) + \
Linear(input_dim=1)
gpr = GPRegression(X_train['year'].values.reshape(-1, 1), y_train_scaled, kernels)
gpr.plot(figsize=(13,4))
plt.show()
### END Solution
In [201]:
predicted = gpr.predict(X_test['year'].values.reshape(-1, 1))
plt.figure(figsize=(13,4))
plt.plot(scaler.inverse_transform(y_test_scaled), scaler.inverse_transform(y_test_scaled), label='x = y', c='r')
plt.scatter(scaler.inverse_transform(predicted[0]), scaler.inverse_transform(y_test_scaled), label="")
plt.title("QQ - plot", size=16)
plt.xlabel("True value", size=16)
plt.ylabel("Predicted values", size=16)
plt.legend()
plt.show()
In [195]:
r2_score(predicted[0], y_test_scaled)
Out[195]: