Home Assignment No. 2: Part 1 (Theory)

In this part of the homework you are to solve several theoretical problems related to machine learning algorithms.

For every separate problem you can get only 0 points or maximal points for this problem. There are NO INTERMEDIATE scores.
Your solution must me COMPLETE, i.e. contain all required formulas/proofs/detailed explanations.
You must write your solution for any problem just right after the words BEGIN SOLUTION. Attaching pictures of your handwriting is allowed, but highly discouraged.
The are problems with * mark - they are not obligatory. You can get EXTRA POINTS for solving them. ## $\LaTeX$ in Jupyter Jupyter has constantly improving $\LaTeX$ support. Below are the basic methods to write neat, tidy, and well typeset equations in your notebooks:
to write an inline equation use
```
$ you latex equation here $
```
to write an equation, that is displayed on a separate line use
```
$$ you latex equation here $$
```

to write a block of equations use

\begin{align}
  left-hand-side
      &= right-hand-side on line 1
      \\
      &= right-hand-side on line 2
      \\
      &= right-hand-side on the last line
\end{align}

The ampersand (&) aligns the equations horizontally and the double backslash (\\) creates a new line.

Write your theoretical derivations within such blocks:

**BEGIN Solution**

<!-- >>> your derivation here <<< -->

**END Solution**

Model and feature selection

Task 1 (1 pt.): Information criteria

Assume that regression model is $$y = \sum_{i=1}^k \beta_i x_i + \varepsilon,$$ and $\varepsilon$ is dictributed normally: $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, $\sigma^2$ is known.

Prove that the model with highest Akaike information criterion is the model with smallest Mallow's $C_p$.

BEGIN Solution

Let $\hat{R}_{tr}$ be training sample error and $R$ in-sample generalization error: $$\hat{R}_{tr} (J) = \frac{1}{m} \sum_{i=1}^{m} \Big( \hat{y}_i (J) - y_i \Big)^2$$ $$R (J) = \frac{1}{m} \sum_{i=1}^{m} \mathbb{E} \Big( \hat{y}_i (J) - y_i^* \Big)^2 $$ where $J \subseteq \{1, ..., d\}$ is subset of selected features from $x$ we use to construct a linear model and $y^*$ is newly obtained output measurements (with newly generated noise values $\varepsilon_i^*$)

There was a theorem on the lecture stating that: $$ \mbox{bias}( \hat{R}_{tr}(J)) = \mathbb{E}(\hat{R}_{tr} (J)) - R(J) = - \frac{2}{m} \sum_{i=1}^{m} \mbox{Cov} ( \hat{y}_i, y_i) $$ So inbiased gen. error will be: $$ R(J) \approx \hat{R}_{tr}(J) + \frac{2}{m} \sum_{i=1}^{m} \mbox{Cov} (\hat{y}_i, y_i) $$

There was given a statement on the the lecture that in case of linear regression model: $$ 2 \sum_{i=1}^{m} \mbox{Cov} (\hat{y}_i, y_i) \sim 2 | J | \sigma^2 $$ Hence unbiased estimate of the regression risk: $$\hat{R}(J) = \hat{R}_{tr} + \frac{2\sigma^2}{m} |J| $$ As $\sigma^2$ we can use estimate $\hat{\sigma}^2$ based on residuals on the training set.

Then we select some subset of features $J$ and minimize $\hat{R}(J)$: $$ \hat{R}(J) \to \min_{w_J, J} $$

AIC(Akaike Information Criterion) form ($\mathcal{L}_J$ is a log of model likelihood): $$ \mathcal{L}_J - |J| \to \max_{w_J, J}$$

In our case: $$ \mathcal{L}_J (w) = m \log \frac{1}{\sqrt{2 \pi}\sigma} - \frac{1}{2\sigma^2} \sum^{m}_{i=1} \Big (y_i - \beta_J x_{i,J} \Big)^2$$ So AIC becomes: $$ \mathcal{L}_J (w) - |J| = - \frac{m}{2\sigma^2} \hat{R}_{tr} (J) - |J| + C \to \max_{w_J, J} $$

Actually both models are equivalent in case of linear regression, we can simply multiple AIC on $-\frac{2\sigma^2}{m} $: $$ - \frac{2\sigma^2}{m} \Big(\mathcal{L}_J - |J| \Big) \sim \hat{R}_{tr}(J) + \frac{2\sigma^2}{m} |J| $$ As we can see left part is absolutely the same statement but negative so when first tends to max, negated first statement (which in fact second part) tends to min, q.e.d

NOTE: the purpose of added coefficient is to get rid of difference in sign and show that both statements are the same

END Solution

Boosting: gradient boosting, adaboost

Boosting and its theory

Minimization of the loss function is an optimization task, and "Gradient Boosting" is one of the many methods to perform optimization. It shoould be noted that it uses greedy approach and therefore, like greedy algorithms in CS, may produce results that are not globally optimal.

$$ \begin{aligned} & b_n(x) := \text{the best base model from the family of the algorithms $\mathcal{A}$} \\ & \gamma_n(x) := \text{scale or weight of the new model} \\ & a_N(x) = \sum_{n=0}^N \gamma_n b_n(x) := \text{the final composite model} \end{aligned} $$

Gradient Boosting Algorithm

Consider a loss loss function $L(y, z)$ for target $y$ and prediction $z$, and let $(x_i, y_i)_{i=1}^l$ be our train dataset for a regression task.

Initialize $a_0(x) = \hat{z}$ with the flat constant prediction $\hat{z} = \arg\min\limits_{z \in \mathbb{R}} \sum_{i=1}^l L(y_i, z)$;
For $n$ from 1 to n_boost_steps do:
- Solve the current subporblem $G_n(b_n, \gamma_n) \to \min\limits_{b_{n}, \gamma_n}$ where $$ G_n(b, \gamma) = \sum_{i=1}^l L\bigl(y_i, a_{n-1}(x_i) + \gamma b(x_i)\bigr) $$ with the following method: \begin{align} & s_i = - \frac{\partial}{\partial z} L(yi, z) \Big\vert{z=a_{n-1}(x_i)}
```
\\
```
  & bn(x) = \arg\min\limits{b\in\mathcal{A}}\sum_{i=1}^l \bigl(b(x_i) - s_i\bigr)^2
```
\\
```
  & \gamman = \arg\min\gamma G_n(b_n, \gamma)
```
\\
```
  & an(x) = a{n-1}(x) + \gamma_n b_n(x) \end{align}
return $a_N(x) = a_0(x) + \sum_{n=1}^N \gamma_n b_n(x)$

Task 2 (1 pt.)

At the $n$-th step of Garient Boosting ($n \geq 1$) we compute the "residuals" $(s_i)_{i=1}^l$ and learn $x\mapsto b_n(x)$ with a regression algorithm $\mathcal{A}$ applied to the dataset $(x_i, s_i)_{i=1}^l$. For the next two tasks assume that we have already perfomed these substeps.

Derive the optimal value of $\gamma_n$ for MSE loss function $L(y, z) = \tfrac12 (y - z)^2$.

BEGIN Solution

$$ G_n (b_n, \gamma_n ) = \sum_{i=1}^{l} L \Big(y_i, a_{n-1}(x_i) + \gamma_n b_n (x_i) \Big) $$

$$ \frac{\partial{G_n(b_n, \gamma_n)}}{\partial{\gamma_n}} = \sum_{i=1}^{l} \frac{\partial{L}}{\partial{x}} \frac{\partial{z}}{\partial{\gamma_n}} $$$$ \frac{\partial{L}}{\partial{z}} = - (y - z), \, \frac{\partial{z}}{\partial{\gamma_n}} = b_n(x_i) $$$$ \frac{\partial{G_n}}{\partial{\gamma_n}} = - \sum_{i=1}^{l}\Big( y_i - a_{n-1} (x_i) - \gamma_n b_n(x_i) \Big) b_n(x_i) = 0 $$$$ - \sum_{i=1}{l} \Big(y_i - a_{n-1} (x_i) \Big)b_n(x_i) + \sum_{i=1}^{l} \gamma_n b_n^2(x_i) = 0 $$$$ \boxed {\gamma_n = \frac{\sum_{i=1}^{l} \Big(y_i - a_{n-1} (x_i) \Big)b_n(x_i)}{ \sum_{i=1}^{l} b_n^2(x_i) }} $$

END Solution

Task 3 (1+1+1 pt.)

Let $S = (x_i, y_i)_{i=1}^l$ be a sample for a classification task $y_i \in \{-1, +1\}$.

The AdaBoost algorithm can be regarded as a gradient boosting algorithm with the exponential loss $L(y,z) = e^{-y z}$. Consdier the state of AdaBoost at the $(T-1)$-step $$ G_{T-1}(b_T, \gamma_T) = \sum_{i=1}^l L\bigl(y_i, a_{n-1}(x_i) + \gamma b(x_i)\bigr) = \sum_{i=1}^l \underbrace{\exp(- y_i a_{T-1}(x_i))}_{w^{T-1}_i} \exp(- y_i \gamma_T b_T(x_i)) \,. $$

Task 3.1 (1 pt.)

Derive the next weights $w^T_i$ used in $G_T$ at the stage $T$ as a function of the learnt classifier $b_T$ and the current weights $w^{T-1}_i$;

BEGIN Solution

We know that: $$ a_T (x_i) = a_{T-1} (x_i) + \gamma_T b_T(x_i) $$

Taking into account that $w_i^T = \exp(-y_i a_{T} (x_i))$, we will get:

$$ a_T(x_i) = - \frac{\log(w_i^{T-1})}{y_i} + \gamma_T b_T (x) $$

Therefore:

$$ w^T_i = w_i^{T-1} \exp( -y_i \gamma_T b_T (x_i)) $$

From lecture 8 we know:

$$ \gamma_T = \frac{1}{2} \log \frac{1-N_T}{N_T}; \, N_T = \sum_{i=1}^l \tilde{w}^T_i \mathbb{1}_{\{ y_i b_T(x_i) \leq 0 \}}; \tilde{w}^T_i = \frac{w^{T-1}_i}{\sum_{j} w^{T-1}_j} $$

Finally:

$$ \boxed{ w_i^{T} = w_i^{T-1} \exp \Big(-\frac{1}{2} y_i b_T(x_i) \log \Big( \frac{1 - \sum_{i=1}^l \tilde{w}^T_i \mathbb{1}_{\{ y_i b_T(x_i)\}}}{\sum_{i=1}^l \tilde{w}^T_i \mathbb{1}_{\{ y_i b_T(x_i)\}}}\Big)\Big) } $$

END Solution

Task 3.2 (1 pt.)

Compute the ratio of weights $(w^T_i)_{i=1}^l$ between the normal and outlier samples in $S$. Propose a formal definition of being an outlier, and reflect on the value of margin for both.

**HINT**: margin value.

BEGIN Solution

Outliers are extreme values that fall a long way outside of the other observations. To derive a rule for detecting let's introduce a margin:

$$ \rho = y_i * \hat{y}_i $$

In case of binary classification with $y_i \in \{-1, +1\}$ the point is an outlier when:

$$ \rho < 0 $$

So, now let's derive the ratio of weights $(w^T_i)_{i=1}^l$ between the normal and outlier samples:

$$ \frac{w_i^{T, \mbox{outlier}}}{w_i^{T, \mbox{normal}}} = \frac{\exp(-y_i a_T(x_i^{\mbox{outlier}}))}{\exp(-y_i a_T(x_i^{\mbox{normal}}))} = \frac{\exp\Big(-y_i \sum_j \gamma_j b_j(x_i^{\mbox{outlier}})\Big)}{\exp\Big(-y_i \sum_j \gamma_j b_j(x_i^{\mbox{normal}})\Big)} = \boxed{ \exp \Big( -y_i \sum_j \gamma_j \Big( b_j(x_i^{\mbox{outlier}}) - b_j(x_i^{\mbox{normal}})\Big)\Big) } $$

If the value of ratio is big then it is an outlier

END Solution

Task 3.3 (1 pt.)

Conclude about sensitivity of Adaboost to outliers.

BEGIN Solution

Due to the fact that we have exponential loss function the value of loss function greatly increases when it is fed with an outlier which makes this sample quite distuinguishable. So AdaBoost has quite good sensitivity.

END Solution

Task 4 (2+1+2 pt.): Alternative objective functions.

This problem studies boosting-type algorithms defined with objective functions different from that of AdaBoost.We assume that the training data are given as m labeled examples $(x_{1},y_{1}),...,(x_{m},y_{m}) \in X \times \{-1,+1\}$.We further assume that $\Phi$ is a strictly increasing convex and differentiable function over $\mathbb{R}$ such that:$\ \forall x \geqslant 0,\Phi(x)\geqslant 1 \ and \ \forall x < 0,\Phi(x) > 0$.

Task 4.1 (2 pt.)

Consider the loss function $L(a) =\sum_{i=1}^{m}\Phi \left( -y_{i}g(x_i) \right)$ where $g$ is a linear combination of base classifiers, i.e., $g= \sum_{t=1}^{T} a_t h_t$(as in AdaBoost). Derive a new boosting algorithm using the objective function $L$. In particular, characterize the best base classifier $h_u$ to select at each round of boosting if we use coordinate descent.

BEGIN Solution

$$-\sum_{i=1}^{m}y_{i}h_{u}\left(x_{i}\right)\Phi^{\prime}\left(-y_{i}g\left(x_{i}\right)\right) \approx -\sum_{i=1}^{m}y_{i}h_{u}\widetilde { R } \left( g \right) = 0$$

The last transition follows from lecture materials. Therefore, $h_u$ minimizes the error on train data.

END Solution

Task 4.2 (1 pt.)

Consider the following functions:

zero-one loss $\Phi_1(−u) = 1_{u\leqslant0}$;
least squared loss $\Phi_2(−u) = (1 − u)^2$;
SVM loss $\Phi_3(−u) = \max\{0, 1 − u\}$;
logistic loss $\Phi_4(−u) = \log(1 + e^{−u})$.

Which functions satisfy the assumptions on $\Phi$ stated earlier in this problem?

Compute the gradient for them.

BEGIN Solution

Only logistic loss satisfies the assumptions on $\Phi$.

$$\frac{\partial \Phi_4(−u)}{\partial u} = -\frac{e^{-u}}{1 + e^{-u}} $$

END Solution

$ \Phi_1(x) = \mathbb{1}_{x \ge 0}, \Phi_{1}(-1) = 0$ doesn't satisfy

$ \Phi_2(x) = (1+x)^{2} $, is is not monotonic over

$ \Phi_3(x) = \max(0, 1 + x), \Phi_{1}(-1) = 0$ doesn't satisfy

Only logistic loss satisfies the assumptions on $\Phi$.

$$\frac{\partial \Phi_4(−u)}{\partial u} = -\frac{e^{-u}}{1 + e^{-u}} $$

Task 4.3* (2 pt.)

For each loss function satisfying the assumptions in Task 5 formualtion, derive the corresponding boosting algorithm. How do the algorithm(s) differ from AdaBoost?

BEGIN Solution

END Solution

NNs

Task 5. (1 pt.)

Consider a two-layer network function of the form \begin{equation} yk(x, \mathbf{w}) = \sigma \left ( \sum{j=1}^M w{kj}^{(2)} \sigma \left(\sum{i=1}^D w_{ji}^{(1)}xi + w{j0}^{(1)}\right)

                           + w_{k0}^{(2)}\right)
\end{equation}

in which the nonlinear activation functions are logistic sigmoid functions \begin{equation} \sigma(a) = (1 + \exp(−a))^{-1}. \end{equation} Show that there exists an equivalent network, which computes exactly the same function but with hidden unit activation function given by hyperbolic tangent ${\rm tanh}(a)$ \begin{equation} {\rm tanh}(a) = \frac{e^a - e^{-a}}{e^a + e^{-a}}. \end{equation}

BEGIN Solution

As we know $\mbox{tanh}$ is rescaled $\mbox{sigmoid}$: $$\mbox{tanh}(x) = 2\mbox{sigmoid}(2x) - 1$$

Or $\mbox{sigmoid}$ is rescaled $\mbox{tanh}$: $$\mbox{sigmoid}(x) = \frac{\mbox{tahn}\Big(\frac{x}{2}\Big) + 1}{2} + \frac{1}{2} $$

So to construct an equivalent network with $\mbox{tanh}$ we can simple adjust weights of the hidden and output layers to get the same output values:

$$ w^{(1)}_{ji} \leftarrow \frac{w^{(1)}_{ji}}{2}; \, w_{j0}^{(1)} \leftarrow \frac{w_{j0}^{(1)}}{2} $$$$ w^{(2)}_{kj} \leftarrow \frac{w^{(2)}_{kj}}{2}; \, w^{(2)}_{k0} \leftarrow w^{(2)}_{k0} + \frac{1}{2}\sum_{j=1}^{M} w_{kj}^{(2)} $$

END Solution

Task 6*. Data augmentation (2 pt.)

One way to encourage invariance of a model w.r.t a set of transformations is to expand the training set using transformed versions of the original input patterns. Consider the framework for training with transformed data in the special case when the transformation is simply addition of random noise $x \rightarrow x + \xi$ where $\xi$ has a Gaussian distribution with zero mean and unit variance. The error function for untransformed inputs can be written (in the infinite data set limit) in the form \begin{equation} E = \frac12 \int \int (y(\mathbf{x}) - t)^2 p(t|\mathbf{x}) p(\mathbf{x}){\rm d}\mathbf{x} {\rm d}t. \end{equation} If we now consider an infinite number of copies of each data point perturbed by the transformation, then the error function can be written as \begin{equation} \widetilde{E} = \frac12 \int\int(y(x + \xi) - t)^2p(t | \mathbf{x})p(\mathbf{x}) p(\xi){\rm d}\mathbf{x}{\rm d}t {\rm d}\xi \end{equation} Using Taylor series expansion of $y(\mathbf{x} + \xi)$ show that \begin{equation} \widetilde{E} \simeq E + \lambda \Omega \end{equation} where $\Omega$ is a regularizer which takes the form of Tikhonov regularizer \begin{equation} \Omega = \frac12 \int \|\nabla y(\mathbf{x})\|^2 p(\mathbf{x}){\rm d} \mathbf{x}. \end{equation}

BEGIN Solution

END Solution