In this part of the homework you are to solve several theoretical problems related to machine learning algorithms.
$ you latex equation here $
$$ you latex equation here $$
\begin{align}
left-hand-side
&= right-hand-side on line 1
\\
&= right-hand-side on line 2
\\
&= right-hand-side on the last line
\end{align}
&
) aligns the equations horizontally and the double backslash
(\\
) creates a new line.Write your theoretical derivations within such blocks:
**BEGIN Solution**
<!-- >>> your derivation here <<< -->
**END Solution**
Assume that regression model is $$y = \sum_{i=1}^k \beta_i x_i + \varepsilon,$$ and $\varepsilon$ is dictributed normally: $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, $\sigma^2$ is known.
Prove that the model with highest Akaike information criterion is the model with smallest Mallow's $C_p$.
BEGIN Solution
Let $\hat{R}_{tr}$ be training sample error and $R$ in-sample generalization error: $$\hat{R}_{tr} (J) = \frac{1}{m} \sum_{i=1}^{m} \Big( \hat{y}_i (J) - y_i \Big)^2$$ $$R (J) = \frac{1}{m} \sum_{i=1}^{m} \mathbb{E} \Big( \hat{y}_i (J) - y_i^* \Big)^2 $$ where $J \subseteq \{1, ..., d\}$ is subset of selected features from $x$ we use to construct a linear model and $y^*$ is newly obtained output measurements (with newly generated noise values $\varepsilon_i^*$)
There was a theorem on the lecture stating that: $$ \mbox{bias}( \hat{R}_{tr}(J)) = \mathbb{E}(\hat{R}_{tr} (J)) - R(J) = - \frac{2}{m} \sum_{i=1}^{m} \mbox{Cov} ( \hat{y}_i, y_i) $$ So inbiased gen. error will be: $$ R(J) \approx \hat{R}_{tr}(J) + \frac{2}{m} \sum_{i=1}^{m} \mbox{Cov} (\hat{y}_i, y_i) $$
There was given a statement on the the lecture that in case of linear regression model: $$ 2 \sum_{i=1}^{m} \mbox{Cov} (\hat{y}_i, y_i) \sim 2 | J | \sigma^2 $$ Hence unbiased estimate of the regression risk: $$\hat{R}(J) = \hat{R}_{tr} + \frac{2\sigma^2}{m} |J| $$ As $\sigma^2$ we can use estimate $\hat{\sigma}^2$ based on residuals on the training set.
Then we select some subset of features $J$ and minimize $\hat{R}(J)$: $$ \hat{R}(J) \to \min_{w_J, J} $$
AIC(Akaike Information Criterion) form ($\mathcal{L}_J$ is a log of model likelihood): $$ \mathcal{L}_J - |J| \to \max_{w_J, J}$$
In our case: $$ \mathcal{L}_J (w) = m \log \frac{1}{\sqrt{2 \pi}\sigma} - \frac{1}{2\sigma^2} \sum^{m}_{i=1} \Big (y_i - \beta_J x_{i,J} \Big)^2$$ So AIC becomes: $$ \mathcal{L}_J (w) - |J| = - \frac{m}{2\sigma^2} \hat{R}_{tr} (J) - |J| + C \to \max_{w_J, J} $$
Actually both models are equivalent in case of linear regression, we can simply multiple AIC on $-\frac{2\sigma^2}{m} $: $$ - \frac{2\sigma^2}{m} \Big(\mathcal{L}_J - |J| \Big) \sim \hat{R}_{tr}(J) + \frac{2\sigma^2}{m} |J| $$ As we can see left part is absolutely the same statement but negative so when first tends to max, negated first statement (which in fact second part) tends to min, q.e.d
NOTE: the purpose of added coefficient is to get rid of difference in sign and show that both statements are the same
END Solution
Minimization of the loss function is an optimization task, and "Gradient Boosting" is one of the many methods to perform optimization. It shoould be noted that it uses greedy approach and therefore, like greedy algorithms in CS, may produce results that are not globally optimal.
$$ \begin{aligned} & b_n(x) := \text{the best base model from the family of the algorithms $\mathcal{A}$} \\ & \gamma_n(x) := \text{scale or weight of the new model} \\ & a_N(x) = \sum_{n=0}^N \gamma_n b_n(x) := \text{the final composite model} \end{aligned} $$Consider a loss loss function $L(y, z)$ for target $y$ and prediction $z$, and let $(x_i, y_i)_{i=1}^l$ be our train dataset for a regression task.
1
to n_boost_steps
do:\\
& bn(x) = \arg\min\limits{b\in\mathcal{A}}\sum_{i=1}^l \bigl(b(x_i) - s_i\bigr)^2
\\
& \gamman = \arg\min\gamma G_n(b_n, \gamma)
\\
& an(x) = a{n-1}(x) + \gamma_n b_n(x)
\end{align}At the $n$-th step of Garient Boosting ($n \geq 1$) we compute the "residuals" $(s_i)_{i=1}^l$ and learn $x\mapsto b_n(x)$ with a regression algorithm $\mathcal{A}$ applied to the dataset $(x_i, s_i)_{i=1}^l$. For the next two tasks assume that we have already perfomed these substeps.
Derive the optimal value of $\gamma_n$ for MSE loss function $L(y, z) = \tfrac12 (y - z)^2$.
BEGIN Solution
$$ G_n (b_n, \gamma_n ) = \sum_{i=1}^{l} L \Big(y_i, a_{n-1}(x_i) + \gamma_n b_n (x_i) \Big) $$ $$ \frac{\partial{G_n(b_n, \gamma_n)}}{\partial{\gamma_n}} = \sum_{i=1}^{l} \frac{\partial{L}}{\partial{x}} \frac{\partial{z}}{\partial{\gamma_n}} $$$$ \frac{\partial{L}}{\partial{z}} = - (y - z), \, \frac{\partial{z}}{\partial{\gamma_n}} = b_n(x_i) $$$$ \frac{\partial{G_n}}{\partial{\gamma_n}} = - \sum_{i=1}^{l}\Big( y_i - a_{n-1} (x_i) - \gamma_n b_n(x_i) \Big) b_n(x_i) = 0 $$$$ - \sum_{i=1}{l} \Big(y_i - a_{n-1} (x_i) \Big)b_n(x_i) + \sum_{i=1}^{l} \gamma_n b_n^2(x_i) = 0 $$$$ \boxed {\gamma_n = \frac{\sum_{i=1}^{l} \Big(y_i - a_{n-1} (x_i) \Big)b_n(x_i)}{ \sum_{i=1}^{l} b_n^2(x_i) }} $$END Solution
Let $S = (x_i, y_i)_{i=1}^l$ be a sample for a classification task $y_i \in \{-1, +1\}$.
The AdaBoost algorithm can be regarded as a gradient boosting algorithm with the exponential loss $L(y,z) = e^{-y z}$. Consdier the state of AdaBoost at the $(T-1)$-step $$ G_{T-1}(b_T, \gamma_T) = \sum_{i=1}^l L\bigl(y_i, a_{n-1}(x_i) + \gamma b(x_i)\bigr) = \sum_{i=1}^l \underbrace{\exp(- y_i a_{T-1}(x_i))}_{w^{T-1}_i} \exp(- y_i \gamma_T b_T(x_i)) \,. $$
BEGIN Solution
We know that: $$ a_T (x_i) = a_{T-1} (x_i) + \gamma_T b_T(x_i) $$
Taking into account that $w_i^T = \exp(-y_i a_{T} (x_i))$, we will get:
$$ a_T(x_i) = - \frac{\log(w_i^{T-1})}{y_i} + \gamma_T b_T (x) $$Therefore:
$$ w^T_i = w_i^{T-1} \exp( -y_i \gamma_T b_T (x_i)) $$From lecture 8 we know:
$$ \gamma_T = \frac{1}{2} \log \frac{1-N_T}{N_T}; \, N_T = \sum_{i=1}^l \tilde{w}^T_i \mathbb{1}_{\{ y_i b_T(x_i) \leq 0 \}}; \tilde{w}^T_i = \frac{w^{T-1}_i}{\sum_{j} w^{T-1}_j} $$Finally:
$$ \boxed{ w_i^{T} = w_i^{T-1} \exp \Big(-\frac{1}{2} y_i b_T(x_i) \log \Big( \frac{1 - \sum_{i=1}^l \tilde{w}^T_i \mathbb{1}_{\{ y_i b_T(x_i)\}}}{\sum_{i=1}^l \tilde{w}^T_i \mathbb{1}_{\{ y_i b_T(x_i)\}}}\Big)\Big) } $$END Solution
BEGIN Solution
Outliers are extreme values that fall a long way outside of the other observations. To derive a rule for detecting let's introduce a margin:
$$ \rho = y_i * \hat{y}_i $$In case of binary classification with $y_i \in \{-1, +1\}$ the point is an outlier when:
$$ \rho < 0 $$So, now let's derive the ratio of weights $(w^T_i)_{i=1}^l$ between the normal and outlier samples:
$$ \frac{w_i^{T, \mbox{outlier}}}{w_i^{T, \mbox{normal}}} = \frac{\exp(-y_i a_T(x_i^{\mbox{outlier}}))}{\exp(-y_i a_T(x_i^{\mbox{normal}}))} = \frac{\exp\Big(-y_i \sum_j \gamma_j b_j(x_i^{\mbox{outlier}})\Big)}{\exp\Big(-y_i \sum_j \gamma_j b_j(x_i^{\mbox{normal}})\Big)} = \boxed{ \exp \Big( -y_i \sum_j \gamma_j \Big( b_j(x_i^{\mbox{outlier}}) - b_j(x_i^{\mbox{normal}})\Big)\Big) } $$If the value of ratio is big then it is an outlier
END Solution
BEGIN Solution
Due to the fact that we have exponential loss function the value of loss function greatly increases when it is fed with an outlier which makes this sample quite distuinguishable. So AdaBoost has quite good sensitivity.
END Solution
This problem studies boosting-type algorithms defined with objective functions different from that of AdaBoost.We assume that the training data are given as m labeled examples $(x_{1},y_{1}),...,(x_{m},y_{m}) \in X \times \{-1,+1\}$.We further assume that $\Phi$ is a strictly increasing convex and differentiable function over $\mathbb{R}$ such that:$\ \forall x \geqslant 0,\Phi(x)\geqslant 1 \ and \ \forall x < 0,\Phi(x) > 0$.
Consider the loss function $L(a) =\sum_{i=1}^{m}\Phi \left( -y_{i}g(x_i) \right)$ where $g$ is a linear combination of base classifiers, i.e., $g= \sum_{t=1}^{T} a_t h_t$(as in AdaBoost). Derive a new boosting algorithm using the objective function $L$. In particular, characterize the best base classifier $h_u$ to select at each round of boosting if we use coordinate descent.
BEGIN Solution
$$-\sum_{i=1}^{m}y_{i}h_{u}\left(x_{i}\right)\Phi^{\prime}\left(-y_{i}g\left(x_{i}\right)\right) \approx -\sum_{i=1}^{m}y_{i}h_{u}\widetilde { R } \left( g \right) = 0$$The last transition follows from lecture materials. Therefore, $h_u$ minimizes the error on train data.
END Solution
Consider the following functions:
Which functions satisfy the assumptions on $\Phi$ stated earlier in this problem?
Compute the gradient for them.
BEGIN Solution
Only logistic loss satisfies the assumptions on $\Phi$.
$$\frac{\partial \Phi_4(−u)}{\partial u} = -\frac{e^{-u}}{1 + e^{-u}} $$END Solution
$ \Phi_1(x) = \mathbb{1}_{x \ge 0}, \Phi_{1}(-1) = 0$ doesn't satisfy
$ \Phi_2(x) = (1+x)^{2} $, is is not monotonic over
$ \Phi_3(x) = \max(0, 1 + x), \Phi_{1}(-1) = 0$ doesn't satisfy
Only logistic loss satisfies the assumptions on $\Phi$.
$$\frac{\partial \Phi_4(−u)}{\partial u} = -\frac{e^{-u}}{1 + e^{-u}} $$
BEGIN Solution
END Solution
Consider a two-layer network function of the form \begin{equation} yk(x, \mathbf{w}) = \sigma \left ( \sum{j=1}^M w{kj}^{(2)} \sigma \left(\sum{i=1}^D w_{ji}^{(1)}xi + w{j0}^{(1)}\right)
+ w_{k0}^{(2)}\right)
\end{equation}
in which the nonlinear activation functions are logistic sigmoid functions \begin{equation} \sigma(a) = (1 + \exp(−a))^{-1}. \end{equation} Show that there exists an equivalent network, which computes exactly the same function but with hidden unit activation function given by hyperbolic tangent ${\rm tanh}(a)$ \begin{equation} {\rm tanh}(a) = \frac{e^a - e^{-a}}{e^a + e^{-a}}. \end{equation}
BEGIN Solution
As we know $\mbox{tanh}$ is rescaled $\mbox{sigmoid}$: $$\mbox{tanh}(x) = 2\mbox{sigmoid}(2x) - 1$$
Or $\mbox{sigmoid}$ is rescaled $\mbox{tanh}$: $$\mbox{sigmoid}(x) = \frac{\mbox{tahn}\Big(\frac{x}{2}\Big) + 1}{2} + \frac{1}{2} $$
So to construct an equivalent network with $\mbox{tanh}$ we can simple adjust weights of the hidden and output layers to get the same output values:
$$ w^{(1)}_{ji} \leftarrow \frac{w^{(1)}_{ji}}{2}; \, w_{j0}^{(1)} \leftarrow \frac{w_{j0}^{(1)}}{2} $$$$ w^{(2)}_{kj} \leftarrow \frac{w^{(2)}_{kj}}{2}; \, w^{(2)}_{k0} \leftarrow w^{(2)}_{k0} + \frac{1}{2}\sum_{j=1}^{M} w_{kj}^{(2)} $$END Solution
One way to encourage invariance of a model w.r.t a set of transformations is to expand the training set using transformed versions of the original input patterns. Consider the framework for training with transformed data in the special case when the transformation is simply addition of random noise $x \rightarrow x + \xi$ where $\xi$ has a Gaussian distribution with zero mean and unit variance. The error function for untransformed inputs can be written (in the infinite data set limit) in the form \begin{equation} E = \frac12 \int \int (y(\mathbf{x}) - t)^2 p(t|\mathbf{x}) p(\mathbf{x}){\rm d}\mathbf{x} {\rm d}t. \end{equation} If we now consider an infinite number of copies of each data point perturbed by the transformation, then the error function can be written as \begin{equation} \widetilde{E} = \frac12 \int\int(y(x + \xi) - t)^2p(t | \mathbf{x})p(\mathbf{x}) p(\xi){\rm d}\mathbf{x}{\rm d}t {\rm d}\xi \end{equation} Using Taylor series expansion of $y(\mathbf{x} + \xi)$ show that \begin{equation} \widetilde{E} \simeq E + \lambda \Omega \end{equation} where $\Omega$ is a regularizer which takes the form of Tikhonov regularizer \begin{equation} \Omega = \frac12 \int \|\nabla y(\mathbf{x})\|^2 p(\mathbf{x}){\rm d} \mathbf{x}. \end{equation}
BEGIN Solution
END Solution