In this part of the homework you are to solve several simple theoretical problems related to machine learning algorithms.
$ you latex equation here $
$$ you latex equation here $$
\begin{align}
left-hand-side
&= right-hand-side on line 1
\\
&= right-hand-side on line 2
\\
&= right-hand-side on the last line
\end{align}
&
) aligns the equations horizontally and the double backslash
(\\
) creates a new line.Let us consider the problem of linear ridge regression for data $(x_{1},y_{1}),\dots,(x_{n},y_{n})\in\mathbb{R}^{d\times 1}$. Let the objects have positive sample weights $v_{i}>0$, i.e. the optimization problem is $$\sum_{i=1}^{n}v_{i}\cdot L(y_{i}, \hat{y}_{i})+\frac{\lambda}{2}\|w\|_{2}^{2}=\sum_{i=1}^{n}v_{i}\cdot (\langle\boldsymbol{w},\boldsymbol{x}_{i}\rangle-y_{i})^{2}+\frac{\lambda}{2}\|w\|_{2}^{2}\rightarrow \min_{\boldsymbol{w}}.$$ This problem reduces to classical linear ridge regression when $v_{i}\equiv 1$. The matrix form of the problem is $$(Xw-y)^{\top}V(Xw-y)+\frac{\lambda}{2}w^{\top}w\rightarrow\min_{w},$$ where $V=V^{\top}\in\mathbb{R}^{n\times n}$ is the diagonal matrix with diagonal elements $v_{1},\dots, v_{n}$. Note that the quadratic problem is still convex (w.r.t. $\boldsymbol{w}$), thus, the solution is unique. Solve this problem for any (symmetric) positive-definite matrix $V$ (not just diagonal) and provide the answer in the matrix form.
Let's consider sigmoid function: $$ S(\theta, x) = \frac{1}{1+\exp(-\theta x)} $$ The loss function in that case is: $$ L(\theta) = \sum_i y_i \log(S(\theta, x_i)) + (1 - y_i) \log(1 - S(\theta, x_i)) $$ Let's consider two possible classes. If output is 0 and probabilty tends to zero (i.d. value of sigmoid function) then $\theta$ will go to infinity. In the second case, if output is one and probabilty goes to 1 (i.e. sigmoid function is reaching 1) then $\theta$ parameter will go infinity as well. So here is need for an additional contraint, q.e.d.
As we now decision boundaries are determined by the line, where probabilities of different classes are equal.
Or using Bayes' theorem (this form is more suitable for our case):
$$\Pr(C_s)\Pr(\boldsymbol{x}|C_s) \leq \Pr(C_t)\Pr(\boldsymbol{x}|C_t)$$In a given task the probability that object is pertained to a particular class (or is composed of a given set of binary states ) is the following (which is simple probability multiplication for independent events): $$\Pr(\boldsymbol{x}|C_q) = \prod_{j = 1}^d p_{qj}^{x_j}(1 - p_{qj})^{1 - x_j}$$
After applying logarithm to the borders, we will get (for simplicity there introduced $a_j$, $a_j'$, $b_j$ and $b_j'$):
$$\log(\Pr(C_s)) + \sum_{j = 1}^d x_j a_j + (1 - x_j) b_j = \log(\Pr(C_t)) + \sum_{j = 1}^d x_j a_j' + (1 - x_j) b_j'$$Let's work it out a little bit to make the linearity more explicit: $$\log(\Pr(C_m)) - \log(\Pr(C_n)) + \sum_{i = 1}^{d}x_{i} \boldsymbol{A}_i + \boldsymbol{B}_i = 0$$
Consider the 1-nearest-neighbor classifier applied to multiclass classification problem. Let's denote the classifier fitted on data $X$ by $f_X(\cdot)$.
The formula for complete leave-k-out cross-validation on data sample $X^{n}$ is defined as $$L_{k}OCV=\frac{1}{C_{n}^{k}}\bigg[\sum\limits_{X\subset \mathcal{P}(X^{n})\wedge |X|=k}\frac{1}{k}\bigg(\sum_{i\in X}[y_{i}\neq f_{X^{n}\setminus X}( x_{i})]\bigg)\bigg],$$ where $\mathcal{P}(X^{n})$ is the set of all subsets of $X^{n}$. For all $i=1,2\dots,n$ denote the label of $m$-th closest neighbor of $x_{i}$ in $X^{n}\setminus \{x_{i}\}$ by $y_{i}^{m}$. Show that $$L_{k}OCV=\sum_{m=1}^{k}\underbrace{\frac{1}{n}\sum_{i=1}^{n}[y_{i}\neq y_{i}^{m}]}_{K_{m}(X^{n})}\frac{C_{n-1-m}^{n-k-1}}{C_{n-1}^{k-1}},$$ where $K_{m}(X^{n})$ is called the compactness profile of $X^{n}$, i.e. the fraction of objects whose $m$-th nearest neighbor has different label. For convenience assume that all the pairwise distances between objects are different.
Let the subsample $\hat{X}^{n}$ of size $n$ be generated from sample $X^{n}=\{\boldsymbol{x}_{1},\dots\boldsymbol{x}_{n}\}$ by bootstrap procedure. Find the probability of object $x_{i}$ to be present in $\hat{X}^{n}$ and compute the limit of this probability for $n\rightarrow\infty$.
To calculate the probability of object $x_{i}$ to be present in $\hat{X}^{n}$ we can calculate the probabilty of object $x_{i}$ not to be present in $\hat{X}^{n}$, which is in fact much easier.
$$
\Pr\Big( \boldsymbol{x_i} \notin \hat{X}^n\Big) =
\Big(\frac{n-1}{n}\Big)^n = \Big(1 - \frac{1}{n}\Big)^n
$$
So
$$
\Pr\Big( \boldsymbol{x_i} \in \hat{X}^n\Big) =
1 - \Pr\Big( \boldsymbol{x_i} \notin \hat{X}^n\Big) =
\boxed{1 - \Big(1 - \frac{1}{n}\Big)^n}
$$
Once $n$ tends to infinity:
$$
\lim_{\boldsymbol{n \rightarrow \infty}} \Pr\Big( \boldsymbol{x_i} \in \hat{X}^n\Big) \rightarrow \boxed {1 - \frac{1}{e}}
$$
Consider a leaf of a binary decision tree which consists of object-label pairs $(x_{1},y_{1}),\dots,(x_{n},y_{n})$. The prediction $\hat{y}$ of this leaf is defined to minimize the loss on the training sample, i.e. $\frac{1}{n}\sum_{i=1}^{n}L(y_{i}, \hat{y})\rightarrow\min$. We consider three cases:
Suppose that that instead of using optimal prediction for this leaf we output the label from $y_{1},\dots,y_{n}$ at random. What will happen with the (expected) loss on the training sample (will it increase/decrease/not change)? Prove your answer (separately for every case).
Let objects $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}$ have binary labels $y_{1}, y_{2},\dots,y_{n}\in\{0,1\}$. Let the classifier $f$ assign to objects probabilities of being from class $1$. Without loss of generality assume that $f(\boldsymbol{x_{i}})<f(\boldsymbol{x_{j}})$ for all $i<j$. Define the fraction of concordant pairs by $$\text{FCP}(f, X^{n})=\frac{2}{n(n-1)}\sum_{i<j}[y_{i}<=y_{j}].$$ Show that this value is equal to Area Under ROC of classifier $f$.
Recall that the prediction of Kernel Ridge Regression fitted on data $X^{n}$ with the kernel $K(\cdot, \cdot)$ has the form $\mathcal{K}(x)=\sum_{i=1}^{n}\alpha_{i}K(x, x_{i})$, where $\alpha=(K+\lambda I)^{-1}Y$ ($K_{ij}=K(x_{i},x_{j})$). The time complexity of computation of a prediction $\mathcal{K}(x)$ for any point $x$ is $O(n)$, i.e. grows linearly with the size of the training set.
Consider the bilinear kernel $K(x, x')=\langle x, x'\rangle$. For this kernel, the Kernel Regression is known to turn into simple linear ridge regression. However, for linear regression the computation time of prediction $\mathcal{R}(x)=\sum_{j=1}^{d}\beta_{j}x^{j}$ is $O(d)$, where $d$ is the dimension of the feature space and does not grow with the training, which is a little bit controversary to the previous paragraph.
In this task in order to show that the kernel regression with the bilinear kernel is indeed the linear ridge regression, you have to prove that the predictions exactly match by direct comparison.
For linear ridge regression there is the following solution: $$ \hat{f}(x) = x \Big( X^T X + \lambda I \Big)^{-1} X^T Y $$ According to the kernel given in the problem there is the following solution for kernel regression: $$ \hat{f}(x) = x X^T \Big(XX^T + \lambda I \Big)^{-1} Y $$ So let's show that $\Big( X^T X + \lambda I \Big)^{-1} X^T$ is equal to $X^T \Big(XX^T + \lambda I \Big)^{-1}$:
To do so we simply multiply both formulas by $\lambda I + X^T X$ on the left, and by $\lambda I + X X^T$ on the rigth:
$$ \Big(\lambda I + X^TX\Big) X^T \, \boldsymbol{?} \, X^T \Big(\lambda I + X X^T\Big) \\ \, \\ \lambda X^T + X^T X X^T = \lambda X^T + X^T X X^T $$q.e.d.
Let's decompose given kernel: $$ K(x,x') = \exp(-\|x-x'\|^2) = \exp\Big(-(x- x')^T)(x-x')\Big) = \exp(-x^T x + 2x^T x' - {x'}^T x')$$ Let's replace left and righ terms with $f(x) = \exp(-x^Tx)$: $$ K(x,x') = f(x) \cdot \exp(2x^T x') \cdot f(x') $$ Now we can prove that $\exp(2x^T x')$ is positive-semidefinite kernel: $$ \exp(2x^T x') = \sum_{k=0}^{\infty} \frac{(2x^T x')^k}{k!} $$ The corresponding matrix to $2x^T x'$ kernel is Gramm which is positive-semidefinite. After exponentiation nothing changes hence the matrix corresponding to $\exp(2x^T x')$ kernel is positive-semidefinite as well.
Now we can go back to original kernel $K(x,x')$ and its matrix $\mathbf{K}$ and pick up any vector $c \in \mathbf{R}^{n \times 1}$: $$ c^T \mathbf{K} c = \sum_i^n \sum_j^n c_i c_j K(x_i, x_j) = \sum_i^n \sum_j^n (c_i f(x_i)) (c_j f(x_j)) \exp(2x_i^T x_j) = \sum_i^n \sum_j^n c_i'c_j' \exp(2x_i^T x_j) $$ As $\exp(2x_i^T x_j)$ is positive-semidefinite then inequality for $K(x, x')$ holds as well and $K(x, x')$ is positive-semidefinite, q.e.d.
Show that for two-class SVM classifier the following upper bound on accuracy leave-one-out-cross-validation estimate holds true: $$L_{1}OCV=\frac{1}{n}\sum_{i=1}^{n}[y_{i}=f_{i}(x_{i})]\leq \frac{|SV|}{n},$$ where for all $i=1,\dots,n$ $f_{i}(x_{i})$ is SVM fitted on the entire data without observation $(x_{i},y_{i})$ and $|SV|$ is the number of support vectors of SVM fit on the entire data.
The end.