Home Assignment No. 1: Part 2 (Theory)

In this part of the homework you are to solve several simple theoretical problems related to machine learning algorithms.

  • For every separate problem you can get only 0 points or maximal points for this problem. There are NO INTERMEDIATE scores.
  • Your solution must me COMPLETE, i.e. contain all required formulas/proofs/detailed explanations.
  • You must write your solution for any problem just right after the words YOUR SOLUTION. Attaching pictures of your handwriting is allowed, but highly discouraged.
  • The are two problems with * mark - they are not obligatory. You can get EXTRA POINTS for solving them. ## $\LaTeX$ in Jupyter Jupyter has constantly improving $\LaTeX$ support. Below are the basic methods to write neat, tidy, and well typeset equations in your notebooks:
  • to write an inline equation use
    $ you latex equation here $
    
  • to write an equation, that is displayed on a separate line use
    $$ you latex equation here $$
    
  • to write a block of equations use
    \begin{align}
      left-hand-side
          &= right-hand-side on line 1
          \\
          &= right-hand-side on line 2
          \\
          &= right-hand-side on the last line
    \end{align}
    
    The ampersand (&) aligns the equations horizontally and the double backslash (\\) creates a new line.

Task 1. Linear Ridge Regression (1 point)

Let us consider the problem of linear ridge regression for data $(x_{1},y_{1}),\dots,(x_{n},y_{n})\in\mathbb{R}^{d\times 1}$. Let the objects have positive sample weights $v_{i}>0$, i.e. the optimization problem is $$\sum_{i=1}^{n}v_{i}\cdot L(y_{i}, \hat{y}_{i})+\frac{\lambda}{2}\|w\|_{2}^{2}=\sum_{i=1}^{n}v_{i}\cdot (\langle\boldsymbol{w},\boldsymbol{x}_{i}\rangle-y_{i})^{2}+\frac{\lambda}{2}\|w\|_{2}^{2}\rightarrow \min_{\boldsymbol{w}}.$$ This problem reduces to classical linear ridge regression when $v_{i}\equiv 1$. The matrix form of the problem is $$(Xw-y)^{\top}V(Xw-y)+\frac{\lambda}{2}w^{\top}w\rightarrow\min_{w},$$ where $V=V^{\top}\in\mathbb{R}^{n\times n}$ is the diagonal matrix with diagonal elements $v_{1},\dots, v_{n}$. Note that the quadratic problem is still convex (w.r.t. $\boldsymbol{w}$), thus, the solution is unique. Solve this problem for any (symmetric) positive-definite matrix $V$ (not just diagonal) and provide the answer in the matrix form.

Your solution:

$$ \Big(XW - Y\Big)^T V \Big(XW - Y\Big) + \frac{\lambda}{2} W^T W \rightarrow \min_{\boldsymbol{W}}; \\ \, \\ F(W) = \Big(XW - Y\Big)^T V \Big(XW - Y\Big) + \frac{\lambda}{2}W^T W; \\ \, \\ \nabla F (W) = 0; \,\,\, 2X^TVXW - X^T V Y - X^TV^TY + \lambda W = 0; \\ \, \\ X^T V \Big( 2XW - 2Y \Big) + \lambda W = 0; \\ \, \\ \Big(X^T V X - \frac{1}{2} \lambda I \Big) W = X^T V Y \\ \, \\ \boxed{W = \Big(X^T V X + \frac{1}{2} \lambda I\Big)^{-1} X^T V Y} $$

Task 2. Logistic Regression (1 point)

Let us consider the case when in the problem of binary classification the training set is lineary separable. Show that in this case the optimization problem for logistic regression without L2-regularization does not have optimum.

Your solution:

Let's consider sigmoid function: $$ S(\theta, x) = \frac{1}{1+\exp(-\theta x)} $$ The loss function in that case is: $$ L(\theta) = \sum_i y_i \log(S(\theta, x_i)) + (1 - y_i) \log(1 - S(\theta, x_i)) $$ Let's consider two possible classes. If output is 0 and probabilty tends to zero (i.d. value of sigmoid function) then $\theta$ will go to infinity. In the second case, if output is one and probabilty goes to 1 (i.e. sigmoid function is reaching 1) then $\theta$ parameter will go infinity as well. So here is need for an additional contraint, q.e.d.

Task 3. Bayesian Naive Classifier (1 point)

Show that in case of $d$ binary-valued features $x_{j}\in\{0, 1\}$ (for $j=1,2,\dots,d$) Bayesian Naive Classifier's decision rule is linear.

Your solution:

As we now decision boundaries are determined by the line, where probabilities of different classes are equal.

$$\Pr(C_s|\boldsymbol{x}) \leq \Pr(C_t|\boldsymbol{x})$$

Or using Bayes' theorem (this form is more suitable for our case):

$$\Pr(C_s)\Pr(\boldsymbol{x}|C_s) \leq \Pr(C_t)\Pr(\boldsymbol{x}|C_t)$$

In a given task the probability that object is pertained to a particular class (or is composed of a given set of binary states ) is the following (which is simple probability multiplication for independent events): $$\Pr(\boldsymbol{x}|C_q) = \prod_{j = 1}^d p_{qj}^{x_j}(1 - p_{qj})^{1 - x_j}$$

After applying logarithm to the borders, we will get (for simplicity there introduced $a_j$, $a_j'$, $b_j$ and $b_j'$):

$$\log(\Pr(C_s)) + \sum_{j = 1}^d x_j a_j + (1 - x_j) b_j = \log(\Pr(C_t)) + \sum_{j = 1}^d x_j a_j' + (1 - x_j) b_j'$$

Let's work it out a little bit to make the linearity more explicit: $$\log(\Pr(C_m)) - \log(\Pr(C_n)) + \sum_{i = 1}^{d}x_{i} \boldsymbol{A}_i + \boldsymbol{B}_i = 0$$

Task 4. Nearest Neighbors (1 point)

Consider the 1-nearest-neighbor classifier applied to multiclass classification problem. Let's denote the classifier fitted on data $X$ by $f_X(\cdot)$.

The formula for complete leave-k-out cross-validation on data sample $X^{n}$ is defined as $$L_{k}OCV=\frac{1}{C_{n}^{k}}\bigg[\sum\limits_{X\subset \mathcal{P}(X^{n})\wedge |X|=k}\frac{1}{k}\bigg(\sum_{i\in X}[y_{i}\neq f_{X^{n}\setminus X}( x_{i})]\bigg)\bigg],$$ where $\mathcal{P}(X^{n})$ is the set of all subsets of $X^{n}$. For all $i=1,2\dots,n$ denote the label of $m$-th closest neighbor of $x_{i}$ in $X^{n}\setminus \{x_{i}\}$ by $y_{i}^{m}$. Show that $$L_{k}OCV=\sum_{m=1}^{k}\underbrace{\frac{1}{n}\sum_{i=1}^{n}[y_{i}\neq y_{i}^{m}]}_{K_{m}(X^{n})}\frac{C_{n-1-m}^{n-k-1}}{C_{n-1}^{k-1}},$$ where $K_{m}(X^{n})$ is called the compactness profile of $X^{n}$, i.e. the fraction of objects whose $m$-th nearest neighbor has different label. For convenience assume that all the pairwise distances between objects are different.

Your solution:

Task 5. Bootstrap (1 point)

Let the subsample $\hat{X}^{n}$ of size $n$ be generated from sample $X^{n}=\{\boldsymbol{x}_{1},\dots\boldsymbol{x}_{n}\}$ by bootstrap procedure. Find the probability of object $x_{i}$ to be present in $\hat{X}^{n}$ and compute the limit of this probability for $n\rightarrow\infty$.

Your solution:

To calculate the probability of object $x_{i}$ to be present in $\hat{X}^{n}$ we can calculate the probabilty of object $x_{i}$ not to be present in $\hat{X}^{n}$, which is in fact much easier.
$$ \Pr\Big( \boldsymbol{x_i} \notin \hat{X}^n\Big) = \Big(\frac{n-1}{n}\Big)^n = \Big(1 - \frac{1}{n}\Big)^n $$ So $$ \Pr\Big( \boldsymbol{x_i} \in \hat{X}^n\Big) = 1 - \Pr\Big( \boldsymbol{x_i} \notin \hat{X}^n\Big) = \boxed{1 - \Big(1 - \frac{1}{n}\Big)^n} $$ Once $n$ tends to infinity: $$ \lim_{\boldsymbol{n \rightarrow \infty}} \Pr\Big( \boldsymbol{x_i} \in \hat{X}^n\Big) \rightarrow \boxed {1 - \frac{1}{e}} $$

Task 6. Decision Tree Leaves (1+1+1=3 points)

Consider a leaf of a binary decision tree which consists of object-label pairs $(x_{1},y_{1}),\dots,(x_{n},y_{n})$. The prediction $\hat{y}$ of this leaf is defined to minimize the loss on the training sample, i.e. $\frac{1}{n}\sum_{i=1}^{n}L(y_{i}, \hat{y})\rightarrow\min$. We consider three cases:

  1. Regression tree ($y_{i}\in\mathbb{R}$), absolute loss function $L(y,\hat{y})=|y-\hat{y}|$. The optimal prediction that minimizes $\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\hat{y}|$ is the median of labels: $$\hat{y}=\text{median}(y_{1},\dots,y_{n}).$$ In this case, for simplicity you may assume that $n$ is even (or odd, as you wish).
  2. Regression tree ($y_{i}\in\mathbb{R}$), squared loss function $L(y,\hat{y})=(y-\hat{y})^{2}$. The optimal prediction that minimizes $\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y})^{2}$ is the mean of labels: $$\hat{y}=\frac{1}{n}\sum_{i=1}^{n}y_i.$$
  3. Classification tree for K classes ($y_{i}\in\{1,2,\dots,K\}$), zero-one loss $L(y,\hat{y})=[y\neq \hat{y}]$. The optimal prediction that minimizes $\frac{1}{n}\sum_{i=1}^{n}[y_{i}\neq\hat{y}]$ is the most frequent label: $$\hat{y}=\underset{k=1,2,\dots,K}{\operatorname{argmax}}\frac{1}{n}\sum_{i=1}^{n}[y_{i}=k].$$ In this case, for simplicity you may assume that there is only one most frequent label.

Suppose that that instead of using optimal prediction for this leaf we output the label from $y_{1},\dots,y_{n}$ at random. What will happen with the (expected) loss on the training sample (will it increase/decrease/not change)? Prove your answer (separately for every case).

Your solution:

Task 7*. Classification (1 point)

Let objects $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}$ have binary labels $y_{1}, y_{2},\dots,y_{n}\in\{0,1\}$. Let the classifier $f$ assign to objects probabilities of being from class $1$. Without loss of generality assume that $f(\boldsymbol{x_{i}})<f(\boldsymbol{x_{j}})$ for all $i<j$. Define the fraction of concordant pairs by $$\text{FCP}(f, X^{n})=\frac{2}{n(n-1)}\sum_{i<j}[y_{i}<=y_{j}].$$ Show that this value is equal to Area Under ROC of classifier $f$.

Your solution:

Task 8. Kernel Regression (1 point)

Recall that the prediction of Kernel Ridge Regression fitted on data $X^{n}$ with the kernel $K(\cdot, \cdot)$ has the form $\mathcal{K}(x)=\sum_{i=1}^{n}\alpha_{i}K(x, x_{i})$, where $\alpha=(K+\lambda I)^{-1}Y$ ($K_{ij}=K(x_{i},x_{j})$). The time complexity of computation of a prediction $\mathcal{K}(x)$ for any point $x$ is $O(n)$, i.e. grows linearly with the size of the training set.

Consider the bilinear kernel $K(x, x')=\langle x, x'\rangle$. For this kernel, the Kernel Regression is known to turn into simple linear ridge regression. However, for linear regression the computation time of prediction $\mathcal{R}(x)=\sum_{j=1}^{d}\beta_{j}x^{j}$ is $O(d)$, where $d$ is the dimension of the feature space and does not grow with the training, which is a little bit controversary to the previous paragraph.

In this task in order to show that the kernel regression with the bilinear kernel is indeed the linear ridge regression, you have to prove that the predictions exactly match by direct comparison.

Your solution:

For linear ridge regression there is the following solution: $$ \hat{f}(x) = x \Big( X^T X + \lambda I \Big)^{-1} X^T Y $$ According to the kernel given in the problem there is the following solution for kernel regression: $$ \hat{f}(x) = x X^T \Big(XX^T + \lambda I \Big)^{-1} Y $$ So let's show that $\Big( X^T X + \lambda I \Big)^{-1} X^T$ is equal to $X^T \Big(XX^T + \lambda I \Big)^{-1}$:

To do so we simply multiply both formulas by $\lambda I + X^T X$ on the left, and by $\lambda I + X X^T$ on the rigth:

$$ \Big(\lambda I + X^TX\Big) X^T \, \boldsymbol{?} \, X^T \Big(\lambda I + X X^T\Big) \\ \, \\ \lambda X^T + X^T X X^T = \lambda X^T + X^T X X^T $$

q.e.d.

Task 9. Kernel Methods (1 point)

Prove that the function $K(x,x')=\exp(-\|x-x'\|^{2})$ is positive definite kernel.

Your solution:

Let's decompose given kernel: $$ K(x,x') = \exp(-\|x-x'\|^2) = \exp\Big(-(x- x')^T)(x-x')\Big) = \exp(-x^T x + 2x^T x' - {x'}^T x')$$ Let's replace left and righ terms with $f(x) = \exp(-x^Tx)$: $$ K(x,x') = f(x) \cdot \exp(2x^T x') \cdot f(x') $$ Now we can prove that $\exp(2x^T x')$ is positive-semidefinite kernel: $$ \exp(2x^T x') = \sum_{k=0}^{\infty} \frac{(2x^T x')^k}{k!} $$ The corresponding matrix to $2x^T x'$ kernel is Gramm which is positive-semidefinite. After exponentiation nothing changes hence the matrix corresponding to $\exp(2x^T x')$ kernel is positive-semidefinite as well.

Now we can go back to original kernel $K(x,x')$ and its matrix $\mathbf{K}$ and pick up any vector $c \in \mathbf{R}^{n \times 1}$: $$ c^T \mathbf{K} c = \sum_i^n \sum_j^n c_i c_j K(x_i, x_j) = \sum_i^n \sum_j^n (c_i f(x_i)) (c_j f(x_j)) \exp(2x_i^T x_j) = \sum_i^n \sum_j^n c_i'c_j' \exp(2x_i^T x_j) $$ As $\exp(2x_i^T x_j)$ is positive-semidefinite then inequality for $K(x, x')$ holds as well and $K(x, x')$ is positive-semidefinite, q.e.d.

Task 10*. Support Vector Machine (1 point)

Show that for two-class SVM classifier the following upper bound on accuracy leave-one-out-cross-validation estimate holds true: $$L_{1}OCV=\frac{1}{n}\sum_{i=1}^{n}[y_{i}=f_{i}(x_{i})]\leq \frac{|SV|}{n},$$ where for all $i=1,\dots,n$ $f_{i}(x_{i})$ is SVM fitted on the entire data without observation $(x_{i},y_{i})$ and $|SV|$ is the number of support vectors of SVM fit on the entire data.

Your solution:

The end.