Home Assignment No. 3: Part 2

In this part of the homework you are to solve several problems related to machine learning algorithms.

  • Your solution must me COMPLETE, i.e. contain all required formulas/proofs/detailed explanations.
  • You must write your solution for any problem just right after the words BEGIN SOLUTION. Attaching pictures of your handwriting is allowed, but highly discouraged.
  • If you want an easy life, you have to use BUILT-IN METHODS of sklearn library instead of writing tons of our yown code. There exists a class/method for almost everything you can imagine (related to this homework).
  • To do some tasks in this part of homework, you have to write CODE directly inside specified places inside notebook CELLS.
  • In some problems you may be asked to provide short discussion of the results. In this cases you have to create MARKDOWN cell with your comments right after the your code cell.
  • Your SOLUTION notebook MUST BE REPRODUCIBLE, i.e. if the reviewer decides to execute Kernel -> Restart Kernel and Run All Cells, after all the computation he will obtain exactly the same solution (with all the corresponding plots) as in your uploaded notebook. For this purpose, we suggest to fix random seed or (better) define random_state= inside every algorithm that uses some pseudorandomness.

  • Your code must be clear to the reviewer. For this purpose, try to include neccessary comments inside the code. But remember: GOOD CODE MUST BE SELF-EXPLANATORY without any additional comments.

  • The are problems with * mark - they are not obligatory. You can get EXTRA POINTS for solving them. ## $\LaTeX$ in Jupyter Jupyter has constantly improving $\LaTeX$ support. Below are the basic methods to write neat, tidy, and well typeset equations in your notebooks:
  • to write an inline equation use
    $ you latex equation here $
    
  • to write an equation, that is displayed on a separate line use
    $$ you latex equation here $$
    
  • to write a block of equations use
    \begin{align}
      left-hand-side
          &= right-hand-side on line 1
          \\
          &= right-hand-side on line 2
          \\
          &= right-hand-side on the last line
    \end{align}
    
    The ampersand (&) aligns the equations horizontally and the double backslash (\\) creates a new line.

Write your theoretical derivations within such blocks:

**BEGIN Solution**

<!-- >>> your derivation here <<< -->

**END Solution**

Please, write your implementation within the designated blocks:

...
### BEGIN Solution

# >>> your solution here <<<

### END Solution
...


Clustering

Task 1 (1 + 2 = 3 pt.): Practice with Different Clustering Algorithms

Task 1.1 (1 pt.)

Try KMeans, Gaussian Mixture and two more clustering algorithms from sklearn. To evaluate clustering performance use two clustering metrics: silhouette score and adjusted mutual information.

For each algorithm your task is to try to find the parameters leading to the best performance for n_clusters=true_number_of_clusters:

  1. Apply the algorithm with the true number of clusters and at least two other settings for the number of clusters: a smaller and a larger number than the true one;
  2. For each number of clusters visualize the clustering result, calculate the clustering metrics and plot them;

In [20]:
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, Birch, AgglomerativeClustering
from sklearn.metrics.cluster import adjusted_mutual_info_score
from sklearn.mixture import GaussianMixture

In [168]:
points, labels = np.loadtxt('data/clustering.txt'), np.loadtxt('data/clustering_labels.txt')
labels = labels.astype(int)
print("True number of clusters is {}".format(np.max(labels)))
plt.figure(figsize=(10, 10))
plt.scatter(points[:,0], points[:,1], c=labels, alpha=0.3, edgecolor='k')
plt.show()


True number of clusters is 15

In [169]:
### BEGIN Solution

min_num_of_clusters = 13
max_num_of_clusters = 17

f, axes = plt.subplots(1, 2, sharex=True, figsize=(15, 7))
axes[0].set_title('silhouette_score')
axes[1].set_title('mutual_info_score')

f2, axes2 = plt.subplots(4, max_num_of_clusters - min_num_of_clusters + 1, sharex=True, sharey=True, figsize=(15, 7))
for i in range(max_num_of_clusters - min_num_of_clusters + 1):
    axes2[0][i].set_title('KMeans, clusters = ' + str(i + min_num_of_clusters))
    axes2[0][i].axis('off')
    axes2[1][i].set_title('GaussMixt, clusters = ' + str(i + min_num_of_clusters))
    axes2[1][i].axis('off')
    axes2[2][i].set_title('Birch, clusters = ' + str(i + min_num_of_clusters))
    axes2[2][i].axis('off')
    axes2[3][i].set_title('AgglClust, clusters = ' + str(i + min_num_of_clusters))
    axes2[3][i].axis('off')
                  
silhouette_kmeans = []
silhouette_gaussianmixture = []
silhouette_birch = []
silhouette_agglomerativeclust = []
                  
mi_kmeans = []
mi_gaussianmixture = []
mi_birch = []
mi_agglomerativeclust = []
                  
for n_cluster in range(min_num_of_clusters, max_num_of_clusters + 1):
    # KMeans
    kmeans = KMeans(n_clusters=n_cluster, random_state=101, n_jobs=-1)
    kmeans_labels = kmeans.fit_predict(points)
    silhouette_kmeans.append(silhouette_score(points, kmeans_labels))
    mi_kmeans.append(adjusted_mutual_info_score(labels, kmeans_labels, average_method='arithmetic'))   
    axes2[0][n_cluster - min_num_of_clusters].scatter(points[:,0], points[:,1], c=kmeans_labels, alpha=0.3, edgecolor='k')
    
    # Gaussian Mixture
    gaussian_mixture = GaussianMixture(n_components=n_cluster, random_state=101)
    gm_labels = gaussian_mixture.fit_predict(points)
    silhouette_gaussianmixture.append(silhouette_score(points, gm_labels))
    mi_gaussianmixture.append(adjusted_mutual_info_score(labels, gm_labels, average_method='arithmetic'))  
    axes2[1][n_cluster- min_num_of_clusters].scatter(points[:,0], points[:,1], c=gm_labels, alpha=0.3, edgecolor='k')
    
    # Birch
    birch = Birch(n_clusters=n_cluster)
    birch_labels = birch.fit_predict(points)
    silhouette_birch.append(silhouette_score(points, birch_labels))
    mi_birch.append(adjusted_mutual_info_score(labels, birch_labels, average_method='arithmetic'))
    axes2[2][n_cluster - min_num_of_clusters].scatter(points[:,0], points[:,1], c=birch_labels, alpha=0.3, edgecolor='k')
    
    # Agglomerative Clustering
    aggl_clust = AgglomerativeClustering(n_clusters=n_cluster)
    aggl_clust_labels = aggl_clust.fit_predict(points)
    silhouette_agglomerativeclust.append(silhouette_score(points, aggl_clust_labels))
    mi_agglomerativeclust.append(adjusted_mutual_info_score(labels, aggl_clust_labels, average_method='arithmetic'))
    axes2[3][n_cluster - min_num_of_clusters].scatter(points[:,0], points[:,1], c=aggl_clust_labels, alpha=0.3, edgecolor='k')
                  
axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_kmeans, label="KMeans")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_kmeans, label="KMeans")      

axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_gaussianmixture, label="GaussianMixture")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_gaussianmixture, label="GaussianMixture")

axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_birch, label="Birch")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_birch, label="Birch")

axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_agglomerativeclust, label="AgglomerativeClustering")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_agglomerativeclust, label="AgglomerativeClustering")

axes[0].axvline(x=15, color='k', linestyle='--')
axes[1].axvline(x=15, color='k', linestyle='--')
axes[0].legend()
axes[1].legend()

plt.tight_layout()
plt.show()
                  
### END Solution



Task 1.2 (2 pt.): Finding the Number of Clusters with Bootstrap

For a chosen algorithm find the best number of clusters using bootstrap.

So, first, based on your observations made at the previous task, choose one algorithm assuming you do not have true labels and do not know the true number of clusters.

  1. (1 pt.) Estimate variance of the metric and construct normal 95% confidence intervals;
  2. (1 pt.) Plot the metric with the corresponding confidence intervals and choose the best number of clusters.

In [56]:
from tqdm import tqdm_notebook
from sklearn.utils import resample
from scipy.stats import sem, t
from scipy import mean

### BEGIN Solution

cluster_ns = [13,14,15,16,17]
sample_number = 10

score_clusters = []

for i_cn, n_clusters in tqdm_notebook(enumerate(cluster_ns)):
    score = []
    # >>> your code here <<<
    kmean = KMeans(n_clusters=n_clusters)
    
    for sample_index in tqdm_notebook(range(sample_number)):
        points_resampled = resample(points, replace=True, n_samples=points.shape[0], random_state=101)
        label_predicted = kmean.fit_predict(points_resampled)
        score.append(silhouette_score(points_resampled, label_predicted))
        
    score_clusters.append(score)

### END Solution




In [81]:
conf_interval_mean = []
conf_interval_low = []
conf_interval_upper = []
for score in score_clusters:
    m = mean(score)
    h = sem(score) * t.ppf((1 + 0.95) / 2, len(score) - 1)
    conf_interval_mean.append(m)
    conf_interval_low.append(m - h)
    conf_interval_upper.append(m + h)
    
fig = plt.figure(figsize=(10,6))
plt.fill_between(cluster_ns, conf_interval_low, conf_interval_upper,
                 color='b', alpha=.5)
plt.plot(cluster_ns, conf_interval_mean, 'b')
plt.title("95% confidence interval", size=16)
plt.xticks(cluster_ns)
plt.xlabel("Number of clusters", size=16)
plt.ylabel("Silhouette score", size=16)
plt.axvline(x=15, color='k', linestyle='--')
plt.show()



Dimentionality Reduction and Manifold Learning

Task 2 (1 + 1 + 2 + 1 = 5 pt.)

Assume that you have $n$ points in $D$ dimensional space:

$$u_1, ..., u_n \in \mathbb R^D.$$

There exist a linear data transformation $F(u): \mathbb R^D \rightarrow \mathbb R^d, D >> d$ such that:

$$(1-\delta) \|u_i - u_j\|^2 \leq \|F(u_i) - F(u_j)\|^2\leq (1+\delta) \|u_i - u_j\|^2$$

with high probability.

The transformation $F(u)$ is: $F(u) = \tfrac{1}{\sqrt d}Xu$, and $X \in \mathbb R^{d \times D}$ is a random matrix, components of which are independent identically distributed $X_{ij} \sim \mathcal{N}(0, 1)$.

This statement means that there exists a matrix that reduces the original dimensionality such that pairwise distances are not distorted much. This is a theoretical probabilistic result that guarantees you such a transformation. We will obtain a bound on $d$ for which this result holds true.

Chernoff inequality. This unequality states a bound on distribution tail.

$$\mathbb P(X \geq t) \leq \frac{\mathbb E \exp(\lambda X)}{\exp(\lambda t)}$$

Definition. Random variable $X$ is called subexponential with parameters $(\sigma^2, b)$, if for all $\lambda: |\lambda| < \frac{1}{b}$ the following is true:

$$\mathbb E \exp(\lambda (X - \mathbb E X)) \leq \exp\left(\frac{\lambda^2\sigma^2}{2}\right)$$

Fact. $\chi^2$ distribution with $d$ degrees of freedom is a sum of squares of $d$ independent standard gaussian random variables.

Fact. $\chi^2$ distribution with $d$ degrees of freedom is subexponential with parameters $\sigma^2 = 4d, b = 4$


Task 2.1 (1 pt.):

Using the above information, prove that for $Y \sim \chi^2$ with $d$ degrees of freedom the following inequality holds true:

$$\mathbb P (\lvert Y - d \rvert \geq t) \leq 2\exp\left(\frac{-t^2}{8d}\right)$$

for $t \leq \frac{\sigma^2}{b}$.

Hint: you will need to optimise the power of exponential in order to get optimal $\lambda$.

BEGIN Solution

$$ \mathbb{P}(| Y - d| \geq t ) = \mathbb{P} ( Y - d \geq t ) + \mathbb{P} ( Y - d \leq -t )$$

Using Fact №1, Chernoff bound and subexponentiality:

$$ \mathbb{P} (Y -d \geq t ) \leq \frac{\mathbb{E} \exp (\lambda (Y -d))}{\exp(\lambda t)} = \exp \Big[ \frac{\lambda^2 \sigma^2 }{2} - \lambda t \Big] = f(\lambda)$$

In fact to get the lowest upper bound we need to obtain $\hat{\lambda}$:

$$ \hat{\lambda} = \arg \min_{\lambda} f(\lambda) $$

Here we go:

$$ \frac{\partial{f(\lambda)}}{\partial \lambda} = 0, \, \exp \Big[...\Big] \Big[ \lambda \sigma^2 - t \Big] = 0 \to \hat{\lambda} = \frac{t}{\sigma^2} $$

Recalling subexponentiality constraint:

$$ \lambda \leq \frac{1}{b} \to t \leq \frac{\sigma^2} {b} $$

So:

$$ \mathbb{P} (Y - d \geq t) \leq \exp \Big[ \frac{\hat{\lambda}^2 \sigma^2}{2} - \lambda t \Big] = \exp \Big[ -\frac{t^2}{2 \sigma^2}\Big] = \exp \Big[ - \frac{t^2}{8d} \Big]$$

As subexponential R.V. is symmetric $-Y$ is subexponential variable (as $Y$ is), so in the same way:

$$ \mathbb{P} (Y - d \leq -t) \leq \exp \Big[ \frac{\hat{\lambda}^2 \sigma^2}{2} + \lambda t \Big] = \exp \Big[ -\frac{t^2}{2 \sigma^2}\Big] = \exp \Big[ - \frac{t^2}{8d} \Big]$$

Finally:

$$ \mathbb{P}(| Y - d| \geq t ) \leq 2 \exp \Big[ - \frac{t^2}{8d} \Big], \, t \leq \frac{\sigma^2} {b}$$

qed

END Solution


Task 2.2 (1 pt.):

Prove that $\frac{\|Xu\|^2}{\|u\|^2}$ is $\chi_2$ random variable with $d$ degrees of freedom.

BEGIN Solution

We know that $\chi^2$ distribution with $d$ degrees of freedom is a sum of squares of $d$ independent standard gaussian random variables. So:

$$ \frac{\|Xu\|^2}{\|u\|^2} = \frac{u^T X^T X u}{\|u\| \|u\|} $$

Basically, we need to prove that $\frac{Xu}{\|u\|}$ is normally distributed. We know that:

$$ \frac{Xu}{\|u\|} = \sum^d_{i=1} \frac{X_i u}{\|u\|}, \mbox{where} \, X_i\mbox{is i-th row of X}$$

Furthering on:

$$ \frac{X_i u}{\|u\|} = \sum_{j=1}^D \frac{X_{ij} u_j }{\|u\|} = \sum_{j=1}^D \frac{u_j}{\|u\|} X_{ij} $$

In fact, $X_{ij} \sim \mathcal{N}(0, 1)$

So we have to prove that a linear combination of $X_{ij}$ is normally distributed as well.

Let's break down this prove into two parts, in the first part we prove that:

  1. $ \mathbb{E} [aX + b] = a \mathbb{E}[X] + b $
  2. $ \mathrm{Var}[aX + b] = a^2 \mathrm{Var}[X] $

Prove:

  1. Let $g(X) = aX + b $ $$\begin{align} E[g(X)]

     &= \int (ax + b) f_X (x) dx
     \\
     &= a \int x f_X (x) dx + b \int f_X (x) dx
     \\
     &= aE[X] + b
    

    \end{align} $$

  2. Let $g(X) = aX + b $ $$\begin{align} \mathrm{Var} [g(X)]

     &= \mathbb{E} [g(X) ^2 - \mathbb{E} [g(X)]^2
     \\
     &= \mathbb{E} [(aX + b)^2] - (a\mathbb{E}[X] + b) ^2
     \\
     &= a^2 \mathbb{E} [X^2] + 2ab \mathbb{E}[X] + b^2 - (a\mathbb{E}[X] + b)^2
     \\
     &= a^2 (\mathbb{E} [X^2] - \mathbb{E}[X]^2)
     \\
     &= a^2 \mathrm{Var}[X]
    

    \end{align} $$

In the second part we prove:

  1. The following equation holds for arbitrary random variables $X_1, X_2, ..., X_n$:
$$ \mathbb{E} [X_1 + X_2 + ... + X_n] = \mathbb{E}[X_1] + \mathbb{E}[X_2] + ... + \mathbb{E}[X_n] $$
  1. If $X_1, X_2, ..., X_n$ are independent, then
$$ \mathrm{Var} [X_1 + X_2 + ... + X_n] = \mathrm{Var}[X_1] + \mathrm{Var}[X_2] + ... + \mathrm{Var}[X_n] $$

Prove by induction:

First prove for arbitrary two random variable $X$, $Y$, $\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ ($f(x,y)$ is the joint PDF of $X$, $Y$):

$$\begin{align} \mathbb{E} [X+Y] &= \int_y \int_x (x+y) f(x,y) dxdxy \\ &= \int_x x \int_y f(x,y) dy dx + \int_y y \int_x f(x,y) dxdy \\ &= \int_x x f_X(x) dx + \int_y y f_Y (y) dy \\ &= \mathbb{E}[X] + \mathbb{E}[Y] \end{align} $$

Suppose $$ \mathbb{E} \Big[ \sum_{i=1}^{k-1} X_i \Big] = \sum_{i=1}^{k-1} \mathbb{E} [X_i] $$

For simplicity define R.V. $ Y_{k-1} = \sum_{i=1}^{k-1} X_i $, then:

$$\begin{align} \mathbb{E} \Big[ \sum_{i=1}^{k} X_i \Big] &= \mathbb{E} [Y_{k-1} + X_k] \\ &= \mathbb{E} [Y_{k-1}] + \mathbb{E}[X_k] \\ &= \sum_{i=1}^{k-1} \mathbb{E} [X_i] + \mathbb{E}[X_k] \\ &= \sum_{i=1}^{k} \mathbb{E} [X_i] \end{align} $$

For variance prove looks the same (based on induction)

From those proves we can easily say that:

$$ \mathbb{E} [c_0 + c_1 X_1 + c_2 X_2 + ... + c_n X_n] = c_0 + c_1 \mathbb{E} [X_1] + c_2 \mathbb{E}[X_2] + ... + c_n \mathbb{E} [X_n] $$$$ \mathrm{Var} [c_0 + c_1 X_1 + c_2 X_2 + ... + c_n X_n] = c_1^2 \mathrm{Var} [X_1] + c_2^2 \mathrm{Var}[X_2] + ... + c_n^2 \mathrm{Var} [X_n] $$

So, linear combination of normally distributed $X_{ij}$ is normal as well, qed

END Solution


Task 2.3 (2 pt.)

Having all of the previous results, we now may apply them to get the bound.

From inequality for tails from Task 4.1 we get that:

$$\mathbb P \left(\left\lvert \frac{\|Xu\|^2}{\|u\|^2} - d \right\rvert \geq t\right) \leq 2\exp\left(\frac{-t^2}{8d}\right)$$

This means that probability of such event that our distances will change a lot is bounded.

  1. Show that if probability above is small, then probability of: $$(1-\delta) \|u_i - u_j\|^2 \leq \|\frac{1}{\sqrt d} X(u_i - u_j)\|^2\leq (1+\delta) \|u_i - u_j\|^2$$ is big and basically almost equal to $1 - n(n-1)\exp \left(-\frac{\delta^2d}{8}\right)$.

  2. After the previous step, we got that if we want to have our inequalities to be true with high probability $1-\varepsilon$, we want to bound it from above. Derive the inequality for $d$ from that.

Hint:

  • at some point you would like to take $\delta = \frac{t}{d}$. Note that it makes $\delta$ be in range of 0 and 1

BEGIN Solution

  1. $$ \mathbb{P} \Big[ \forall i,j: (1-\delta) \| u_i -u_j \|^2 \leq \frac{1}{d} \| X(u_i - u_j) \|^2 \leq (1+\delta) \| u_i - u_j\|^2 \Big] = $$ Let's consider inner inequalities: $$ (1-\delta) \| u_i -u_j \|^2 \leq \frac{1}{d} \| X(u_i - u_j) \|^2 \leq (1+\delta) \| u_i - u_j\|^2 $$ $$ d - d\delta \leq \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} \leq d + d\delta $$ $$ \Big| \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} -d \Big| \leq d\delta $$ So let's fix $i,j$ as well: $$ \mathbb{P} \Big[ \forall i,j: (1-\delta) \| u_i -u_j \|^2 \leq \frac{1}{d} \| X(u_i - u_j) \|^2 \leq (1+\delta) \| u_i - u_j\|^2 \Big] = \mathbb{P} \Big[ \forall i,j: \Big| \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} -d \Big| \leq d\delta \Big] $$ $$ \mathbb{P} \Big[ \forall i,j: \Big| \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} -d \Big| \leq d\delta \Big] = 1 - \mathbb{P} \Big[ \exists i,j: \Big| \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} -d \Big| \geq d\delta \Big] =$$ $$1 - \frac{n(n-1)}{2} \mathbb{P} \Big[ \mbox{for fixed } i,j: \Big| \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} -d \Big| \geq d\delta \Big] $$ So: $$ \mathbb{P} \Big[ \mbox{for fixed } i,j: \Big| \frac{\|X(u_i - u_j) \|^2}{\|u_i - u_j \|^2} -d \Big| \geq d\delta \Big] \leq 2\exp \Big[ - \frac{\delta^2 d^2}{8d}\Big] = 2\exp \Big[ - \frac{\delta^2 d}{8}\Big] $$ Finally: $$ \mathbb{P} \Big[ \forall i,j: (1-\delta) \| u_i -u_j \|^2 \leq \frac{1}{d} \| X(u_i - u_j) \|^2 \leq (1+\delta) \| u_i - u_j\|^2 \Big] \geq \boxed{ 1 - n(n-1)\exp \Big[ - \frac{\delta^2 d}{8}\Big]} $$
  2. Let's derive $d$ from the following inequality: $$ 1-\varepsilon \leq 1 - n(n-1)\exp \Big[ - \frac{\delta^2 d}{8}\Big] $$ $$ \varepsilon \geq n(n-1)\exp \Big[ - \frac{\delta^2 d}{d} \Big] $$ $$ \frac{\varepsilon}{n(n-1)} \geq \exp \Big[ -\frac{\delta^2 d}{8} \Big] $$ $$ \ln\Big[ \frac{\varepsilon}{n(n-1)} \Big] \geq - \frac{\delta^2 d}{8} $$ Hence: $$ \boxed{d \geq - \frac{8}{\delta^2} \ln \Big[ \frac{\varepsilon}{n(n-1)} \Big]} $$

END Solution


Task 2.4 (1 pt.)

We have sucessfully got our lower bound on the dimensionality we can safely reduce to such that pairwise distances do not change much:

$$d \geq \frac{8}{\delta^2}\log\left(\frac{n(n-1)}{\varepsilon}\right)$$

Note the beauty of that inequality. It doesn't depend on original dimensionality $D$, parameters $n$ -- number of samples and $\varepsilon$ -- probability are under the $\log$ function.

This bound is not very tight, using more advanced techniques you may improve it. That means, our estimate of $d$ may be too high, but we can guarantee our result for it. Also remember that this approach is probabilistic, and, basically, depends on how lucky you will be with your data. It is very useful in case of extremely large dimensionalities $D$, and allows to reduce dimensionality while preserving the structure of data.

Now implement the obtained result and confirm that it holds true:

  1. Fetch 20news dataset
  2. Check the dimensionality of data
  3. Generate a random matrix of the corresponding size. Note that this might take a while (or you may run out of memory).
  4. Fix $\delta$ = 0.15 and $\varepsilon$ = 0.01
  5. Show that distances of transformed data are within the delta tube. Write down 1-2 sentences to point out the ratio. Remember that our result is not applicable in case if distance is 0
  6. Show that for smaller $d$ this result doesn't work. You will have to take much smaller $d$ in order to show that (ten times less, for example). Write down 1-2 sentences, describing the result .

In [3]:
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.metrics.pairwise import euclidean_distances
from numpy.random import standard_normal

X = fetch_20newsgroups_vectorized().data[:400]

In [9]:
### BEGIN Solution

n, D = X.shape
delta = 0.15
epsilon = 0.01
min_d = np.int(- np.ceil(8/delta**2 * np.log(epsilon/(n*(n-1)))))

print("Number of samples ", n)
print("Dimensionality of each sample", D)
print("Min dimensionaly of reduced space", min_d)

d = min_d + 50

np.random.seed(101)
RM = np.random.standard_normal(size=(d, D))
F = 1/np.sqrt(d) * X @ RM.T

#check results
X_dist = euclidean_distances(X, X)**2
F_dist = euclidean_distances(F, F)**2
print('For d = ', d, 'left inequalities hold:', ((1-delta)*X_dist <= F_dist).all())
print('For d = ', d, 'right inequalities hold:', ((1+delta)*X_dist >= F_dist).all())

d = min_d // 10

RM = np.random.standard_normal(size=(d, D))
F = 1/np.sqrt(d) * X @ RM.T

X_dist = euclidean_distances(X, X)**2
F_dist = euclidean_distances(F, F)**2
print('For d = ', d, 'left inequalities hold:', ((1-delta)*X_dist <= F_dist).all())
print('For d = ', d, 'right inequalities hold:', ((1+delta)*X_dist >= F_dist).all())

### END Solution


Number of samples  400
Dimensionality of each sample 130107
Min dimensionaly of reduced space 5897
For d =  5947 left inequalities hold: True
For d =  5947 right inequalities hold: True
For d =  589 left inequalities hold: False
For d =  589 right inequalities hold: False
  1. We derived that for $d \geq \frac{8}{\delta^2}\log\left(\frac{n(n-1)}{\varepsilon}\right)$ there exists a linear data transformation $F(u): \mathbb R^D \rightarrow \mathbb R^d, D >> d$ such that: $$(1-\delta) \|u_i - u_j\|^2 \leq \|F(u_i) - F(u_j)\|^2\leq (1+\delta) \|u_i - u_j\|^2$$ So, we took $d = d_{\min} + 50$ and saw that the aforementioned inequality holds true.
  1. Then we took $d = \frac{d_{\min}}{10}$ and saw that the aforementioned inequality does not hold.


Task 3 (2 + 3 = 6 pt.). MNIST principal component analysis

Task 3.1 (2 pt.)

Download MNIST dataset. Here we will estimate the data intrinsic dimensionality, decompose the dataset into principal components and estimate the cumulative explained variance by each component.

You can either use the method proposed here https://www.stat.berkeley.edu/~bickel/mldim.pdf for the sample intrinsic dimension estimation, or any other method known to you. For the first case, you can use implemented code from here https://gist.github.com/mehdidc/8a0bb21a31c43b0cbbdd31d75929b5e4

  1. Estimate the intrinsic dimensionality of the dataset. Provide the plot.
  2. Decompose the data into principal components (PCA). Plot the cumulative explained variance by each component.
  3. Plot the example of reconstructed image with 6 different sets of components and prove the corresponding explained variance. You can choose any digit from the dataset.

In [100]:
from sklearn import datasets
import matplotlib.pyplot as plt
from tqdm import tqdm
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

digits = datasets.load_digits()

data = StandardScaler().fit_transform(digits.data)

k1, k2 = 3, 70
### BEGIN Solution

def intrinsic_dim_sample_wise(X, k=5):
    neighb = NearestNeighbors(n_neighbors=k + 1).fit(X)
    dist, ind = neighb.kneighbors(X)
    dist = dist[:, 1:]
    dist = dist[:, 0:k]
    assert dist.shape == (X.shape[0], k)
    assert np.all(dist > 0)
    d = np.log(dist[:, k - 1: k] / dist[:, 0:k-1])
    d = d.sum(axis=1) / (k - 2)
    d = 1. / d
    intdim_sample = d
    return intdim_sample

def intrinsic_dim_scale_interval(X, k1=10, k2=20):
    X = pd.DataFrame(X).drop_duplicates().values # remove duplicates in case you use bootstrapping
    intdim_k = []
    for k in range(k1, k2 + 1):
        m = intrinsic_dim_sample_wise(X, k).mean()
        intdim_k.append(m)
    return intdim_k

def repeated(func, X, nb_iter=100, random_state=None, verbose=0, mode='bootstrap', **func_kw):
    if random_state is None:
        rng = np.random
    else:
        rng = np.random.RandomState(random_state)
    nb_examples = X.shape[0]
    results = []

    iters = range(nb_iter)
    if verbose > 0:
        iters = tqdm(iters)    
    for i in iters:
        if mode == 'bootstrap':
            Xr = X[rng.randint(0, nb_examples, size=nb_examples)]
        elif mode == 'shuffle':
            ind = np.arange(nb_examples)
            rng.shuffle(ind)
            Xr = X[ind]
        elif mode == 'same':
            Xr = X
        else:
            raise ValueError('unknown mode : {}'.format(mode))
        results.append(func(Xr, **func_kw))
    return results

result = np.array(repeated(intrinsic_dim_scale_interval, data, nb_iter=20, random_state=101, k1=k1, k2=k2)).T

In [47]:
from scipy.stats import sem, t
from scipy import mean

conf_interval_mean = []
conf_interval_low = []
conf_interval_upper = []

for iterat in result:
    m = iterat.mean()
    h = sem(iterat) * t.ppf((1 + 0.95) / 2, len(iterat) - 1)
    conf_interval_mean.append(m)
    conf_interval_low.append(m - h)
    conf_interval_upper.append(m + h)
    
fig = plt.figure(figsize=(15,4))
plt.fill_between(range(k1,k2+1), conf_interval_low, conf_interval_upper,
                 color='g', alpha=.3, label="CI=95%")
plt.plot(range(k1,k2+1), conf_interval_mean, 'b', label="Mean")
plt.xticks(range(k1-3,k2+1, 10))
plt.yticks(range(6, 11, 1))
plt.xlabel("Nearest Neighbours", size=15)
plt.ylabel("Intrinsic Dimensionality", size=15)
plt.grid()
plt.legend()
plt.show()



In [128]:
from sklearn.decomposition import PCA

n_comp = 64
pca_model = PCA(n_components=n_comp)
pca_model.fit_transform(data)

cev = pca_model.explained_variance_ratio_.cumsum()
fig = plt.figure(figsize=(15,4))
plt.step(range(0, n_comp), cev, 'b', label="cumulative explained variance")
plt.xticks(range(0, n_comp, 10))
plt.title("Cumulative explained variance")
plt.xlabel("Num of principal components")
plt.ylabel("Cumulative explained variance")
plt.axhline(y=0.95, color='r', linestyle='--', label='95% explained variance')
plt.legend()
plt.show()



In [166]:
from matplotlib.colors import NoNorm

def plot_pca_components(x, coefficients, mean=0, components=None,
                        imshape=(8, 8), n_components=[]):
    mean = np.zeros_like(x) + mean
        
    fig = plt.figure(figsize=(4 * len(n_components), 4))
    g = plt.GridSpec(1, len(n_components), hspace=0.3)

    def show(n, x, title=None):
        ax = fig.add_subplot(g[n], xticks=[], yticks=[])
        ax.imshow(x.reshape(imshape), cmap="gray", norm=NoNorm())
        if title:
            ax.set_title(title, fontsize=12)
    
    approx = mean.copy()

    count = 0 
    for i in range(max(n_components.keys())+1):
        approx = approx + coefficients[i-1] * components[i-1]
        if i in n_components.keys():
            if n_components[i] != "Original":
                show(count, approx, str(i) + ' Components \n' + n_components[i] + "% of Explained image")
            else:
                show(count, approx, str(i) + ' Components \n' + "Original image")
            count+=1
            
    return fig

pca_model = PCA(n_components=64)
data_fitted = pca_model.fit_transform(data)
fig = plot_pca_components(data[4], data_fitted[4], pca_model.mean_, pca_model.components_, n_components={13:"70", 18:"80", 23:"85", 29:"90", 39:"95", 64:"Original"})
plt.show()



Task 3.2 (3 pt.)

Here we will visualize the decomposition, to observe how the data diverges across different manifold learning methods. We are to compare PCA, ICA and Isomap non-linear decompositions [see more over here https://scikit-learn.org/stable/modules/manifold.html], and check the classification accuracies on the new features.

  1. (1 pt.) Visualise ICA, PCA and Isomap 3D decomposition on 3 components in plotly. Be sure that the data is whitened (scaled).
  2. (1 pt.) Calculate new features of the data with ICA, PCA, and Isomap. Choose the number of components according to your estimation of intrinsic dimention, see Task 3.1. Calculate the classification accuracy on these features with LogisticRegression on cv=5, 3 repeats. Use RepeatedKFold and fix the random_seed=0.
  3. (1 pt.) Show that Isomap gives statistically significant improvement on the classification accuracy with any appropriate statistical test (for example scipy.stats.ttest_ind). Provide corresponding p-values. Justify your results, write down 2-3 sentences.

In [101]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot,iplot
import plotly.graph_objs as go
import colorlover as cl
import pandas as pd
from sklearn.manifold import Isomap
from sklearn.decomposition import PCA
from sklearn.decomposition import FastICA

init_notebook_mode(connected=True)

### BEGIN Solution

iso = Isomap(n_neighbors=5, n_components=3, eigen_solver='dense')
digits_transformed = StandardScaler().fit_transform(iso.fit_transform(data))

pca = PCA(n_components=3)
digits_transformed2 = StandardScaler().fit_transform(pca.fit_transform(data))

ica = FastICA(n_components=3)
digits_transformed3 = StandardScaler().fit_transform(ica.fit_transform(data))

### END Solution



In [102]:
trace1 = go.Scatter3d(
    x=digits_transformed[:,0],
    y=digits_transformed[:,1],
    z=digits_transformed[:,2],
    mode='markers',
    marker=dict(
        color='rgb(255, 1, 1)',
        size=3,
        symbol='circle',
        line=dict(
            color='rgb(204, 1, 1)',
            width=1
        ),
        opacity=0.9
    )
)

trace2 = go.Scatter3d(
    x=digits_transformed2[:,0],
    y=digits_transformed2[:,1],
    z=digits_transformed2[:,2],
    mode='markers',
    marker=dict(
        color='rgb(1, 255, 1)',
        size=3,
        symbol='circle',
        line=dict(
            color='rgb(1, 204, 1)',
            width=1
        ),
        opacity=0.9
    )
)

trace3 = go.Scatter3d(
    x=digits_transformed3[:,0],
    y=digits_transformed3[:,1],
    z=digits_transformed3[:,2],
    mode='markers',
    marker=dict(
        color='rgb(1, 1, 255)',
        size=3,
        symbol='circle',
        line=dict(
            color='rgb(1, 1, 204)',
            width=1
        ),
        opacity=0.9
    )
)

layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    ),
    legend=dict(x=-.1, y=1.2)
)

fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)

iplot(fig, show_link = False)



In [90]:
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='newton-cg',multi_class='multinomial')
rkf = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0)

### BEGIN Solution

report = [
    {'dimred_method': PCA(n_components=7), 'acc': [], 'mean_acc': 0, 'std_acc': 0.0},
    {'dimred_method': FastICA(n_components=7), 'acc': [], 'mean_acc': 0, 'std_acc': 0.0},
    {'dimred_method': Isomap(n_components=7, eigen_solver='dense'), 'acc': [], 'mean_acc': 0, 'std_acc' : 0.0},
]

pca_data = report[0]['dimred_method'].fit_transform(data)
ica_data = report[1]['dimred_method'].fit_transform(data)
isomap_data = report[2]['dimred_method'].fit_transform(data)
data_red=[pca_data, ica_data, isomap_data]

for i in range(len(report)):
    for train_index, test_index in rkf.split(data_red[i]):
        X_train, X_test = data_red[i][train_index], data_red[i][test_index]
        y_train, y_test = digits.target[train_index], digits.target[test_index]
        logreg.fit(X_train, y_train)
        report[i]['acc'].append(logreg.score(X_test, y_test))
    report[i]['std_acc'] = np.std(report[i]['acc'])
    report[i]['mean_acc'] = np.mean(report[i]['acc'])

# Printing
for i, elem in enumerate(report):
    report[i]['dimred_method'] = report[i]['dimred_method'].__class__.__name__
df = pd.DataFrame(report)
df.drop(['acc'], axis=1)

### END Solution


Out[90]:
dimred_method mean_acc std_acc
0 PCA 0.854755 0.021191
1 FastICA 0.686134 0.037035
2 Isomap 0.954553 0.010544

In [96]:
import scipy as sp

t_test, p_value = sp.stats.ttest_ind(report[0]['acc'], report[1]['acc'])

if (p_value < 0.005 and t_test > 0):
    print("Statistically significant improvement of PCA decomposition over ICA: p-value", p_value)


Statistically significant improvement of PCA decomposition over ICA: p-value 9.337081412362015e-15

In [97]:
t_test, p_value = sp.stats.ttest_ind(report[2]['acc'], report[0]['acc'])

if (p_value < 0.005 and t_test > 0):
    print("Statistically significant improvement of Isomap decomposition over PCA: p-value", p_value)


Statistically significant improvement of Isomap decomposition over PCA: p-value 1.835870430465974e-15

So using statisctical test and considering means to be equal as null hypothesis we found out that it doens't hold true (means are different) and got extremely low p-values which justifies our conclusion and makes probabilty of such an event quite high. Using t-values we found out the sign of difference and using p-value how likely that it is true.


Task 3.3* (4 pt.). Bonus.

Explore KMNIST dataset https://github.com/rois-codh/kmnist (Kuzushiji-MNIST (10 classes, 28x28, 70k examples)). You are to engineer new features with any convenient method of Manifold Learning to increase the classification accuracy. Use the $k$NN classifier and validation code from here https://github.com/rois-codh/kmnist/blob/master/benchmarks/kuzushiji_mnist_knn.py. Your accuracy on the test should be more than 91.56%.

NOTE that the data is rather heavy, thus your decomposition will take a while.


In [2]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.decomposition import PCA


def load(f):
    return np.load(f)['arr_0']

# Load the data
x_train = load('kmnist-train-imgs.npz')
x_test = load('kmnist-test-imgs.npz')
y_train = load('kmnist-train-labels.npz')
y_test = load('kmnist-test-labels.npz')

In [112]:
n_comp = 200
pca_model = PCA(n_components=n_comp)
x_train_new = StandardScaler().fit_transform(x_train.astype(np.float64))
pca_model.fit_transform(x_train_new)

cev = pca_model.explained_variance_ratio_.cumsum()
fig = plt.figure(figsize=(15,4))
plt.step(range(0, n_comp), cev, 'b', label="cumulative explained variance")
plt.xticks(range(0, n_comp, 10))
plt.title("Cumulative explained variance")
plt.xlabel("Num of principal components")
plt.ylabel("Cumulative explained variance")
plt.axhline(y=0.95, color='r', linestyle='--', label='95% explained variance')
plt.legend()
plt.show()



In [28]:
# Reshape the data
x_train_flatten = x_train.reshape(-1, 28*28)
x_test_flatten = x_test.reshape(-1, 28*28)

In [41]:
test_scores = []
n_comp_max = 200

for n_comp in range(10, n_comp_max, 10):
    print("Components", n_comp)
    pca_model = PCA(n_components=n_comp, random_state=101)
    x_train_red = pca_model.fit_transform(x_train_flatten)
    x_test_red = pca_model.transform(x_test_flatten)
    clf = KNeighborsClassifier(weights='distance', n_jobs=-1)
    clf.fit(x_train_red, y_train)
    test_scores.append(clf.score(x_test_red, y_test))


Components 10
Components 20
Components 30
Components 40
Components 50
Components 60
Components 70
Components 80
Components 90
Components 100
Components 110
Components 120
Components 130
Components 140
Components 150
Components 160
Components 170
Components 180
Components 190

In [42]:
fig = plt.figure(figsize=(14,7))
plt.plot(range(10, n_comp_max, 10), test_scores, 'b')
plt.xticks(range(0, n_comp_max, 10))
plt.title("Accuracy of kNN", size=18)
plt.xlabel("Num of principal components", size=18)
plt.ylabel("Accuracy", size=18)
plt.axhline(y=0.9156, color='r', linestyle='--', label='Required accuracy')
plt.grid()
plt.legend()
plt.show()