In this part of the homework you are to solve several problems related to machine learning algorithms.
sklearn library instead of writing tons of our yown code. There exists a class/method for almost everything you can imagine (related to this homework).Your SOLUTION notebook MUST BE REPRODUCIBLE, i.e. if the reviewer decides to execute Kernel -> Restart Kernel and Run All Cells, after all the computation he will obtain exactly the same solution (with all the corresponding plots) as in your uploaded notebook. For this purpose, we suggest to fix random seed or (better) define random_state= inside every algorithm that uses some pseudorandomness.
Your code must be clear to the reviewer. For this purpose, try to include neccessary comments inside the code. But remember: GOOD CODE MUST BE SELF-EXPLANATORY without any additional comments.
$ you latex equation here $
$$ you latex equation here $$
\begin{align}
left-hand-side
&= right-hand-side on line 1
\\
&= right-hand-side on line 2
\\
&= right-hand-side on the last line
\end{align}
&) aligns the equations horizontally and the double backslash
(\\) creates a new line.Write your theoretical derivations within such blocks:
**BEGIN Solution**
<!-- >>> your derivation here <<< -->
**END Solution**
Please, write your implementation within the designated blocks:
...
### BEGIN Solution
# >>> your solution here <<<
### END Solution
...
Try KMeans, Gaussian Mixture and two more clustering algorithms from sklearn.
To evaluate clustering performance use two clustering metrics: silhouette score
and adjusted mutual information.
For each algorithm your task is to try to find the parameters leading to the best performance for n_clusters=true_number_of_clusters:
In [20]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, Birch, AgglomerativeClustering
from sklearn.metrics.cluster import adjusted_mutual_info_score
from sklearn.mixture import GaussianMixture
In [168]:
points, labels = np.loadtxt('data/clustering.txt'), np.loadtxt('data/clustering_labels.txt')
labels = labels.astype(int)
print("True number of clusters is {}".format(np.max(labels)))
plt.figure(figsize=(10, 10))
plt.scatter(points[:,0], points[:,1], c=labels, alpha=0.3, edgecolor='k')
plt.show()
In [169]:
### BEGIN Solution
min_num_of_clusters = 13
max_num_of_clusters = 17
f, axes = plt.subplots(1, 2, sharex=True, figsize=(15, 7))
axes[0].set_title('silhouette_score')
axes[1].set_title('mutual_info_score')
f2, axes2 = plt.subplots(4, max_num_of_clusters - min_num_of_clusters + 1, sharex=True, sharey=True, figsize=(15, 7))
for i in range(max_num_of_clusters - min_num_of_clusters + 1):
axes2[0][i].set_title('KMeans, clusters = ' + str(i + min_num_of_clusters))
axes2[0][i].axis('off')
axes2[1][i].set_title('GaussMixt, clusters = ' + str(i + min_num_of_clusters))
axes2[1][i].axis('off')
axes2[2][i].set_title('Birch, clusters = ' + str(i + min_num_of_clusters))
axes2[2][i].axis('off')
axes2[3][i].set_title('AgglClust, clusters = ' + str(i + min_num_of_clusters))
axes2[3][i].axis('off')
silhouette_kmeans = []
silhouette_gaussianmixture = []
silhouette_birch = []
silhouette_agglomerativeclust = []
mi_kmeans = []
mi_gaussianmixture = []
mi_birch = []
mi_agglomerativeclust = []
for n_cluster in range(min_num_of_clusters, max_num_of_clusters + 1):
# KMeans
kmeans = KMeans(n_clusters=n_cluster, random_state=101, n_jobs=-1)
kmeans_labels = kmeans.fit_predict(points)
silhouette_kmeans.append(silhouette_score(points, kmeans_labels))
mi_kmeans.append(adjusted_mutual_info_score(labels, kmeans_labels, average_method='arithmetic'))
axes2[0][n_cluster - min_num_of_clusters].scatter(points[:,0], points[:,1], c=kmeans_labels, alpha=0.3, edgecolor='k')
# Gaussian Mixture
gaussian_mixture = GaussianMixture(n_components=n_cluster, random_state=101)
gm_labels = gaussian_mixture.fit_predict(points)
silhouette_gaussianmixture.append(silhouette_score(points, gm_labels))
mi_gaussianmixture.append(adjusted_mutual_info_score(labels, gm_labels, average_method='arithmetic'))
axes2[1][n_cluster- min_num_of_clusters].scatter(points[:,0], points[:,1], c=gm_labels, alpha=0.3, edgecolor='k')
# Birch
birch = Birch(n_clusters=n_cluster)
birch_labels = birch.fit_predict(points)
silhouette_birch.append(silhouette_score(points, birch_labels))
mi_birch.append(adjusted_mutual_info_score(labels, birch_labels, average_method='arithmetic'))
axes2[2][n_cluster - min_num_of_clusters].scatter(points[:,0], points[:,1], c=birch_labels, alpha=0.3, edgecolor='k')
# Agglomerative Clustering
aggl_clust = AgglomerativeClustering(n_clusters=n_cluster)
aggl_clust_labels = aggl_clust.fit_predict(points)
silhouette_agglomerativeclust.append(silhouette_score(points, aggl_clust_labels))
mi_agglomerativeclust.append(adjusted_mutual_info_score(labels, aggl_clust_labels, average_method='arithmetic'))
axes2[3][n_cluster - min_num_of_clusters].scatter(points[:,0], points[:,1], c=aggl_clust_labels, alpha=0.3, edgecolor='k')
axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_kmeans, label="KMeans")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_kmeans, label="KMeans")
axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_gaussianmixture, label="GaussianMixture")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_gaussianmixture, label="GaussianMixture")
axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_birch, label="Birch")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_birch, label="Birch")
axes[0].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), silhouette_agglomerativeclust, label="AgglomerativeClustering")
axes[1].plot(list(range(min_num_of_clusters, max_num_of_clusters + 1)), mi_agglomerativeclust, label="AgglomerativeClustering")
axes[0].axvline(x=15, color='k', linestyle='--')
axes[1].axvline(x=15, color='k', linestyle='--')
axes[0].legend()
axes[1].legend()
plt.tight_layout()
plt.show()
### END Solution
For a chosen algorithm find the best number of clusters using bootstrap.
So, first, based on your observations made at the previous task, choose one algorithm assuming you do not have true labels and do not know the true number of clusters.
In [56]:
from tqdm import tqdm_notebook
from sklearn.utils import resample
from scipy.stats import sem, t
from scipy import mean
### BEGIN Solution
cluster_ns = [13,14,15,16,17]
sample_number = 10
score_clusters = []
for i_cn, n_clusters in tqdm_notebook(enumerate(cluster_ns)):
score = []
# >>> your code here <<<
kmean = KMeans(n_clusters=n_clusters)
for sample_index in tqdm_notebook(range(sample_number)):
points_resampled = resample(points, replace=True, n_samples=points.shape[0], random_state=101)
label_predicted = kmean.fit_predict(points_resampled)
score.append(silhouette_score(points_resampled, label_predicted))
score_clusters.append(score)
### END Solution
In [81]:
conf_interval_mean = []
conf_interval_low = []
conf_interval_upper = []
for score in score_clusters:
m = mean(score)
h = sem(score) * t.ppf((1 + 0.95) / 2, len(score) - 1)
conf_interval_mean.append(m)
conf_interval_low.append(m - h)
conf_interval_upper.append(m + h)
fig = plt.figure(figsize=(10,6))
plt.fill_between(cluster_ns, conf_interval_low, conf_interval_upper,
color='b', alpha=.5)
plt.plot(cluster_ns, conf_interval_mean, 'b')
plt.title("95% confidence interval", size=16)
plt.xticks(cluster_ns)
plt.xlabel("Number of clusters", size=16)
plt.ylabel("Silhouette score", size=16)
plt.axvline(x=15, color='k', linestyle='--')
plt.show()
Assume that you have $n$ points in $D$ dimensional space:
$$u_1, ..., u_n \in \mathbb R^D.$$There exist a linear data transformation $F(u): \mathbb R^D \rightarrow \mathbb R^d, D >> d$ such that:
$$(1-\delta) \|u_i - u_j\|^2 \leq \|F(u_i) - F(u_j)\|^2\leq (1+\delta) \|u_i - u_j\|^2$$with high probability.
The transformation $F(u)$ is: $F(u) = \tfrac{1}{\sqrt d}Xu$, and $X \in \mathbb R^{d \times D}$ is a random matrix, components of which are independent identically distributed $X_{ij} \sim \mathcal{N}(0, 1)$.
This statement means that there exists a matrix that reduces the original dimensionality such that pairwise distances are not distorted much. This is a theoretical probabilistic result that guarantees you such a transformation. We will obtain a bound on $d$ for which this result holds true.
Chernoff inequality. This unequality states a bound on distribution tail.
$$\mathbb P(X \geq t) \leq \frac{\mathbb E \exp(\lambda X)}{\exp(\lambda t)}$$Definition. Random variable $X$ is called subexponential with parameters $(\sigma^2, b)$, if for all $\lambda: |\lambda| < \frac{1}{b}$ the following is true:
$$\mathbb E \exp(\lambda (X - \mathbb E X)) \leq \exp\left(\frac{\lambda^2\sigma^2}{2}\right)$$Fact. $\chi^2$ distribution with $d$ degrees of freedom is a sum of squares of $d$ independent standard gaussian random variables.
Fact. $\chi^2$ distribution with $d$ degrees of freedom is subexponential with parameters $\sigma^2 = 4d, b = 4$
Using the above information, prove that for $Y \sim \chi^2$ with $d$ degrees of freedom the following inequality holds true:
$$\mathbb P (\lvert Y - d \rvert \geq t) \leq 2\exp\left(\frac{-t^2}{8d}\right)$$for $t \leq \frac{\sigma^2}{b}$.
Hint: you will need to optimise the power of exponential in order to get optimal $\lambda$.
BEGIN Solution
$$ \mathbb{P}(| Y - d| \geq t ) = \mathbb{P} ( Y - d \geq t ) + \mathbb{P} ( Y - d \leq -t )$$Using Fact №1, Chernoff bound and subexponentiality:
$$ \mathbb{P} (Y -d \geq t ) \leq \frac{\mathbb{E} \exp (\lambda (Y -d))}{\exp(\lambda t)} = \exp \Big[ \frac{\lambda^2 \sigma^2 }{2} - \lambda t \Big] = f(\lambda)$$In fact to get the lowest upper bound we need to obtain $\hat{\lambda}$:
$$ \hat{\lambda} = \arg \min_{\lambda} f(\lambda) $$Here we go:
$$ \frac{\partial{f(\lambda)}}{\partial \lambda} = 0, \, \exp \Big[...\Big] \Big[ \lambda \sigma^2 - t \Big] = 0 \to \hat{\lambda} = \frac{t}{\sigma^2} $$Recalling subexponentiality constraint:
$$ \lambda \leq \frac{1}{b} \to t \leq \frac{\sigma^2} {b} $$So:
$$ \mathbb{P} (Y - d \geq t) \leq \exp \Big[ \frac{\hat{\lambda}^2 \sigma^2}{2} - \lambda t \Big] = \exp \Big[ -\frac{t^2}{2 \sigma^2}\Big] = \exp \Big[ - \frac{t^2}{8d} \Big]$$As subexponential R.V. is symmetric $-Y$ is subexponential variable (as $Y$ is), so in the same way:
$$ \mathbb{P} (Y - d \leq -t) \leq \exp \Big[ \frac{\hat{\lambda}^2 \sigma^2}{2} + \lambda t \Big] = \exp \Big[ -\frac{t^2}{2 \sigma^2}\Big] = \exp \Big[ - \frac{t^2}{8d} \Big]$$Finally:
$$ \mathbb{P}(| Y - d| \geq t ) \leq 2 \exp \Big[ - \frac{t^2}{8d} \Big], \, t \leq \frac{\sigma^2} {b}$$qed
END Solution
BEGIN Solution
We know that $\chi^2$ distribution with $d$ degrees of freedom is a sum of squares of $d$ independent standard gaussian random variables. So:
$$ \frac{\|Xu\|^2}{\|u\|^2} = \frac{u^T X^T X u}{\|u\| \|u\|} $$Basically, we need to prove that $\frac{Xu}{\|u\|}$ is normally distributed. We know that:
$$ \frac{Xu}{\|u\|} = \sum^d_{i=1} \frac{X_i u}{\|u\|}, \mbox{where} \, X_i\mbox{is i-th row of X}$$Furthering on:
$$ \frac{X_i u}{\|u\|} = \sum_{j=1}^D \frac{X_{ij} u_j }{\|u\|} = \sum_{j=1}^D \frac{u_j}{\|u\|} X_{ij} $$In fact, $X_{ij} \sim \mathcal{N}(0, 1)$
So we have to prove that a linear combination of $X_{ij}$ is normally distributed as well.
Let's break down this prove into two parts, in the first part we prove that:
Prove:
Let $g(X) = aX + b $ $$\begin{align} E[g(X)]
&= \int (ax + b) f_X (x) dx
\\
&= a \int x f_X (x) dx + b \int f_X (x) dx
\\
&= aE[X] + b
\end{align} $$
Let $g(X) = aX + b $ $$\begin{align} \mathrm{Var} [g(X)]
&= \mathbb{E} [g(X) ^2 - \mathbb{E} [g(X)]^2
\\
&= \mathbb{E} [(aX + b)^2] - (a\mathbb{E}[X] + b) ^2
\\
&= a^2 \mathbb{E} [X^2] + 2ab \mathbb{E}[X] + b^2 - (a\mathbb{E}[X] + b)^2
\\
&= a^2 (\mathbb{E} [X^2] - \mathbb{E}[X]^2)
\\
&= a^2 \mathrm{Var}[X]
\end{align} $$
In the second part we prove:
Prove by induction:
First prove for arbitrary two random variable $X$, $Y$, $\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]$ ($f(x,y)$ is the joint PDF of $X$, $Y$):
$$\begin{align} \mathbb{E} [X+Y] &= \int_y \int_x (x+y) f(x,y) dxdxy \\ &= \int_x x \int_y f(x,y) dy dx + \int_y y \int_x f(x,y) dxdy \\ &= \int_x x f_X(x) dx + \int_y y f_Y (y) dy \\ &= \mathbb{E}[X] + \mathbb{E}[Y] \end{align} $$Suppose $$ \mathbb{E} \Big[ \sum_{i=1}^{k-1} X_i \Big] = \sum_{i=1}^{k-1} \mathbb{E} [X_i] $$
For simplicity define R.V. $ Y_{k-1} = \sum_{i=1}^{k-1} X_i $, then:
$$\begin{align} \mathbb{E} \Big[ \sum_{i=1}^{k} X_i \Big] &= \mathbb{E} [Y_{k-1} + X_k] \\ &= \mathbb{E} [Y_{k-1}] + \mathbb{E}[X_k] \\ &= \sum_{i=1}^{k-1} \mathbb{E} [X_i] + \mathbb{E}[X_k] \\ &= \sum_{i=1}^{k} \mathbb{E} [X_i] \end{align} $$For variance prove looks the same (based on induction)
From those proves we can easily say that:
$$ \mathbb{E} [c_0 + c_1 X_1 + c_2 X_2 + ... + c_n X_n] = c_0 + c_1 \mathbb{E} [X_1] + c_2 \mathbb{E}[X_2] + ... + c_n \mathbb{E} [X_n] $$$$ \mathrm{Var} [c_0 + c_1 X_1 + c_2 X_2 + ... + c_n X_n] = c_1^2 \mathrm{Var} [X_1] + c_2^2 \mathrm{Var}[X_2] + ... + c_n^2 \mathrm{Var} [X_n] $$So, linear combination of normally distributed $X_{ij}$ is normal as well, qed
END Solution
Having all of the previous results, we now may apply them to get the bound.
From inequality for tails from Task 4.1 we get that:
$$\mathbb P \left(\left\lvert \frac{\|Xu\|^2}{\|u\|^2} - d \right\rvert \geq t\right) \leq 2\exp\left(\frac{-t^2}{8d}\right)$$This means that probability of such event that our distances will change a lot is bounded.
Show that if probability above is small, then probability of: $$(1-\delta) \|u_i - u_j\|^2 \leq \|\frac{1}{\sqrt d} X(u_i - u_j)\|^2\leq (1+\delta) \|u_i - u_j\|^2$$ is big and basically almost equal to $1 - n(n-1)\exp \left(-\frac{\delta^2d}{8}\right)$.
After the previous step, we got that if we want to have our inequalities to be true with high probability $1-\varepsilon$, we want to bound it from above. Derive the inequality for $d$ from that.
Hint:
BEGIN Solution
END Solution
We have sucessfully got our lower bound on the dimensionality we can safely reduce to such that pairwise distances do not change much:
$$d \geq \frac{8}{\delta^2}\log\left(\frac{n(n-1)}{\varepsilon}\right)$$Note the beauty of that inequality. It doesn't depend on original dimensionality $D$, parameters $n$ -- number of samples and $\varepsilon$ -- probability are under the $\log$ function.
This bound is not very tight, using more advanced techniques you may improve it. That means, our estimate of $d$ may be too high, but we can guarantee our result for it. Also remember that this approach is probabilistic, and, basically, depends on how lucky you will be with your data. It is very useful in case of extremely large dimensionalities $D$, and allows to reduce dimensionality while preserving the structure of data.
Now implement the obtained result and confirm that it holds true:
In [3]:
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.metrics.pairwise import euclidean_distances
from numpy.random import standard_normal
X = fetch_20newsgroups_vectorized().data[:400]
In [9]:
### BEGIN Solution
n, D = X.shape
delta = 0.15
epsilon = 0.01
min_d = np.int(- np.ceil(8/delta**2 * np.log(epsilon/(n*(n-1)))))
print("Number of samples ", n)
print("Dimensionality of each sample", D)
print("Min dimensionaly of reduced space", min_d)
d = min_d + 50
np.random.seed(101)
RM = np.random.standard_normal(size=(d, D))
F = 1/np.sqrt(d) * X @ RM.T
#check results
X_dist = euclidean_distances(X, X)**2
F_dist = euclidean_distances(F, F)**2
print('For d = ', d, 'left inequalities hold:', ((1-delta)*X_dist <= F_dist).all())
print('For d = ', d, 'right inequalities hold:', ((1+delta)*X_dist >= F_dist).all())
d = min_d // 10
RM = np.random.standard_normal(size=(d, D))
F = 1/np.sqrt(d) * X @ RM.T
X_dist = euclidean_distances(X, X)**2
F_dist = euclidean_distances(F, F)**2
print('For d = ', d, 'left inequalities hold:', ((1-delta)*X_dist <= F_dist).all())
print('For d = ', d, 'right inequalities hold:', ((1+delta)*X_dist >= F_dist).all())
### END Solution
Download MNIST dataset. Here we will estimate the data intrinsic dimensionality, decompose the dataset into principal components and estimate the cumulative explained variance by each component.
You can either use the method proposed here https://www.stat.berkeley.edu/~bickel/mldim.pdf for the sample intrinsic dimension estimation, or any other method known to you. For the first case, you can use implemented code from here https://gist.github.com/mehdidc/8a0bb21a31c43b0cbbdd31d75929b5e4
In [100]:
from sklearn import datasets
import matplotlib.pyplot as plt
from tqdm import tqdm
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
data = StandardScaler().fit_transform(digits.data)
k1, k2 = 3, 70
### BEGIN Solution
def intrinsic_dim_sample_wise(X, k=5):
neighb = NearestNeighbors(n_neighbors=k + 1).fit(X)
dist, ind = neighb.kneighbors(X)
dist = dist[:, 1:]
dist = dist[:, 0:k]
assert dist.shape == (X.shape[0], k)
assert np.all(dist > 0)
d = np.log(dist[:, k - 1: k] / dist[:, 0:k-1])
d = d.sum(axis=1) / (k - 2)
d = 1. / d
intdim_sample = d
return intdim_sample
def intrinsic_dim_scale_interval(X, k1=10, k2=20):
X = pd.DataFrame(X).drop_duplicates().values # remove duplicates in case you use bootstrapping
intdim_k = []
for k in range(k1, k2 + 1):
m = intrinsic_dim_sample_wise(X, k).mean()
intdim_k.append(m)
return intdim_k
def repeated(func, X, nb_iter=100, random_state=None, verbose=0, mode='bootstrap', **func_kw):
if random_state is None:
rng = np.random
else:
rng = np.random.RandomState(random_state)
nb_examples = X.shape[0]
results = []
iters = range(nb_iter)
if verbose > 0:
iters = tqdm(iters)
for i in iters:
if mode == 'bootstrap':
Xr = X[rng.randint(0, nb_examples, size=nb_examples)]
elif mode == 'shuffle':
ind = np.arange(nb_examples)
rng.shuffle(ind)
Xr = X[ind]
elif mode == 'same':
Xr = X
else:
raise ValueError('unknown mode : {}'.format(mode))
results.append(func(Xr, **func_kw))
return results
result = np.array(repeated(intrinsic_dim_scale_interval, data, nb_iter=20, random_state=101, k1=k1, k2=k2)).T
In [47]:
from scipy.stats import sem, t
from scipy import mean
conf_interval_mean = []
conf_interval_low = []
conf_interval_upper = []
for iterat in result:
m = iterat.mean()
h = sem(iterat) * t.ppf((1 + 0.95) / 2, len(iterat) - 1)
conf_interval_mean.append(m)
conf_interval_low.append(m - h)
conf_interval_upper.append(m + h)
fig = plt.figure(figsize=(15,4))
plt.fill_between(range(k1,k2+1), conf_interval_low, conf_interval_upper,
color='g', alpha=.3, label="CI=95%")
plt.plot(range(k1,k2+1), conf_interval_mean, 'b', label="Mean")
plt.xticks(range(k1-3,k2+1, 10))
plt.yticks(range(6, 11, 1))
plt.xlabel("Nearest Neighbours", size=15)
plt.ylabel("Intrinsic Dimensionality", size=15)
plt.grid()
plt.legend()
plt.show()
In [128]:
from sklearn.decomposition import PCA
n_comp = 64
pca_model = PCA(n_components=n_comp)
pca_model.fit_transform(data)
cev = pca_model.explained_variance_ratio_.cumsum()
fig = plt.figure(figsize=(15,4))
plt.step(range(0, n_comp), cev, 'b', label="cumulative explained variance")
plt.xticks(range(0, n_comp, 10))
plt.title("Cumulative explained variance")
plt.xlabel("Num of principal components")
plt.ylabel("Cumulative explained variance")
plt.axhline(y=0.95, color='r', linestyle='--', label='95% explained variance')
plt.legend()
plt.show()
In [166]:
from matplotlib.colors import NoNorm
def plot_pca_components(x, coefficients, mean=0, components=None,
imshape=(8, 8), n_components=[]):
mean = np.zeros_like(x) + mean
fig = plt.figure(figsize=(4 * len(n_components), 4))
g = plt.GridSpec(1, len(n_components), hspace=0.3)
def show(n, x, title=None):
ax = fig.add_subplot(g[n], xticks=[], yticks=[])
ax.imshow(x.reshape(imshape), cmap="gray", norm=NoNorm())
if title:
ax.set_title(title, fontsize=12)
approx = mean.copy()
count = 0
for i in range(max(n_components.keys())+1):
approx = approx + coefficients[i-1] * components[i-1]
if i in n_components.keys():
if n_components[i] != "Original":
show(count, approx, str(i) + ' Components \n' + n_components[i] + "% of Explained image")
else:
show(count, approx, str(i) + ' Components \n' + "Original image")
count+=1
return fig
pca_model = PCA(n_components=64)
data_fitted = pca_model.fit_transform(data)
fig = plot_pca_components(data[4], data_fitted[4], pca_model.mean_, pca_model.components_, n_components={13:"70", 18:"80", 23:"85", 29:"90", 39:"95", 64:"Original"})
plt.show()
Here we will visualize the decomposition, to observe how the data diverges across different manifold learning methods. We are to compare PCA, ICA and Isomap non-linear decompositions [see more over here https://scikit-learn.org/stable/modules/manifold.html], and check the classification accuracies on the new features.
plotly. Be sure that the data is whitened (scaled). LogisticRegression on cv=5, 3 repeats. Use RepeatedKFold and fix the random_seed=0.Isomap gives statistically significant improvement on the classification accuracy with any appropriate statistical test (for example scipy.stats.ttest_ind). Provide corresponding p-values. Justify your results, write down 2-3 sentences.
In [101]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot,iplot
import plotly.graph_objs as go
import colorlover as cl
import pandas as pd
from sklearn.manifold import Isomap
from sklearn.decomposition import PCA
from sklearn.decomposition import FastICA
init_notebook_mode(connected=True)
### BEGIN Solution
iso = Isomap(n_neighbors=5, n_components=3, eigen_solver='dense')
digits_transformed = StandardScaler().fit_transform(iso.fit_transform(data))
pca = PCA(n_components=3)
digits_transformed2 = StandardScaler().fit_transform(pca.fit_transform(data))
ica = FastICA(n_components=3)
digits_transformed3 = StandardScaler().fit_transform(ica.fit_transform(data))
### END Solution
In [102]:
trace1 = go.Scatter3d(
x=digits_transformed[:,0],
y=digits_transformed[:,1],
z=digits_transformed[:,2],
mode='markers',
marker=dict(
color='rgb(255, 1, 1)',
size=3,
symbol='circle',
line=dict(
color='rgb(204, 1, 1)',
width=1
),
opacity=0.9
)
)
trace2 = go.Scatter3d(
x=digits_transformed2[:,0],
y=digits_transformed2[:,1],
z=digits_transformed2[:,2],
mode='markers',
marker=dict(
color='rgb(1, 255, 1)',
size=3,
symbol='circle',
line=dict(
color='rgb(1, 204, 1)',
width=1
),
opacity=0.9
)
)
trace3 = go.Scatter3d(
x=digits_transformed3[:,0],
y=digits_transformed3[:,1],
z=digits_transformed3[:,2],
mode='markers',
marker=dict(
color='rgb(1, 1, 255)',
size=3,
symbol='circle',
line=dict(
color='rgb(1, 1, 204)',
width=1
),
opacity=0.9
)
)
layout = go.Layout(
margin=dict(
l=0,
r=0,
b=0,
t=0
),
legend=dict(x=-.1, y=1.2)
)
fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
iplot(fig, show_link = False)
In [90]:
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='newton-cg',multi_class='multinomial')
rkf = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0)
### BEGIN Solution
report = [
{'dimred_method': PCA(n_components=7), 'acc': [], 'mean_acc': 0, 'std_acc': 0.0},
{'dimred_method': FastICA(n_components=7), 'acc': [], 'mean_acc': 0, 'std_acc': 0.0},
{'dimred_method': Isomap(n_components=7, eigen_solver='dense'), 'acc': [], 'mean_acc': 0, 'std_acc' : 0.0},
]
pca_data = report[0]['dimred_method'].fit_transform(data)
ica_data = report[1]['dimred_method'].fit_transform(data)
isomap_data = report[2]['dimred_method'].fit_transform(data)
data_red=[pca_data, ica_data, isomap_data]
for i in range(len(report)):
for train_index, test_index in rkf.split(data_red[i]):
X_train, X_test = data_red[i][train_index], data_red[i][test_index]
y_train, y_test = digits.target[train_index], digits.target[test_index]
logreg.fit(X_train, y_train)
report[i]['acc'].append(logreg.score(X_test, y_test))
report[i]['std_acc'] = np.std(report[i]['acc'])
report[i]['mean_acc'] = np.mean(report[i]['acc'])
# Printing
for i, elem in enumerate(report):
report[i]['dimred_method'] = report[i]['dimred_method'].__class__.__name__
df = pd.DataFrame(report)
df.drop(['acc'], axis=1)
### END Solution
Out[90]:
In [96]:
import scipy as sp
t_test, p_value = sp.stats.ttest_ind(report[0]['acc'], report[1]['acc'])
if (p_value < 0.005 and t_test > 0):
print("Statistically significant improvement of PCA decomposition over ICA: p-value", p_value)
In [97]:
t_test, p_value = sp.stats.ttest_ind(report[2]['acc'], report[0]['acc'])
if (p_value < 0.005 and t_test > 0):
print("Statistically significant improvement of Isomap decomposition over PCA: p-value", p_value)
So using statisctical test and considering means to be equal as null hypothesis we found out that it doens't hold true (means are different) and got extremely low p-values which justifies our conclusion and makes probabilty of such an event quite high. Using t-values we found out the sign of difference and using p-value how likely that it is true.
Explore KMNIST dataset https://github.com/rois-codh/kmnist (Kuzushiji-MNIST (10 classes, 28x28, 70k examples)). You are to engineer new features with any convenient method of Manifold Learning to increase the classification accuracy. Use the $k$NN classifier and validation code from here https://github.com/rois-codh/kmnist/blob/master/benchmarks/kuzushiji_mnist_knn.py. Your accuracy on the test should be more than 91.56%.
NOTE that the data is rather heavy, thus your decomposition will take a while.
In [2]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.decomposition import PCA
def load(f):
return np.load(f)['arr_0']
# Load the data
x_train = load('kmnist-train-imgs.npz')
x_test = load('kmnist-test-imgs.npz')
y_train = load('kmnist-train-labels.npz')
y_test = load('kmnist-test-labels.npz')
In [112]:
n_comp = 200
pca_model = PCA(n_components=n_comp)
x_train_new = StandardScaler().fit_transform(x_train.astype(np.float64))
pca_model.fit_transform(x_train_new)
cev = pca_model.explained_variance_ratio_.cumsum()
fig = plt.figure(figsize=(15,4))
plt.step(range(0, n_comp), cev, 'b', label="cumulative explained variance")
plt.xticks(range(0, n_comp, 10))
plt.title("Cumulative explained variance")
plt.xlabel("Num of principal components")
plt.ylabel("Cumulative explained variance")
plt.axhline(y=0.95, color='r', linestyle='--', label='95% explained variance')
plt.legend()
plt.show()
In [28]:
# Reshape the data
x_train_flatten = x_train.reshape(-1, 28*28)
x_test_flatten = x_test.reshape(-1, 28*28)
In [41]:
test_scores = []
n_comp_max = 200
for n_comp in range(10, n_comp_max, 10):
print("Components", n_comp)
pca_model = PCA(n_components=n_comp, random_state=101)
x_train_red = pca_model.fit_transform(x_train_flatten)
x_test_red = pca_model.transform(x_test_flatten)
clf = KNeighborsClassifier(weights='distance', n_jobs=-1)
clf.fit(x_train_red, y_train)
test_scores.append(clf.score(x_test_red, y_test))
In [42]:
fig = plt.figure(figsize=(14,7))
plt.plot(range(10, n_comp_max, 10), test_scores, 'b')
plt.xticks(range(0, n_comp_max, 10))
plt.title("Accuracy of kNN", size=18)
plt.xlabel("Num of principal components", size=18)
plt.ylabel("Accuracy", size=18)
plt.axhline(y=0.9156, color='r', linestyle='--', label='Required accuracy')
plt.grid()
plt.legend()
plt.show()