**784**.

```
In [ ]:
```%matplotlib inline
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnnotationBbox, OffsetImage

As usual we provide the code that fetches the data:

```
In [ ]:
```import os
from sklearn.datasets import fetch_mldata
# Fetch MNIST dataset and create a local copy.
if os.path.exists('mnist.npz'):
with np.load('mnist.npz', 'r') as data:
X = data['X']
y = data['y']
else:
mnist = fetch_mldata("MNIST original")
X, y = mnist.data / 255.0, mnist.target
np.savez('mnist.npz', X=X, y=y)

```
In [ ]:
``````
# Preferably arrange images as a 10x10 matrix.
```

`train_indices`

):

```
In [ ]:
```n_train_samples = 5000
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
train_indices = indices[: n_train_samples]

Your task is to try three different dimensionality reduction techniques which may be helpful for the visualization of high-dimensional data. The first one should be familiar to you: it's **PCA** with 2 components. The other two are non-linear embedding approaches described in the lecture, namely, MDS and LLE. Plot all three embeddings and check if they are good enough for the dataset.

**NOTE:** MDS (default settings) may take some time (>10 min) to compute, so be patient. It's probably a good idea to leave the exploration of this method as a hometask.

```
In [ ]:
```from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import MDS, LocallyLinearEmbedding
# Your code goes here.

**2**'s using Isomap. First, you need to compute the embedding of the corresponding subset of MNIST.

```
In [ ]:
```from sklearn.manifold import Isomap
indices_of_2 = np.arange(X.shape[0])[y == 2]
np.random.shuffle(indices_of_2)
train_indices_of_2 = indices_of_2[: n_train_samples]
# Your code goes here.

** scipy.spatial.KDTree** for the fast nearest neighbour computation.

```
In [ ]:
```from scipy.spatial import KDTree
from scipy.spatial.distance import pdist, cdist, squareform
from scipy import linspace
def find_representatives(kdtree, from_point, to_point, n_points):
# Given two 2D points this function should return a sequence of the dataset representatives (indices) that
# we encounter nearby as we go from from_point to to_point. This can be done by taking a set points on
# the segment and finding corresponding nearest neighbours in the dataset.
# Your code goes here.
return representatives
n_points = 100
# Your code starts here and shoudld end with:
# representatives = find_representatives(kdtree, X_embedded[from_idx, :], X_embedded[to_idx, :], n_points)

** diplay_manifold_flythrough** function. First two arguments are array of images and their coordinates on the 2D plane.

```
In [2]:
```from IPython.display import HTML
from matplotlib import animation
VIDEO_TAG = """<video controls>
<source src="data:video/x-m4v;base64,{0}" type="video/mp4">
Your browser does not support the video tag.
</video>"""
def anim_to_html(anim):
if not hasattr(anim, '_encoded_video'):
with open('temp.mp4', 'wb') as f:
anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264'], writer='ffmpeg')
video = open('temp.mp4', "rb").read()
anim._encoded_video = video.encode("base64")
return VIDEO_TAG.format(anim._encoded_video)
def display_animation(anim):
plt.close(anim._fig)
return HTML(anim_to_html(anim))
def diplay_manifold_flythrough(X, coords, fig, ax):
imagebox = OffsetImage(X[0].reshape(28, 28), cmap=plt.cm.gray_r)
annbox = AnnotationBbox(imagebox, coords[0])
ax.add_artist(annbox)
def init():
return imagebox, annbox
def animate(i):
imagebox.set_data(X[i].reshape(28, 28))
annbox.xyann = coords[i]
return imagebox, annbox
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=X.shape[0], interval=20, blit=True)
return display_animation(anim)

** diplay_manifold_flythrough** to display the manifold fly-through over the scatter plot of the subset:

```
In [ ]:
```fig = plt.figure(figsize=(10, 10))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())
# Your code starts here and should end with:
# diplay_manifold_flythrough(..., ..., fig, ax)

```
In [4]:
```HTML("""<video controls>
<source src='./interpolation.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>""")

```
Out[4]:
```

```
In [ ]:
```def display_interpolation(X, steps, fig):
# NOTE: First argument corresponds to the sequence of images.
n_images = X.shape[0]
im = plt.imshow(X[0].reshape(28, 28), cmap=plt.cm.gray_r)
def init():
return im,
def animate(i):
# Your code goes here.
im.set_array(img)
return im,
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=steps, interval=20, blit=True)
return display_animation(anim)
fig = plt.figure(figsize=(3, 3))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())
# Your code starts here and should end with:
# display_interpolation(.., 500, fig)

If we need to visualize the whole set of digits, a much better embedding technique is t-SNE. This method has been proven to generate high-quality visualizations in various scenarios and has become a standard de-facto in analyzing high-dimensional data (like activities of neurons in an Artificial Neural Network). We are going to use a slower version of the algorithm supplied in the `scikit-learn`

(** sklearn.manifold.TSNE**). In practice, however, I'd suggest to use an approximate Barnes-Hut t-SNE which produces comparable results while working substantially faster.

Your task here is similar to the ones in the previous section. The differeces are:

- You should use all digits (a
`train_indices`

subset) - The fly-through should capture all classes. One way to achieve this would be to use picker capabilities of
`matplotlib`

(demo) in order to manually pick a sequence of keypoints on a scatter plot. Given a list of keypoints, it's easy to obtain intermediate representatives using thefunction. The rest of the code should be almost identical to the Isomap-case.`find_representatives`

**Don't forget about the interpolation animation :)**

**NOTE 1:** t-SNE is available in the latest version of `scikit-learn`

. There is a `python`

package called `tsne`

(`pip install tsne`

) which implements Barnes-Hut t-SNE. Feel free to use that one.

**NOTE 2:** It may be a good idea to apply PCA (with, for example, 50 components) to the data first.

```
In [ ]:
```from sklearn.manifold import TSNE
# Your code goes here.