In [1]:
%pylab inline
import pickle
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.utils import shuffle
Load the MNIST dataset.
In [2]:
mnist = fetch_mldata('MNIST original', data_home=".")
In [3]:
mnist["data"].shape
Out[3]:
Take random 2500 images - we will project them to 50 x 50 grid.
In [4]:
size = 50
N = size * size
data, target = shuffle(mnist["data"], mnist["target"], random_state=777, n_samples=N)
PCA them to 100 dimensions and run t-SNE. Normalize the results to [0, 1].
In [19]:
data_100 = PCA(n_components=100).fit_transform(data.astype(float32) / 255)
embeddings = TSNE(init="pca", random_state=777, verbose=2).fit_transform(data_100)
embeddings -= embeddings.min(axis=0)
embeddings /= embeddings.max(axis=0)
Use scatter plot to visualize the embedded samples. We see that t-SNE did it's job well and there are clear digit clusters.
In [237]:
rcParams["figure.figsize"] = (17, 9)
scatter(embeddings[:, 0], embeddings[:, 1], c=target)
my_colorbar = colorbar(fraction=0.05, pad = 0.0125)
xticks([]); yticks([])
Out[237]:
This is how the scatter plot looks like if we draw original images with alpha mask instead of dots.
In [191]:
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
gca().set_facecolor("black")
rcParams["figure.figsize"] = (16, 9)
for pos, img in zip(embeddings, data):
img = img.reshape(28, 28)
img = stack((img, img, img, ones((28, 28), dtype=uint8) * 255), axis=-1)
img[img[:, :, 0] == 0] = 0
ab = AnnotationBbox(OffsetImage(img), 0.03 + pos * 0.94, xycoords="axes fraction", frameon=False)
gca().add_artist(ab)
xticks([]); yticks([])
Out[191]:
Let's define a regular grid with 2500 dots (54 by 54) on which we will project the samples after t-SNE.
In [43]:
from scipy.spatial.distance import cdist
grid = dstack(meshgrid(linspace(0, 1, size), linspace(0, 1, size))).reshape(-1, 2)
rcParams["figure.figsize"] = (9, 9)
scatter(grid[:,0], grid[:,1])
Out[43]:
We define the cost matrix - squared L2 distance from every sample to each of the grid nodes. Thus the matrix size is 2500 by 2500. We normalize it to 100000 - this makes Jonker-Volgenant algorithm more numerically stable. Finally, we run lapjv and obtain the solution of our linear assignment problem.
In [42]:
from lapjv import lapjv
cost_matrix = cdist(grid, embeddings, "sqeuclidean").astype(float32)
cost_matrix = cost_matrix * (100000 / cost_matrix.max())
%time row_asses, col_asses, _ = lapjv(cost_matrix)
This is the visualization of the found solution. Each arrow starts at the original sample and points to the corresponding optimal grid node.
In [240]:
grid_jv = grid[col_asses]
pp_cmap = cm.get_cmap(rcParams["image.cmap"])
rcParams["figure.figsize"] = (17, 9)
for start, end, t in zip(embeddings, grid_jv, target):
arrow(start[0], start[1], end[0] - start[0], end[1] - start[1],
head_length=0.005, head_width=0.005, color=pp_cmap(t / 9), alpha=0.5)
colorbar(my_colorbar.mappable, fraction=0.05, pad = 0.0125)
xticks([]); yticks([])
Out[240]:
Show the final MNIST 2D map. grid_jv contains the found optimal coordinates, we iterate it together with the corresponding images. Matplotlib seems to have a bug with AnnotationBbox - the actual OffsetImage size is double of that requested, so we need to scale each to (14, 14) so that they do not overlap. However, most MNIST images do not contain any useful information at the boundaries, so we "overcommit" and resize to (20, 20).
In [188]:
from PIL import Image
rcParams["figure.figsize"] = (16, 16)
gca().set_facecolor("black")
for pos, img in zip(grid_jv, data):
img = Image.fromarray(255 - img.reshape(28, 28)).resize((20, 20), Image.ANTIALIAS)
ab = AnnotationBbox(OffsetImage(img, cmap="Greys"),
pos * (size - 1) * 28, xycoords="data", frameon=False, box_alignment=(0, 0))
gca().add_artist(ab)
xlim(0, (size - 1) * 28 + 16); ylim(0, (size - 1) * 28 + 16)
xticks([]); yticks([])
Out[188]: