M11 Visualizing high dimensional data


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss

sns.set_style('white')

Scatterplot matrix for low-high dimensional data

In many cases, the number of dimensions is not too large. For instance, the "Iris" dataset contains four dimensions of measurements on the three types of iris flower species. It's more than two dimensions, yet still manageable.


In [3]:
iris = sns.load_dataset('iris')
iris.head(2)


Out[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa

We get four dimensions (sepal_length, sepal_width, petal_length, petal_width). One direct way to visualize them is to have a scatter plot for each pair of dimensions. We can use the pairplot() function in seaborn to do this.


In [4]:
sns.pairplot(iris)


Out[4]:
<seaborn.axisgrid.PairGrid at 0x1a18469690>

By using colors, you can get a much more useful plot.


In [5]:
sns.pairplot(iris, hue='species')


Out[5]:
<seaborn.axisgrid.PairGrid at 0x1a1900eb50>

Seaborn also lets us to specify what to put in the diagonal. When hue is used, it defaults to KDE plot. We can change it back to histogram. See: https://seaborn.pydata.org/generated/seaborn.pairplot.html

Q: draw a pairplot with hue and histogram on the diagonal


In [8]:
# Implement


Out[8]:
<seaborn.axisgrid.PairGrid at 0x1a1a086f10>

We can use altair to create an interactive scatterplot matrix. Can you create a scatterplot matrix of iris dataset by consulting https://altair-viz.github.io/gallery/scatter_matrix.html?

Q: Draw an interactive scatterplot matrix for iris dataset in altair


In [6]:
# Implement
import altair as alt


Out[6]:

Parallel coordinates

Another useful visualization you can create with not-so-high-dimensional datasets is parallel coordinate visualization. Actually pandas supports parallel coordinate plots as well as "Andrews curve" (you can think of it as a smooth version of parallel coordinate.

Q: Can you draw a parallel coordinate plot and a andrews curve plot of iris dataset? (I'm using viridis and winter colormap btw)


In [7]:



Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f896b5cf210>

In [8]:



Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8961318a10>

We can also use altair.


In [9]:
iris_transformed = iris.reset_index().melt(['species', 'index'])
alt.Chart(iris_transformed).mark_line().encode(
    x='variable:N',
    y='value:Q',
    color='species:N',
    detail='index:N',
    opacity=alt.value(0.5),
).properties(width=500)


Out[9]:

Q: can you explain how iris_transformed is different from the original iris dataset and why do we need to transform in this way?

PCA

The principal component analysis (PCA) is the most basic dimensionality reduction method. For example, in the Iris dataset we have four variables (sepal_length, sepal_width, petal_length, petal_width). If we can reduce the number of variables to two, then we can easily visualize them in two dimensions.

PCA is already implemented in the scikit-learn package, a machine learning library in Python, which should have been included in Anaconda. If you don't have it, install it with:

conda install scikit-learn

or

pip install scikit-learn

In [10]:
iris.head(2)


Out[10]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa

This is a four dimensional data. To run the PCA we want to isolate only the numerical columns.


In [11]:
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_only_features = iris[features]
iris_only_features.head()


Out[11]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

We should first create a PCA object and specify the number of components to obtain. Note that you can obtain more than two principal components.


In [12]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # set the number of components to 2

Now you can run fit() method to identify principal components.


In [13]:
pca_iris_fitted = pca.fit(iris_only_features)

An important set of numbers that you want to look at is the explained variance ratio.


In [14]:
pca_iris_fitted.explained_variance_ratio_


Out[14]:
array([0.92461872, 0.05306648])

It tells you how much of the variance in the original dataset is explained by the principal components that you obtained. It seems like the first two components capture more than 95% of the variance in original dataset. This means that the PCA is very effective on this dataset and just using two principal components is a very good approximation to use all dimensions. Now you can use the result to transform the original dataset.


In [15]:
iris_pca = pca_iris_fitted.transform(iris_only_features)
iris_pca[:5]


Out[15]:
array([[-2.68412563,  0.31939725],
       [-2.71414169, -0.17700123],
       [-2.88899057, -0.14494943],
       [-2.74534286, -0.31829898],
       [-2.72871654,  0.32675451]])

A convenient way to do both fitting and transforming is


In [16]:
iris_pca = pca.fit_transform(iris_only_features)
iris_pca[:5]


Out[16]:
array([[-2.68412563,  0.31939725],
       [-2.71414169, -0.17700123],
       [-2.88899057, -0.14494943],
       [-2.74534286, -0.31829898],
       [-2.72871654,  0.32675451]])

You can see that this transformed matrix has two columns. Each column corresponds to the "loading" for one of the principal components.


In [17]:
iris_pca_df = pd.DataFrame(data=iris_pca, columns=['PC1', 'PC2'])
iris_pca_df.head()


Out[17]:
PC1 PC2
0 -2.684126 0.319397
1 -2.714142 -0.177001
2 -2.888991 -0.144949
3 -2.745343 -0.318299
4 -2.728717 0.326755

Let's add the species information to the dataframe.

Q: add species column to iris_pca_df.


In [18]:
# Implement


Out[18]:
PC1 PC2 species
0 -2.684126 0.319397 setosa
1 -2.714142 -0.177001 setosa
2 -2.888991 -0.144949 setosa
3 -2.745343 -0.318299 setosa
4 -2.728717 0.326755 setosa

Now we can produce a scatterplot based on the two principal components. Well, let's just draw a pairplot.


In [19]:
# Implement


Out[19]:
<seaborn.axisgrid.PairGrid at 0x7f896b8b5050>

The PC1 seems to capture inter-species variation while PC2 seems to capture intra-species variation. 🧐 Interesting!

PCA with faces

Let's play with PCA with some faces. 🙄😬🤓


In [20]:
from sklearn.datasets import fetch_olivetti_faces

dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data

In [21]:
n_samples, n_features = faces.shape
print(n_samples)
print(n_features)


400
4096

So, this dataset contains 400 faces, and each of them has 4096 features (=pixels). Let's look at the first face:


In [22]:
print(faces[0].shape)
faces[0]


(4096,)
Out[22]:
array([0.6694215 , 0.6363636 , 0.6487603 , ..., 0.08677686, 0.08264463,
       0.07438017], dtype=float32)

It's an one-dimensional array with 4096 numbers. But a face should be a two-dimensional picture, right? Use numpy's reshape() function as well as matplotlib's imshow() function, transform this one-dimensional array into an appropriate 2-D matrix and draw it to show the face. You probably want to use plt.cm.gray as colormap.

Be sure to play with different shapes (e.g. 2 x 2048, 1024 x 4, 128 x 32, and so on) and think about why they look like what they look like. What is the right shape of the array?

Q: reshape the one-dimensional array into an appropriate two dimensional array and show the face


In [23]:
# TODO: draw faces[0] with various shapes. Find the correct dimension.


Out[23]:
<matplotlib.image.AxesImage at 0x7f898185f150>

Let's perform PCA on this dataset.


In [24]:
from sklearn.decomposition import PCA

Set the number of components to 6:


In [25]:
n_components=6
pca = PCA(n_components=n_components)

Fit the faces data:


In [26]:
pca.fit(faces)


Out[26]:
PCA(copy=True, iterated_power='auto', n_components=6, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

PCA has an attribute called components_. It is a $\text{n_components} \times \text{n_features}$ matrix, in our case $6 \times 4096$. Each row is a component.


In [27]:
pca.components_


Out[27]:
array([[-0.00419104, -0.00710949, -0.00933609, ...,  0.00018516,
         0.00337965,  0.00318825],
       [-0.02859143, -0.03328834, -0.03784649, ...,  0.02962781,
         0.02721297,  0.02488898],
       [ 0.00135693, -0.00032557, -0.00019804, ..., -0.01541367,
        -0.01370979, -0.01188343],
       [ 0.00112445, -0.00179018, -0.01168211, ...,  0.02942991,
         0.02781909,  0.02521843],
       [-0.02384169, -0.02359037, -0.02216066, ..., -0.04243831,
        -0.04007326, -0.04110203],
       [ 0.02910092,  0.03130446,  0.02877684, ..., -0.01635963,
        -0.01637501, -0.01491011]], dtype=float32)

In [28]:
pca.components_.shape


Out[28]:
(6, 4096)

We can display the 6 components as images:


In [29]:
for i, comp in enumerate(pca.components_, 1):
    plt.subplot(2, 3, i)
    plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
    plt.xticks(())
    plt.yticks(())


😱 Looks a bit scary...

They are the "principal faces", which means that, by adding up these images with some appropriate weights, we can get a close approximation of the 400 images in the dataset!

We can get the coordinates of the 6 components to understand how each face is composed with the components.


In [30]:
faces_pca_transformed = pca.transform(faces)

In [31]:
faces_pca_transformed.shape


Out[31]:
(400, 6)

faces_r is a $400 \times 6$ matrix. Each row corresponds to one face, containing the coordinates of the 6 components. For instance, the coordinates for the first face is


In [32]:
faces_pca_transformed[0]


Out[32]:
array([ 0.8157951 , -4.144032  ,  2.483267  , -0.90308505,  0.83133876,
        0.8862674 ], dtype=float32)

It seems that the second component (with coordinate 4.14403343) contributes the most to the first face. Let's display them together and see how similar they are:


In [33]:
# display the first face image 
plt.subplot(1, 2, 1)
plt.imshow(faces[0].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())

# display the second component
plt.subplot(1, 2, 2)
plt.imshow(pca.components_[1].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())


Out[33]:
([], <a list of 0 Text yticklabel objects>)

We can display the composition of faces in an "equation" style:


In [34]:
from matplotlib import gridspec

def display_image(ax, image):
    ax.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
    ax.set_xticks(())
    ax.set_yticks(())

def display_text(ax, text):
    ax.text(.5, .5, text, size=12)
    ax.axis('off')

face_idx = 0

plt.figure(figsize=(16,4))
gs = gridspec.GridSpec(2, 10, width_ratios=[5,1,1,5,1,1,5,1,1,5])

# display the face
ax = plt.subplot(gs[0])
display_image(ax, faces[face_idx].reshape(image_shape))

# display the equal sign
ax = plt.subplot(gs[1])
display_text(ax, r'$=$')

# display the 6 coordinates
for coord_i, gs_i in enumerate( [2,5,8,12,15,18] ):
    ax = plt.subplot(gs[gs_i])
    display_text( ax, r'$%.3f \times $' % faces_pca_transformed[face_idx][coord_i] )

# display the 6 components
for comp_i, gs_i in enumerate( [3,6,9,13,16,19] ):
    ax = plt.subplot(gs[gs_i])
    display_image( ax, pca.components_[comp_i].reshape(image_shape) )

# display the plus sign
for gs_i in [4,7,11,14,17]:
    ax = plt.subplot(gs[gs_i])
    display_text(ax, r'$+$')


We can directly see the results of this addition.


In [35]:
f, axes = plt.subplots(1, 6, figsize=(16,4))

faceid = 0

constructed_faces = []
for i in range(2, 10):
    constructed_faces.append(np.dot(faces_pca_transformed[faceid][:i], pca.components_[:i]))

# the face that we want to construct. 
display_image(axes[0], faces[0].reshape(image_shape))

for idx, ax in enumerate(axes[1:]):
    display_image(ax, constructed_faces[idx].reshape(image_shape))


It becomes more and more real, although quite far with only several components.

NMF

There is another pretty cool dimensionality reduction method called NMF (Non-negative matrix factorization). It is widely used in many domains, such as identifying topics in documents, identifying key components in images, and so on. The key idea is by forcing every element in the decomposed matrices positive, NMF breaks something into parts that we can add together.


In [36]:
from sklearn.decomposition import NMF
n_components=20
nmf = NMF(n_components=n_components)
nmf_fitted = nmf.fit(faces)

for i, comp in enumerate(nmf_fitted.components_, 1):
    plt.subplot(4, 5, i)
    plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
    plt.xticks(())
    plt.yticks(())


As you can see here, each 'component' of NMF picks up a certain part of the face (light area), such as eyes, chin, nose, and so on. Very cool.


In [37]:
faces_nmf_tranformed = nmf_fitted.transform(faces)

In [38]:
faces_nmf_tranformed[0]


Out[38]:
array([0.5448082 , 0.        , 0.32704465, 0.        , 0.32316211,
       0.47868305, 0.61156429, 0.54635072, 0.32697753, 0.16225584,
       0.        , 0.09397038, 0.0952292 , 0.        , 0.60943093,
       0.22976332, 0.        , 0.32904996, 0.21041442, 0.        ])

Can you show the reconstructed faces using the first n components, as we did for the PCA?


In [39]:
f, axes = plt.subplots(1, 8, figsize=(20,4))
faceid = 0
constructed_faces = []

# Implement


Unlike PCA that keeps superposing positive and negative images, NMF tends to gradually add multiple parts to the image. This is why it is widely used for many decomposing tasks such as detecting topics from documents.

t-SNE, Isomap, and MDS

Isomap, t-SNE, and MDS are nonlinear dimensionality reduction methods. Isomap preserves only the local relationships, MDS tries to preserve everything, and t-SNE is more flexible. t-SNE is very popular especially in machine learning.

Let's try t-SNE out with the iris data.

Q: Fit-transform the iris data with t-SNE and create a scatterplot of it.


In [40]:
from sklearn.manifold import TSNE
from sklearn.manifold import Isomap
from sklearn.manifold import MDS
from sklearn.datasets import load_iris

iris = load_iris()

# Implement


Out[40]:
<matplotlib.collections.PathCollection at 0x7f898106de90>

The hyperparameter perplexity determines how to balance attention between local and global aspects of your data. Changing this parameter (default is 30) may cause drastic changes in the output. Play with multiple values of perplexity.


In [41]:



Out[41]:
<matplotlib.collections.PathCollection at 0x7f8962c4b290>

Visualizing the Digits dataset

This is a classic dataset of images of handwritten digits. It contains 1797 images with (8*8=64) pixels each.


In [42]:
from sklearn.datasets import load_digits

digits = load_digits()
digits.data.shape


Out[42]:
(1797, 64)

digits.data stores the images:


In [43]:
digits.data[0]


Out[43]:
array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

and digits.target is the classes (or labels) that the images belong to. There are 10 classes in total.


In [44]:
digits.target


Out[44]:
array([0, 1, 2, ..., 8, 9, 8])

Q: use imshow to display the first image.


In [45]:
# Implement


Out[45]:
<matplotlib.image.AxesImage at 0x7f8972605110>

Let's first reorder the data points according to the handwritten numbers. We can use np.vstack and np.hstack.


In [46]:
X = np.vstack([digits.data[digits.target==i]
               for i in range(10)])
y = np.hstack([digits.target[digits.target==i]
               for i in range(10)])

Then initialize a tsne model. For the meaning of the parameters see here.


In [47]:
tsne = TSNE(n_components=2, init='pca', random_state=0)

A models like tsne has a lot of parameters. By fitting it on the data, we are selecting the "best" parameters which minimize a certain objective function. For example, when fitting a linear regression model, we are selecting a line that minimizes the sum of squared errors. Here after we select the best parameters for tsne, we also obtain the clusters found by this model. calling fit_transform is equivilant to calling fit and then transform.


In [48]:
digits_proj = tsne.fit_transform(X)

Plot the results. Seaborn's hls palatte provides evenly spaced colors in HLS hue space, we can divide it into 10 colors.


In [49]:
palette = np.array(sns.color_palette("hls", 10))

Make a scatter plot of the first component against the second component, with color based on the numbers.


In [50]:
plt.figure(figsize = (6,6))
plt.scatter(digits_proj[:,0], digits_proj[:,1],c=palette[y])


Out[50]:
<matplotlib.collections.PathCollection at 0x7f8961a43bd0>

We can add some text for each cluster. The place of the text can be the center of the cluster. We can use np.median to find the centers. To simplify things, we can make the code into a function.


In [51]:
def plot_scatter(projection):
    plt.figure(figsize = (6,6))
    plt.scatter(projection[:,0], projection[:,1],c=palette[y])
    for i in range(10):
        # Position of each label.
        xtext, ytext = np.median(projection[y == i, :], axis=0)
        txt = plt.text(xtext, ytext, str(i), fontsize=24)

In [52]:
plot_scatter(digits_proj)


Comparison with Isomap and MDS

We talked about MDS and Isomap in class as two other manifold learning methods. Sklearn also has implementations for this two algorithms: MDS and Isomap, so the usage is very similar. Examples for using this methods can be found here.

Can you make another two plots with these two methods? You only need to change the models and call the plot_scatter function,


In [53]:
# Implement



In [54]:
# Implement



In [ ]:


In [ ]: