In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
sns.set_style('white')
In many cases, the number of dimensions is not too large. For instance, the "Iris" dataset contains four dimensions of measurements on the three types of iris flower species. It's more than two dimensions, yet still manageable.
In [3]:
iris = sns.load_dataset('iris')
iris.head(2)
Out[3]:
We get four dimensions (sepal_length, sepal_width, petal_length, petal_width). One direct way to visualize them is to have a scatter plot for each pair of dimensions. We can use the pairplot()
function in seaborn to do this.
In [4]:
sns.pairplot(iris)
Out[4]:
By using colors, you can get a much more useful plot.
In [5]:
sns.pairplot(iris, hue='species')
Out[5]:
Seaborn also lets us to specify what to put in the diagonal. When hue
is used, it defaults to KDE plot. We can change it back to histogram. See: https://seaborn.pydata.org/generated/seaborn.pairplot.html
Q: draw a pairplot with hue and histogram on the diagonal
In [8]:
# Implement
Out[8]:
We can use altair to create an interactive scatterplot matrix. Can you create a scatterplot matrix of iris dataset by consulting https://altair-viz.github.io/gallery/scatter_matrix.html?
Q: Draw an interactive scatterplot matrix for iris dataset in altair
In [6]:
# Implement
import altair as alt
Out[6]:
Another useful visualization you can create with not-so-high-dimensional datasets is parallel coordinate visualization. Actually pandas supports parallel coordinate plots as well as "Andrews curve" (you can think of it as a smooth version of parallel coordinate.
Q: Can you draw a parallel coordinate plot and a andrews curve plot of iris dataset? (I'm using viridis
and winter
colormap btw)
In [7]:
Out[7]:
In [8]:
Out[8]:
We can also use altair.
In [9]:
iris_transformed = iris.reset_index().melt(['species', 'index'])
alt.Chart(iris_transformed).mark_line().encode(
x='variable:N',
y='value:Q',
color='species:N',
detail='index:N',
opacity=alt.value(0.5),
).properties(width=500)
Out[9]:
Q: can you explain how iris_transformed
is different from the original iris
dataset and why do we need to transform in this way?
The principal component analysis (PCA) is the most basic dimensionality reduction method. For example, in the Iris dataset we have four variables (sepal_length
, sepal_width
, petal_length
, petal_width
). If we can reduce the number of variables to two, then we can easily visualize them in two dimensions.
PCA is already implemented in the scikit-learn package, a machine learning library in Python, which should have been included in Anaconda. If you don't have it, install it with:
conda install scikit-learn
or
pip install scikit-learn
In [10]:
iris.head(2)
Out[10]:
This is a four dimensional data. To run the PCA we want to isolate only the numerical columns.
In [11]:
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_only_features = iris[features]
iris_only_features.head()
Out[11]:
We should first create a PCA object and specify the number of components to obtain. Note that you can obtain more than two principal components.
In [12]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # set the number of components to 2
Now you can run fit()
method to identify principal components.
In [13]:
pca_iris_fitted = pca.fit(iris_only_features)
An important set of numbers that you want to look at is the explained variance ratio.
In [14]:
pca_iris_fitted.explained_variance_ratio_
Out[14]:
It tells you how much of the variance in the original dataset is explained by the principal components that you obtained. It seems like the first two components capture more than 95% of the variance in original dataset. This means that the PCA is very effective on this dataset and just using two principal components is a very good approximation to use all dimensions. Now you can use the result to transform the original dataset.
In [15]:
iris_pca = pca_iris_fitted.transform(iris_only_features)
iris_pca[:5]
Out[15]:
A convenient way to do both fitting and transforming is
In [16]:
iris_pca = pca.fit_transform(iris_only_features)
iris_pca[:5]
Out[16]:
You can see that this transformed matrix has two columns. Each column corresponds to the "loading" for one of the principal components.
In [17]:
iris_pca_df = pd.DataFrame(data=iris_pca, columns=['PC1', 'PC2'])
iris_pca_df.head()
Out[17]:
Let's add the species information to the dataframe.
Q: add species column to iris_pca_df
.
In [18]:
# Implement
Out[18]:
Now we can produce a scatterplot based on the two principal components. Well, let's just draw a pairplot.
In [19]:
# Implement
Out[19]:
The PC1 seems to capture inter-species variation while PC2 seems to capture intra-species variation. 🧐 Interesting!
In [20]:
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data
In [21]:
n_samples, n_features = faces.shape
print(n_samples)
print(n_features)
So, this dataset contains 400 faces, and each of them has 4096 features (=pixels). Let's look at the first face:
In [22]:
print(faces[0].shape)
faces[0]
Out[22]:
It's an one-dimensional array with 4096 numbers. But a face should be a two-dimensional picture, right? Use numpy
's reshape()
function as well as matplotlib
's imshow()
function, transform this one-dimensional array into an appropriate 2-D matrix and draw it to show the face. You probably want to use plt.cm.gray
as colormap.
Be sure to play with different shapes (e.g. 2 x 2048, 1024 x 4, 128 x 32, and so on) and think about why they look like what they look like. What is the right shape of the array?
Q: reshape the one-dimensional array into an appropriate two dimensional array and show the face
In [23]:
# TODO: draw faces[0] with various shapes. Find the correct dimension.
Out[23]:
Let's perform PCA on this dataset.
In [24]:
from sklearn.decomposition import PCA
Set the number of components to 6:
In [25]:
n_components=6
pca = PCA(n_components=n_components)
Fit the faces data:
In [26]:
pca.fit(faces)
Out[26]:
PCA has an attribute called components_
. It is a $\text{n_components} \times \text{n_features}$ matrix, in our case $6 \times 4096$. Each row is a component.
In [27]:
pca.components_
Out[27]:
In [28]:
pca.components_.shape
Out[28]:
We can display the 6 components as images:
In [29]:
for i, comp in enumerate(pca.components_, 1):
plt.subplot(2, 3, i)
plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())
😱 Looks a bit scary...
They are the "principal faces", which means that, by adding up these images with some appropriate weights, we can get a close approximation of the 400 images in the dataset!
We can get the coordinates of the 6 components to understand how each face is composed with the components.
In [30]:
faces_pca_transformed = pca.transform(faces)
In [31]:
faces_pca_transformed.shape
Out[31]:
faces_r
is a $400 \times 6$ matrix. Each row corresponds to one face, containing the coordinates of the 6 components. For instance, the coordinates for the first face is
In [32]:
faces_pca_transformed[0]
Out[32]:
It seems that the second component (with coordinate 4.14403343) contributes the most to the first face. Let's display them together and see how similar they are:
In [33]:
# display the first face image
plt.subplot(1, 2, 1)
plt.imshow(faces[0].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())
# display the second component
plt.subplot(1, 2, 2)
plt.imshow(pca.components_[1].reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())
Out[33]:
We can display the composition of faces in an "equation" style:
In [34]:
from matplotlib import gridspec
def display_image(ax, image):
ax.imshow(image, cmap=plt.cm.gray, interpolation='nearest')
ax.set_xticks(())
ax.set_yticks(())
def display_text(ax, text):
ax.text(.5, .5, text, size=12)
ax.axis('off')
face_idx = 0
plt.figure(figsize=(16,4))
gs = gridspec.GridSpec(2, 10, width_ratios=[5,1,1,5,1,1,5,1,1,5])
# display the face
ax = plt.subplot(gs[0])
display_image(ax, faces[face_idx].reshape(image_shape))
# display the equal sign
ax = plt.subplot(gs[1])
display_text(ax, r'$=$')
# display the 6 coordinates
for coord_i, gs_i in enumerate( [2,5,8,12,15,18] ):
ax = plt.subplot(gs[gs_i])
display_text( ax, r'$%.3f \times $' % faces_pca_transformed[face_idx][coord_i] )
# display the 6 components
for comp_i, gs_i in enumerate( [3,6,9,13,16,19] ):
ax = plt.subplot(gs[gs_i])
display_image( ax, pca.components_[comp_i].reshape(image_shape) )
# display the plus sign
for gs_i in [4,7,11,14,17]:
ax = plt.subplot(gs[gs_i])
display_text(ax, r'$+$')
We can directly see the results of this addition.
In [35]:
f, axes = plt.subplots(1, 6, figsize=(16,4))
faceid = 0
constructed_faces = []
for i in range(2, 10):
constructed_faces.append(np.dot(faces_pca_transformed[faceid][:i], pca.components_[:i]))
# the face that we want to construct.
display_image(axes[0], faces[0].reshape(image_shape))
for idx, ax in enumerate(axes[1:]):
display_image(ax, constructed_faces[idx].reshape(image_shape))
It becomes more and more real, although quite far with only several components.
There is another pretty cool dimensionality reduction method called NMF (Non-negative matrix factorization). It is widely used in many domains, such as identifying topics in documents, identifying key components in images, and so on. The key idea is by forcing every element in the decomposed matrices positive, NMF breaks something into parts that we can add together.
In [36]:
from sklearn.decomposition import NMF
n_components=20
nmf = NMF(n_components=n_components)
nmf_fitted = nmf.fit(faces)
for i, comp in enumerate(nmf_fitted.components_, 1):
plt.subplot(4, 5, i)
plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray, interpolation='gaussian')
plt.xticks(())
plt.yticks(())
As you can see here, each 'component' of NMF picks up a certain part of the face (light area), such as eyes, chin, nose, and so on. Very cool.
In [37]:
faces_nmf_tranformed = nmf_fitted.transform(faces)
In [38]:
faces_nmf_tranformed[0]
Out[38]:
Can you show the reconstructed faces using the first n components, as we did for the PCA?
In [39]:
f, axes = plt.subplots(1, 8, figsize=(20,4))
faceid = 0
constructed_faces = []
# Implement
Unlike PCA that keeps superposing positive and negative images, NMF tends to gradually add multiple parts to the image. This is why it is widely used for many decomposing tasks such as detecting topics from documents.
Isomap, t-SNE, and MDS are nonlinear dimensionality reduction methods. Isomap preserves only the local relationships, MDS tries to preserve everything, and t-SNE is more flexible. t-SNE is very popular especially in machine learning.
Let's try t-SNE out with the iris data.
Q: Fit-transform the iris data with t-SNE and create a scatterplot of it.
In [40]:
from sklearn.manifold import TSNE
from sklearn.manifold import Isomap
from sklearn.manifold import MDS
from sklearn.datasets import load_iris
iris = load_iris()
# Implement
Out[40]:
The hyperparameter perplexity
determines how to balance attention between local and global aspects of your data. Changing this parameter (default is 30) may cause drastic changes in the output. Play with multiple values of perplexity.
In [41]:
Out[41]:
If you want to learn more about t-SNE, play with https://distill.pub/2016/misread-tsne/ and https://experiments.withgoogle.com/visualizing-high-dimensional-space
This is a classic dataset of images of handwritten digits. It contains 1797 images with (8*8=64) pixels each.
In [42]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
Out[42]:
digits.data stores the images:
In [43]:
digits.data[0]
Out[43]:
and digits.target is the classes (or labels) that the images belong to. There are 10 classes in total.
In [44]:
digits.target
Out[44]:
Q: use imshow
to display the first image.
In [45]:
# Implement
Out[45]:
In [46]:
X = np.vstack([digits.data[digits.target==i]
for i in range(10)])
y = np.hstack([digits.target[digits.target==i]
for i in range(10)])
Then initialize a tsne model. For the meaning of the parameters see here.
In [47]:
tsne = TSNE(n_components=2, init='pca', random_state=0)
A models like tsne has a lot of parameters. By fitting it on the data, we are selecting the "best" parameters which minimize a certain objective function. For example, when fitting a linear regression model, we are selecting a line that minimizes the sum of squared errors. Here after we select the best parameters for tsne, we also obtain the clusters found by this model. calling fit_transform
is equivilant to calling fit
and then transform
.
In [48]:
digits_proj = tsne.fit_transform(X)
Plot the results. Seaborn's hls palatte provides evenly spaced colors in HLS hue space, we can divide it into 10 colors.
In [49]:
palette = np.array(sns.color_palette("hls", 10))
Make a scatter plot of the first component against the second component, with color based on the numbers.
In [50]:
plt.figure(figsize = (6,6))
plt.scatter(digits_proj[:,0], digits_proj[:,1],c=palette[y])
Out[50]:
We can add some text for each cluster. The place of the text can be the center of the cluster. We can use np.median
to find the centers. To simplify things, we can make the code into a function.
In [51]:
def plot_scatter(projection):
plt.figure(figsize = (6,6))
plt.scatter(projection[:,0], projection[:,1],c=palette[y])
for i in range(10):
# Position of each label.
xtext, ytext = np.median(projection[y == i, :], axis=0)
txt = plt.text(xtext, ytext, str(i), fontsize=24)
In [52]:
plot_scatter(digits_proj)
We talked about MDS and Isomap in class as two other manifold learning methods. Sklearn also has implementations for this two algorithms: MDS and Isomap, so the usage is very similar. Examples for using this methods can be found here.
Can you make another two plots with these two methods? You only need to change the models and call the plot_scatter
function,
In [53]:
# Implement
In [54]:
# Implement
In [ ]:
In [ ]: