Unsupervised Learning - Clustering and PCA

Timothy Helton

NOTE:
This notebook uses code found in the k2datascience.cluster module. To execute all the cells do one of the following items:

Install the k2datascience package to the active Python interpreter.
Add k2datascience/k2datascience to the PYTHON_PATH system variable.
Create a link to the cluster.py file in the same directory as this notebook.

Imports



In [ ]:

    
from bokeh.plotting import figure, show
import bokeh.io as bkio
import pandas as pd

from k2datascience import cluster
from k2datascience import plotting

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
bkio.output_notebook()
%matplotlib inline

Load Data

US Arrests Dataset



In [ ]:

    
arrests = cluster.Arrests()
arrests.data.info()
arrests.data.head()
arrests.data.describe()



In [ ]:

    
plotting.correlation_heatmap_plot(arrests.data)



In [ ]:

    
plotting.correlation_pair_plot(arrests.data)

Genes Dataset



In [ ]:

    
genes = cluster.Genes()
genes.data.info()
genes.data.head()
genes.data.describe()

Exercise 1

We mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let $r_{ij}$ denote the correlation between the ith and jth observations, then the quantity $1−r_{ij}$ is proportional to the squared Euclidean distance between the ith and jth observations. On the USArrests data, show that this proportionality holds.

Correlation

$$ r_{xy} = \frac{\sum_i^n (x_i - \overline{x}) (y_i - \overline{y})}{\sigma_x \sigma_y} $$

Euclidean Distance

$$ d(x,y) = \sqrt{ \sum_i^n (x_i - y_i)^2 } $$$$ d^2(x,y) = \sum_i^n (x_i - y_i)^2 $$$$ d^2(x,y) = \sum_i^n x_i^2 - 2\sum_i^n x_i y_i + \sum_i^n y_i^2 $$

When the data is scaled so the mean is zero and standard deviation is 1.

$$ r_{xy} = \sum_i^n x_i y_i $$$$ d^2(x,y) = n - 2\sum_i^n x_i y_i + n $$$$ d^2(x,y) = 2n - 2\sum_i^n x_i y_i $$$$ d^2(x,y) = 1 - \frac{\sum_i^n x_i y_i}{n} $$$$ d^2(x,y) = 1 - \frac{r}{n} $$

Exercise 2

A formula for calculating PVE is given below. PVE can also be obtained by accessing the explained_variance_ratio_ attribute of the PCA function. On the USArrests data, calculate PVE in two ways:

By accessing the explained_variance_ratio_ attribute of the PCA function.
By applying the Equation directly. That is, use the PCA function to compute the principal component loadings. Then, use those loadings in the Equation to obtain the PVE.

1) By accessing the explained_variance_ratio_ attribute of the PCA function.



In [ ]:

    
arrests.n_components=4
arrests.calc_pca()
arrests.var_pct



In [ ]:

    
plotting.pca_variance(arrests.var_pct)

2) By applying the Equation directly. That is, use the PCA function to compute the principal component loadings. Then, use those loadings in the Equation to obtain the PVE.



In [ ]:

    
arrests.calc_pca_eq()

Exercise 3

Consider the “USArrests” data. We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
Hierachically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

1) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.



In [ ]:

    
plotting.agglomerative_dendrogram_plot(
    data=arrests.data,
    labels=arrests.data.index,
    title='US Arrests',
    method='complete',
    metric='euclidean',
)

2) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?



In [ ]:

    
arrests.hierarchical_cluster(n_clusters=3,
                             criterion='maxclust',
                             method='complete',
                             metric='euclidean')
arrests.us_map_clusters()

3) Hierachically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.



In [ ]:

    
plotting.agglomerative_dendrogram_plot(
    data=arrests.std_x,
    labels=arrests.data.index,
    title='US Arrests',
    method='complete',
    metric='euclidean',
)



In [ ]:

    
original_data = arrests.data
arrests.data = pd.DataFrame(arrests.std_x, index=arrests.data.index)
arrests.hierarchical_cluster(n_clusters=3,
                             criterion='maxclust',
                             method='complete',
                             metric='euclidean')
arrests.us_map_clusters()
arrests.data = original_data

4) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

FINDINGS

The data is able to be sectioned into more uniform clusters once standardized.
The data should be standardized to reduce the effect of a single response dominating the cluster only due to larger magnitude values.

Exercise 4

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (3). If not, the return to part (1) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (3) until the three classes show at least some separation in the first two principal component score vectors.
Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?
Perform K-means clustering with K = 2. Describe your results.
Now perform K-means clustering with K = 4, and describe your results.
Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60x2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (2)? Explain.

1) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.



In [ ]:

    
sim = cluster.Simulated()
sim.data.head()
sim.data.describe()

2) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (3). If not, the return to part (1) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (3) until the three classes show at least some separation in the first two principal component score vectors.



In [ ]:

    
sim.calc_pca()
plotting.pca_variance(sim.var_pct)



In [ ]:

    
sim.plot_pca()

3) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?



In [ ]:

    
sim.calc_kmeans(sim.data, n_clusters=3)

4) Perform K-means clustering with K = 2. Describe your results.



In [ ]:

    
sim.calc_kmeans(sim.data, n_clusters=2)

5) Now perform K-means clustering with K = 4, and describe your results.



In [ ]:

    
sim.calc_kmeans(sim.data, n_clusters=4)

6) Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60x2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.



In [ ]:

    
sim.calc_kmeans(sim.trans[:, [0, 1]], n_clusters=3)

7) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (2)? Explain.



In [ ]:

    
sim.calc_kmeans(sim.std_x, n_clusters=3)

Exercise 5

We will use a gene expression data set that consists of 40 tissue samples with measurements on 1000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

Load the data.
Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into two groups? Do your results depend on the type of linkage used?
Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

1) Load the data.

Completed Above

2) Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into two groups? Do your results depend on the type of linkage used?



In [ ]:

    
plotting.agglomerative_dendrogram_plot(
    data=genes.data,
    labels=list(range(1, 41)),
    title='Genes (Complete)',
    method='complete',
    metric='correlation',
)



In [ ]:

    
plotting.agglomerative_dendrogram_plot(
    data=genes.data,
    labels=list(range(1, 41)),
    title='Genes (Average)',
    method='average',
    metric='correlation',
)



In [ ]:

    
plotting.agglomerative_dendrogram_plot(
    data=genes.data,
    labels=list(range(1, 41)),
    title='Genes (Single)',
    method='single',
    metric='correlation',
)



In [ ]:

    
plotting.agglomerative_dendrogram_plot(
    data=genes.data,
    labels=list(range(1, 41)),
    title='Genes',
    method='ward',
    metric='euclidean',
)

3) Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.



In [ ]:

    
genes.std_x = genes.data
genes.n_components = None
genes.calc_pca()

genes.var_pct.head()



In [ ]:

    
genes.unique_genes()