In [ ]:
# Week 10 - Dimensionality reduction and clustering

## Learning Objectives

* List options available for dimensionality reduction in scikit-learn
* Discuss different clustering algorithms
* Demonstrate clustering in scikit-learn

In [1]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Dimensionality reduction

Many types of data can contain a massive number of features. Whether this is individual pixels in images, transcripts or proteins in -omics data or word occurrances in text data this bounty of features can bring with it several challenges.

Visualizing more than 4 dimensions directly is difficult complicating our data analysis and exploration. In machine learning models we run the risk of overfitting to the data and having a model that does not generalize to new observations. There are two main approaches to handling this situation:

  • Identify important features and discard less important features
  • Transform the data into a lower dimensional space

Identify important features

Feature selection can be used to choose the most informative features. This can improve the performance of subsequent models, reduce overfitting and have practical advantages when the model is ready to be utilized. For example, RT-qPCR on a small number of transcripts will be faster and cheaper than RNAseq, and similarly targeted mass spectrometry such as MRM on a limited number of proteins will be cheaper, faster and more accurate than data independent acquisition mass spectrometry.

There are a variety of approaches for feature selection:

  • Remove uninformative features (same value for all, or nearly all, samples)
  • Remove features that perform poorly at the task when used alone
  • Iteratively remove the weakest features from a model until the desires number is reached

In [2]:
from sklearn.feature_selection import VarianceThreshold

X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])
print(X)


[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]

In [3]:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_selected = sel.fit_transform(X)
print(X_selected)


[[0 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]]

In [3]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape)
print(y)


(150, 4)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

In [5]:
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X_new.shape)


(150, 2)

When iteratively removing weak features the choice of model is important. We will discuss the different models available for regression and classification next week but there are a few points relevant to feature selection we will cover here.

A linear model is a useful and easily interpreted model, and when used for feature selection L1 regularization should be used. L1 regularization penalizes large coefficients based on their absolute values. This favors a sparse model with weak features having coefficients close to zero. In contrast, L2 regularization penalizes large coefficients based on their squared value, and this has a tendency to favor many small coefficients rather than a smaller set of larger coefficients.


In [6]:
from sklearn import linear_model
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE


# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

# Create the RFE object and rank each pixel
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6)
rfe = RFE(estimator=clf, n_features_to_select=1, step=1)
rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)

# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()


The disadvantage with L1 regularization is that if multiple features are correlated only one of them will have a high coefficient.


In [7]:
from sklearn.linear_model import RandomizedLogisticRegression
randomized_logistic = RandomizedLogisticRegression()

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

randomized_logistic.fit(X, y)

ranking = randomized_logistic.scores_.reshape(digits.images[0].shape)


# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels")
plt.show()


Also important is to normalize the means and variances of the features before comparing the coefficients. The approaches we covered last week are crucial for feature selection from a linear model.

A limitation of linear models is that any interactions must be hand coded. A feature that is poorly predictive overall may actually be very powerful but only in a limited subgroup. This might be missed in a linear model when we would prefer to keep the feature.

Any model exposing a coef_ or feature_importances_ attribute can be used with the SelectFromModel class for feature selection. Forests of randomized decision trees handle interactions well and unlike some of the other models do not require careful tuning of parameters to achieve reasonable performance.


In [8]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

clf.fit(X, y)

ranking = clf.feature_importances_.reshape(digits.images[0].shape)

# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels")
plt.show()


Transformation into lower dimensional space

An alternative approach is to transform the data in such a way that the variance observed in the features is maintained while only using a smaller number of dimensions. This approach includes all the features so is not a simpler model when considering the entire process from data acquisition onwards. It can however improve the performance of subsequent algorithms and is a very popular approach for visualization and the initial data analysis phase of a project.

The classical method is Principal Components Analysis although other algorithms are also available. Given a set of features usually some will be at least weakly correlated. By performing an orthogonal transformation a reduced number of features that are uncorrelated can be chosen that maintains as much of the variation in the original data as possible.

An orthogonal transformation simply means that the data is rotated and reflected about the axes.

Scikit-learn has several different implementations of PCA available together with other techniques performing a similar function.

Most of these techniques are unsupervised. Linear Discriminant Analysis (LDA) is one algorithm that can include labels and will attempt to create features that account for the greatest amount of variance between classes.


In [9]:
# http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html#example-decomposition-plot-pca-vs-lda-py

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

iris = datasets.load_iris()

X = iris.data
y = iris.target
target_names = iris.target_names

pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

plt.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend(scatterpoints=1)
plt.title('PCA of IRIS dataset')

plt.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend(scatterpoints=1)
plt.title('LDA of IRIS dataset')

plt.show()


explained variance ratio (first two components): [ 0.92461621  0.05301557]

Exercises

  1. Apply feature selection to the Olivetti faces dataset, identifying the most important 25% of features.
  2. Apply PCA and LDA to the digits dataset used above

In [10]:
from sklearn.datasets import fetch_olivetti_faces

faces = datasets.fetch_olivetti_faces()

X = faces.data
y = faces.target
print(X.shape, y.shape)

plt.matshow(X[0].reshape((64,64)))
plt.colorbar()
plt.title("Face")
plt.show()

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

clf.fit(X, y)

rankings = clf.feature_importances_


plt.matshow(rankings.reshape((64,64)))
plt.colorbar()
plt.title("Pixel importances")
plt.show()

p75 = np.percentile(rankings, 75)

mask = rankings > p75

plt.matshow(mask.reshape((64,64)))
plt.colorbar()
plt.title("Pixel importances")
plt.show()


(400, 4096) (400,)

In [17]:
# http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html#example-decomposition-plot-pca-vs-lda-py

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

digits = datasets.load_digits()

X = digits.data
y = digits.target
target_names = digits.target_names

pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

plt.figure()
for c, i, target_name in zip("rgbcmykrgbcmyk", range(10), target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend(scatterpoints=1)
plt.title('PCA of digits dataset')

plt.figure()
for c, i, target_name in zip("rgbcmykrgbcmyk", range(10), target_names):
    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend(scatterpoints=1)
plt.title('LDA of digits dataset')

plt.show()


explained variance ratio (first two components): [ 0.14890594  0.13618771]
C:\Users\stree\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")

In [ ]:

Clustering

In clustering we attempt to group observations in such a way that observations assigned to the same cluster are more similar to each other than to observations in other clusters.

Although labels may be known, clustering is usually performed on unlabeled data as a step in exploratory data analysis.

Previously we looked at the Otsu thresholding method as a basic example of clustering. This is very closely related to k-means clustering. A variety of other methods are available with different characteristics.

The best method to use will vary developing on the particular problem.


In [12]:
import matplotlib
import matplotlib.pyplot as plt

from skimage.data import camera
from skimage.filters import threshold_otsu


matplotlib.rcParams['font.size'] = 9


image = camera()
thresh = threshold_otsu(image)
binary = image > thresh

#fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(8, 2.5))
fig = plt.figure(figsize=(8, 2.5))
ax1 = plt.subplot(1, 3, 1, adjustable='box-forced')
ax2 = plt.subplot(1, 3, 2)
ax3 = plt.subplot(1, 3, 3, sharex=ax1, sharey=ax1, adjustable='box-forced')

ax1.imshow(image, cmap=plt.cm.gray)
ax1.set_title('Original')
ax1.axis('off')

ax2.hist(image)
ax2.set_title('Histogram')
ax2.axvline(thresh, color='r')

ax3.imshow(binary, cmap=plt.cm.gray)
ax3.set_title('Thresholded')
ax3.axis('off')

plt.show()


Different clustering algorithms

Cluster comparison

The following algorithms are provided by scikit-learn

  • K-means
  • Affinity propagation
  • Mean Shift
  • Spectral clustering
  • Ward
  • Agglomerative Clustering
  • DBSCAN
  • Birch

K-means clustering divides samples between clusters by attempting to minimize the within-cluster sum of squares. It is an iterative algorithm repeatedly updating the position of the centroids (cluster centers), re-assigning samples to the best cluster and repeating until an optimal solution is reached. The clusters will depend on the starting position of the centroids so k-means is often run multiple times with random initialization and then the best solution chosen.

Affinity Propagation operates by passing messages between the samples updating a record of the exemplar samples. These are samples that best represent other samples. The algorithm functions on an affinity matrix that can be eaither user supplied or computed by the algorothm. Two matrices are maintained. One matrix records how well each sample represents other samples in the dataset. When the algorithm finishes the highest scoring samples are chosen to represent the clusters. The second matrix records which other samples best represent each sample so that the entire dataset can be assigned to a cluster when the algorithm terminates.

Mean Shift iteratively updates candidate centroids to represent the clusters. The algorithm attempts to find areas of higher density.

Spectral clustering operates on an affinity matrix that can be user supplied or computed by the model. The algorithm functions by minimizing the value of the links cut in a graph created from the affinity matrix. By focusing on the relationships between samples this algorithm performs well for non-convex clusters.

Ward is a type of agglomerative clustering using minimization of the within-cluster sum of squares to join clusters together until the specified number of clusters remain.

Agglomerative clustering starts all the samples in their own cluster and then progressively joins clusters together minimizing some performance measure. In addition to minimizing the variance as seen with Ward other options are, 1) minimizing the average distance between samples in each cluster, and 2) minimizing the maximum distance between observations in each cluster.

DBSCAN is another algorithm that attempts to find regions of high density and then expands the clusters from there.

Birch is a tree based clustering algorithm assigning samples to nodes on a tree


In [13]:
from sklearn import cluster, datasets

dataset, true_labels = datasets.make_blobs(n_samples=200, n_features=2, random_state=0,
                                           centers=3, cluster_std=0.1)

fig, ax = plt.subplots(1,1)
ax.scatter(dataset[:,0], dataset[:,1], c=true_labels)
plt.show()

# Clustering algorithm can be used as a class
means = cluster.KMeans(n_clusters=3)
prediction = means.fit_predict(dataset)

fig, ax = plt.subplots(1,1)
ax.scatter(dataset[:,0], dataset[:,1], c=prediction)
plt.show()


Model evaluation

Several approaches have been developed for evaluating clustering models but are generally limited in requiring the true clusters to be known. In the general use case for clustering this is not known with the goal being exploratory.

Ultimately, a model is just a tool to better understand the structure of our data. If we are able to gain insight from using a clustering algorithm then it has served its purpose.

The metrics available are Adjusted Rand Index, Mutual Information based scores, Homogeneity, completeness, v-measure, and silhouette coefficient. Of these, only the silhouette coefficient does not require the true clusters to be known.

Although the silhouette coefficient can be useful it takes a very similar approach to k-means, favoring convex clusters over more complex, equally valid, clusters.

How to determine number of clusters

One important use for the model evaluation algorithms is in choosing the number of clusters. The clustering algorithms take as parameters either the number of clusters to partition a dataset into or other scaling factors that ultimately determine the number of clusters. It is left to the user to determine the correct value for these parameters.

As the number of clusters increases the fit to the data will always improve until each point is in a cluster by itself. As such, classical optimization algorithms searching for a minimum or maximum score will not work. Often, the goal is to find an inflection point.

If the cluster parameter is too low adding an additional cluster will have a large impact on the evaluation score. The gradient will be high at numbers of clusters less than the true value. If the cluster parameter is too high adding an additional cluster will have a small impact on the evaluation score. The gradient will be low at numbers of clusters higher than the true value.

At the correct number of clusters the gradient should suddenly change, this is an inflection point.


In [14]:
from sklearn import cluster, datasets, metrics


dataset, true_labels = datasets.make_blobs(n_samples=200, n_features=2, random_state=0,
                                           centers=[[x,y] for x in range(3) for y in range(3)], cluster_std=0.1)


fig, ax = plt.subplots(1,1)
ax.scatter(dataset[:,0], dataset[:,1], c=true_labels)
plt.show()



In [15]:
inertia = []
predictions = []
for i in range(1,20):
    means = cluster.KMeans(n_clusters=i)
    prediction = means.fit_predict(dataset)
    inertia.append(means.inertia_)
    predictions.append(prediction)


plt.plot(range(1,20), inertia)
plt.show()



In [16]:
plt.scatter(dataset[:,0], dataset[:,1], c=predictions[8])
plt.show()


This is an ideal case with clusters that can be clearly distinguished - convex clusters with similar distributions and large gaps between the clusters. Most real world datasets will not be as easy to work with and determining the correct number of clusters will be more challenging. As an example, compare the performance between challenges 1 and 2 (unknown number of clusters) and challenge 3 (known number of clusters) in table 2 of this report on automated FACS.

There is a pull request on the scikit-learn repository to add several automated algorithms to identify the correct number of clusters but it has not been integrated.

Exercises

  1. Using the grid of blobs sample dataset investigate how different cluster shapes and distances alter the plot of inertia with number of clusters. Is the plot still interpretable if the distances are 1/2, 1/4, 1/8, etc? If the variance in the first and second dimensions is unequal is the plot still interpretable?
  2. Does using a different algorithm improve performance?

In [39]:
from sklearn import cluster, datasets, metrics


fig, ax = plt.subplots(3,3, figsize=(15,15))
for i in range(3):
    print(i, 3/2**i, 1/2**i)
    dataset, true_labels = datasets.make_blobs(n_samples=200, n_features=2, random_state=0,
                                           centers=[[x,y] for x in np.arange(0,3/2**i,1/2**i) for y in np.arange(0,3/2**i,1/2**i)], 
                                           cluster_std=0.1)

    ax[0,i].scatter(dataset[:,0], dataset[:,1], c=true_labels)

    inertia = []
    predictions = []
    for j in range(1,20):
        means = cluster.KMeans(n_clusters=j)
        prediction = means.fit_predict(dataset)
        inertia.append(means.inertia_)
        predictions.append(prediction)

    ax[1,i].plot(range(1,20), inertia)


    ax[2,i].scatter(dataset[:,0], dataset[:,1], c=predictions[8])
plt.show()


0 3.0 1.0
1 1.5 0.5
2 0.75 0.25

As the distance between clusters is reduced the ability to distinguish individual clusters becomes increasingly difficult. This is reflected in the relationship between number of clusters and inertia.


In [43]:
from sklearn import cluster, datasets, metrics


fig, ax = plt.subplots(3,3, figsize=(15,15))
for n, i in enumerate([1,5,10]):
    print(i, )
    dataset, true_labels = datasets.make_blobs(n_samples=200, n_features=2, random_state=0,
                                           centers=[[x,y] for x in np.arange(0,3,1) for y in np.arange(0,3,1)], 
                                           cluster_std=0.1)
    dataset[:,1] *= i
    ax[0,n].scatter(dataset[:,0], dataset[:,1], c=true_labels)

    inertia = []
    predictions = []
    for j in range(1,20):
        means = cluster.KMeans(n_clusters=j)
        prediction = means.fit_predict(dataset)
        inertia.append(means.inertia_)
        predictions.append(prediction)

    ax[1,n].plot(range(1,20), inertia)


    ax[2,n].scatter(dataset[:,0], dataset[:,1], c=predictions[8])
plt.show()


1
5
10

In these plots the scale in the x and y dimensions is different to match the variance in these dimensions. The relationship between number of clusters and inertia is still interpretable. Looking at the cluster assignments indicates that k-means does not handle the different variances in the clusters well.


In [54]:
from sklearn import cluster, datasets, metrics


fig, ax = plt.subplots(3,3, figsize=(15,15))

for i in range(3):
    print(i, 3/2**i, 1/2**i)
    dataset, true_labels = datasets.make_blobs(n_samples=200, n_features=2, random_state=0,
                                           centers=[[x,y] for x in np.arange(0,3/2**i,1/2**i) for y in np.arange(0,3/2**i,1/2**i)], 
                                           cluster_std=0.1)

    ax[0,i].scatter(dataset[:,0], dataset[:,1], c=true_labels)

    inertia = []
    predictions = []
    for j in range(2,20):
        #means = cluster.KMeans(n_clusters=j)
        sc = cluster.SpectralClustering(n_clusters=j)
        prediction = sc.fit_predict(dataset)
        inertia.append(metrics.silhouette_score(dataset, prediction, metric='sqeuclidean'))
        predictions.append(prediction)

    ax[1,i].plot(range(2,20), inertia)


    ax[2,i].scatter(dataset[:,0], dataset[:,1], c=predictions[7])
plt.show()


0 3.0 1.0
1 1.5 0.5
2 0.75 0.25

Rather than using the inertia attribute on the kmeans object the silhouette metric is used. In this case there is little difference between kmeans and spectral clustering.


In [ ]: