In [1]:
%matplotlib inline
In [2]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import InterclusterDistance, KElbowVisualizer, SilhouetteVisualizer
mpl.rcParams["figure.figsize"] = (9,6)
The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. It extends the scikit-learn API with a new core object: the Visualizer
. Visualizers allow models to be fit and transformed as part of the scikit-learn pipeline process, providing visual diagnostics throughout the transformation of high-dimensional data.
In machine learning, clustering models are unsupervised methods that attempt to detect patterns in unlabeled data. There are two primary classes of clustering algorithms: agglomerative clustering which links similar data points together, and centroidal clustering which attempts to find centers or partitions in the data.
Currently, Yellowbrick provides several visualizers to evaluate centroidal mechanisms, particularly K-Means clustering, that help users discover an optimal $K$ parameter in the clustering metric:
KElbowVisualizer
— visualizes the clusters according to a scoring function, looking for an "elbow" in the curveSilhouetteVisualizer
— visualizes the silhouette scores of each cluster in a single modelInterclusterDistance
— visualizes the relative distance and size of clusters
In [3]:
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(n_samples=1000, n_features=15, centers=8, random_state=42)
K-Means is a simple unsupervised machine learning algorithm that groups data into the number $K$ of clusters specified by the user, even if it is not the optimal number of clusters for the dataset.
Yellowbrick's KElbowVisualizer
implements the “elbow” method of selecting the optimal number of clusters by fitting the K-Means model with a range of values for $K$. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.
In the following example, the KElbowVisualizer
fits the model for a range of $K$ values from 4 to 11, which is set by the parameter k=(4,12)
. When the model is fit with 8 clusters we can see an "elbow" in the graph, which in this case we know to be the optimal number since we created our synthetic dataset with 8 clusters of points.
In [4]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
Out[4]:
By default, the scoring parameter metric
is set to distortion
, which computes the sum of squared distances from each point to its assigned center. However, two other metrics can also be used with the KElbowVisualizer
—silhouette
and calinski_harabasz
. The silhouette
score is the mean silhouette coefficient for all samples, while the calinski_harabasz
score computes the ratio of dispersion between and within clusters.
The KElbowVisualizer
also displays the amount of time to fit the model per $K$, which can be hidden by setting timings=False
. In the following example, we'll use the calinski_harabasz
score and hide the time to fit the model.
In [5]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12), metric='calinski_harabasz', timings=False)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
Out[5]:
It is important to remember that the Elbow method does not work well if the data is not very clustered. In this case, you might see a smooth curve and the optimal value of $K$ will be unclear.
You can learn more about the Elbow method at Robert Grove's Blocks.
Silhouette analysis can be used to evaluate the density and separation between clusters. The score is calculated by averaging the silhouette coefficient for each sample, which is computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value. This produces a score between -1 and +1, where scores near +1 indicate high separation and scores near -1 indicate that the samples may have been assigned to the wrong cluster.
The SilhouetteVisualizer
displays the silhouette coefficient for each sample on a per-cluster basis, allowing users to visualize the density and separation of the clusters. This is particularly useful for determining cluster imbalance or for selecting a value for $K$ by comparing multiple visualizers.
Since we created the sample dataset for these examples, we already know that the data points are grouped into 8 clusters. So for the first SilhouetteVisualizer
example, we'll set $K$ to 8 in order to show how the plot looks when using the optimal value of $K$.
Notice that graph contains homogeneous and long silhouettes. In addition, the vertical red-dotted line on the plot indicates the average silhouette score for all observations.
In [6]:
# Instantiate the clustering model and visualizer
model = KMeans(8)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
Out[6]:
For the next example, let's see what happens when using a non-optimal value for $K$, in this case, 6.
Now we see that the width of clusters 1 and 2 have both increased and their silhouette coefficient scores have dropped. This occurs because the width of each silhouette is proportional to the number of samples assigned to the cluster. The model is trying to fit our data into a smaller than optimal number of clusters, making two of the clusters larger (wider) but much less cohesive (as we can see from their below-average scores).
In [7]:
# Instantiate the clustering model and visualizer
model = KMeans(6)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
Out[7]:
Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved, e.g. the closer two centers are in the visualization, the closer they are in the original feature space.
The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.
In [8]:
# Generate synthetic dataset with 12 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=12, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans(6)
visualizer = InterclusterDistance(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
Out[8]: