In [1]:
%matplotlib inline
In [2]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
mpl.rcParams["figure.figsize"] = (9,6)
The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. It extends the scikit-learn API with a new core object: the Visualizer
. Visualizers allow models to be fit and transformed as part of the scikit-learn pipeline process, providing visual diagnostics throughout the transformation of high-dimensional data.
In machine learning, clustering models are unsupervised methods that attempt to detect patterns in unlabeled data. There are two primary classes of clustering algorithms: agglomerative clustering which links similar data points together, and centroidal clustering which attempts to find centers or partitions in the data.
Currently, Yellowbrick provides two visualizers to evaluate centroidal mechanisms, particularly K-Means clustering, that help users discover an optimal $K$ parameter in the clustering metric:
KElbowVisualizer
visualizes the clusters according to a scoring function, looking for an "elbow" in the curve. SilhouetteVisualizer
visualizes the silhouette scores of each cluster in a single model.
In [3]:
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, shuffle=True, random_state=42)
K-Means is a simple unsupervised machine learning algorithm that groups data into the number $K$ of clusters specified by the user, even if it is not the optimal number of clusters for the dataset.
Yellowbrick's KElbowVisualizer
implements the “elbow” method of selecting the optimal number of clusters by fitting the K-Means model with a range of values for $K$. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.
In the following example, the KElbowVisualizer
fits the model for a range of $K$ values from 4 to 11, which is set by the parameter k=(4,12)
. When the model is fit with 8 clusters we can see an "elbow" in the graph, which in this case we know to be the optimal number since we created our synthetic dataset with 8 clusters of points.
In [4]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
By default, the scoring parameter metric
is set to distortion
, which computes the sum of squared distances from each point to its assigned center. However, two other metrics can also be used with the KElbowVisualizer
—silhouette
and calinski_harabaz
. The silhouette
score is the mean silhouette coefficient for all samples, while the calinski_harabaz
score computes the ratio of dispersion between and within clusters.
The KElbowVisualizer
also displays the amount of time to fit the model per $K$, which can be hidden by setting timings=False
. In the following example, we'll use the calinski_harabaz
score and hide the time to fit the model.
In [5]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12), metric='calinski_harabaz', timings=False)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
It is important to remember that the Elbow method does not work well if the data is not very clustered. In this case, you might see a smooth curve and the optimal value of $K$ will be unclear.
You can learn more about the Elbow method at Robert Grove's Blocks.
Silhouette analysis can be used to evaluate the density and separation between clusters. The score is calculated by averaging the silhouette coefficient for each sample, which is computed as the difference between the average intra-cluster distance and the mean nearest-cluster distance for each sample, normalized by the maximum value. This produces a score between -1 and +1, where scores near +1 indicate high separation and scores near -1 indicate that the samples may have been assigned to the wrong cluster.
The SilhouetteVisualizer
displays the silhouette coefficient for each sample on a per-cluster basis, allowing users to visualize the density and separation of the clusters. This is particularly useful for determining cluster imbalance or for selecting a value for $K$ by comparing multiple visualizers.
Since we created the sample dataset for these examples, we already know that the data points are grouped into 8 clusters. So for the first SilhouetteVisualizer
example, we'll set $K$ to 8 in order to show how the plot looks when using the optimal value of $K$.
Notice that graph contains homogeneous and long silhouettes. In addition, the vertical red-dotted line on the plot indicates the average silhouette score for all observations.
In [6]:
# Instantiate the clustering model and visualizer
model = KMeans(8)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
For the next example, let's see what happens when using a non-optimal value for $K$, in this case, 6.
Now we see that the width of clusters 1 and 2 have both increased and their silhouette coefficient scores have dropped. This occurs because the width of each silhouette is proportional to the number of samples assigned to the cluster. The model is trying to fit our data into a smaller than optimal number of clusters, making two of the clusters larger (wider) but much less cohesive (as we can see from their below-average scores).
In [7]:
# Instantiate the clustering model and visualizer
model = KMeans(6)
visualizer = SilhouetteVisualizer(model)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
In [ ]: