This notebook duplicates the process from Clustering Flag Data.ipynb, but implements a pipeline with OneHotEncoder instead.
Professor Bengfort put together a notebook using the UCI Machine Learning Repository flags dataset to predict the religion of a country based on the attributes of their flags.
What if we had the same data, without the religion column? Can we used unsupervised machine learning to draw some conclusions about the data?
🇦🇫🇦🇽🇦🇱🇩🇿🇦🇸🇦🇩🇦🇴🇦🇮🇦🇶🇦🇬🇦🇷🇦🇲🇦🇼🇦🇺🇦🇹🇦🇿🇧🇸🇧🇭🇧🇩🇧🇧🇧🇾🇧🇪🇧🇿🇧🇯🇧🇲🇧🇹🇧🇴🇧🇶🇧🇦🇧🇼🇧🇷🇮🇴
Here is some infomation about our dataset:
This data file contains details of various nations and their flags. In this file the fields are separated by spaces (not commas). With this data you can try things like predicting the religion of a country from its size and the colours in its flag.
10 attributes are numeric-valued. The remainder are either Boolean- or nominal-valued.
In [ ]:
import os
import requests
import numpy as np
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn import manifold
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from time import time
In [ ]:
%matplotlib inline
pd.set_option('max_columns', 500)
Let's grab the data and set it up for analysis.
In [ ]:
# You should recognize this from the Wheat notebook
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data"
def fetch_data(fname='flags.txt'):
"""
Helper method to retreive the ML Repository dataset.
"""
response = requests.get(URL)
outpath = os.path.abspath(fname)
with open(outpath, 'wb') as f:
f.write(response.content)
return outpath
# Fetch the data if required
DATA = fetch_data()
In [ ]:
# Load data and do some simple data management
# We are going to define the names from the features and build a dictionary to convert our categorical features.
FEATS = [
"name", "landmass", "zone", "area", "population", "language", "religion", "bars",
"stripes", "colours", "red", "green", "blue", "gold", "white", "black", "orange",
"mainhue", "circles", "crosses", "saltires", "quarters", "sunstars", "crescent",
"triangle", "icon", "animate", "text", "topleft", "botright",
]
COLOR_MAP = {"red": 1, "blue": 2, "green": 3, "white": 4, "gold": 5, "black": 6, "orange": 7, "brown": 8}
# Load Data
df = pd.read_csv(DATA, header=None, names=FEATS)
df.head()
In [ ]:
# Now we will use the dictionary to convert categoricals into int values
for k,v in COLOR_MAP.items():
df.ix[df.mainhue == k, 'mainhue'] = v
for k,v in COLOR_MAP.items():
df.ix[df.topleft == k, 'topleft'] = v
for k,v in COLOR_MAP.items():
df.ix[df.botright == k, 'botright'] = v
df.mainhue = df.mainhue.apply(int)
df.topleft = df.topleft.apply(int)
df.botright = df.botright.apply(int)
In [ ]:
df.describe()
Clustering is an unsupervised machine learning method. This means we don't have to have a value we are predicting.
You can use clustering when you know this information as well. Scikit-learn provides a number of metrics you can employ with a "known ground truth" (i.e. the values you are predicting). We won't cover them here, but you can use this notebook to add some cells, create your "y" value, and explore the metrics described here.
In the case of the flags data, we do have our "known ground truth". However, for the purpose of this exercise we are going to drop that information out of our data set. We will use it later with Agglomerative Clustering.
In [ ]:
feature_names = [
"landmass", "zone", "area", "population", "language", "bars",
"stripes", "colours", "red", "green", "blue", "gold", "white", "black", "orange",
"mainhue", "circles", "crosses", "saltires", "quarters", "sunstars", "crescent",
"triangle", "icon", "animate", "text", "topleft", "botright",
]
X = df[feature_names]
y = df.religion
Let's look at KMeans clustering first.
"K-means is a simple unsupervised machine learning algorithm that groups a dataset into a user-specified number (k) of clusters. The algorithm is somewhat naive--it clusters the data into k clusters, even if k is not the right number of clusters to use. Therefore, when using k-means clustering, users need some way to determine whether they are using the right number of clusters."
One way to determine the number of cluster is through the "elbow" method. Using this method, we try a range of values for k and evaluate the "variance explained as a function of the number of clusters".
In [ ]:
K = range(1,10)
meandistortions = []
for k in K:
clf = Pipeline([
('encoder', OneHotEncoder(categorical_features=[0, 1, 4, 7, 15, 26, 27])),
('estimator', KMeans(n_clusters=k, n_jobs=-1, random_state=1))
])
Z = clf.fit_transform(X)
meandistortions.append(sum(np.min(euclidean_distances(Z, clf.steps[1][1].cluster_centers_.T), axis=1)) / X.shape[0])
plt.plot(K, meandistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
plt.show()
If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best. Our goal is to choose a small value of k that still has a low variance. The elbow usually represents where we start to have diminishing returns by increasing k.
However, the elbow method doesn't always work well; especially if the data is not very clustered.
Based on our plot, it looks like k=3 is worth looking at. How do we measure which might be better? We can use the Silhouette Coefficient. A higher Silhouette Coefficient score relates to a model with better defined clusters. Look at the silhouette score based on some different k values.
In [ ]:
pipeline =Pipeline([
('encoder', OneHotEncoder(categorical_features=[0, 1, 4, 7, 15, 26, 27])),
('estimator', KMeans(n_clusters=3, n_jobs=-1, random_state=1))
])
pipeline.fit(X)
labels = pipeline.steps[1][1].labels_
silhouette_score(X, labels, metric='euclidean')
In [ ]:
pipeline =Pipeline([
('encoder', OneHotEncoder(categorical_features=[0, 1, 4, 7, 15, 26, 27])),
('estimator', KMeans(n_clusters=5, n_jobs=-1, random_state=1))
])
pipeline.fit(X)
labels = pipeline.steps[1][1].labels_
silhouette_score(X, labels, metric='euclidean')
In [ ]:
pipeline =Pipeline([
('encoder', OneHotEncoder(categorical_features=[0, 1, 4, 7, 15, 26, 27])),
('estimator', KMeans(n_clusters=4, n_jobs=-1, random_state=1))
])
pipeline.fit(X)
labels = pipeline.steps[1][1].labels_
silhouette_score(X, labels, metric='euclidean')
We can see above, k=3 has the better score.
As implemented in scikit-learn, KMeans will use 8 clusters by default. Given our data, it makes sense to try this out since our data actually has 8 potential labels (look at "religion" in the data secription above). Based on the plot above, we should expect the silhouette score for k=8 to be less than for k=4.
In [ ]:
pipeline =Pipeline([
('encoder', OneHotEncoder(categorical_features=[0, 1, 4, 7, 15, 26, 27])),
('estimator', KMeans(n_clusters=8, n_jobs=-1, random_state=1))
])
pipeline.fit(X)
labels = pipeline.steps[1][1].labels_
silhouette_score(X, labels, metric='euclidean')
We can also visualize what our clusters look like. The function below will plot the clusters and visulaize their silhouette scores.
In [ ]:
# Code adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
def silhouette_plot(X, range_n_clusters = range(2, 12, 2)):
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1
ax1.set_xlim([-.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
'''clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)'''
clusterer = Pipeline([
('encoder', OneHotEncoder(categorical_features=[0, 1, 4, 7, 15, 26, 27])),
('estimator', KMeans(n_clusters=n_clusters, n_jobs=-1, random_state=1))
])
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X.ix[:, 0], X.ix[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.steps[1][1].cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
In [ ]:
silhouette_plot(X)
If we had just used silhouette scores, we would have missed that a lot of our data is actually not clustering very well. The plots above should make us reevaluate whether clustering is the right thing to do on our data.
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for more details.
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.
The linkage criteria determines the metric used for the merge strategy:
AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.
In [ ]:
# Code adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_digits_linkage.html
# Visualize the clustering
def plot_clustering(X_red, X, labels, title=None):
x_min, x_max = np.min(X_red, axis=0), np.max(X_red, axis=0)
X_red = (X_red - x_min) / (x_max - x_min)
plt.figure(figsize=(6, 4))
for i in range(X_red.shape[0]):
plt.text(X_red[i, 0], X_red[i, 1], str(y[i]),
color=plt.cm.spectral(labels[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
plt.xticks([])
plt.yticks([])
if title is not None:
plt.title(title, size=17)
plt.axis('off')
plt.tight_layout()
In [ ]:
print("Computing embedding")
X_red = manifold.SpectralEmbedding(n_components=2).fit_transform(X)
print("Done.")
for linkage in ('ward', 'average', 'complete'):
clustering = AgglomerativeClustering(linkage=linkage, n_clusters=8)
t0 = time()
clustering.fit(X_red)
print("%s : %.2fs" % (linkage, time() - t0))
plot_clustering(X_red, X, clustering.labels_, "%s linkage" % linkage)
plt.show()
This notebook has been compiled using several references: