This mini-project is based on this blog post by yhat. Please feel free to refer to the post for additional information, and solutions.
In [13]:
%matplotlib inline
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")
The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.
In [14]:
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()
Out[14]:
We see that the first dataset contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak. The second dataset in the second worksheet contains transactional data -- which offer each customer responded to.
In [15]:
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()
Out[15]:
We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.
The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.
Exercise: Create a data frame where each row has the following columns (Use the pandas [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`pivot_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) functions for this purpose):
Make sure you also deal with any weird values such as `NaN`. Read the documentation to develop your solution.
In [16]:
#your turn
# merge the dataframes based on offer id
df_merged = pd.merge(df_transactions, df_offers, on='offer_id')
# create a matrix of customer name and offer id. Replace NaN values with zero and reset index to offer id rather than customer
x_cols = pd.pivot_table(df_merged, values='n', index=['customer_name'], columns=['offer_id']).fillna(0).reset_index()
# create dataframe without customer name
X = x_cols[x_cols.columns[1:]]
Recall that in K-Means Clustering we want to maximize the distance between centroids and minimize the distance between data points and the respective centroid for the cluster they are in. True evaluation for unsupervised learning would require labeled data; however, we can use a variety of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the Elbow method, the Silhouette method and the gap statistic.
The first method looks at the sum-of-squares error in each cluster against $K$. We compute the distance from each data point to the center of the cluster (centroid) to which the data point was assigned.
$$SS = \sum_k \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left( x_i - x_j \right)^2 = \sum_k \sum_{x_i \in C_k} \left( x_i - \mu_k \right)^2$$where $x_i$ is a point, $C_k$ represents cluster $k$ and $\mu_k$ is the centroid for cluster $k$. We can plot SS vs. $K$ and choose the elbow point in the plot as the best value for $K$. The elbow point is the point at which the plot starts descending much more slowly.
Exercise:
In [17]:
#your turn
from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans
import numpy as np
# get Kmean and centroids
K = range(2, 11)
KM = [KMeans(n_clusters=k).fit(X) for k in K]
centroids = [k.cluster_centers_ for k in KM]
# compute euclidean distance
D_k = [cdist(X, mid, 'euclidean') for mid in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
# Total with-in sum of square
tss = [sum(d**2) for d in dist]
# Construct a plot showing SSSS for each KK
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlim([1, 11])
ax.plot(K, tss, 'b*-')
ax.plot(K[6], tss[6], marker='o', markersize=12,
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
Out[17]:
In [18]:
# setup KMEans for cluster = 8
cluster = KMeans(n_clusters=8)
# predict and assign to a cluster
x_cols['cluster'] = cluster.fit_predict(X)
y = x_cols.cluster.value_counts()
# index number is the cluster number
cluster = y.index.values
x_lim = np.arange(len(y))
# plot bar chart
plt.bar(x_lim, y, align='center', alpha=0.5)
plt.xticks(x_lim, cluster)
plt.ylabel('Counts')
plt.title('Number of points per cluster')
plt.show()
There exists another method that measures how well each datapoint $x_i$ "fits" its assigned cluster and also how poorly it fits into other clusters. This is a different way of looking at the same objective. Denote $a_{x_i}$ as the average distance from $x_i$ to all other points within its own cluster $k$. The lower the value, the better. On the other hand $b_{x_i}$ is the minimum average distance from $x_i$ to points in a different cluster, minimized over clusters. That is, compute separately for each cluster the average distance from $x_i$ to the points within that cluster, and then take the minimum. The silhouette $s(x_i)$ is defined as
$$s(x_i) = \frac{b_{x_i} - a_{x_i}}{\max{\left( a_{x_i}, b_{x_i}\right)}}$$The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. Some criteria for the silhouette coefficient is provided in the table below.
Range | Interpretation |
---|---|
0.71 - 1.0 | A strong structure has been found. |
0.51 - 0.7 | A reasonable structure has been found. |
0.26 - 0.5 | The structure is weak and could be artificial. |
< 0.25 | No substantial structure has been found. |
</pre> Source: http://www.stat.berkeley.edu/~spector/s133/Clus.html
Fortunately, scikit-learn provides a function to compute this for us (phew!) called sklearn.metrics.silhouette_score
. Take a look at this article on picking $K$ in scikit-learn, as it will help you in the next exercise set.
Exercise: Using the documentation for the `silhouette_score` function above, construct a series of silhouette plots like the ones in the article linked above.
Exercise: Compute the average silhouette score for each $K$ and plot it. What $K$ does the plot suggest we should choose? Does it differ from what we found using the Elbow method?
Based on the silhouette method, the value of K with the max score is K=5. It is different to the elbow method since our data shows that the SSE (elbow) tends to stabilize at K=8
In [19]:
from __future__ import print_function
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
print(__doc__)
# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
In [20]:
# Your turn.
from sklearn.metrics import silhouette_samples, silhouette_score
df_sil=[]
for n_clusters in range(2,10):
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
# add data to the list
df_sil.append([n_clusters, silhouette_avg])
# convert into a dataframe
df_sil=pd.DataFrame(df_sil, columns=['cluster', 'avg_score'])
# index number is the cluster number
cluster = df_sil.cluster
x_lim = np.arange(len(df_sil))
y= df_sil.avg_score
# plot bar chart
plt.bar(x_lim, y, align='center', alpha=0.5)
plt.xticks(x_lim, cluster)
plt.ylabel('silhoutte score')
plt.title('Silhoutte score per cluster')
plt.show()
There is one last method worth covering for picking $K$, the so-called Gap statistic. The computation for the gap statistic builds on the sum-of-squares established in the Elbow method discussion, and compares it to the sum-of-squares of a "null distribution," that is, a random set of points with no clustering. The estimate for the optimal number of clusters $K$ is the value for which $\log{SS}$ falls the farthest below that of the reference distribution:
$$G_k = E_n^*\{\log SS_k\} - \log SS_k$$In other words a good clustering yields a much larger difference between the reference distribution and the clustered data. The reference distribution is a Monte Carlo (randomization) procedure that constructs $B$ random distributions of points within the bounding box (limits) of the original data and then applies K-means to this synthetic distribution of data points.. $E_n^*\{\log SS_k\}$ is just the average $SS_k$ over all $B$ replicates. We then compute the standard deviation $\sigma_{SS}$ of the values of $SS_k$ computed from the $B$ replicates of the reference distribution and compute
$$s_k = \sqrt{1+1/B}\sigma_{SS}$$Finally, we choose $K=k$ such that $G_k \geq G_{k+1} - s_{k+1}$.
Unsupervised learning expects that we do not have the labels. In some situations, we may wish to cluster data that is labeled. Computing the optimal number of clusters is much easier if we have access to labels. There are several methods available. We will not go into the math or details since it is rare to have access to the labels, but we provide the names and references of these measures.
See this article for more information about these metrics.
How do we visualize clusters? If we only had two features, we could likely plot the data as is. But we have 100 data points each containing 32 features (dimensions). Principal Component Analysis (PCA) will help us reduce the dimensionality of our data from 32 to something lower. For a visualization on the coordinate plane, we will use 2 dimensions. In this exercise, we're going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset.
This is only one use of PCA for dimension reduction. We can also use PCA when we want to perform regression but we have a set of highly correlated variables. PCA untangles these correlations into a smaller number of features/predictors all of which are orthogonal (not correlated). PCA is also used to reduce a large set of variables into a much smaller one.
Exercise: Use PCA to plot your clusters:
Exercise: Now look at both the original raw data about the offers and transactions and look at the fitted clusters. Tell a story about the clusters in context of the original data. For example, do the clusters correspond to wine variants or something else interesting?
Cluster 4 tends to buy in bulk. That segment has an average of 82 minimum quantity compared to 45 minimum quantity for non-cluster 4. Also, cluster 4 corresponds to mostly buyers of Champange.
In [49]:
#your turn
from sklearn.decomposition import PCA
cluster = KMeans(n_clusters=5)
x_cols['cluster'] = cluster.fit_predict(x_cols[x_cols.columns[1:]])
pca = PCA(n_components=2)
x_cols['x'] = pca.fit_transform(x_cols[x_cols.columns[1:]])[:,0]
x_cols['y'] = pca.fit_transform(x_cols[x_cols.columns[1:]])[:,1]
customer_clusters = x_cols[['customer_name', 'cluster', 'x', 'y']]
df = pd.merge(df_transactions, customer_clusters)
df = pd.merge(df_offers, df)
sns.lmplot('x', 'y',
data=df,
fit_reg=False,
hue="cluster",
scatter_kws={"marker": "D",
"s": 100})
plt.title('Scatter plot of clustered data')
Out[49]:
In [70]:
df['is_4'] = df.cluster==4
print(df.groupby("is_4")[['min_qty', 'discount']].mean())
df.groupby("is_4").varietal.value_counts()
Out[70]:
What we've done is we've taken those columns of 0/1 indicator variables, and we've transformed them into a 2-D dataset. We took one column and arbitrarily called it x
and then called the other y
. Now we can throw each point into a scatterplot. We color coded each point based on it's cluster so it's easier to see them.
As we saw earlier, PCA has a lot of other uses. Since we wanted to visualize our data in 2 dimensions, restricted the number of dimensions to 2 in PCA. But what is the true optimal number of dimensions?
Exercise: Using a new PCA object shown in the next cell, plot the `explained_variance_` field and look for the elbow point, the point where the curve's rate of descent seems to slow sharply. This value is one possible value for the optimal number of dimensions. What is it?
In [22]:
#your turn
# Initialize a new PCA model with a default number of components.
from sklearn.decomposition import PCA
# Do the rest on your own :)
from sklearn import decomposition
pca = PCA()
pca.fit(X)
pca_ratio = (np.round(pca.explained_variance_, decimals=4)*100)
K = [1, 2]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlim([1, 3])
ax.plot(K, pca_ratio, 'b*-')
plt.grid(True)
plt.xlabel('Number of dimensions')
plt.ylabel('PCA Explained variance')
plt.title('Elbow for PCA explained variance')
Out[22]:
k-means is only one of a ton of clustering algorithms. Below is a brief description of several clustering algorithms, and the table provides references to the other clustering algorithms in scikit-learn.
Affinity Propagation does not require the number of clusters $K$ to be known in advance! AP uses a "message passing" paradigm to cluster points based on their similarity.
Spectral Clustering uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. This is tangentially similar to what we did to visualize k-means clusters using PCA. The number of clusters must be known a priori.
Ward's Method applies to hierarchical clustering. Hierarchical clustering algorithms take a set of data and successively divide the observations into more and more clusters at each layer of the hierarchy. Ward's method is used to determine when two clusters in the hierarchy should be combined into one. It is basically an extension of hierarchical clustering. Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first, and at each successive iteration, the clusters are made smaller and smaller. With hierarchical clustering, a hierarchy is constructed, and there is not really the concept of "number of clusters." The number of clusters simply determines how low or how high in the hierarchy we reference and can be determined empirically or by looking at the dendogram.
Agglomerative Clustering is similar to hierarchical clustering but but is not divisive, it is agglomerative. That is, every observation is placed into its own cluster and at each iteration or level or the hierarchy, observations are merged into fewer and fewer clusters until convergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or the dendogram.
DBSCAN is based on point density rather than distance. It groups together points with many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It does not require knowing the number of clusters a priori, but does require specifying the neighborhood size.
Exercise: Try clustering using the following algorithms.
How do their results compare? Which performs the best? Tell a story why you think it performs the best.
Affinity propagation and DBScan will suggest a number of clusters to be used while Spectral and Agglomerative clustering required a pre-assigned number of clusters. Based on the silhouette coefficient, the best algorithm for the given set of data is spectral clustering with silhouette value = 0.71. I think the best algorithms is DBSCAN because it gives a better idea on how the data can be grouped based on the distance of the neighboring points. Affinity propagation tends to give a bigger number of clusters compared to DBScan.
In [23]:
# your turn
# Affinity propagation
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
af = AffinityPropagation().fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels, metric='sqeuclidean'))
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
In [33]:
# your turn
# Spectral Clustering
from sklearn import cluster
for n_clusters in range(2,3):
#n_clusters = 4
spectral = cluster.SpectralClustering(n_clusters=n_clusters,
eigen_solver='arpack',
affinity="nearest_neighbors")
spectral.fit(X)
labels = spectral.labels_
print('Assigned number of clusters: %d' % n_clusters)
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
plt.scatter(X[:, 0], X[:, 1], c=spectral.labels_, cmap=plt.cm.spectral)
plt.title('Assigned number of clusters: %d' % n_clusters)
Out[33]:
In [35]:
# AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
for n_clusters in range(2,3):
#n_clusters = 4
linkage= 'ward'
model = AgglomerativeClustering(n_clusters=n_clusters)
model.fit(X)
labels = model.labels_
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap=plt.cm.spectral)
plt.title('linkage=%s' % (linkage), fontdict=dict(verticalalignment='top'))
plt.axis('equal')
plt.axis('off')
plt.subplots_adjust(bottom=0, top=.89, wspace=0, left=0, right=1)
plt.suptitle('n_cluster=%i' % (n_clusters), size=17)
plt.show()
In [42]:
# Your turn
# Using DBSCAN
from sklearn.cluster import DBSCAN
from sklearn import metrics
for eps in [.6]:
db = DBSCAN(eps=eps).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in list(zip(unique_labels, colors)):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
In [ ]: