# Customer Segmentation using Clustering

This mini-project is based on this blog post by yhat. Please feel free to refer to the post for additional information, and solutions.



In :

%matplotlib inline
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")



## Data

The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.



In :

df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]




Out:

offer_id
campaign
varietal
min_qty
discount
origin
past_peak

0
1
January
Malbec
72
56
France
False

1
2
January
Pinot Noir
72
17
France
False

2
3
February
Espumante
144
32
Oregon
True

3
4
February
Champagne
72
48
France
True

4
5
February
Cabernet Sauvignon
144
44
New Zealand
True



We see that the first dataset contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak. The second dataset in the second worksheet contains transactional data -- which offer each customer responded to.



In :

df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1




Out:

customer_name
offer_id
n

0
Smith
2
1

1
Smith
24
1

2
Johnson
17
1

3
Johnson
24
1

4
Johnson
26
1



## Data wrangling

We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.

The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.

### Checkup Exercise Set I

Exercise: Create a data frame where each row has the following columns (Use the pandas [merge](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [pivot_table](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) functions for this purpose):

• customer_name
• One column for each offer, with a 1 if the customer responded to the offer

Make sure you also deal with any weird values such as NaN. Read the documentation to develop your solution.



In :

# Create a data frame where each row has a customer_name column and one column
# for each offer, with a 1 if the customer responded to the offer.
# Use pandas merge and pivot table functions for this purpose.
df_responses = pd.pivot_table(data=df_offers.merge(df_transactions, how='left', on='offer_id'),
values='n', index='customer_name', columns='offer_id', fill_value=0, dropna=False)



## K-Means Clustering

Recall that in K-Means Clustering we want to maximize the distance between centroids and minimize the distance between data points and the respective centroid for the cluster they are in. True evaluation for unsupervised learning would require labeled data; however, we can use a variety of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the Elbow method, the Silhouette method and the gap statistic.

### Checkup Exercise Set II

Exercise:

• What values of $SS$ do you believe represent better clusterings? Why?
• Create a numpy matrix x_cols with only the columns representing the offers (i.e. the 0/1 colums)
• Write code that applies the [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering method from scikit-learn to this matrix.
• Construct a plot showing $SS$ for each $K$ and pick $K$ using this plot. For simplicity, test $2 \le K \le 10$.
• Make a bar chart showing the number of points in each cluster for k-means under the best $K$.
• What challenges did you experience using the Elbow method to pick $K$?


In :

# Create numpy matrix with only the columns representing the offers.
from sklearn.cluster import KMeans
import numpy as np
x_cols = np.matrix(df_responses)

# Set-up a placeholder dataframe for SS scores.
# The scores will be populated by NaN for now.
SS_scores = pd.DataFrame(columns=['K', 'score'])
SS_scores['K'] = range(2, 11)

# Apply KMeans clustering method to above matrix.
# The exercise suggests testing for 2 <= K <= 10.
for K in range(2, 11):
cluster = KMeans(n_clusters=K, random_state=0).fit(x_cols)
SS_scores['score'][SS_scores.K == K] = cluster.inertia_

# Plot SS for each K.
plt.scatter(data=SS_scores, x='K', y='score')

# Based on this SSE plot, there is no clear elbow wherein the
# the SSE decreases rapidly. The score is still dropping
# from K =8 to K =10.  So I guess at this point, I'm choosing
# K = 10 as the best K.




C:\Users\Noreena\Anaconda3\lib\site-packages\ipykernel\__main__.py:17: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out:




In :

# Make a bar chart showing the number of points in each cluster for k-means
# under the best K, which I chose to be 10.

cluster = KMeans(n_clusters=10, random_state=0).fit(x_cols)

plt.hist(x=cluster.labels_, bins=10)

# What challenges did you experience using the Elbow method to pick K
#     Answer: The elbow method didn't really work since I couldn't find
#             that elbow inflection point.




Out:

(array([  9.,  12.,  17.,  14.,   5.,   7.,   9.,  19.,   3.,   5.]),
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ]),
<a list of 10 Patch objects>)



### Choosing K: The Silhouette Method

There exists another method that measures how well each datapoint $x_i$ "fits" its assigned cluster and also how poorly it fits into other clusters. This is a different way of looking at the same objective. Denote $a_{x_i}$ as the average distance from $x_i$ to all other points within its own cluster $k$. The lower the value, the better. On the other hand $b_{x_i}$ is the minimum average distance from $x_i$ to points in a different cluster, minimized over clusters. That is, compute separately for each cluster the average distance from $x_i$ to the points within that cluster, and then take the minimum. The silhouette $s(x_i)$ is defined as

$$s(x_i) = \frac{b_{x_i} - a_{x_i}}{\max{\left( a_{x_i}, b_{x_i}\right)}}$$

The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. Some criteria for the silhouette coefficient is provided in the table below.

Fortunately, scikit-learn provides a function to compute this for us (phew!) called sklearn.metrics.silhouette_score. Take a look at this article on picking $K$ in scikit-learn, as it will help you in the next exercise set.

### Checkup Exercise Set III

Exercise: Using the documentation for the silhouette_score function above, construct a series of silhouette plots like the ones in the article linked above.

Exercise: Compute the average silhouette score for each $K$ and plot it. What $K$ does the plot suggest we should choose? Does it differ from what we found using the Elbow method?



In :

# Using the documentation for the silhoutte_score function above,
# construct a series of silhouetter plots like the ones in the
# article linked above. By doing silhouette analysis, we can
# study the separation distance between the resulting clusters.

from __future__ import print_function

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.cm as cm
import numpy as np

print(__doc__)

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1)  # For reproducibility

range_n_clusters = range(2, 10)

for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

# The 1st subplot is the silhouette plot.
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.2, 0.5]
ax1.set_xlim([-0.1, 1.0])

# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
# ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(x_cols)

# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(x_cols, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)

# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(x_cols, cluster_labels)

y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]

ith_cluster_silhouette_values.sort()

size_cluster_i = ith_cluster_silhouette_values.shape
y_upper = y_lower + size_cluster_i

color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)

# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

# Compute the new y_lower for next plot
y_lower = y_upper + 10  # 10 for the 0 samples

ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

ax1.set_yticks([])  # Clear the yaxis labels / ticks
ax1.set_xticks([])

# 2nd Plot show the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)

# Labeling the clusters
centers = clusterer.cluster_centers_

# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)

for i, c in enumerate(centers):
ax2.scatter(c, c, marker='$%d$' % i, alpha=1, s=50)

ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")

plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')

plt.show()




Automatically created module for IPython interactive environment
For n_clusters = 2 The average silhouette_score is : 0.0936557328349

For n_clusters = 3 The average silhouette_score is : 0.118899428636

For n_clusters = 4 The average silhouette_score is : 0.123470539196

For n_clusters = 5 The average silhouette_score is : 0.14092516242

For n_clusters = 6 The average silhouette_score is : 0.133564655434

For n_clusters = 7 The average silhouette_score is : 0.123858079543

For n_clusters = 8 The average silhouette_score is : 0.118736028013