Customer Segmentation using Clustering

This mini-project is based on this blog post by yhat. Please feel free to refer to the post for additional information, and solutions.



In [1]:

    
%matplotlib inline
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

Data

The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.



In [2]:

    
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()









    Out[2]:






  
    
      
      offer_id
      campaign
      varietal
      min_qty
      discount
      origin
      past_peak
    
  
  
    
      0
      1
      January
      Malbec
      72
      56
      France
      False
    
    
      1
      2
      January
      Pinot Noir
      72
      17
      France
      False
    
    
      2
      3
      February
      Espumante
      144
      32
      Oregon
      True
    
    
      3
      4
      February
      Champagne
      72
      48
      France
      True
    
    
      4
      5
      February
      Cabernet Sauvignon
      144
      44
      New Zealand
      True

We see that the first dataset contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak. The second dataset in the second worksheet contains transactional data -- which offer each customer responded to.



In [3]:

    
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()









    Out[3]:






  
    
      
      customer_name
      offer_id
      n
    
  
  
    
      0
      Smith
      2
      1
    
    
      1
      Smith
      24
      1
    
    
      2
      Johnson
      17
      1
    
    
      3
      Johnson
      24
      1
    
    
      4
      Johnson
      26
      1

Data wrangling

We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.

The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.

Checkup Exercise Set I

Exercise: Create a data frame where each row has the following columns (Use the pandas [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`pivot_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) functions for this purpose):

customer_name
One column for each offer, with a 1 if the customer responded to the offer

Make sure you also deal with any weird values such as `NaN`. Read the documentation to develop your solution.



In [4]:

    
# your turn

# Create a data frame where each row has a customer_name column and one column
# for each offer, with a 1 if the customer responded to the offer.
# Use pandas merge and pivot table functions for this purpose.
df_responses = pd.pivot_table(data=df_offers.merge(df_transactions, how='left', on='offer_id'),
                             values='n', index='customer_name', columns='offer_id', fill_value=0, dropna=False)

K-Means Clustering

Recall that in K-Means Clustering we want to maximize the distance between centroids and minimize the distance between data points and the respective centroid for the cluster they are in. True evaluation for unsupervised learning would require labeled data; however, we can use a variety of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the Elbow method, the Silhouette method and the gap statistic.

Checkup Exercise Set II

Exercise:

What values of $SS$ do you believe represent better clusterings? Why?
Create a numpy matrix `x_cols` with only the columns representing the offers (i.e. the 0/1 colums)
Write code that applies the [`KMeans`](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering method from scikit-learn to this matrix.
Construct a plot showing $SS$ for each $K$ and pick $K$ using this plot. For simplicity, test $2 \le K \le 10$.
Make a bar chart showing the number of points in each cluster for k-means under the best $K$.
What challenges did you experience using the Elbow method to pick $K$?



In [5]:

    
# your turn

# Create numpy matrix with only the columns representing the offers.
from sklearn.cluster import KMeans
import numpy as np
x_cols = np.matrix(df_responses)

# Set-up a placeholder dataframe for SS scores.
# The scores will be populated by NaN for now.
SS_scores = pd.DataFrame(columns=['K', 'score'])
SS_scores['K'] = range(2, 11)

# Apply KMeans clustering method to above matrix.
# The exercise suggests testing for 2 <= K <= 10.
for K in range(2, 11):
    cluster = KMeans(n_clusters=K, random_state=0).fit(x_cols)
    SS_scores['score'][SS_scores.K == K] = cluster.inertia_

# Plot SS for each K.
plt.scatter(data=SS_scores, x='K', y='score')

# Based on this SSE plot, there is no clear elbow wherein the 
# the SSE decreases rapidly. The score is still dropping
# from K =8 to K =10.  So I guess at this point, I'm choosing
# K = 10 as the best K.









    



C:\Users\Noreena\Anaconda3\lib\site-packages\ipykernel\__main__.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[5]:





<matplotlib.collections.PathCollection at 0x22ead1525c0>



In [6]:

    
# Make a bar chart showing the number of points in each cluster for k-means 
# under the best K, which I chose to be 10.
 
cluster = KMeans(n_clusters=10, random_state=0).fit(x_cols)

plt.hist(x=cluster.labels_, bins=10)

# What challenges did you experience using the Elbow method to pick K
#     Answer: The elbow method didn't really work since I couldn't find
#             that elbow inflection point.









    Out[6]:





(array([  9.,  12.,  17.,  14.,   5.,   7.,   9.,  19.,   3.,   5.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ]),
 <a list of 10 Patch objects>)

Choosing K: The Silhouette Method

There exists another method that measures how well each datapoint $x_i$ "fits" its assigned cluster and also how poorly it fits into other clusters. This is a different way of looking at the same objective. Denote $a_{x_i}$ as the average distance from $x_i$ to all other points within its own cluster $k$. The lower the value, the better. On the other hand $b_{x_i}$ is the minimum average distance from $x_i$ to points in a different cluster, minimized over clusters. That is, compute separately for each cluster the average distance from $x_i$ to the points within that cluster, and then take the minimum. The silhouette $s(x_i)$ is defined as

$$s(x_i) = \frac{b_{x_i} - a_{x_i}}{\max{\left( a_{x_i}, b_{x_i}\right)}}$$

The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. Some criteria for the silhouette coefficient is provided in the table below.

Fortunately, scikit-learn provides a function to compute this for us (phew!) called sklearn.metrics.silhouette_score. Take a look at this article on picking $K$ in scikit-learn, as it will help you in the next exercise set.

Checkup Exercise Set III

Exercise: Using the documentation for the `silhouette_score` function above, construct a series of silhouette plots like the ones in the article linked above.

Exercise: Compute the average silhouette score for each $K$ and plot it. What $K$ does the plot suggest we should choose? Does it differ from what we found using the Elbow method?



In [7]:

    
# Your turn.

# Using the documentation for the silhoutte_score function above,
# construct a series of silhouetter plots like the ones in the
# article linked above. By doing silhouette analysis, we can
# study the separation distance between the resulting clusters.

from __future__ import print_function

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.cm as cm
import numpy as np

print(__doc__)

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)  # For reproducibility

range_n_clusters = range(2, 10)

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns.
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot.
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.2, 0.5]
    ax1.set_xlim([-0.1, 1.0])
    
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    # ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(x_cols)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(x_cols, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(x_cols, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([])
    
    # 2nd Plot show the actual clusters formed
    colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors)

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1],
                marker='o', c="white", alpha=1, s=200)

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")


    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

    plt.show()









    



Automatically created module for IPython interactive environment
For n_clusters = 2 The average silhouette_score is : 0.0936557328349






    












    



For n_clusters = 3 The average silhouette_score is : 0.118899428636






    












    



For n_clusters = 4 The average silhouette_score is : 0.123470539196






    












    



For n_clusters = 5 The average silhouette_score is : 0.14092516242






    












    



For n_clusters = 6 The average silhouette_score is : 0.133564655434






    












    



For n_clusters = 7 The average silhouette_score is : 0.123858079543






    












    



For n_clusters = 8 The average silhouette_score is : 0.118736028013






    












    



For n_clusters = 9 The average silhouette_score is : 0.125328106462



In [8]:

    
# What K does the plot suggest we should choose? 
# Does it differ from what we found using the Elbow method?

#     The silhouette score for the different number of clusters is around 0.12
#     which according to the table criteria for silhouette scoring is that
#     no substantial structure has been found.  Like the elbow test, we can't
#     determine through this method the best K.

Aside: Choosing $K$ when we Have Labels

Unsupervised learning expects that we do not have the labels. In some situations, we may wish to cluster data that is labeled. Computing the optimal number of clusters is much easier if we have access to labels. There are several methods available. We will not go into the math or details since it is rare to have access to the labels, but we provide the names and references of these measures.

Adjusted Rand Index
Mutual Information
V-Measure
Fowlkes–Mallows index

See this article for more information about these metrics.

Visualizing Clusters using PCA

How do we visualize clusters? If we only had two features, we could likely plot the data as is. But we have 100 data points each containing 32 features (dimensions). Principal Component Analysis (PCA) will help us reduce the dimensionality of our data from 32 to something lower. For a visualization on the coordinate plane, we will use 2 dimensions. In this exercise, we're going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset.

This is only one use of PCA for dimension reduction. We can also use PCA when we want to perform regression but we have a set of highly correlated variables. PCA untangles these correlations into a smaller number of features/predictors all of which are orthogonal (not correlated). PCA is also used to reduce a large set of variables into a much smaller one.

Checkup Exercise Set IV

Exercise: Use PCA to plot your clusters:

Use scikit-learn's [`PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) function to reduce the dimensionality of your clustering data to 2 components
Create a data frame with the following fields:
- customer name
- cluster id the customer belongs to
- the two PCA components (label them `x` and `y`)
Plot a scatterplot of the `x` vs `y` columns
Color-code points differently based on cluster ID
How do the clusters look?
Based on what you see, what seems to be the best value for $K$? Moreover, which method of choosing $K$ seems to have produced the optimal result visually?

Exercise: Now look at both the original raw data about the offers and transactions and look at the fitted clusters. Tell a story about the clusters in context of the original data. For example, do the clusters correspond to wine variants or something else interesting?



In [9]:

    
# your turn

# Use scikit-learn's PCA function to reduce the dimensionality of 
# your clustering data to 2 components.  But first, I will
# re-run the clustering program to only make 5 clusters.
# This is so that I may be able to see clearly the
# different clusters.
cluster = KMeans(n_clusters=5, random_state=0).fit(x_cols)
PCA2 = sklearn.decomposition.PCA(n_components=2)
PCA2.fit(df_responses)

df_responses_pca = pd.DataFrame(PCA2.transform(df_responses))
df_responses_pca.index = df_responses.index

# Create a data frame with the following fields: 
#     customer name 
#     the two PCA components (label them x and y) 
df_labels = pd.DataFrame(cluster.labels_)
df_labels.index = df_responses.index

df_pca = df_responses_pca.merge(df_labels, left_index=True, right_index=True)
df_pca.columns = ['x', 'y', 'cluster_id']
df_pca.head(10)









    Out[9]:






  
    
      
      x
      y
      cluster_id
    
    
      customer_name
      
      
      
    
  
  
    
      Adams
      -1.007580
      0.108215
      0
    
    
      Allen
      0.287539
      0.044715
      2
    
    
      Anderson
      0.392032
      1.038391
      1
    
    
      Bailey
      -0.699477
      -0.022542
      0
    
    
      Baker
      -0.088183
      -0.471695
      2
    
    
      Barnes
      0.485282
      -0.725503
      3
    
    
      Bell
      0.591941
      1.506500
      1
    
    
      Bennett
      -0.661732
      0.090308
      2
    
    
      Brooks
      0.118943
      -0.577499
      2
    
    
      Brown
      -1.079827
      -0.007488
      0



In [10]:

    
# Plot a scatterplot of the x vs y columns. 
# Color-code points differently based on cluster ID 
sns.lmplot(data=df_pca, x='x', y='y', hue='cluster_id', size=5, fit_reg=False)









    Out[10]:





<seaborn.axisgrid.FacetGrid at 0x22eacdb0160>



In [11]:

    
# your turn

# The below code is not asked by the exercise but I wanted to
# see a different plot with centroids so I patterned it
# after an example by scikit-learn.

# Use scikit-learn's PCA function to reduce the dimensionality of 
# your clustering data to 2 components.  But first, I will
# re-run the clustering program to only make 5 clusters.
# This is so that I may be able to see clearly the
# different clusters.

# Plot the PCA results showing the centroids, this is
# patterned from an example ny scikit-learn

import numpy as np
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

PCA2comp = PCA(n_components=2).fit_transform(x_cols)
kmeans = KMeans(n_clusters=5, n_init=100)
kmeans.fit(PCA2comp)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = PCA2comp[:, 0].min() - 1, PCA2comp[:, 0].max() + 1
y_min, y_max = PCA2comp[:, 1].min() - 1, PCA2comp[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
df_PCA = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
df_PCA = df_PCA.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(df_PCA, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(PCA2comp[:, 0], PCA2comp[:, 1], 'k.', markersize=5)

# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()



In [12]:

    
# How do the clusters look? 
#     It appears that there's a better separation
#     of variables in the 5 clusters.

# Based on what you see, what seems to be the best value for K?
# Moreover, which method of choosing K seems to have produced
# the optimal result visually?
#     Based on the PCA plot, it appears that K=2 is the best K.
#     With this sample data, I would choose this PCA method
#     as having shown optimal result visually.

Exercise Set V

As we saw earlier, PCA has a lot of other uses. Since we wanted to visualize our data in 2 dimensions, restricted the number of dimensions to 2 in PCA. But what is the true optimal number of dimensions?

Exercise: Using a new PCA object shown in the next cell, plot the `explained_variance_` field and look for the elbow point, the point where the curve's rate of descent seems to slow sharply. This value is one possible value for the optimal number of dimensions. What is it?



In [13]:

    
# your turn

# Initialize a new PCA model with a default number of components.
import sklearn.decomposition
pca = sklearn.decomposition.PCA()
pca.fit(df_responses)

# Do the rest on your own :)
# Plot the explained_variance_ field.
df_explained_variance = pd.DataFrame({"Dimension Number": range(1, len(pca.explained_variance_) + 1),
                                      "Explained Variance": pca.explained_variance_})

plt.scatter(data=df_explained_variance, x="Dimension Number", y="Explained Variance")

# Look for the elbow point, the point where the curve's rate of descent seems to slow sharply. 
# This value is one possible value for the optimal number of dimensions. What is it?
#      Based on this explained_variance_ graph, it appears that the elbow point is
#      the inflection point of K=4.









    Out[13]:





<matplotlib.collections.PathCollection at 0x22ead9adc50>

Other Clustering Algorithms

k-means is only one of a ton of clustering algorithms. Below is a brief description of several clustering algorithms, and the table provides references to the other clustering algorithms in scikit-learn.

Affinity Propagation does not require the number of clusters $K$ to be known in advance! AP uses a "message passing" paradigm to cluster points based on their similarity.
Spectral Clustering uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. This is tangentially similar to what we did to visualize k-means clusters using PCA. The number of clusters must be known a priori.
Ward's Method applies to hierarchical clustering. Hierarchical clustering algorithms take a set of data and successively divide the observations into more and more clusters at each layer of the hierarchy. Ward's method is used to determine when two clusters in the hierarchy should be combined into one. It is basically an extension of hierarchical clustering. Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first, and at each successive iteration, the clusters are made smaller and smaller. With hierarchical clustering, a hierarchy is constructed, and there is not really the concept of "number of clusters." The number of clusters simply determines how low or how high in the hierarchy we reference and can be determined empirically or by looking at the dendogram.
Agglomerative Clustering is similar to hierarchical clustering but but is not divisive, it is agglomerative. That is, every observation is placed into its own cluster and at each iteration or level or the hierarchy, observations are merged into fewer and fewer clusters until convergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or the dendogram.
DBSCAN is based on point density rather than distance. It groups together points with many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It does not require knowing the number of clusters a priori, but does require specifying the neighborhood size.

Clustering Algorithms in Scikit-learn

</colgroup>

Method name

Parameters

Scalability

Use Case

Geometry (metric used) </tr> </thead>

K-Means

number of clusters

Very largen_samples, medium n_clusters with MiniBatch code

General-purpose, even cluster size, flat geometry, not too many clusters

Distances between points </tr>

Affinity propagation

damping, sample preference

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry

Graph distance (e.g. nearest-neighbor graph) </tr>

Mean-shift

bandwidth

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry

Distances between points </tr>

Spectral clustering

number of clusters

Medium n_samples, small n_clusters

Few clusters, even cluster size, non-flat geometry

Graph distance (e.g. nearest-neighbor graph) </tr>

Ward hierarchical clustering

number of clusters

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints

Distances between points </tr>

Agglomerative clustering

number of clusters, linkage type, distance

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints, non Euclidean distances

Any pairwise distance </tr>

DBSCAN

neighborhood size

Very large n_samples, medium n_clusters

Non-flat geometry, uneven cluster sizes

Distances between nearest points </tr>

Gaussian mixtures

many

Not scalable

Flat geometry, good for density estimation

Mahalanobis distances to centers </tr>

Birch

branching factor, threshold, optional global clusterer.

Large n_clusters and n_samples

Large dataset, outlier removal, data reduction.

Euclidean distance between points </tr> </tbody> </table> Source: http://scikit-learn.org/stable/modules/clustering.html

Exercise Set VI

Exercise: Try clustering using the following algorithms.

Affinity propagation
Spectral clustering
Agglomerative clustering
DBSCAN

How do their results compare? Which performs the best? Tell a story why you think it performs the best.



In [14]:

    
# Your turn
# Try clustering using the Affinity Propagation algorithm.
# I patterned this after the scikit-learn demo on the algorithm.
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from itertools import cycle

AP_cluster = AffinityPropagation(preference=10).fit(x_cols)

cluster_centers_indices = AP_cluster.cluster_centers_indices_
labels = AP_cluster.labels_

n_clusters_ = len(cluster_centers_indices)

# Plot the result, again this code is from the
# scikit learn example.
plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()









    



C:\Users\Noreena\Anaconda3\lib\site-packages\ipykernel\__main__.py:27: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 500 but corresponding boolean dimension is 100
C:\Users\Noreena\Anaconda3\lib\site-packages\ipykernel\__main__.py:30: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 500 but corresponding boolean dimension is 100



In [15]:

    
# Your turn

# Try clustering using the Spectral Clustering algorithm.
# I patterned this after the scikit-learn demo.

from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.cluster import SpectralClustering

# normalize dataset for easier parameter selection
#x_cols_norm = StandardScaler().fit_transform(x_cols)

spectral_cluster = SpectralClustering(n_clusters=2,
                                      eigen_solver='arpack',
                                      affinity="nearest_neighbors").fit(x_cols)

# Get labels of clusters.
labels = spectral_cluster.labels_

# Plot the result, again this code is from the scikit-learn example.
plt.close('all')
plt.clf()

#plt.title('Estimated number of clusters: %d' % n_clusters_)

# Plot spectral clusters.
plt.scatter(x_cols[:, 0], x_cols[:, 1])

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.xticks(())
plt.yticks(())
plt.show









    Out[15]:





<function matplotlib.pyplot.show>

	offer_id	campaign	varietal	min_qty	discount	origin	past_peak
0	1	January	Malbec	72	56	France	False
1	2	January	Pinot Noir	72	17	France	False
2	3	February	Espumante	144	32	Oregon	True
3	4	February	Champagne	72	48	France	True
4	5	February	Cabernet Sauvignon	144	44	New Zealand	True

	x	y	cluster_id
customer_name
Adams	-1.007580	0.108215	0
Allen	0.287539	0.044715	2
Anderson	0.392032	1.038391	1
Bailey	-0.699477	-0.022542	0
Baker	-0.088183	-0.471695	2
Barnes	0.485282	-0.725503	3
Bell	0.591941	1.506500	1
Bennett	-0.661732	0.090308	2
Brooks	0.118943	-0.577499	2
Brown	-1.079827	-0.007488	0