In this project you, will analyze a dataset containing annual spending amounts for internal structure, to understand the variation in the different types of customers that a wholesale distributor interacts with.
Instructions:
In [1]:
# Import libraries: NumPy, pandas, matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rc
# Tell iPython to include plots inline in the notebook
%matplotlib inline
# Set styles for seaborn
%config InlineBackend.figure_formats = {'png', 'retina'}
rc_sns = {'lines.linewidth': 2,
'axes.labelsize': 14,
'axes.titlesize': 14,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', font_scale=1.2, rc=rc_sns, )
sns.set_style ('darkgrid', rc=rc_sns)
# Read dataset
data = pd.read_csv("wholesale-customers.csv")
print "Dataset has {} rows, {} columns".format(*data.shape)
print data.head() # print the first 5 rowsdata.describe()
In [2]:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
sc = StandardScaler()
X_std = sc.fit_transform(data)
cols = list(data.columns)
sns.heatmap(np.corrcoef(X_std.T),
cbar=True, square=True, annot=True, fmt='.2f',
xticklabels=cols, yticklabels=cols)
data.describe()
Out[2]:
1) In this section you will be using PCA and ICA to start to understand the structure of the data. Before doing any computations, what do you think will show up in your computations? List one or two ideas for what might show up as the first PCA dimensions, or what type of vectors will show up as ICA dimensions.
Answer:
PCA seeks to find orthogonal directions (principle components) that maximize variances. According to the table above, "Fresh" and "Grocery" have the largest variance, thus more likely to show up in the first PCA dimension.
For ICA, non-normality of the marginal densities are maximized to discover latent variables (independent components) underlying the dataset.[1] The vectors showing up after ICA would be the "source" that can generate the observed dataset. According to the correlation matrix generated above, "Fresh" and "Frozen" have less correlation with other variables, thus more likely to show up as the first ICA dimension.
In [3]:
# TODO: Apply PCA with the same number of dimensions as variables in the dataset
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize input data
sc = StandardScaler()
X = data.values.astype(np.float64)
X_std = sc.fit_transform(X)
# Perform PCA analysis
pca = PCA(n_components=None)
X_std_pca = pca.fit_transform(X_std)
# Print the components and the amount of variance in the data contained in each dimension
print('Principal components (pc) :')
print(pca.components_)
print('\nExplained variance ratios (EVR):')
print(pca.explained_variance_ratio_)
print('\nEVR of the 1st and 2nd pc: %.3f' % sum(pca.explained_variance_ratio_[:2]))
print('EVR of the 3rd and 4th pc: %.3f' % sum(pca.explained_variance_ratio_[2:4]))
print('EVR of the first four pc: %.3f' % sum(pca.explained_variance_ratio_[:4]))
# visualization of indivudual and cumulative explained variance ratio
# code adapted from p132 of S. Raschka "Python Machine Learning" 2015
n_pca = pca.n_components_
evr = pca.explained_variance_ratio_
cum_evr = np.cumsum(evr)
plt.bar (range(1, n_pca+1), evr, alpha=0.75,
align='center', label='explained variance ratio (individual)')
plt.step(range(1, n_pca+1), cum_evr,
where='mid', label='explained variance ratio (cumulative)')
plt.legend(loc='best')
plt.xlabel('Number of principal components')
rc('font', weight='bold')
2) How quickly does the variance drop off by dimension? If you were to use PCA on this dataset, how many dimensions would you choose for your analysis? Why?
Answer:
The explained variance ratio drops off pretty fast after the first two principle components, which account for 72.5% of the total variance while the third and fourth constitute around 21.7%.
It would be reasonable to choose a total of four dimensions for analysis since the first four principle components accounts for around 94.2% of the variance.
3) What do the dimensions seem to represent? How can you use this information?
In [4]:
pd.DataFrame(pca.components_, columns=cols, index=['PC %d' %idx for idx in range(1, len(cols)+1)])
Out[4]:
Answer:
The dimensions represent linear combinations of the original features, which capture the variances between data points in decreasing order. For example, as shown in the table above, the first principle component (PC) will take the original dataset and transform it to lie along an axis that assigns more absolute weight to features such as Milk , Grocery and Detergents_Paper compared to Fresh and Frozen. For the second PC, more absolute weight is assigned to Fresh, Frozen and Delicatessen compared to Grocery and Milk.
The transformed dataset might prove to be a better representation of the original dataset, thus potentially making clustering results more relevant. Throwing out the principal components with low variance might also help reduce the "curse of dimensionality".
In [10]:
# TODO: Fit an ICA model to the data
# Note: Adjust the data to have center at the origin first!
from sklearn.decomposition import FastICA
ica = FastICA(n_components=None, whiten=True, random_state=0)
X_ica = ica.fit_transform(X_std)
print('The unmixing matrix:')
pd.set_option('display.precision', 4)
pd.DataFrame(ica.components_, columns=cols, index=['IC %d' %idx for idx in range(1, len(cols)+1)])
Out[10]:
4) For each vector in the ICA decomposition, write a sentence or two explaining what sort of object or property it corresponds to. What could these components be used for?
Answer:
Each row vector in ica.components_ represents a row in the un-mixing matrix. After transforming the original standardized dataset with X_std.dot(ica.components_.T), each new dimension (customer behaviors) would be as statistically independent as possible. The transformed data can then be used for clustering analysis and identification of customer categories.
Description of independent components (IC) or customer behaviors:
Answer:
KMeans
Gaussian Mixture Model
For this case, GMM might be more appropriate since the cluster size might not be evenly distributed.
6) Below is some starter code to help you visualize some cluster data. The visualization is based on this demo from the sklearn documentation.
In [6]:
# Generate a plot for silhouette scores vs number of specified clusters
# in order to aid selection of optimal cluster size
# Import clustering modules
import itertools
import matplotlib.gridspec as gridspec
from mlxtend.evaluate import plot_decision_regions
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
from sklearn.metrics import silhouette_score
# Use silhouette score to select the number of clusters for K-means and GMM
n_cluster_min = 2
n_cluster_max = 11
cluster_range = range(n_cluster_min, n_cluster_max+1)
# Compute silhouette scores
km_silhouette = []
cov_types = ['spherical', 'tied', 'diag', 'full']
empty_list = [[] for i in cov_types]
gmm_silhouette = dict(zip(cov_types, empty_list))
for i in cluster_range:
y_km = KMeans(n_clusters=i, random_state=0).fit_predict(X_std)
km_silhouette.append (silhouette_score(X_std, y_km, metric='euclidean'))
for cov_type in cov_types:
y_gmm = GMM(n_components=i, random_state=0, covariance_type= cov_type).fit_predict(X_std)
gmm_silhouette[cov_type].append(silhouette_score(X_std, y_gmm, metric='euclidean'))
# Plot silhouette score vs number of clusters chosen
plt.figure(figsize=(8, 5), dpi=300)
plt.plot(cluster_range, km_silhouette, marker='o', label = 'K-Means')
for cov_type in cov_types:
plt.plot(cluster_range, gmm_silhouette[cov_type], marker='s', label = 'GMM (%s)' %cov_type )
plt.xlabel('Number of clusters specified')
plt.ylabel('Silhouette score')
plt.xlim(n_cluster_min-1, n_cluster_max+1)
plt.ylim(0, 0.805)
plt.yticks(np.arange(0, 0.805, 0.2))
plt.legend(loc='upper right')
rc('font', weight='bold')
From the plot shown above, GMM clustering algorithm with a covariance type of tied and n_components=2 provides the best silhouette-score of 0.733, which is around 19.3% better compared to K-means(n_cluster=2).
In [7]:
# TODO: First we reduce the data to two dimensions using PCA to capture variation
pca_reduced = PCA(n_components=2)
X_std_pca_reduced = pca_reduced.fit_transform(X_std)
print('==============================================================================')
print('First 5 elements after transformation with the first two principle components:')
print(X_std_pca_reduced[:5])
print('==============================================================================')
# TODO: Implement your clustering algorithm here,
# and fit it to the reduced data for visualization
# initializing clustering algorithms KMeans/GMM and centroids
clus = [KMeans(n_clusters=2), GMM(n_components=2, covariance_type='tied')]
centroids = {}
# plotting decision regions for KMeans and GMM
gs = gridspec.GridSpec(1, 2)
grds = itertools.product([0, 1], repeat=1)
fig = plt.figure(figsize=(12, 5))
for clu, grd in zip(clus, grds):
# fit to the reduced reduced data
y_clu = clu.fit_predict(X_std_pca_reduced)
clu_name = clu.__class__.__name__
if clu_name=='KMeans': centroids[clu_name] = clu.cluster_centers_
elif clu_name=='GMM': centroids[clu_name] = clu.means_
print('%s:\n%s' % (clu_name, clu))
print('==============================================================================')
# plotting the decision boundaries with mlxtend.evaluate.plot_decision_regions
# http://rasbt.github.io/mlxtend/user_guide/evaluate/plot_decision_regions/
ax = plt.subplot(gs[grd[0]])
fig = plot_decision_regions(X=X_std_pca_reduced, y=y_clu, clf=clu)
fig.legend(loc='lower left', frameon=True)
# plotting the centroids of the clusters
plt.scatter(centroids[clu_name][:, 0], centroids[clu_name][:, 1],
marker='x', s=150, linewidths=2, color='w', zorder=2)
plt.title (clu_name)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
7) What are the central objects in each cluster? Describe them as customers.
In [8]:
print('Centroids of KMeans:')
print(centroids['KMeans'])
print('\nMeans of GMM:')
print(centroids['GMM'])
pd.set_option('precision', 4)
print('\nCustomer types based on KMeans centroids:')
pd.DataFrame(sc.inverse_transform(pca_reduced.inverse_transform(centroids['KMeans'])),
columns=cols, index=range(1,3)).T.plot(kind='barh')
Out[8]:
In [9]:
print('Customer types based on GMM means:')
pd.DataFrame(sc.inverse_transform(pca_reduced.inverse_transform(centroids['GMM'])),
columns=cols, index=range(1,3)).T.plot(kind='barh')
Out[9]:
Answer:
The centroids returned by both KMeans and GMM are listed above. For KMeans, the data points just to the left of the upper decision boundaries might not be suitably classified. Visually, the decision boundaries produced by GMM seems to be more plausible.
According to KMeans,
According to GMM,
In my opinion, PCA and GMM clustering provide a suitable framework for customer type discovery.
[5] Trevor et al. P463, The Elements of Statistical Learning 2e, 2011.
9) How would you use that technique to assist if the company conducted an experiment?
With GMM clustering model, we can separate the existing customer base into two types and perform targeted experiments. For each type of customers, we can first randomly divide them into two groups, one of which will receive existing service and serve as the control while the other will be the experimental group, receiving a variation of the service. Responses/sales from each group of customers can then be recorded and analyzed, which can help the company to decide whether to change the service for a particular type of customers.
10) How would you use that data to predict future customer needs?
After performing some experiments using the methodology outlined above, the responses/sales can be organized into a binary target variable (better/worse) or a continuous target variable (percent difference compared to the average of the matching control group).
For each case, we can use existing data to train and validate a supervised classification/regression model with, for example, KNN or Logistic Regression. When a new customer's data becomes available, we can use the model to predict whether the customer will respond favorably to the change in service or predict the magnitude of response in terms of sales.