In this project you, will analyze a dataset containing annual spending amounts for internal structure, to understand the variation in the different types of customers that a wholesale distributor interacts with.
Instructions:
In [1]:
import warnings
warnings.filterwarnings('ignore')
# Import libraries: NumPy, pandas, matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Tell iPython to include plots inline in the notebook
%matplotlib inline
# Read dataset
data = pd.read_csv("wholesale-customers.csv")
print "Dataset has {} rows, {} columns".format(*data.shape)
print data.head() # print the first 5 rows
print data.describe()
data.plot(kind="box")
Out[1]:
1) In this section you will be using PCA and ICA to start to understand the structure of the data. Before doing any computations, what do you think will show up in your computations? List one or two ideas for what might show up as the first PCA dimensions, or what type of vectors will show up as ICA dimensions.
Answer:
PCA helps us to reduce the dimensionality of the data by projecting the higher dimension features to lower dimensions keeping the maximum variance of data possible. Looking at the box plot of data above, PCA should return first dimension corresponding to category "Fresh" and the second dimension corresponding to category "Grocery".
ICA helps to separate source signals from composite signals. In this problem, the amount of money spent on different product categories by 440 customers of a wholesale distributor is given. The types of vectors resulting from ICA will be indicators for different types of customers of the distributor.
In [2]:
# TODO: Apply PCA with the same number of dimensions as variables in the dataset
from sklearn.decomposition import PCA
pca = PCA (n_components=2, whiten=True)
# we don't need scaling here because the input data represents
# annual spending in monetary units and hence are on same scale.
Z = pd.DataFrame(pca.fit_transform(data), columns=["PC1", "PC2"])
# Print the components and the amount of variance in the data contained in each dimension
print pca.components_
print pca.explained_variance_ratio_
ax = Z.plot(kind='scatter', x="PC1", y="PC2", figsize=(10, 8), s=1)
loading = pca.components_
for i, (x,y) in enumerate(zip(loading[0], loading[1])):
ax.arrow(0, 0, x,y, width=0.001, fc='orange', ec='orange')
ax.annotate(data.columns[i], (x,y))
ax.set_xlim([-2.5, 1])
ax.set_ylim([-1, 2.5])
Out[2]:
2) How quickly does the variance drop off by dimension? If you were to use PCA on this dataset, how many dimensions would you choose for your analysis? Why?
In [3]:
pca = PCA (n_components=6, whiten=True)
# we don't need scaling here because the input data represents
# annual spending in monetary units and hence are on same scale.
Z = pd.DataFrame(pca.fit_transform(data))
# print the amount of variance explained by each component
variances = pca.explained_variance_ratio_
print variances
np.cumsum(variances)
pd.DataFrame({'pca_1': Z[:][0], 'fresh': data["Fresh"]}).plot(x='pca_1', y='fresh', kind='scatter')
pd.DataFrame({'pca_2': Z[:][1], 'grocery': data["Grocery"]}).plot(x='pca_2', y='grocery', kind='scatter')
Out[3]:
Answer: Most of the variance is explained by 2 dimensions. For the third dimension, the variance dropped by almost 98% to 0.04. I will use 3 dimensions, because I would like my dimensions to capture atleast 90% of variance in the data.
3) What do the dimensions seem to represent? How can you use this information?
Answer: The dimensions seem to represent the product groups in the decreasing order of importance i.e. the product that is sold the most.
The first dimension output from the PCA algorithm is the loading vector of the first principal component indicating the maximum variance in product category "Fresh" than anything else. The component that follows, is the loading vector of the second principal component that indicates that second maximum variance is along the feature "Grocery". Also, it has significant contributions from Milk and Detergents.
Also, there seems to be some correlation among Frozen foods and Delicatesen i.e. customers buying Frozen food is likely to buy Delicatessen. Another product categories with correlation are Milk, Groceries and Detergent_Paper. Product category Fresh doesn't have any correlation with any other product groups. We can use these information to introduce discounted sales on the correlated product categories to maximize the sales.
Looking at the biplot and the scatter plots above, we can see that first principal component has high negative correlation with Fresh foods (~-0.97) i.e. it captures the fact that there is high variance in the Fresh foods and also how the volume of Fresh foods purchased reduces with the increase in the first principal component. It is therefore a measure of how Fresh foods is of less favor among high volume customers.
Similary, the second principal component shows a moderate level of positive correlation (~0.5) with the amount of Groceries bought from the distributor. Also Grocery is a favourite item among the high volume buyers unlike Fresh foods.
In [4]:
# TODO: Fit an ICA model to the data
# Note: Adjust the data to have center at the origin first!
from sklearn.decomposition import FastICA
from sklearn import preprocessing
# scaling the data to align the mean to 0 and to have unit variance
scaler = preprocessing.StandardScaler()
data_std = pd.DataFrame(scaler.fit_transform(data))
ica = FastICA(n_components=6, random_state=42)
ica.fit(data_std)
# Print the independent components
print ica.components_
4) For each vector in the ICA decomposition, write a sentence or two explaining what sort of object or property it corresponds to. What could these components be used for?
Answer:
The components can be used for separating out the source signals i.e. the type of customers of the wholesale distributor.
The components output from ICA indicate how each customer weigh different product categories in terms of may be ... space allocated to them, the kind of marketing done and general demand. One potential use of this idea for the distributor is to pre-allocate appropriate portions of different product categories and pack them so that they will be ready to be shipped on demand relatively quickly.
References:
http://whatwhy.in/featured/what-is-the-difference-between-a-hyper-market-super-market-departmental-store-and-a-general-store/457/ https://en.wikipedia.org/wiki/Appetizing_store
Note: Above conclusions were derived from looking for the feature with maximum absolute value and then comparing the remaining features relative to the magnitude and sign of the maximum value.
Answer:
K-Means clustering or GMM lets us to segregate data points into sub groups within which the data points are mostly identical. In a sense, the advantage is "finding" these subgroups within the data that helps in exploratory analysis. These subgroups can then be used to study the effect of other variables on them helping us make effective decisions using the data.
K-means clustering uses a notion of distance of the data points in order to segregate them. GMM assumes that the observed data is a mixture of probability distributions where each distribution is Gaussian or Normal. It then tries to identify sub groups of data based on probability that a particular data point will land on a distribution.
Comparing KMeans and GMM algorithms, KMeans is scalable with even cluster sizes and works well with flat structures and medium number of clusters. GMM on the other hand is not scalable, also works with flat geometry and good for density estimation purposes. GMM is generally fast to learn mixture models but fails when the number of samples is really high. KMeans is generally good with large scale data and so is going to be the model of choice here.
Looking at the results above, it seems that there are distinct groups of customers and we can go with the hard assignment of clusters using K-means. GMM, does a soft assignment where a particular data point belongs to different clusters with varying degrees of probability. I feel that k-means will work well in this case.
6) Below is some starter code to help you visualize some cluster data. The visualization is based on this demo from the sklearn documentation.
In [5]:
# Import clustering modules
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
In [6]:
# TODO: First we reduce the data to two dimensions using PCA to capture variation
from sklearn.decomposition import PCA
pca = PCA(n_components = 2, whiten=True)
reduced_data = pca.fit_transform(data)
print reduced_data[:10] # print upto 10 elements
In [7]:
# TODO: Implement your clustering algorithm here, and fit it to the reduced data for visualization
# The visualizer below assumes your clustering object is named 'clusters'
kmeans = KMeans(n_clusters=4)
clusters = kmeans.fit(reduced_data)
print clusters
In [8]:
# Plot the decision boundary by building a mesh grid to populate a graph.
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
hx = (x_max-x_min)/1000.
hy = (y_max-y_min)/1000.
xx, yy = np.meshgrid(np.arange(x_min, x_max, hx), np.arange(y_min, y_max, hy))
# Obtain labels for each point in mesh. Use last trained model.
Z = clusters.predict(np.c_[xx.ravel(), yy.ravel()])
In [9]:
# TODO: Find the centroids for KMeans or the cluster means for GMM
centroids = kmeans.cluster_centers_
print centroids
pd.DataFrame(pca.inverse_transform(centroids)).plot(kind = 'bar')
Out[9]:
In [10]:
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('Clustering on the wholesale grocery dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
In [11]:
# using elbow chart to determine the #clusters
def fitAndScore(num_clusters):
kmeans = KMeans(n_clusters=num_clusters)
clusters = kmeans.fit(reduced_data)
clusters_ = range(num_clusters)
for i, _ in enumerate(clusters_):
clusters_[i] = []
for l,d in zip(clusters.predict(reduced_data), reduced_data):
clusters_[l].append(d)
mu = kmeans.cluster_centers_
Wk = sum([np.linalg.norm(mu[k]-c)**2/(2*len(c)) for k in range(num_clusters) for c in clusters_[k]])
return Wk
cluster_sizes = range(2,16,1)
# sum of squared errors
sse = [fitAndScore(x) for x in cluster_sizes]
df = pd.DataFrame({'#clusters': cluster_sizes, 'SSE': sse})
df.head()
df.plot(x='#clusters', y='SSE', title='Elbow chart')
# From the chart below, the optimal number of clusters is 4
Out[11]:
7) What are the central objects in each cluster? Describe them as customers.
Answer:
The central object in bottom left cluster is an average convenience store, the one on bottom right is a Super Market and the one on top is a Grocery store. These customer types were identified by how much of the important product groups they buy relative to the average quantities.
8) Which of these techniques did you feel gave you the most insight into the data?
Answer: Principal Component Analysis. Visualizing first two principal components told the story about which is the important product category to differentiate among customers i.e. the amount of Fresh food that they typically buy. The next differentiating factor is the amount of Grocery bought. Also the correlation coefficients told the story about what high volume customers like to buy than low volume customers. The other techniques help to reinforce the ideas derived from PCA.
9) How would you use that technique to help the company design new experiments?
Answer: Since the data helps us to identify the best selling product groups and not so best selling ones, experiments can be designed around introducing discounted sales on product groups that go well together, stacking the items in the store, promoting the less favorable product groups and so forth.
One A/B test could be, once we identify the imporant product group, we can run a following experiment. Randomly separate the customers into two groups i.e. control and test sets, introduce a sale offer on the most popular product to control group. This will help us to ascertain the importance of the product which can assist in a decision of whether efficient marketing on a already popular product improves sales or not. The experiment can then be repeated with slow moving goods to evaluate what kind of marketing helps in the sales of those products.
10) How would you use that data to help you predict future customer needs?
Answer: Now that we have looked at 440 distinct customers in distinct groups, I guess this information can be used to train a supervised learning classifier to identify the customer group any future customer may fall into. The management of the distributor can focus on setting up tailored product offerings appealing to each customer groups. Using the focused and tailor-made product offers will result in increased turn around and improved loyalty of the customer.
Also, the purchase history of the customers in each group can be used in the distributor's favor as they can informed decisions about the demands of the new customer. All these advantages result from classifying the customer into a group.