Homework 4:

  1. Follow the steps below to:
    • Read wine.csv in the data folder.
    • The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
  2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
  3. Try PCA and see how much can you reduce the variable space.
    • How many Components did you need to explain 99% of variance in this dataset?
    • Plot the PCA variables to see if it brings out the clusters.
  4. Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks


In [362]:
# Standard imports
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import preprocessing
from time import time

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

pd.set_option('display.max_rows', 10)
%matplotlib inline

np.random.seed(123)

In [363]:
print('Pandas:', pd.__version__)
print('Numpy:', np.__version__)
print('Matplotlib:', mpl.__version__)


Pandas: 0.14.1
Numpy: 1.9.1
Matplotlib: 1.4.2

In [364]:
raw_data = pd.read_csv('../data/wine.csv', delimiter=",")
labels = raw_data.Wine
data = raw_data.drop('Wine', axis=1)

In [365]:
sample_size, features = raw_data.shape

Exploratory Analysis


In [366]:
data.describe()


Out[366]:
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270 0.361854 1.590899 5.058090 0.957449 2.611685 746.893258
std 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859 0.124453 0.572359 2.318286 0.228572 0.709990 314.907474
min 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000 0.130000 0.410000 1.280000 0.480000 1.270000 278.000000
25% 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000 0.270000 1.250000 3.220000 0.782500 1.937500 500.500000
50% 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000 0.340000 1.555000 4.690000 0.965000 2.780000 673.500000
75% 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000 0.437500 1.950000 6.200000 1.120000 3.170000 985.000000
max 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000 0.660000 3.580000 13.000000 1.710000 4.000000 1680.000000

In [367]:
scaled_data = preprocessing.scale(data)
scaled_data = pd.DataFrame(scaled_data)

In [368]:
scaled_data.columns = data.columns

In [369]:
raw_data.groupby('Wine').describe()


Out[369]:
Acl Alcohol Ash Color.int Flavanoids Hue Malic.acid Mg Nonflavanoid.phenols OD Phenols Proanth Proline
Wine
1 count 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000 59.000000
mean 17.037288 13.744746 2.455593 5.528305 2.982373 1.062034 2.010678 106.338983 0.290000 3.157797 2.840169 1.899322 1115.711864
std 2.546322 0.462125 0.227166 1.238573 0.397494 0.116483 0.688549 10.498949 0.070049 0.357077 0.338961 0.412109 221.520767
min 11.200000 12.850000 2.040000 3.520000 2.190000 0.820000 1.350000 89.000000 0.170000 2.510000 2.200000 1.250000 680.000000
25% 16.000000 13.400000 2.295000 4.550000 2.680000 0.995000 1.665000 98.000000 0.255000 2.870000 2.600000 1.640000 987.500000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3 min 17.500000 12.200000 2.100000 3.850000 0.340000 0.480000 1.240000 80.000000 0.170000 1.270000 0.980000 0.550000 415.000000
25% 20.000000 12.805000 2.300000 5.437500 0.580000 0.587500 2.587500 89.750000 0.397500 1.510000 1.407500 0.855000 545.000000
50% 21.000000 13.165000 2.380000 7.550000 0.685000 0.665000 3.265000 97.000000 0.470000 1.660000 1.635000 1.105000 627.500000
75% 23.000000 13.505000 2.602500 9.225000 0.920000 0.752500 3.957500 106.000000 0.530000 1.820000 1.807500 1.350000 695.000000
max 27.000000 14.340000 2.860000 13.000000 1.570000 0.960000 5.650000 123.000000 0.630000 2.470000 2.800000 2.700000 880.000000

24 rows × 13 columns


In [370]:
raw_data['Wine'].hist(bins=3)


Out[370]:
<matplotlib.axes._subplots.AxesSubplot at 0x2e5839e8>

Distance Matrix & Dendogram with scaled data


In [371]:
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

scaled_data_array = scaled_data.as_matrix()

# not printed as pretty, but the values are correct
distx = squareform(pdist(scaled_data, metric='euclidean'))
print(distx)

from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(scaled_data, method='single'), color_threshold=10)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);


[[ 0.          3.49753522  3.02660794 ...,  6.4909413   6.07878091
   7.18442107]
 [ 3.49753522  0.          4.1429119  ...,  6.39689969  6.09492714
   7.36771922]
 [ 3.02660794  4.1429119   0.         ...,  6.25367723  5.85179331
   6.35388503]
 ..., 
 [ 6.4909413   6.39689969  6.25367723 ...,  0.          1.82621785
   3.39251526]
 [ 6.07878091  6.09492714  5.85179331 ...,  1.82621785  0.          3.32427633]
 [ 7.18442107  7.36771922  6.35388503 ...,  3.39251526  3.32427633  0.        ]]

Determine if 3 clusters is true or not


In [372]:
##### cluster data into K=1..10 clusters #####
#K, KM, centroids,D_k,cIdx,dist,avgWithinSS = kmeans.run_kmeans(X,10)
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist

K = range(1,10)

  # scipy.cluster.vq.kmeans
KM = [kmeans(scaled_data_array,k) for k in K] # apply kmeans 1 to 10
centroids = [cent for (cent,var) in KM]   # cluster centroids

D_k = [cdist(scaled_data_array, cent, 'euclidean') for cent in centroids]

cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/scaled_data.shape[0] for d in dist]  

kIdx = 2
# plot elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
      markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
tt = plt.title('Elbow for K-Means clustering')


Awesome plotting function!


In [373]:
def plot_clusters(orig,pred,nx,ny,kmeans=False,legend=True):
    data = orig
    import matplotlib.pyplot as plt
    
    p0 = plt.plot(orig[pred==0,nx],orig[pred==0,ny],'ro',label='Wine 1')
    p2 = plt.plot(orig[pred==2,nx],orig[pred==2,ny],'go',label='Wine 2') 
    p1 = plt.plot(orig[pred==1,nx],orig[pred==1,ny],'bo',label='Wine 3') 

    tt= plt.title('Wine Data set, KMeans clustering with K=3')
     
    if kmeans:
        #CENTROIDS
        centroids = kmeans.cluster_centers_
        plt.scatter(centroids[:, 0], centroids[:, 1],
        marker='^', s=169, linewidths=3,
        color='orange', zorder=10)


    if legend:
        ll=plt.legend()  
    return (p0,p1,p2)

Predict with K_Means (Scaled Data)


In [374]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(scaled_data).labels_

(pl0,pl1,pl2) = plot_clusters(scaled_data.as_matrix(),Y_hat_kmeans,0,5,kmeans)


With original Data and markers


In [375]:
groups = raw_data.groupby('Wine')

# Plot
fig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.Alcohol, group['Malic.acid'], marker='o', linestyle='', ms=6, label=name)
    

    plt.ylim(0,7)
    plt.xlim(10,16)
ax.legend()

plt.show()


Predict with K_Means (Raw Data)


In [376]:
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(data).labels_
print(data.columns)
(pl0,pl1,pl2) = plot_clusters(data.as_matrix(),Y_hat_kmeans,0,1,kmeans)


Index([u'Alcohol', u'Malic.acid', u'Ash', u'Acl', u'Mg', u'Phenols', u'Flavanoids', u'Nonflavanoid.phenols', u'Proanth', u'Color.int', u'Hue', u'OD', u'Proline'], dtype='object')

Perform PCA on scaled data (Though it doesn't matter)


In [377]:
#Check out covariance matrix
#print(np.cov(data,rowvar=False)) #too large so not running

from sklearn.decomposition import PCA
pca = PCA()
X_pca = pca.fit_transform(scaled_data)
#pca.components_
pca.mean_


Out[377]:
array([ -8.61982146e-16,  -8.35785872e-17,  -8.65724471e-16,
        -1.16012069e-16,  -1.99590656e-17,  -2.97202961e-16,
        -4.01676195e-16,   4.07913403e-16,  -1.69963918e-16,
        -1.24744160e-18,   3.71737597e-16,   2.91901335e-16,
        -7.48464960e-18])

Explained Variance Ratio

  • We need the first 6 dimensions to explain 80% of the variance

In [378]:
plt.plot(pca.explained_variance_ratio_);
print('Explained Variance',sum(pca.explained_variance_ratio_[0:5]))


Explained Variance 0.801622927555

Let's perform KMeans on PCA Results with 6 components


In [379]:
PCA_data = PCA(n_components=6).fit_transform(scaled_data)
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(scaled_data).labels_

(pl0,pl1,pl2) = plot_clusters(PCA_data,Y_hat_kmeans,0,1,kmeans)



In [380]:
PCA_data = PCA(n_components=5).fit_transform(scaled_data)
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(PCA_data).labels_

(pl0,pl1,pl2) = plot_clusters(PCA_data,Y_hat_kmeans,0,1,kmeans)