Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks



In [362]:

    
# Standard imports
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import preprocessing
from time import time

from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

pd.set_option('display.max_rows', 10)
%matplotlib inline

np.random.seed(123)



In [363]:

    
print('Pandas:', pd.__version__)
print('Numpy:', np.__version__)
print('Matplotlib:', mpl.__version__)









    



Pandas: 0.14.1
Numpy: 1.9.1
Matplotlib: 1.4.2



In [364]:

    
raw_data = pd.read_csv('../data/wine.csv', delimiter=",")
labels = raw_data.Wine
data = raw_data.drop('Wine', axis=1)



In [365]:

    
sample_size, features = raw_data.shape

Exploratory Analysis



In [366]:

    
data.describe()









    Out[366]:






  
    
      
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      count
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
        178.000000
    
    
      mean
        13.000618
         2.336348
         2.366517
        19.494944
        99.741573
         2.295112
         2.029270
         0.361854
         1.590899
         5.058090
         0.957449
         2.611685
        746.893258
    
    
      std
         0.811827
         1.117146
         0.274344
         3.339564
        14.282484
         0.625851
         0.998859
         0.124453
         0.572359
         2.318286
         0.228572
         0.709990
        314.907474
    
    
      min
        11.030000
         0.740000
         1.360000
        10.600000
        70.000000
         0.980000
         0.340000
         0.130000
         0.410000
         1.280000
         0.480000
         1.270000
        278.000000
    
    
      25%
        12.362500
         1.602500
         2.210000
        17.200000
        88.000000
         1.742500
         1.205000
         0.270000
         1.250000
         3.220000
         0.782500
         1.937500
        500.500000
    
    
      50%
        13.050000
         1.865000
         2.360000
        19.500000
        98.000000
         2.355000
         2.135000
         0.340000
         1.555000
         4.690000
         0.965000
         2.780000
        673.500000
    
    
      75%
        13.677500
         3.082500
         2.557500
        21.500000
       107.000000
         2.800000
         2.875000
         0.437500
         1.950000
         6.200000
         1.120000
         3.170000
        985.000000
    
    
      max
        14.830000
         5.800000
         3.230000
        30.000000
       162.000000
         3.880000
         5.080000
         0.660000
         3.580000
        13.000000
         1.710000
         4.000000
       1680.000000



In [367]:

    
scaled_data = preprocessing.scale(data)
scaled_data = pd.DataFrame(scaled_data)



In [368]:

    
scaled_data.columns = data.columns



In [369]:

    
raw_data.groupby('Wine').describe()









    Out[369]:






  
    
      
      
      Acl
      Alcohol
      Ash
      Color.int
      Flavanoids
      Hue
      Malic.acid
      Mg
      Nonflavanoid.phenols
      OD
      Phenols
      Proanth
      Proline
    
    
      Wine
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      count
       59.000000
       59.000000
       59.000000
       59.000000
       59.000000
       59.000000
       59.000000
        59.000000
       59.000000
       59.000000
       59.000000
       59.000000
         59.000000
    
    
      mean
       17.037288
       13.744746
        2.455593
        5.528305
        2.982373
        1.062034
        2.010678
       106.338983
        0.290000
        3.157797
        2.840169
        1.899322
       1115.711864
    
    
      std
        2.546322
        0.462125
        0.227166
        1.238573
        0.397494
        0.116483
        0.688549
        10.498949
        0.070049
        0.357077
        0.338961
        0.412109
        221.520767
    
    
      min
       11.200000
       12.850000
        2.040000
        3.520000
        2.190000
        0.820000
        1.350000
        89.000000
        0.170000
        2.510000
        2.200000
        1.250000
        680.000000
    
    
      25%
       16.000000
       13.400000
        2.295000
        4.550000
        2.680000
        0.995000
        1.665000
        98.000000
        0.255000
        2.870000
        2.600000
        1.640000
        987.500000
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3
      min
       17.500000
       12.200000
        2.100000
        3.850000
        0.340000
        0.480000
        1.240000
        80.000000
        0.170000
        1.270000
        0.980000
        0.550000
        415.000000
    
    
      25%
       20.000000
       12.805000
        2.300000
        5.437500
        0.580000
        0.587500
        2.587500
        89.750000
        0.397500
        1.510000
        1.407500
        0.855000
        545.000000
    
    
      50%
       21.000000
       13.165000
        2.380000
        7.550000
        0.685000
        0.665000
        3.265000
        97.000000
        0.470000
        1.660000
        1.635000
        1.105000
        627.500000
    
    
      75%
       23.000000
       13.505000
        2.602500
        9.225000
        0.920000
        0.752500
        3.957500
       106.000000
        0.530000
        1.820000
        1.807500
        1.350000
        695.000000
    
    
      max
       27.000000
       14.340000
        2.860000
       13.000000
        1.570000
        0.960000
        5.650000
       123.000000
        0.630000
        2.470000
        2.800000
        2.700000
        880.000000
    
  

24 rows × 13 columns



In [370]:

    
raw_data['Wine'].hist(bins=3)









    Out[370]:





<matplotlib.axes._subplots.AxesSubplot at 0x2e5839e8>

Distance Matrix & Dendogram with scaled data



In [371]:

    
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

scaled_data_array = scaled_data.as_matrix()

# not printed as pretty, but the values are correct
distx = squareform(pdist(scaled_data, metric='euclidean'))
print(distx)

from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(scaled_data, method='single'), color_threshold=10)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);









    



[[ 0.          3.49753522  3.02660794 ...,  6.4909413   6.07878091
   7.18442107]
 [ 3.49753522  0.          4.1429119  ...,  6.39689969  6.09492714
   7.36771922]
 [ 3.02660794  4.1429119   0.         ...,  6.25367723  5.85179331
   6.35388503]
 ..., 
 [ 6.4909413   6.39689969  6.25367723 ...,  0.          1.82621785
   3.39251526]
 [ 6.07878091  6.09492714  5.85179331 ...,  1.82621785  0.          3.32427633]
 [ 7.18442107  7.36771922  6.35388503 ...,  3.39251526  3.32427633  0.        ]]

Determine if 3 clusters is true or not



In [372]:

    
##### cluster data into K=1..10 clusters #####
#K, KM, centroids,D_k,cIdx,dist,avgWithinSS = kmeans.run_kmeans(X,10)
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist

K = range(1,10)

  # scipy.cluster.vq.kmeans
KM = [kmeans(scaled_data_array,k) for k in K] # apply kmeans 1 to 10
centroids = [cent for (cent,var) in KM]   # cluster centroids

D_k = [cdist(scaled_data_array, cent, 'euclidean') for cent in centroids]

cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/scaled_data.shape[0] for d in dist]  

kIdx = 2
# plot elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
      markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
tt = plt.title('Elbow for K-Means clustering')

Awesome plotting function!



In [373]:

    
def plot_clusters(orig,pred,nx,ny,kmeans=False,legend=True):
    data = orig
    import matplotlib.pyplot as plt
    
    p0 = plt.plot(orig[pred==0,nx],orig[pred==0,ny],'ro',label='Wine 1')
    p2 = plt.plot(orig[pred==2,nx],orig[pred==2,ny],'go',label='Wine 2') 
    p1 = plt.plot(orig[pred==1,nx],orig[pred==1,ny],'bo',label='Wine 3') 

    tt= plt.title('Wine Data set, KMeans clustering with K=3')
     
    if kmeans:
        #CENTROIDS
        centroids = kmeans.cluster_centers_
        plt.scatter(centroids[:, 0], centroids[:, 1],
        marker='^', s=169, linewidths=3,
        color='orange', zorder=10)


    if legend:
        ll=plt.legend()  
    return (p0,p1,p2)

Predict with K_Means (Scaled Data)



In [374]:

    
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(scaled_data).labels_

(pl0,pl1,pl2) = plot_clusters(scaled_data.as_matrix(),Y_hat_kmeans,0,5,kmeans)

With original Data and markers



In [375]:

    
groups = raw_data.groupby('Wine')

# Plot
fig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.Alcohol, group['Malic.acid'], marker='o', linestyle='', ms=6, label=name)
    

    plt.ylim(0,7)
    plt.xlim(10,16)
ax.legend()

plt.show()

Predict with K_Means (Raw Data)



In [376]:

    
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(data).labels_
print(data.columns)
(pl0,pl1,pl2) = plot_clusters(data.as_matrix(),Y_hat_kmeans,0,1,kmeans)









    



Index([u'Alcohol', u'Malic.acid', u'Ash', u'Acl', u'Mg', u'Phenols', u'Flavanoids', u'Nonflavanoid.phenols', u'Proanth', u'Color.int', u'Hue', u'OD', u'Proline'], dtype='object')

Perform PCA on scaled data (Though it doesn't matter)



In [377]:

    
#Check out covariance matrix
#print(np.cov(data,rowvar=False)) #too large so not running

from sklearn.decomposition import PCA
pca = PCA()
X_pca = pca.fit_transform(scaled_data)
#pca.components_
pca.mean_









    Out[377]:





array([ -8.61982146e-16,  -8.35785872e-17,  -8.65724471e-16,
        -1.16012069e-16,  -1.99590656e-17,  -2.97202961e-16,
        -4.01676195e-16,   4.07913403e-16,  -1.69963918e-16,
        -1.24744160e-18,   3.71737597e-16,   2.91901335e-16,
        -7.48464960e-18])

Explained Variance Ratio

We need the first 6 dimensions to explain 80% of the variance



In [378]:

    
plt.plot(pca.explained_variance_ratio_);
print('Explained Variance',sum(pca.explained_variance_ratio_[0:5]))









    



Explained Variance 0.801622927555

Let's perform KMeans on PCA Results with 6 components



In [379]:

    
PCA_data = PCA(n_components=6).fit_transform(scaled_data)
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(scaled_data).labels_

(pl0,pl1,pl2) = plot_clusters(PCA_data,Y_hat_kmeans,0,1,kmeans)



In [380]:

    
PCA_data = PCA(n_components=5).fit_transform(scaled_data)
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(PCA_data).labels_

(pl0,pl1,pl2) = plot_clusters(PCA_data,Y_hat_kmeans,0,1,kmeans)

	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258
std	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474
min	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000
25%	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000
50%	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000
75%	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000
max	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000

		Acl	Alcohol	Ash	Color.int	Flavanoids	Hue	Malic.acid	Mg	Nonflavanoid.phenols	OD	Phenols	Proanth	Proline
Wine
1	count	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000	59.000000
	mean	17.037288	13.744746	2.455593	5.528305	2.982373	1.062034	2.010678	106.338983	0.290000	3.157797	2.840169	1.899322	1115.711864
	std	2.546322	0.462125	0.227166	1.238573	0.397494	0.116483	0.688549	10.498949	0.070049	0.357077	0.338961	0.412109	221.520767
	min	11.200000	12.850000	2.040000	3.520000	2.190000	0.820000	1.350000	89.000000	0.170000	2.510000	2.200000	1.250000	680.000000
	25%	16.000000	13.400000	2.295000	4.550000	2.680000	0.995000	1.665000	98.000000	0.255000	2.870000	2.600000	1.640000	987.500000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3	min	17.500000	12.200000	2.100000	3.850000	0.340000	0.480000	1.240000	80.000000	0.170000	1.270000	0.980000	0.550000	415.000000
	25%	20.000000	12.805000	2.300000	5.437500	0.580000	0.587500	2.587500	89.750000	0.397500	1.510000	1.407500	0.855000	545.000000
	50%	21.000000	13.165000	2.380000	7.550000	0.685000	0.665000	3.265000	97.000000	0.470000	1.660000	1.635000	1.105000	627.500000
	75%	23.000000	13.505000	2.602500	9.225000	0.920000	0.752500	3.957500	106.000000	0.530000	1.820000	1.807500	1.350000	695.000000
	max	27.000000	14.340000	2.860000	13.000000	1.570000	0.960000	5.650000	123.000000	0.630000	2.470000	2.800000	2.700000	880.000000