Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks



In [209]:

    
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%matplotlib inline



In [210]:

    
wine_all=pd.read_csv('../data/wine.csv')



In [211]:

    
wine_all.head()









    Out[211]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      0
       1
       14.23
       1.71
       2.43
       15.6
       127
       2.80
       3.06
       0.28
       2.29
       5.64
       1.04
       3.92
       1065
    
    
      1
       1
       13.20
       1.78
       2.14
       11.2
       100
       2.65
       2.76
       0.26
       1.28
       4.38
       1.05
       3.40
       1050
    
    
      2
       1
       13.16
       2.36
       2.67
       18.6
       101
       2.80
       3.24
       0.30
       2.81
       5.68
       1.03
       3.17
       1185
    
    
      3
       1
       14.37
       1.95
       2.50
       16.8
       113
       3.85
       3.49
       0.24
       2.18
       7.80
       0.86
       3.45
       1480
    
    
      4
       1
       13.24
       2.59
       2.87
       21.0
       118
       2.80
       2.69
       0.39
       1.82
       4.32
       1.04
       2.93
        735



In [212]:

    
wine_all.describe()









    Out[212]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      count
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
       178.000000
        178.000000
    
    
      mean
         1.938202
        13.000618
         2.336348
         2.366517
        19.494944
        99.741573
         2.295112
         2.029270
         0.361854
         1.590899
         5.058090
         0.957449
         2.611685
        746.893258
    
    
      std
         0.775035
         0.811827
         1.117146
         0.274344
         3.339564
        14.282484
         0.625851
         0.998859
         0.124453
         0.572359
         2.318286
         0.228572
         0.709990
        314.907474
    
    
      min
         1.000000
        11.030000
         0.740000
         1.360000
        10.600000
        70.000000
         0.980000
         0.340000
         0.130000
         0.410000
         1.280000
         0.480000
         1.270000
        278.000000
    
    
      25%
         1.000000
        12.362500
         1.602500
         2.210000
        17.200000
        88.000000
         1.742500
         1.205000
         0.270000
         1.250000
         3.220000
         0.782500
         1.937500
        500.500000
    
    
      50%
         2.000000
        13.050000
         1.865000
         2.360000
        19.500000
        98.000000
         2.355000
         2.135000
         0.340000
         1.555000
         4.690000
         0.965000
         2.780000
        673.500000
    
    
      75%
         3.000000
        13.677500
         3.082500
         2.557500
        21.500000
       107.000000
         2.800000
         2.875000
         0.437500
         1.950000
         6.200000
         1.120000
         3.170000
        985.000000
    
    
      max
         3.000000
        14.830000
         5.800000
         3.230000
        30.000000
       162.000000
         3.880000
         5.080000
         0.660000
         3.580000
        13.000000
         1.710000
         4.000000
       1680.000000



In [213]:

    
wine_all.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 20.9 KB



In [214]:

    
X = wine_all.ix[:,1:]



In [215]:

    
X=X.values
X









    Out[215]:





array([[  1.42300000e+01,   1.71000000e+00,   2.43000000e+00, ...,
          1.04000000e+00,   3.92000000e+00,   1.06500000e+03],
       [  1.32000000e+01,   1.78000000e+00,   2.14000000e+00, ...,
          1.05000000e+00,   3.40000000e+00,   1.05000000e+03],
       [  1.31600000e+01,   2.36000000e+00,   2.67000000e+00, ...,
          1.03000000e+00,   3.17000000e+00,   1.18500000e+03],
       ..., 
       [  1.32700000e+01,   4.28000000e+00,   2.26000000e+00, ...,
          5.90000000e-01,   1.56000000e+00,   8.35000000e+02],
       [  1.31700000e+01,   2.59000000e+00,   2.37000000e+00, ...,
          6.00000000e-01,   1.62000000e+00,   8.40000000e+02],
       [  1.41300000e+01,   4.10000000e+00,   2.74000000e+00, ...,
          6.10000000e-01,   1.60000000e+00,   5.60000000e+02]])



In [216]:

    
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()



In [217]:

    
X = scale.fit_transform(X)
X









    Out[217]:





array([[ 1.51861254, -0.5622498 ,  0.23205254, ...,  0.36217728,
         1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, ...,  0.40605066,
         1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, ...,  0.31830389,
         0.78858745,  1.39514818],
       ..., 
       [ 0.33275817,  1.74474449, -0.38935541, ..., -1.61212515,
        -1.48544548,  0.28057537],
       [ 0.20923168,  0.22769377,  0.01273209, ..., -1.56825176,
        -1.40069891,  0.29649784],
       [ 1.39508604,  1.58316512,  1.36520822, ..., -1.52437837,
        -1.42894777, -0.59516041]])



In [218]:

    
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)



In [219]:

    
Y_hat_kmeans = kmeans.fit(X).labels_



In [220]:

    
Y_hat_kmeans









    Out[220]:





array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)



In [221]:

    
plt.scatter(X[:,0], X[:,1], c=Y_hat_kmeans);



In [222]:

    
mu = kmeans.cluster_centers_
mu









    Out[222]:





array([[-0.92607185, -0.39404154, -0.49451676,  0.17060184, -0.49171185,
        -0.07598265,  0.02081257, -0.03353357,  0.0582655 , -0.90191402,
         0.46180361,  0.27076419, -0.75384618],
       [ 0.83523208, -0.30380968,  0.36470604, -0.61019129,  0.5775868 ,
         0.88523736,  0.97781956, -0.56208965,  0.58028658,  0.17106348,
         0.47398365,  0.77924711,  1.12518529],
       [ 0.16490746,  0.87154706,  0.18689833,  0.52436746, -0.07547277,
        -0.97933029, -1.21524764,  0.72606354, -0.77970639,  0.94153874,
        -1.16478865, -1.29241163, -0.40708796]])



In [223]:

    
plt.scatter(X[:,0], X[:,1], c=Y_hat_kmeans, alpha=0.4)
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_hat_kmeans));



In [224]:

    
wine_all['cluster']= Y_hat_kmeans



In [225]:

    
wine_all.groupby(['cluster','Wine']).count()









    Out[225]:






  
    
      
      
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
    
      cluster
      Wine
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      2
       65
       65
       65
       65
       65
       65
       65
       65
       65
       65
       65
       65
       65
    
    
      1
      1
       59
       59
       59
       59
       59
       59
       59
       59
       59
       59
       59
       59
       59
    
    
      2
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
    
    
      2
      2
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
        3
    
    
      3
       48
       48
       48
       48
       48
       48
       48
       48
       48
       48
       48
       48
       48



In [226]:

    
#PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=10)



In [227]:

    
X_pca = pca.fit_transform(X)



In [228]:

    
sum(pca.explained_variance_ratio_)









    Out[228]:





0.96169716844506403



In [229]:

    
pca.components_









    Out[229]:





array([[-0.1443294 ,  0.24518758,  0.00205106,  0.23932041, -0.14199204,
        -0.39466085, -0.4229343 ,  0.2985331 , -0.31342949,  0.0886167 ,
        -0.29671456, -0.37616741, -0.28675223],
       [ 0.48365155,  0.22493093,  0.31606881, -0.0105905 ,  0.299634  ,
         0.06503951, -0.00335981,  0.02877949,  0.03930172,  0.52999567,
        -0.27923515, -0.16449619,  0.36490283],
       [-0.20738262,  0.08901289,  0.6262239 ,  0.61208035,  0.13075693,
         0.14617896,  0.1506819 ,  0.17036816,  0.14945431, -0.13730621,
         0.08522192,  0.16600459, -0.12674592],
       [ 0.0178563 , -0.53689028,  0.21417556, -0.06085941,  0.35179658,
        -0.19806835, -0.15229479,  0.20330102, -0.39905653, -0.06592568,
         0.42777141, -0.18412074,  0.23207086],
       [-0.26566365,  0.03521363, -0.14302547,  0.06610294,  0.72704851,
        -0.14931841, -0.10902584, -0.50070298,  0.13685982, -0.07643678,
        -0.17361452, -0.10116099, -0.1578688 ],
       [ 0.21353865,  0.53681385,  0.15447466, -0.10082451,  0.03814394,
        -0.0841223 , -0.01892002, -0.25859401, -0.53379539, -0.41864414,
         0.10598274,  0.26585107,  0.11972557],
       [-0.05639636,  0.42052391, -0.14917061, -0.28696914,  0.3228833 ,
        -0.02792498, -0.06068521,  0.59544729,  0.37213935, -0.22771214,
         0.23207564, -0.0447637 ,  0.0768045 ],
       [ 0.39613926,  0.06582674, -0.17026002,  0.42797018, -0.15636143,
        -0.40593409, -0.18724536, -0.23328465,  0.36822675, -0.03379692,
         0.43662362, -0.07810789,  0.12002267],
       [-0.50861912,  0.07528304,  0.30769445, -0.20044931, -0.27140257,
        -0.28603452, -0.04957849, -0.19550132,  0.20914487, -0.05621752,
        -0.08582839, -0.1372269 ,  0.57578611],
       [ 0.21160473, -0.30907994, -0.02712539,  0.05279942,  0.06787022,
        -0.32013135, -0.16315051,  0.21553507,  0.1341839 , -0.29077518,
        -0.52239889,  0.52370587,  0.162116  ]])



In [230]:

    
pca.mean_









    Out[230]:





array([  7.84141790e-15,   2.44498554e-16,  -4.05917497e-15,
        -7.11041712e-17,  -2.49488320e-17,  -1.95536471e-16,
         9.44313292e-16,  -4.17892936e-16,  -1.54059038e-15,
        -4.12903170e-16,   1.39838203e-15,   2.12688793e-15,
        -6.98567296e-17])



In [234]:

    
_ = plt.scatter(X_pca[:,0], X_pca[:,1], c=Y_hat_kmeans)



In [235]:

    
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

# not printed as pretty, but the values are correct
distx = squareform(pdist(X_pca, metric='euclidean'))
distx









    Out[235]:





array([[ 0.        ,  3.47982308,  2.97213486, ...,  6.41551514,
         6.02865089,  7.13919207],
       [ 3.47982308,  0.        ,  4.12218096, ...,  6.35531216,
         6.07859806,  7.33725989],
       [ 2.97213486,  4.12218096,  0.        , ...,  6.16277688,
         5.81227387,  6.34139433],
       ..., 
       [ 6.41551514,  6.35531216,  6.16277688, ...,  0.        ,
         1.77543674,  3.16438239],
       [ 6.02865089,  6.07859806,  5.81227387, ...,  1.77543674,
         0.        ,  3.22481763],
       [ 7.13919207,  7.33725989,  6.34139433, ...,  3.16438239,
         3.22481763,  0.        ]])



In [236]:

    
# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distx, method='ward'), color_threshold=100)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);



In [ ]:

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	1.938202	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258
std	0.775035	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474
min	1.000000	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000
25%	1.000000	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000
50%	2.000000	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000
75%	3.000000	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000
max	3.000000	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000

		Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
cluster	Wine
0	2	65	65	65	65	65	65	65	65	65	65	65	65	65
1	1	59	59	59	59	59	59	59	59	59	59	59	59	59
1	2	3	3	3	3	3	3	3	3	3	3	3	3	3
2	2	3	3	3	3	3	3	3	3	3	3	3	3	3
2	3	48	48	48	48	48	48	48	48	48	48	48	48	48