Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks



In [2]:

    
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.datasets as datasets
from sklearn.cluster import KMeans

%matplotlib inline



In [3]:

    
wine = pd.read_csv('../data/wine.csv')

wine.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)

Step 1 - Explore the Dataset



In [4]:

    
# drop the first column of the dataset
modified_wine = wine.drop("Wine",1)
modified_wine









    Out[4]:






  
    
      
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      0  
       14.23
       1.71
       2.43
       15.6
       127
       2.80
       3.06
       0.28
       2.29
        5.640000
       1.04
       3.92
       1065
    
    
      1  
       13.20
       1.78
       2.14
       11.2
       100
       2.65
       2.76
       0.26
       1.28
        4.380000
       1.05
       3.40
       1050
    
    
      2  
       13.16
       2.36
       2.67
       18.6
       101
       2.80
       3.24
       0.30
       2.81
        5.680000
       1.03
       3.17
       1185
    
    
      3  
       14.37
       1.95
       2.50
       16.8
       113
       3.85
       3.49
       0.24
       2.18
        7.800000
       0.86
       3.45
       1480
    
    
      4  
       13.24
       2.59
       2.87
       21.0
       118
       2.80
       2.69
       0.39
       1.82
        4.320000
       1.04
       2.93
        735
    
    
      5  
       14.20
       1.76
       2.45
       15.2
       112
       3.27
       3.39
       0.34
       1.97
        6.750000
       1.05
       2.85
       1450
    
    
      6  
       14.39
       1.87
       2.45
       14.6
        96
       2.50
       2.52
       0.30
       1.98
        5.250000
       1.02
       3.58
       1290
    
    
      7  
       14.06
       2.15
       2.61
       17.6
       121
       2.60
       2.51
       0.31
       1.25
        5.050000
       1.06
       3.58
       1295
    
    
      8  
       14.83
       1.64
       2.17
       14.0
        97
       2.80
       2.98
       0.29
       1.98
        5.200000
       1.08
       2.85
       1045
    
    
      9  
       13.86
       1.35
       2.27
       16.0
        98
       2.98
       3.15
       0.22
       1.85
        7.220000
       1.01
       3.55
       1045
    
    
      10 
       14.10
       2.16
       2.30
       18.0
       105
       2.95
       3.32
       0.22
       2.38
        5.750000
       1.25
       3.17
       1510
    
    
      11 
       14.12
       1.48
       2.32
       16.8
        95
       2.20
       2.43
       0.26
       1.57
        5.000000
       1.17
       2.82
       1280
    
    
      12 
       13.75
       1.73
       2.41
       16.0
        89
       2.60
       2.76
       0.29
       1.81
        5.600000
       1.15
       2.90
       1320
    
    
      13 
       14.75
       1.73
       2.39
       11.4
        91
       3.10
       3.69
       0.43
       2.81
        5.400000
       1.25
       2.73
       1150
    
    
      14 
       14.38
       1.87
       2.38
       12.0
       102
       3.30
       3.64
       0.29
       2.96
        7.500000
       1.20
       3.00
       1547
    
    
      15 
       13.63
       1.81
       2.70
       17.2
       112
       2.85
       2.91
       0.30
       1.46
        7.300000
       1.28
       2.88
       1310
    
    
      16 
       14.30
       1.92
       2.72
       20.0
       120
       2.80
       3.14
       0.33
       1.97
        6.200000
       1.07
       2.65
       1280
    
    
      17 
       13.83
       1.57
       2.62
       20.0
       115
       2.95
       3.40
       0.40
       1.72
        6.600000
       1.13
       2.57
       1130
    
    
      18 
       14.19
       1.59
       2.48
       16.5
       108
       3.30
       3.93
       0.32
       1.86
        8.700000
       1.23
       2.82
       1680
    
    
      19 
       13.64
       3.10
       2.56
       15.2
       116
       2.70
       3.03
       0.17
       1.66
        5.100000
       0.96
       3.36
        845
    
    
      20 
       14.06
       1.63
       2.28
       16.0
       126
       3.00
       3.17
       0.24
       2.10
        5.650000
       1.09
       3.71
        780
    
    
      21 
       12.93
       3.80
       2.65
       18.6
       102
       2.41
       2.41
       0.25
       1.98
        4.500000
       1.03
       3.52
        770
    
    
      22 
       13.71
       1.86
       2.36
       16.6
       101
       2.61
       2.88
       0.27
       1.69
        3.800000
       1.11
       4.00
       1035
    
    
      23 
       12.85
       1.60
       2.52
       17.8
        95
       2.48
       2.37
       0.26
       1.46
        3.930000
       1.09
       3.63
       1015
    
    
      24 
       13.50
       1.81
       2.61
       20.0
        96
       2.53
       2.61
       0.28
       1.66
        3.520000
       1.12
       3.82
        845
    
    
      25 
       13.05
       2.05
       3.22
       25.0
       124
       2.63
       2.68
       0.47
       1.92
        3.580000
       1.13
       3.20
        830
    
    
      26 
       13.39
       1.77
       2.62
       16.1
        93
       2.85
       2.94
       0.34
       1.45
        4.800000
       0.92
       3.22
       1195
    
    
      27 
       13.30
       1.72
       2.14
       17.0
        94
       2.40
       2.19
       0.27
       1.35
        3.950000
       1.02
       2.77
       1285
    
    
      28 
       13.87
       1.90
       2.80
       19.4
       107
       2.95
       2.97
       0.37
       1.76
        4.500000
       1.25
       3.40
        915
    
    
      29 
       14.02
       1.68
       2.21
       16.0
        96
       2.65
       2.33
       0.26
       1.98
        4.700000
       1.04
       3.59
       1035
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      148
       13.32
       3.24
       2.38
       21.5
        92
       1.93
       0.76
       0.45
       1.25
        8.420000
       0.55
       1.62
        650
    
    
      149
       13.08
       3.90
       2.36
       21.5
       113
       1.41
       1.39
       0.34
       1.14
        9.400000
       0.57
       1.33
        550
    
    
      150
       13.50
       3.12
       2.62
       24.0
       123
       1.40
       1.57
       0.22
       1.25
        8.600000
       0.59
       1.30
        500
    
    
      151
       12.79
       2.67
       2.48
       22.0
       112
       1.48
       1.36
       0.24
       1.26
       10.800000
       0.48
       1.47
        480
    
    
      152
       13.11
       1.90
       2.75
       25.5
       116
       2.20
       1.28
       0.26
       1.56
        7.100000
       0.61
       1.33
        425
    
    
      153
       13.23
       3.30
       2.28
       18.5
        98
       1.80
       0.83
       0.61
       1.87
       10.520000
       0.56
       1.51
        675
    
    
      154
       12.58
       1.29
       2.10
       20.0
       103
       1.48
       0.58
       0.53
       1.40
        7.600000
       0.58
       1.55
        640
    
    
      155
       13.17
       5.19
       2.32
       22.0
        93
       1.74
       0.63
       0.61
       1.55
        7.900000
       0.60
       1.48
        725
    
    
      156
       13.84
       4.12
       2.38
       19.5
        89
       1.80
       0.83
       0.48
       1.56
        9.010000
       0.57
       1.64
        480
    
    
      157
       12.45
       3.03
       2.64
       27.0
        97
       1.90
       0.58
       0.63
       1.14
        7.500000
       0.67
       1.73
        880
    
    
      158
       14.34
       1.68
       2.70
       25.0
        98
       2.80
       1.31
       0.53
       2.70
       13.000000
       0.57
       1.96
        660
    
    
      159
       13.48
       1.67
       2.64
       22.5
        89
       2.60
       1.10
       0.52
       2.29
       11.750000
       0.57
       1.78
        620
    
    
      160
       12.36
       3.83
       2.38
       21.0
        88
       2.30
       0.92
       0.50
       1.04
        7.650000
       0.56
       1.58
        520
    
    
      161
       13.69
       3.26
       2.54
       20.0
       107
       1.83
       0.56
       0.50
       0.80
        5.880000
       0.96
       1.82
        680
    
    
      162
       12.85
       3.27
       2.58
       22.0
       106
       1.65
       0.60
       0.60
       0.96
        5.580000
       0.87
       2.11
        570
    
    
      163
       12.96
       3.45
       2.35
       18.5
       106
       1.39
       0.70
       0.40
       0.94
        5.280000
       0.68
       1.75
        675
    
    
      164
       13.78
       2.76
       2.30
       22.0
        90
       1.35
       0.68
       0.41
       1.03
        9.580000
       0.70
       1.68
        615
    
    
      165
       13.73
       4.36
       2.26
       22.5
        88
       1.28
       0.47
       0.52
       1.15
        6.620000
       0.78
       1.75
        520
    
    
      166
       13.45
       3.70
       2.60
       23.0
       111
       1.70
       0.92
       0.43
       1.46
       10.680000
       0.85
       1.56
        695
    
    
      167
       12.82
       3.37
       2.30
       19.5
        88
       1.48
       0.66
       0.40
       0.97
       10.260000
       0.72
       1.75
        685
    
    
      168
       13.58
       2.58
       2.69
       24.5
       105
       1.55
       0.84
       0.39
       1.54
        8.660000
       0.74
       1.80
        750
    
    
      169
       13.40
       4.60
       2.86
       25.0
       112
       1.98
       0.96
       0.27
       1.11
        8.500000
       0.67
       1.92
        630
    
    
      170
       12.20
       3.03
       2.32
       19.0
        96
       1.25
       0.49
       0.40
       0.73
        5.500000
       0.66
       1.83
        510
    
    
      171
       12.77
       2.39
       2.28
       19.5
        86
       1.39
       0.51
       0.48
       0.64
        9.899999
       0.57
       1.63
        470
    
    
      172
       14.16
       2.51
       2.48
       20.0
        91
       1.68
       0.70
       0.44
       1.24
        9.700000
       0.62
       1.71
        660
    
    
      173
       13.71
       5.65
       2.45
       20.5
        95
       1.68
       0.61
       0.52
       1.06
        7.700000
       0.64
       1.74
        740
    
    
      174
       13.40
       3.91
       2.48
       23.0
       102
       1.80
       0.75
       0.43
       1.41
        7.300000
       0.70
       1.56
        750
    
    
      175
       13.27
       4.28
       2.26
       20.0
       120
       1.59
       0.69
       0.43
       1.35
       10.200000
       0.59
       1.56
        835
    
    
      176
       13.17
       2.59
       2.37
       20.0
       120
       1.65
       0.68
       0.53
       1.46
        9.300000
       0.60
       1.62
        840
    
    
      177
       14.13
       4.10
       2.74
       24.5
        96
       2.05
       0.76
       0.56
       1.35
        9.200000
       0.61
       1.60
        560
    
  

178 rows × 13 columns



In [4]:



In [5]:

    
# X, Y = wine.make_blobs(centers=4, cluster_std=0.5, random_state=1)

f1 = plt.figure(1)
plt.scatter(wine["Wine"], wine["Alcohol"]);
f1.show()

f2 = plt.figure(2)
plt.scatter(wine["Wine"], wine["Malic.acid"]);
f2.show()

f3 = plt.figure(3)
plt.scatter(wine["Wine"], wine["Ash"]);
f3.show()

f4 = plt.figure(4)
plt.scatter(wine["Wine"], wine["Acl"]);
f4.show()

f5 = plt.figure(5)
plt.scatter(wine["Wine"], wine["Mg"]);
f5.show()

f6 = plt.figure(6)
plt.scatter(wine["Wine"], wine["Phenols"]);
f6.show()

f7 = plt.figure(7)
plt.scatter(wine["Wine"], wine["Flavanoids"]);
f7.show()

f8 = plt.figure(8)
plt.scatter(wine["Wine"], wine["Nonflavanoid.phenols"]);
f8.show()

f9 = plt.figure(9)
plt.scatter(wine["Wine"], wine["Proanth"]);
f9.show()

f10 = plt.figure(10)
plt.scatter(wine["Wine"], wine["Color.int"]);
f10.show()

f11 = plt.figure(11)
plt.scatter(wine["Wine"], wine["Hue"]);
f11.show()

f12 = plt.figure(12)
plt.scatter(wine["Wine"], wine["OD"]);
f12.show()

f13 = plt.figure(13)
plt.scatter(wine["Wine"], wine["Proline"]);
f13.show()









    



/Users/ChristopherRuiz/anaconda/lib/python2.7/site-packages/matplotlib/figure.py:387: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "



In [6]:

    
first_half_wine = wine[["Wine","Alcohol","Malic.acid","Ash","Acl","Mg","Phenols"]]
_ = pd.scatter_matrix(first_half_wine,diagonal='kde')



In [7]:

    
second_half_wine = wine[["Wine","Flavanoids","Nonflavanoid.phenols","Proanth","Color.int","Hue","OD","Proline"]]
_ = pd.scatter_matrix(second_half_wine,diagonal='kde')



In [8]:

    
_ = pd.scatter_matrix(wine,diagonal='kde')

Step 2 - KMeans

Use Kmeans to split data set into three.



In [9]:

    
from sklearn.cluster import KMeans

# using modified_wine DF instead of wine because I dropped the first column.
X = modified_wine.values
y = modified_wine["Alcohol"].values

X









    Out[9]:





array([[  1.42300000e+01,   1.71000000e+00,   2.43000000e+00, ...,
          1.04000000e+00,   3.92000000e+00,   1.06500000e+03],
       [  1.32000000e+01,   1.78000000e+00,   2.14000000e+00, ...,
          1.05000000e+00,   3.40000000e+00,   1.05000000e+03],
       [  1.31600000e+01,   2.36000000e+00,   2.67000000e+00, ...,
          1.03000000e+00,   3.17000000e+00,   1.18500000e+03],
       ..., 
       [  1.32700000e+01,   4.28000000e+00,   2.26000000e+00, ...,
          5.90000000e-01,   1.56000000e+00,   8.35000000e+02],
       [  1.31700000e+01,   2.59000000e+00,   2.37000000e+00, ...,
          6.00000000e-01,   1.62000000e+00,   8.40000000e+02],
       [  1.41300000e+01,   4.10000000e+00,   2.74000000e+00, ...,
          6.10000000e-01,   1.60000000e+00,   5.60000000e+02]])



In [10]:

    
X[:,1]









    Out[10]:





array([ 1.71,  1.78,  2.36,  1.95,  2.59,  1.76,  1.87,  2.15,  1.64,
        1.35,  2.16,  1.48,  1.73,  1.73,  1.87,  1.81,  1.92,  1.57,
        1.59,  3.1 ,  1.63,  3.8 ,  1.86,  1.6 ,  1.81,  2.05,  1.77,
        1.72,  1.9 ,  1.68,  1.5 ,  1.66,  1.83,  1.53,  1.8 ,  1.81,
        1.64,  1.65,  1.5 ,  3.99,  1.71,  3.84,  1.89,  3.98,  1.77,
        4.04,  3.59,  1.68,  2.02,  1.73,  1.73,  1.65,  1.75,  1.9 ,
        1.67,  1.73,  1.7 ,  1.97,  1.43,  0.94,  1.1 ,  1.36,  1.25,
        1.13,  1.45,  1.21,  1.01,  1.17,  0.94,  1.19,  1.61,  1.51,
        1.66,  1.67,  1.09,  1.88,  0.9 ,  2.89,  0.99,  3.87,  0.92,
        1.81,  1.13,  3.86,  0.89,  0.98,  1.61,  1.67,  2.06,  1.33,
        1.83,  1.51,  1.53,  2.83,  1.99,  1.52,  2.12,  1.41,  1.07,
        3.17,  2.08,  1.34,  2.45,  1.72,  1.73,  2.55,  1.73,  1.75,
        1.29,  1.35,  3.74,  2.43,  2.68,  0.74,  1.39,  1.51,  1.47,
        1.61,  3.43,  3.43,  2.4 ,  2.05,  4.43,  5.8 ,  4.31,  2.16,
        1.53,  2.13,  1.63,  4.3 ,  1.35,  2.99,  2.31,  3.55,  1.24,
        2.46,  4.72,  5.51,  3.59,  2.96,  2.81,  2.56,  3.17,  4.95,
        3.88,  3.57,  5.04,  4.61,  3.24,  3.9 ,  3.12,  2.67,  1.9 ,
        3.3 ,  1.29,  5.19,  4.12,  3.03,  1.68,  1.67,  3.83,  3.26,
        3.27,  3.45,  2.76,  4.36,  3.7 ,  3.37,  2.58,  4.6 ,  3.03,
        2.39,  2.51,  5.65,  3.91,  4.28,  2.59,  4.1 ])



In [11]:

    
plt.scatter(X[:,1], X[:,12]);



In [12]:

    
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(X).labels_

plt.scatter(X[:,1], X[:,12], c=Y_hat_kmeans);



In [13]:

    
mu = kmeans.cluster_centers_
mu









    Out[13]:





array([[  1.29298387e+01,   2.50403226e+00,   2.40806452e+00,
          1.98903226e+01,   1.03596774e+02,   2.11112903e+00,
          1.58403226e+00,   3.88387097e-01,   1.50338710e+00,
          5.65032258e+00,   8.83967742e-01,   2.36548387e+00,
          7.28338710e+02],
       [  1.25166667e+01,   2.49420290e+00,   2.28855072e+00,
          2.08231884e+01,   9.23478261e+01,   2.07072464e+00,
          1.75840580e+00,   3.90144928e-01,   1.45188406e+00,
          4.08695651e+00,   9.41159420e-01,   2.49072464e+00,
          4.58231884e+02],
       [  1.38044681e+01,   1.88340426e+00,   2.42617021e+00,
          1.70234043e+01,   1.05510638e+02,   2.86723404e+00,
          3.01425532e+00,   2.85319149e-01,   1.91042553e+00,
          5.70255319e+00,   1.07829787e+00,   3.11404255e+00,
          1.19514894e+03]])



In [14]:

    
plt.scatter(X[:,1], X[:,12], c=Y_hat_kmeans, alpha=0.6)
plt.scatter(mu[:,1], mu[:,12], s=100, c=np.unique(Y_hat_kmeans))
print mu









    



[[  1.29298387e+01   2.50403226e+00   2.40806452e+00   1.98903226e+01
    1.03596774e+02   2.11112903e+00   1.58403226e+00   3.88387097e-01
    1.50338710e+00   5.65032258e+00   8.83967742e-01   2.36548387e+00
    7.28338710e+02]
 [  1.25166667e+01   2.49420290e+00   2.28855072e+00   2.08231884e+01
    9.23478261e+01   2.07072464e+00   1.75840580e+00   3.90144928e-01
    1.45188406e+00   4.08695651e+00   9.41159420e-01   2.49072464e+00
    4.58231884e+02]
 [  1.38044681e+01   1.88340426e+00   2.42617021e+00   1.70234043e+01
    1.05510638e+02   2.86723404e+00   3.01425532e+00   2.85319149e-01
    1.91042553e+00   5.70255319e+00   1.07829787e+00   3.11404255e+00
    1.19514894e+03]]

Step 3 - PCA

Apply PCA to the data.



In [15]:

    
from IPython.core.pylabtools import figsize
%matplotlib inline
figsize(12,5)



In [16]:

    
# look at the actual covariance matrix
print np.cov(X,rowvar=False)









    



[[  6.59062328e-01   8.56113090e-02   4.71151590e-02  -8.41092903e-01
    3.13987812e+00   1.46887218e-01   1.92033222e-01  -1.57542595e-02
    6.35175205e-02   1.02828254e+00  -1.33134432e-02   4.16978226e-02
    1.64567185e+02]
 [  8.56113090e-02   1.24801540e+00   5.02770393e-02   1.07633171e+00
   -8.70779534e-01  -2.34337723e-01  -4.58630366e-01   4.07333619e-02
   -1.41146982e-01   6.44838183e-01  -1.43325638e-01  -2.92447483e-01
   -6.75488666e+01]
 [  4.71151590e-02   5.02770393e-02   7.52646353e-02   4.06208278e-01
    1.12293658e+00   2.21455913e-02   3.15347299e-02   6.35847140e-03
    1.51557799e-03   1.64654327e-01  -4.68215451e-03   7.61835841e-04
    1.93197391e+01]
 [ -8.41092903e-01   1.07633171e+00   4.06208278e-01   1.11526862e+01
   -3.97476036e+00  -6.71149146e-01  -1.17208281e+00   1.50421856e-01
   -3.77176220e-01   1.45024186e-01  -2.09118054e-01  -6.56234368e-01
   -4.63355345e+02]
 [  3.13987812e+00  -8.70779534e-01   1.12293658e+00  -3.97476036e+00
    2.03989335e+02   1.91646988e+00   2.79308703e+00  -4.55563385e-01
    1.93283248e+00   6.62052061e+00   1.80851266e-01   6.69308068e-01
    1.76915870e+03]
 [  1.46887218e-01  -2.34337723e-01   2.21455913e-02  -6.71149146e-01
    1.91646988e+00   3.91689535e-01   5.40470422e-01  -3.50451247e-02
    2.19373345e-01  -7.99975192e-02   6.20388758e-02   3.11021278e-01
    9.81710573e+01]
 [  1.92033222e-01  -4.58630366e-01   3.15347299e-02  -1.17208281e+00
    2.79308703e+00   5.40470422e-01   9.97718673e-01  -6.68669999e-02
    3.73147553e-01  -3.99168626e-01   1.24081969e-01   5.58262255e-01
    1.55447492e+02]
 [ -1.57542595e-02   4.07333619e-02   6.35847140e-03   1.50421856e-01
   -4.55563385e-01  -3.50451247e-02  -6.68669999e-02   1.54886339e-02
   -2.60598680e-02   4.01205097e-02  -7.47117692e-03  -4.44692440e-02
   -1.22035863e+01]
 [  6.35175205e-02  -1.41146982e-01   1.51557799e-03  -3.77176220e-01
    1.93283248e+00   2.19373345e-01   3.73147553e-01  -2.60598680e-02
    3.27594668e-01  -3.35039177e-02   3.86645655e-02   2.10932940e-01
    5.95543338e+01]
 [  1.02828254e+00   6.44838183e-01   1.64654327e-01   1.45024186e-01
    6.62052061e+00  -7.99975192e-02  -3.99168626e-01   4.01205097e-02
   -3.35039177e-02   5.37444938e+00  -2.76505801e-01  -7.05812576e-01
    2.30767480e+02]
 [ -1.33134432e-02  -1.43325638e-01  -4.68215451e-03  -2.09118054e-01
    1.80851266e-01   6.20388758e-02   1.24081969e-01  -7.47117692e-03
    3.86645655e-02  -2.76505801e-01   5.22449607e-02   9.17662439e-02
    1.70002234e+01]
 [  4.16978226e-02  -2.92447483e-01   7.61835841e-04  -6.56234368e-01
    6.69308068e-01   3.11021278e-01   5.58262255e-01  -4.44692440e-02
    2.10932940e-01  -7.05812576e-01   9.17662439e-02   5.04086409e-01
    6.99275256e+01]
 [  1.64567185e+02  -6.75488666e+01   1.93197391e+01  -4.63355345e+02
    1.76915870e+03   9.81710573e+01   1.55447492e+02  -1.22035863e+01
    5.95543338e+01   2.30767480e+02   1.70002234e+01   6.99275256e+01
    9.91667174e+04]]



In [34]:

    
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
scaled_X = scale.fit_transform(X)



In [43]:

    
from sklearn.decomposition import PCA
pca = PCA()
# pca = PCA(n_components=10)



In [44]:

    
X_pca = pca.fit_transform(scaled_X)



In [45]:

    
pca.components_









    Out[45]:





array([[-0.1443294 ,  0.24518758,  0.00205106,  0.23932041, -0.14199204,
        -0.39466085, -0.4229343 ,  0.2985331 , -0.31342949,  0.0886167 ,
        -0.29671456, -0.37616741, -0.28675223],
       [ 0.48365155,  0.22493093,  0.31606881, -0.0105905 ,  0.299634  ,
         0.06503951, -0.00335981,  0.02877949,  0.03930172,  0.52999567,
        -0.27923515, -0.16449619,  0.36490283],
       [-0.20738262,  0.08901289,  0.6262239 ,  0.61208035,  0.13075693,
         0.14617896,  0.1506819 ,  0.17036816,  0.14945431, -0.13730621,
         0.08522192,  0.16600459, -0.12674592],
       [ 0.0178563 , -0.53689028,  0.21417556, -0.06085941,  0.35179658,
        -0.19806835, -0.15229479,  0.20330102, -0.39905653, -0.06592568,
         0.42777141, -0.18412074,  0.23207086],
       [-0.26566365,  0.03521363, -0.14302547,  0.06610294,  0.72704851,
        -0.14931841, -0.10902584, -0.50070298,  0.13685982, -0.07643678,
        -0.17361452, -0.10116099, -0.1578688 ],
       [ 0.21353865,  0.53681385,  0.15447466, -0.10082451,  0.03814394,
        -0.0841223 , -0.01892002, -0.25859401, -0.53379539, -0.41864414,
         0.10598274,  0.26585107,  0.11972557],
       [-0.05639636,  0.42052391, -0.14917061, -0.28696914,  0.3228833 ,
        -0.02792498, -0.06068521,  0.59544729,  0.37213935, -0.22771214,
         0.23207564, -0.0447637 ,  0.0768045 ],
       [ 0.39613926,  0.06582674, -0.17026002,  0.42797018, -0.15636143,
        -0.40593409, -0.18724536, -0.23328465,  0.36822675, -0.03379692,
         0.43662362, -0.07810789,  0.12002267],
       [-0.50861912,  0.07528304,  0.30769445, -0.20044931, -0.27140257,
        -0.28603452, -0.04957849, -0.19550132,  0.20914487, -0.05621752,
        -0.08582839, -0.1372269 ,  0.57578611],
       [ 0.21160473, -0.30907994, -0.02712539,  0.05279942,  0.06787022,
        -0.32013135, -0.16315051,  0.21553507,  0.1341839 , -0.29077518,
        -0.52239889,  0.52370587,  0.162116  ],
       [ 0.22591696, -0.07648554,  0.49869142, -0.47931378, -0.07128891,
        -0.30434119,  0.02569409, -0.11689586,  0.23736257, -0.0318388 ,
         0.04821201, -0.0464233 , -0.53926983],
       [-0.26628645,  0.12169604, -0.04962237, -0.05574287,  0.06222011,
        -0.30388245, -0.04289883,  0.04235219, -0.09555303,  0.60422163,
         0.259214  ,  0.60095872, -0.07940162],
       [ 0.01496997,  0.02596375, -0.14121803,  0.09168285,  0.05677422,
        -0.46390791,  0.83225706,  0.11403985, -0.11691707, -0.0119928 ,
        -0.08988884, -0.15671813,  0.01444734]])



In [46]:

    
pca.mean_









    Out[46]:





array([  7.84141790e-15,   2.44498554e-16,  -4.05917497e-15,
        -7.11041712e-17,  -2.49488320e-17,  -1.95536471e-16,
         9.44313292e-16,  -4.17892936e-16,  -1.54059038e-15,
        -4.12903170e-16,   1.39838203e-15,   2.12688793e-15,
        -6.98567296e-17])



In [47]:

    
_ = plt.scatter(X_pca[:,1], X_pca[:,2])



In [48]:

    
# How many Components did you need to explain 99% of variance in this dataset?
# Answer: the graph below indicates that 1 component creates 99% of variance for this dataset. 
plt.plot(pca.explained_variance_ratio_);



In [49]:

    
sum(pca.explained_variance_ratio_)









    Out[49]:





0.99999999999999978

Step 4 - KMeans and Clustering after PCA



In [50]:

    
# trying kMeans again with PCA
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(X_pca).labels_
mu = kmeans.cluster_centers_

plt.scatter(X_pca[:,1], X_pca[:,12], c=Y_hat_kmeans, alpha=0.4)
plt.scatter(mu[:,1], mu[:,12], s=100, c=np.unique(Y_hat_kmeans))









    Out[50]:





<matplotlib.collections.PathCollection at 0x122f82210>



In [51]:

    
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

# not printed as pretty, but the values are correct
distx = squareform(pdist(X, metric='euclidean'))
distx









    Out[51]:





array([[   0.        ,   31.26501239,  122.83115403, ...,  230.24002302,
         225.21518399,  506.05936766],
       [  31.26501239,    0.        ,  135.22469301, ...,  216.22123207,
         211.21353863,  490.23526821],
       [ 122.83115403,  135.22469301,    0.        , ...,  350.57118792,
         345.56265177,  625.07017782],
       ..., 
       [ 230.24002302,  216.22123207,  350.57118792, ...,    0.        ,
           5.35888981,  276.08601522],
       [ 225.21518399,  211.21353863,  345.56265177, ...,    5.35888981,
           0.        ,  281.06899242],
       [ 506.05936766,  490.23526821,  625.07017782, ...,  276.08601522,
         281.06899242,    0.        ]])



In [52]:

    
# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distx, method='single'), color_threshold=10)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);



In [23]:



In [ ]:

	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
0	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.640000	1.04	3.92	1065
1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.380000	1.05	3.40	1050
2	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.680000	1.03	3.17	1185
3	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.800000	0.86	3.45	1480
4	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.320000	1.04	2.93	735
5	14.20	1.76	2.45	15.2	112	3.27	3.39	0.34	1.97	6.750000	1.05	2.85	1450
6	14.39	1.87	2.45	14.6	96	2.50	2.52	0.30	1.98	5.250000	1.02	3.58	1290
7	14.06	2.15	2.61	17.6	121	2.60	2.51	0.31	1.25	5.050000	1.06	3.58	1295
8	14.83	1.64	2.17	14.0	97	2.80	2.98	0.29	1.98	5.200000	1.08	2.85	1045
9	13.86	1.35	2.27	16.0	98	2.98	3.15	0.22	1.85	7.220000	1.01	3.55	1045
10	14.10	2.16	2.30	18.0	105	2.95	3.32	0.22	2.38	5.750000	1.25	3.17	1510
11	14.12	1.48	2.32	16.8	95	2.20	2.43	0.26	1.57	5.000000	1.17	2.82	1280
12	13.75	1.73	2.41	16.0	89	2.60	2.76	0.29	1.81	5.600000	1.15	2.90	1320
13	14.75	1.73	2.39	11.4	91	3.10	3.69	0.43	2.81	5.400000	1.25	2.73	1150
14	14.38	1.87	2.38	12.0	102	3.30	3.64	0.29	2.96	7.500000	1.20	3.00	1547
15	13.63	1.81	2.70	17.2	112	2.85	2.91	0.30	1.46	7.300000	1.28	2.88	1310
16	14.30	1.92	2.72	20.0	120	2.80	3.14	0.33	1.97	6.200000	1.07	2.65	1280
17	13.83	1.57	2.62	20.0	115	2.95	3.40	0.40	1.72	6.600000	1.13	2.57	1130
18	14.19	1.59	2.48	16.5	108	3.30	3.93	0.32	1.86	8.700000	1.23	2.82	1680
19	13.64	3.10	2.56	15.2	116	2.70	3.03	0.17	1.66	5.100000	0.96	3.36	845
20	14.06	1.63	2.28	16.0	126	3.00	3.17	0.24	2.10	5.650000	1.09	3.71	780
21	12.93	3.80	2.65	18.6	102	2.41	2.41	0.25	1.98	4.500000	1.03	3.52	770
22	13.71	1.86	2.36	16.6	101	2.61	2.88	0.27	1.69	3.800000	1.11	4.00	1035
23	12.85	1.60	2.52	17.8	95	2.48	2.37	0.26	1.46	3.930000	1.09	3.63	1015
24	13.50	1.81	2.61	20.0	96	2.53	2.61	0.28	1.66	3.520000	1.12	3.82	845
25	13.05	2.05	3.22	25.0	124	2.63	2.68	0.47	1.92	3.580000	1.13	3.20	830
26	13.39	1.77	2.62	16.1	93	2.85	2.94	0.34	1.45	4.800000	0.92	3.22	1195
27	13.30	1.72	2.14	17.0	94	2.40	2.19	0.27	1.35	3.950000	1.02	2.77	1285
28	13.87	1.90	2.80	19.4	107	2.95	2.97	0.37	1.76	4.500000	1.25	3.40	915
29	14.02	1.68	2.21	16.0	96	2.65	2.33	0.26	1.98	4.700000	1.04	3.59	1035
...	...	...	...	...	...	...	...	...	...	...	...	...	...
148	13.32	3.24	2.38	21.5	92	1.93	0.76	0.45	1.25	8.420000	0.55	1.62	650
149	13.08	3.90	2.36	21.5	113	1.41	1.39	0.34	1.14	9.400000	0.57	1.33	550
150	13.50	3.12	2.62	24.0	123	1.40	1.57	0.22	1.25	8.600000	0.59	1.30	500
151	12.79	2.67	2.48	22.0	112	1.48	1.36	0.24	1.26	10.800000	0.48	1.47	480
152	13.11	1.90	2.75	25.5	116	2.20	1.28	0.26	1.56	7.100000	0.61	1.33	425
153	13.23	3.30	2.28	18.5	98	1.80	0.83	0.61	1.87	10.520000	0.56	1.51	675
154	12.58	1.29	2.10	20.0	103	1.48	0.58	0.53	1.40	7.600000	0.58	1.55	640
155	13.17	5.19	2.32	22.0	93	1.74	0.63	0.61	1.55	7.900000	0.60	1.48	725
156	13.84	4.12	2.38	19.5	89	1.80	0.83	0.48	1.56	9.010000	0.57	1.64	480
157	12.45	3.03	2.64	27.0	97	1.90	0.58	0.63	1.14	7.500000	0.67	1.73	880
158	14.34	1.68	2.70	25.0	98	2.80	1.31	0.53	2.70	13.000000	0.57	1.96	660
159	13.48	1.67	2.64	22.5	89	2.60	1.10	0.52	2.29	11.750000	0.57	1.78	620
160	12.36	3.83	2.38	21.0	88	2.30	0.92	0.50	1.04	7.650000	0.56	1.58	520
161	13.69	3.26	2.54	20.0	107	1.83	0.56	0.50	0.80	5.880000	0.96	1.82	680
162	12.85	3.27	2.58	22.0	106	1.65	0.60	0.60	0.96	5.580000	0.87	2.11	570
163	12.96	3.45	2.35	18.5	106	1.39	0.70	0.40	0.94	5.280000	0.68	1.75	675
164	13.78	2.76	2.30	22.0	90	1.35	0.68	0.41	1.03	9.580000	0.70	1.68	615
165	13.73	4.36	2.26	22.5	88	1.28	0.47	0.52	1.15	6.620000	0.78	1.75	520
166	13.45	3.70	2.60	23.0	111	1.70	0.92	0.43	1.46	10.680000	0.85	1.56	695
167	12.82	3.37	2.30	19.5	88	1.48	0.66	0.40	0.97	10.260000	0.72	1.75	685
168	13.58	2.58	2.69	24.5	105	1.55	0.84	0.39	1.54	8.660000	0.74	1.80	750
169	13.40	4.60	2.86	25.0	112	1.98	0.96	0.27	1.11	8.500000	0.67	1.92	630
170	12.20	3.03	2.32	19.0	96	1.25	0.49	0.40	0.73	5.500000	0.66	1.83	510
171	12.77	2.39	2.28	19.5	86	1.39	0.51	0.48	0.64	9.899999	0.57	1.63	470
172	14.16	2.51	2.48	20.0	91	1.68	0.70	0.44	1.24	9.700000	0.62	1.71	660
173	13.71	5.65	2.45	20.5	95	1.68	0.61	0.52	1.06	7.700000	0.64	1.74	740
174	13.40	3.91	2.48	23.0	102	1.80	0.75	0.43	1.41	7.300000	0.70	1.56	750
175	13.27	4.28	2.26	20.0	120	1.59	0.69	0.43	1.35	10.200000	0.59	1.56	835
176	13.17	2.59	2.37	20.0	120	1.65	0.68	0.53	1.46	9.300000	0.60	1.62	840
177	14.13	4.10	2.74	24.5	96	2.05	0.76	0.56	1.35	9.200000	0.61	1.60	560