Homework 4:

  1. Follow the steps below to:
    • Read wine.csv in the data folder.
    • The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
  2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
  3. Try PCA and see how much can you reduce the variable space.
    • How many Components did you need to explain 99% of variance in this dataset?
    • Plot the PCA variables to see if it brings out the clusters.
  4. Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.datasets as datasets
from sklearn.cluster import KMeans

%matplotlib inline

In [3]:
wine = pd.read_csv('../data/wine.csv')

wine.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)

Step 1 - Explore the Dataset


In [4]:
# drop the first column of the dataset
modified_wine = wine.drop("Wine",1)
modified_wine


Out[4]:
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.640000 1.04 3.92 1065
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.380000 1.05 3.40 1050
2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.680000 1.03 3.17 1185
3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.800000 0.86 3.45 1480
4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.320000 1.04 2.93 735
5 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.750000 1.05 2.85 1450
6 14.39 1.87 2.45 14.6 96 2.50 2.52 0.30 1.98 5.250000 1.02 3.58 1290
7 14.06 2.15 2.61 17.6 121 2.60 2.51 0.31 1.25 5.050000 1.06 3.58 1295
8 14.83 1.64 2.17 14.0 97 2.80 2.98 0.29 1.98 5.200000 1.08 2.85 1045
9 13.86 1.35 2.27 16.0 98 2.98 3.15 0.22 1.85 7.220000 1.01 3.55 1045
10 14.10 2.16 2.30 18.0 105 2.95 3.32 0.22 2.38 5.750000 1.25 3.17 1510
11 14.12 1.48 2.32 16.8 95 2.20 2.43 0.26 1.57 5.000000 1.17 2.82 1280
12 13.75 1.73 2.41 16.0 89 2.60 2.76 0.29 1.81 5.600000 1.15 2.90 1320
13 14.75 1.73 2.39 11.4 91 3.10 3.69 0.43 2.81 5.400000 1.25 2.73 1150
14 14.38 1.87 2.38 12.0 102 3.30 3.64 0.29 2.96 7.500000 1.20 3.00 1547
15 13.63 1.81 2.70 17.2 112 2.85 2.91 0.30 1.46 7.300000 1.28 2.88 1310
16 14.30 1.92 2.72 20.0 120 2.80 3.14 0.33 1.97 6.200000 1.07 2.65 1280
17 13.83 1.57 2.62 20.0 115 2.95 3.40 0.40 1.72 6.600000 1.13 2.57 1130
18 14.19 1.59 2.48 16.5 108 3.30 3.93 0.32 1.86 8.700000 1.23 2.82 1680
19 13.64 3.10 2.56 15.2 116 2.70 3.03 0.17 1.66 5.100000 0.96 3.36 845
20 14.06 1.63 2.28 16.0 126 3.00 3.17 0.24 2.10 5.650000 1.09 3.71 780
21 12.93 3.80 2.65 18.6 102 2.41 2.41 0.25 1.98 4.500000 1.03 3.52 770
22 13.71 1.86 2.36 16.6 101 2.61 2.88 0.27 1.69 3.800000 1.11 4.00 1035
23 12.85 1.60 2.52 17.8 95 2.48 2.37 0.26 1.46 3.930000 1.09 3.63 1015
24 13.50 1.81 2.61 20.0 96 2.53 2.61 0.28 1.66 3.520000 1.12 3.82 845
25 13.05 2.05 3.22 25.0 124 2.63 2.68 0.47 1.92 3.580000 1.13 3.20 830
26 13.39 1.77 2.62 16.1 93 2.85 2.94 0.34 1.45 4.800000 0.92 3.22 1195
27 13.30 1.72 2.14 17.0 94 2.40 2.19 0.27 1.35 3.950000 1.02 2.77 1285
28 13.87 1.90 2.80 19.4 107 2.95 2.97 0.37 1.76 4.500000 1.25 3.40 915
29 14.02 1.68 2.21 16.0 96 2.65 2.33 0.26 1.98 4.700000 1.04 3.59 1035
... ... ... ... ... ... ... ... ... ... ... ... ... ...
148 13.32 3.24 2.38 21.5 92 1.93 0.76 0.45 1.25 8.420000 0.55 1.62 650
149 13.08 3.90 2.36 21.5 113 1.41 1.39 0.34 1.14 9.400000 0.57 1.33 550
150 13.50 3.12 2.62 24.0 123 1.40 1.57 0.22 1.25 8.600000 0.59 1.30 500
151 12.79 2.67 2.48 22.0 112 1.48 1.36 0.24 1.26 10.800000 0.48 1.47 480
152 13.11 1.90 2.75 25.5 116 2.20 1.28 0.26 1.56 7.100000 0.61 1.33 425
153 13.23 3.30 2.28 18.5 98 1.80 0.83 0.61 1.87 10.520000 0.56 1.51 675
154 12.58 1.29 2.10 20.0 103 1.48 0.58 0.53 1.40 7.600000 0.58 1.55 640
155 13.17 5.19 2.32 22.0 93 1.74 0.63 0.61 1.55 7.900000 0.60 1.48 725
156 13.84 4.12 2.38 19.5 89 1.80 0.83 0.48 1.56 9.010000 0.57 1.64 480
157 12.45 3.03 2.64 27.0 97 1.90 0.58 0.63 1.14 7.500000 0.67 1.73 880
158 14.34 1.68 2.70 25.0 98 2.80 1.31 0.53 2.70 13.000000 0.57 1.96 660
159 13.48 1.67 2.64 22.5 89 2.60 1.10 0.52 2.29 11.750000 0.57 1.78 620
160 12.36 3.83 2.38 21.0 88 2.30 0.92 0.50 1.04 7.650000 0.56 1.58 520
161 13.69 3.26 2.54 20.0 107 1.83 0.56 0.50 0.80 5.880000 0.96 1.82 680
162 12.85 3.27 2.58 22.0 106 1.65 0.60 0.60 0.96 5.580000 0.87 2.11 570
163 12.96 3.45 2.35 18.5 106 1.39 0.70 0.40 0.94 5.280000 0.68 1.75 675
164 13.78 2.76 2.30 22.0 90 1.35 0.68 0.41 1.03 9.580000 0.70 1.68 615
165 13.73 4.36 2.26 22.5 88 1.28 0.47 0.52 1.15 6.620000 0.78 1.75 520
166 13.45 3.70 2.60 23.0 111 1.70 0.92 0.43 1.46 10.680000 0.85 1.56 695
167 12.82 3.37 2.30 19.5 88 1.48 0.66 0.40 0.97 10.260000 0.72 1.75 685
168 13.58 2.58 2.69 24.5 105 1.55 0.84 0.39 1.54 8.660000 0.74 1.80 750
169 13.40 4.60 2.86 25.0 112 1.98 0.96 0.27 1.11 8.500000 0.67 1.92 630
170 12.20 3.03 2.32 19.0 96 1.25 0.49 0.40 0.73 5.500000 0.66 1.83 510
171 12.77 2.39 2.28 19.5 86 1.39 0.51 0.48 0.64 9.899999 0.57 1.63 470
172 14.16 2.51 2.48 20.0 91 1.68 0.70 0.44 1.24 9.700000 0.62 1.71 660
173 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.700000 0.64 1.74 740
174 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.300000 0.70 1.56 750
175 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.200000 0.59 1.56 835
176 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.300000 0.60 1.62 840
177 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.200000 0.61 1.60 560

178 rows × 13 columns


In [4]:


In [5]:
# X, Y = wine.make_blobs(centers=4, cluster_std=0.5, random_state=1)

f1 = plt.figure(1)
plt.scatter(wine["Wine"], wine["Alcohol"]);
f1.show()

f2 = plt.figure(2)
plt.scatter(wine["Wine"], wine["Malic.acid"]);
f2.show()

f3 = plt.figure(3)
plt.scatter(wine["Wine"], wine["Ash"]);
f3.show()

f4 = plt.figure(4)
plt.scatter(wine["Wine"], wine["Acl"]);
f4.show()

f5 = plt.figure(5)
plt.scatter(wine["Wine"], wine["Mg"]);
f5.show()

f6 = plt.figure(6)
plt.scatter(wine["Wine"], wine["Phenols"]);
f6.show()

f7 = plt.figure(7)
plt.scatter(wine["Wine"], wine["Flavanoids"]);
f7.show()

f8 = plt.figure(8)
plt.scatter(wine["Wine"], wine["Nonflavanoid.phenols"]);
f8.show()

f9 = plt.figure(9)
plt.scatter(wine["Wine"], wine["Proanth"]);
f9.show()

f10 = plt.figure(10)
plt.scatter(wine["Wine"], wine["Color.int"]);
f10.show()

f11 = plt.figure(11)
plt.scatter(wine["Wine"], wine["Hue"]);
f11.show()

f12 = plt.figure(12)
plt.scatter(wine["Wine"], wine["OD"]);
f12.show()

f13 = plt.figure(13)
plt.scatter(wine["Wine"], wine["Proline"]);
f13.show()


/Users/ChristopherRuiz/anaconda/lib/python2.7/site-packages/matplotlib/figure.py:387: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "

In [6]:
first_half_wine = wine[["Wine","Alcohol","Malic.acid","Ash","Acl","Mg","Phenols"]]
_ = pd.scatter_matrix(first_half_wine,diagonal='kde')



In [7]:
second_half_wine = wine[["Wine","Flavanoids","Nonflavanoid.phenols","Proanth","Color.int","Hue","OD","Proline"]]
_ = pd.scatter_matrix(second_half_wine,diagonal='kde')



In [8]:
_ = pd.scatter_matrix(wine,diagonal='kde')


Step 2 - KMeans

Use Kmeans to split data set into three.


In [9]:
from sklearn.cluster import KMeans

# using modified_wine DF instead of wine because I dropped the first column.
X = modified_wine.values
y = modified_wine["Alcohol"].values

X


Out[9]:
array([[  1.42300000e+01,   1.71000000e+00,   2.43000000e+00, ...,
          1.04000000e+00,   3.92000000e+00,   1.06500000e+03],
       [  1.32000000e+01,   1.78000000e+00,   2.14000000e+00, ...,
          1.05000000e+00,   3.40000000e+00,   1.05000000e+03],
       [  1.31600000e+01,   2.36000000e+00,   2.67000000e+00, ...,
          1.03000000e+00,   3.17000000e+00,   1.18500000e+03],
       ..., 
       [  1.32700000e+01,   4.28000000e+00,   2.26000000e+00, ...,
          5.90000000e-01,   1.56000000e+00,   8.35000000e+02],
       [  1.31700000e+01,   2.59000000e+00,   2.37000000e+00, ...,
          6.00000000e-01,   1.62000000e+00,   8.40000000e+02],
       [  1.41300000e+01,   4.10000000e+00,   2.74000000e+00, ...,
          6.10000000e-01,   1.60000000e+00,   5.60000000e+02]])

In [10]:
X[:,1]


Out[10]:
array([ 1.71,  1.78,  2.36,  1.95,  2.59,  1.76,  1.87,  2.15,  1.64,
        1.35,  2.16,  1.48,  1.73,  1.73,  1.87,  1.81,  1.92,  1.57,
        1.59,  3.1 ,  1.63,  3.8 ,  1.86,  1.6 ,  1.81,  2.05,  1.77,
        1.72,  1.9 ,  1.68,  1.5 ,  1.66,  1.83,  1.53,  1.8 ,  1.81,
        1.64,  1.65,  1.5 ,  3.99,  1.71,  3.84,  1.89,  3.98,  1.77,
        4.04,  3.59,  1.68,  2.02,  1.73,  1.73,  1.65,  1.75,  1.9 ,
        1.67,  1.73,  1.7 ,  1.97,  1.43,  0.94,  1.1 ,  1.36,  1.25,
        1.13,  1.45,  1.21,  1.01,  1.17,  0.94,  1.19,  1.61,  1.51,
        1.66,  1.67,  1.09,  1.88,  0.9 ,  2.89,  0.99,  3.87,  0.92,
        1.81,  1.13,  3.86,  0.89,  0.98,  1.61,  1.67,  2.06,  1.33,
        1.83,  1.51,  1.53,  2.83,  1.99,  1.52,  2.12,  1.41,  1.07,
        3.17,  2.08,  1.34,  2.45,  1.72,  1.73,  2.55,  1.73,  1.75,
        1.29,  1.35,  3.74,  2.43,  2.68,  0.74,  1.39,  1.51,  1.47,
        1.61,  3.43,  3.43,  2.4 ,  2.05,  4.43,  5.8 ,  4.31,  2.16,
        1.53,  2.13,  1.63,  4.3 ,  1.35,  2.99,  2.31,  3.55,  1.24,
        2.46,  4.72,  5.51,  3.59,  2.96,  2.81,  2.56,  3.17,  4.95,
        3.88,  3.57,  5.04,  4.61,  3.24,  3.9 ,  3.12,  2.67,  1.9 ,
        3.3 ,  1.29,  5.19,  4.12,  3.03,  1.68,  1.67,  3.83,  3.26,
        3.27,  3.45,  2.76,  4.36,  3.7 ,  3.37,  2.58,  4.6 ,  3.03,
        2.39,  2.51,  5.65,  3.91,  4.28,  2.59,  4.1 ])

In [11]:
plt.scatter(X[:,1], X[:,12]);



In [12]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(X).labels_

plt.scatter(X[:,1], X[:,12], c=Y_hat_kmeans);



In [13]:
mu = kmeans.cluster_centers_
mu


Out[13]:
array([[  1.29298387e+01,   2.50403226e+00,   2.40806452e+00,
          1.98903226e+01,   1.03596774e+02,   2.11112903e+00,
          1.58403226e+00,   3.88387097e-01,   1.50338710e+00,
          5.65032258e+00,   8.83967742e-01,   2.36548387e+00,
          7.28338710e+02],
       [  1.25166667e+01,   2.49420290e+00,   2.28855072e+00,
          2.08231884e+01,   9.23478261e+01,   2.07072464e+00,
          1.75840580e+00,   3.90144928e-01,   1.45188406e+00,
          4.08695651e+00,   9.41159420e-01,   2.49072464e+00,
          4.58231884e+02],
       [  1.38044681e+01,   1.88340426e+00,   2.42617021e+00,
          1.70234043e+01,   1.05510638e+02,   2.86723404e+00,
          3.01425532e+00,   2.85319149e-01,   1.91042553e+00,
          5.70255319e+00,   1.07829787e+00,   3.11404255e+00,
          1.19514894e+03]])

In [14]:
plt.scatter(X[:,1], X[:,12], c=Y_hat_kmeans, alpha=0.6)
plt.scatter(mu[:,1], mu[:,12], s=100, c=np.unique(Y_hat_kmeans))
print mu


[[  1.29298387e+01   2.50403226e+00   2.40806452e+00   1.98903226e+01
    1.03596774e+02   2.11112903e+00   1.58403226e+00   3.88387097e-01
    1.50338710e+00   5.65032258e+00   8.83967742e-01   2.36548387e+00
    7.28338710e+02]
 [  1.25166667e+01   2.49420290e+00   2.28855072e+00   2.08231884e+01
    9.23478261e+01   2.07072464e+00   1.75840580e+00   3.90144928e-01
    1.45188406e+00   4.08695651e+00   9.41159420e-01   2.49072464e+00
    4.58231884e+02]
 [  1.38044681e+01   1.88340426e+00   2.42617021e+00   1.70234043e+01
    1.05510638e+02   2.86723404e+00   3.01425532e+00   2.85319149e-01
    1.91042553e+00   5.70255319e+00   1.07829787e+00   3.11404255e+00
    1.19514894e+03]]

Step 3 - PCA

Apply PCA to the data.


In [15]:
from IPython.core.pylabtools import figsize
%matplotlib inline
figsize(12,5)

In [16]:
# look at the actual covariance matrix
print np.cov(X,rowvar=False)


[[  6.59062328e-01   8.56113090e-02   4.71151590e-02  -8.41092903e-01
    3.13987812e+00   1.46887218e-01   1.92033222e-01  -1.57542595e-02
    6.35175205e-02   1.02828254e+00  -1.33134432e-02   4.16978226e-02
    1.64567185e+02]
 [  8.56113090e-02   1.24801540e+00   5.02770393e-02   1.07633171e+00
   -8.70779534e-01  -2.34337723e-01  -4.58630366e-01   4.07333619e-02
   -1.41146982e-01   6.44838183e-01  -1.43325638e-01  -2.92447483e-01
   -6.75488666e+01]
 [  4.71151590e-02   5.02770393e-02   7.52646353e-02   4.06208278e-01
    1.12293658e+00   2.21455913e-02   3.15347299e-02   6.35847140e-03
    1.51557799e-03   1.64654327e-01  -4.68215451e-03   7.61835841e-04
    1.93197391e+01]
 [ -8.41092903e-01   1.07633171e+00   4.06208278e-01   1.11526862e+01
   -3.97476036e+00  -6.71149146e-01  -1.17208281e+00   1.50421856e-01
   -3.77176220e-01   1.45024186e-01  -2.09118054e-01  -6.56234368e-01
   -4.63355345e+02]
 [  3.13987812e+00  -8.70779534e-01   1.12293658e+00  -3.97476036e+00
    2.03989335e+02   1.91646988e+00   2.79308703e+00  -4.55563385e-01
    1.93283248e+00   6.62052061e+00   1.80851266e-01   6.69308068e-01
    1.76915870e+03]
 [  1.46887218e-01  -2.34337723e-01   2.21455913e-02  -6.71149146e-01
    1.91646988e+00   3.91689535e-01   5.40470422e-01  -3.50451247e-02
    2.19373345e-01  -7.99975192e-02   6.20388758e-02   3.11021278e-01
    9.81710573e+01]
 [  1.92033222e-01  -4.58630366e-01   3.15347299e-02  -1.17208281e+00
    2.79308703e+00   5.40470422e-01   9.97718673e-01  -6.68669999e-02
    3.73147553e-01  -3.99168626e-01   1.24081969e-01   5.58262255e-01
    1.55447492e+02]
 [ -1.57542595e-02   4.07333619e-02   6.35847140e-03   1.50421856e-01
   -4.55563385e-01  -3.50451247e-02  -6.68669999e-02   1.54886339e-02
   -2.60598680e-02   4.01205097e-02  -7.47117692e-03  -4.44692440e-02
   -1.22035863e+01]
 [  6.35175205e-02  -1.41146982e-01   1.51557799e-03  -3.77176220e-01
    1.93283248e+00   2.19373345e-01   3.73147553e-01  -2.60598680e-02
    3.27594668e-01  -3.35039177e-02   3.86645655e-02   2.10932940e-01
    5.95543338e+01]
 [  1.02828254e+00   6.44838183e-01   1.64654327e-01   1.45024186e-01
    6.62052061e+00  -7.99975192e-02  -3.99168626e-01   4.01205097e-02
   -3.35039177e-02   5.37444938e+00  -2.76505801e-01  -7.05812576e-01
    2.30767480e+02]
 [ -1.33134432e-02  -1.43325638e-01  -4.68215451e-03  -2.09118054e-01
    1.80851266e-01   6.20388758e-02   1.24081969e-01  -7.47117692e-03
    3.86645655e-02  -2.76505801e-01   5.22449607e-02   9.17662439e-02
    1.70002234e+01]
 [  4.16978226e-02  -2.92447483e-01   7.61835841e-04  -6.56234368e-01
    6.69308068e-01   3.11021278e-01   5.58262255e-01  -4.44692440e-02
    2.10932940e-01  -7.05812576e-01   9.17662439e-02   5.04086409e-01
    6.99275256e+01]
 [  1.64567185e+02  -6.75488666e+01   1.93197391e+01  -4.63355345e+02
    1.76915870e+03   9.81710573e+01   1.55447492e+02  -1.22035863e+01
    5.95543338e+01   2.30767480e+02   1.70002234e+01   6.99275256e+01
    9.91667174e+04]]

In [34]:
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
scaled_X = scale.fit_transform(X)

In [43]:
from sklearn.decomposition import PCA
pca = PCA()
# pca = PCA(n_components=10)

In [44]:
X_pca = pca.fit_transform(scaled_X)

In [45]:
pca.components_


Out[45]:
array([[-0.1443294 ,  0.24518758,  0.00205106,  0.23932041, -0.14199204,
        -0.39466085, -0.4229343 ,  0.2985331 , -0.31342949,  0.0886167 ,
        -0.29671456, -0.37616741, -0.28675223],
       [ 0.48365155,  0.22493093,  0.31606881, -0.0105905 ,  0.299634  ,
         0.06503951, -0.00335981,  0.02877949,  0.03930172,  0.52999567,
        -0.27923515, -0.16449619,  0.36490283],
       [-0.20738262,  0.08901289,  0.6262239 ,  0.61208035,  0.13075693,
         0.14617896,  0.1506819 ,  0.17036816,  0.14945431, -0.13730621,
         0.08522192,  0.16600459, -0.12674592],
       [ 0.0178563 , -0.53689028,  0.21417556, -0.06085941,  0.35179658,
        -0.19806835, -0.15229479,  0.20330102, -0.39905653, -0.06592568,
         0.42777141, -0.18412074,  0.23207086],
       [-0.26566365,  0.03521363, -0.14302547,  0.06610294,  0.72704851,
        -0.14931841, -0.10902584, -0.50070298,  0.13685982, -0.07643678,
        -0.17361452, -0.10116099, -0.1578688 ],
       [ 0.21353865,  0.53681385,  0.15447466, -0.10082451,  0.03814394,
        -0.0841223 , -0.01892002, -0.25859401, -0.53379539, -0.41864414,
         0.10598274,  0.26585107,  0.11972557],
       [-0.05639636,  0.42052391, -0.14917061, -0.28696914,  0.3228833 ,
        -0.02792498, -0.06068521,  0.59544729,  0.37213935, -0.22771214,
         0.23207564, -0.0447637 ,  0.0768045 ],
       [ 0.39613926,  0.06582674, -0.17026002,  0.42797018, -0.15636143,
        -0.40593409, -0.18724536, -0.23328465,  0.36822675, -0.03379692,
         0.43662362, -0.07810789,  0.12002267],
       [-0.50861912,  0.07528304,  0.30769445, -0.20044931, -0.27140257,
        -0.28603452, -0.04957849, -0.19550132,  0.20914487, -0.05621752,
        -0.08582839, -0.1372269 ,  0.57578611],
       [ 0.21160473, -0.30907994, -0.02712539,  0.05279942,  0.06787022,
        -0.32013135, -0.16315051,  0.21553507,  0.1341839 , -0.29077518,
        -0.52239889,  0.52370587,  0.162116  ],
       [ 0.22591696, -0.07648554,  0.49869142, -0.47931378, -0.07128891,
        -0.30434119,  0.02569409, -0.11689586,  0.23736257, -0.0318388 ,
         0.04821201, -0.0464233 , -0.53926983],
       [-0.26628645,  0.12169604, -0.04962237, -0.05574287,  0.06222011,
        -0.30388245, -0.04289883,  0.04235219, -0.09555303,  0.60422163,
         0.259214  ,  0.60095872, -0.07940162],
       [ 0.01496997,  0.02596375, -0.14121803,  0.09168285,  0.05677422,
        -0.46390791,  0.83225706,  0.11403985, -0.11691707, -0.0119928 ,
        -0.08988884, -0.15671813,  0.01444734]])

In [46]:
pca.mean_


Out[46]:
array([  7.84141790e-15,   2.44498554e-16,  -4.05917497e-15,
        -7.11041712e-17,  -2.49488320e-17,  -1.95536471e-16,
         9.44313292e-16,  -4.17892936e-16,  -1.54059038e-15,
        -4.12903170e-16,   1.39838203e-15,   2.12688793e-15,
        -6.98567296e-17])

In [47]:
_ = plt.scatter(X_pca[:,1], X_pca[:,2])



In [48]:
# How many Components did you need to explain 99% of variance in this dataset?
# Answer: the graph below indicates that 1 component creates 99% of variance for this dataset. 
plt.plot(pca.explained_variance_ratio_);



In [49]:
sum(pca.explained_variance_ratio_)


Out[49]:
0.99999999999999978

Step 4 - KMeans and Clustering after PCA


In [50]:
# trying kMeans again with PCA
kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(X_pca).labels_
mu = kmeans.cluster_centers_

plt.scatter(X_pca[:,1], X_pca[:,12], c=Y_hat_kmeans, alpha=0.4)
plt.scatter(mu[:,1], mu[:,12], s=100, c=np.unique(Y_hat_kmeans))


Out[50]:
<matplotlib.collections.PathCollection at 0x122f82210>

In [51]:
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

# not printed as pretty, but the values are correct
distx = squareform(pdist(X, metric='euclidean'))
distx


Out[51]:
array([[   0.        ,   31.26501239,  122.83115403, ...,  230.24002302,
         225.21518399,  506.05936766],
       [  31.26501239,    0.        ,  135.22469301, ...,  216.22123207,
         211.21353863,  490.23526821],
       [ 122.83115403,  135.22469301,    0.        , ...,  350.57118792,
         345.56265177,  625.07017782],
       ..., 
       [ 230.24002302,  216.22123207,  350.57118792, ...,    0.        ,
           5.35888981,  276.08601522],
       [ 225.21518399,  211.21353863,  345.56265177, ...,    5.35888981,
           0.        ,  281.06899242],
       [ 506.05936766,  490.23526821,  625.07017782, ...,  276.08601522,
         281.06899242,    0.        ]])

In [52]:
# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distx, method='single'), color_threshold=10)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);



In [23]:


In [ ]: