Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks



In [1]:

    
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cross_validation as cv

import sys
import re
import os
import pprint
import random 

# from fastcluster import *
from scipy import stats 
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, silhouette_score
from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC, SVC
from sklearn import preprocessing
from collections import Counter
from sklearn.metrics import matthews_corrcoef
from datetime import datetime
from collections import Counter
from fuzzywuzzy import fuzz
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist, squareform

pd.set_option('display.max_rows', 15)
pd.set_option('display.precision', 4)
np.set_printoptions(precision = 4, suppress = True)

%matplotlib inline

print 'Python version ' + sys.version
print 'Pandas version ' + pd.__version__
print 'Numpy version ' + np.__version__









    



Python version 2.7.9 |Anaconda 2.1.0 (x86_64)| (default, Dec 15 2014, 10:37:34) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version 0.15.2
Numpy version 1.9.1



In [2]:

    
wine = pd.read_csv('./wine.csv', sep = ',')



In [3]:

    
wine.head()









    Out[3]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      0
       1
       14.23
       1.71
       2.43
       15.6
       127
       2.80
       3.06
       0.28
       2.29
       5.64
       1.04
       3.92
       1065
    
    
      1
       1
       13.20
       1.78
       2.14
       11.2
       100
       2.65
       2.76
       0.26
       1.28
       4.38
       1.05
       3.40
       1050
    
    
      2
       1
       13.16
       2.36
       2.67
       18.6
       101
       2.80
       3.24
       0.30
       2.81
       5.68
       1.03
       3.17
       1185
    
    
      3
       1
       14.37
       1.95
       2.50
       16.8
       113
       3.85
       3.49
       0.24
       2.18
       7.80
       0.86
       3.45
       1480
    
    
      4
       1
       13.24
       2.59
       2.87
       21.0
       118
       2.80
       2.69
       0.39
       1.82
       4.32
       1.04
       2.93
        735



In [4]:

    
wine.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 20.9 KB



In [5]:

    
wine.describe() # No weird looking values or anything suggesting NaNs. Different scales - preprocess data.









    Out[5]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      count
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
       178.000
        178.000
    
    
      mean
         1.938
        13.001
         2.336
         2.367
        19.495
        99.742
         2.295
         2.029
         0.362
         1.591
         5.058
         0.957
         2.612
        746.893
    
    
      std
         0.775
         0.812
         1.117
         0.274
         3.340
        14.282
         0.626
         0.999
         0.124
         0.572
         2.318
         0.229
         0.710
        314.907
    
    
      min
         1.000
        11.030
         0.740
         1.360
        10.600
        70.000
         0.980
         0.340
         0.130
         0.410
         1.280
         0.480
         1.270
        278.000
    
    
      25%
         1.000
        12.362
         1.603
         2.210
        17.200
        88.000
         1.742
         1.205
         0.270
         1.250
         3.220
         0.782
         1.938
        500.500
    
    
      50%
         2.000
        13.050
         1.865
         2.360
        19.500
        98.000
         2.355
         2.135
         0.340
         1.555
         4.690
         0.965
         2.780
        673.500
    
    
      75%
         3.000
        13.678
         3.083
         2.558
        21.500
       107.000
         2.800
         2.875
         0.438
         1.950
         6.200
         1.120
         3.170
        985.000
    
    
      max
         3.000
        14.830
         5.800
         3.230
        30.000
       162.000
         3.880
         5.080
         0.660
         3.580
        13.000
         1.710
         4.000
       1680.000



In [6]:

    
# Drop column that will not be used in models. 
wine_category = wine['Wine'].values
wine.drop('Wine', axis = 1, inplace = True)



In [7]:

    
wine.head()









    Out[7]:






  
    
      
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      0
       14.23
       1.71
       2.43
       15.6
       127
       2.80
       3.06
       0.28
       2.29
       5.64
       1.04
       3.92
       1065
    
    
      1
       13.20
       1.78
       2.14
       11.2
       100
       2.65
       2.76
       0.26
       1.28
       4.38
       1.05
       3.40
       1050
    
    
      2
       13.16
       2.36
       2.67
       18.6
       101
       2.80
       3.24
       0.30
       2.81
       5.68
       1.03
       3.17
       1185
    
    
      3
       14.37
       1.95
       2.50
       16.8
       113
       3.85
       3.49
       0.24
       2.18
       7.80
       0.86
       3.45
       1480
    
    
      4
       13.24
       2.59
       2.87
       21.0
       118
       2.80
       2.69
       0.39
       1.82
       4.32
       1.04
       2.93
        735



In [8]:

    
# Correlations - some feats significantly correlated. 
wine.corr()









    Out[8]:






  
    
      
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      Alcohol
       1.000
       0.094
       0.212
      -0.310
       0.271
       0.289
       0.237
      -0.156
       0.137
       0.546
      -0.072
       0.072
       0.644
    
    
      Malic.acid
       0.094
       1.000
       0.164
       0.289
      -0.055
      -0.335
      -0.411
       0.293
      -0.221
       0.249
      -0.561
      -0.369
      -0.192
    
    
      Ash
       0.212
       0.164
       1.000
       0.443
       0.287
       0.129
       0.115
       0.186
       0.010
       0.259
      -0.075
       0.004
       0.224
    
    
      Acl
      -0.310
       0.289
       0.443
       1.000
      -0.083
      -0.321
      -0.351
       0.362
      -0.197
       0.019
      -0.274
      -0.277
      -0.441
    
    
      Mg
       0.271
      -0.055
       0.287
      -0.083
       1.000
       0.214
       0.196
      -0.256
       0.236
       0.200
       0.055
       0.066
       0.393
    
    
      Phenols
       0.289
      -0.335
       0.129
      -0.321
       0.214
       1.000
       0.865
      -0.450
       0.612
      -0.055
       0.434
       0.700
       0.498
    
    
      Flavanoids
       0.237
      -0.411
       0.115
      -0.351
       0.196
       0.865
       1.000
      -0.538
       0.653
      -0.172
       0.543
       0.787
       0.494
    
    
      Nonflavanoid.phenols
      -0.156
       0.293
       0.186
       0.362
      -0.256
      -0.450
      -0.538
       1.000
      -0.366
       0.139
      -0.263
      -0.503
      -0.311
    
    
      Proanth
       0.137
      -0.221
       0.010
      -0.197
       0.236
       0.612
       0.653
      -0.366
       1.000
      -0.025
       0.296
       0.519
       0.330
    
    
      Color.int
       0.546
       0.249
       0.259
       0.019
       0.200
      -0.055
      -0.172
       0.139
      -0.025
       1.000
      -0.522
      -0.429
       0.316
    
    
      Hue
      -0.072
      -0.561
      -0.075
      -0.274
       0.055
       0.434
       0.543
      -0.263
       0.296
      -0.522
       1.000
       0.565
       0.236
    
    
      OD
       0.072
      -0.369
       0.004
      -0.277
       0.066
       0.700
       0.787
      -0.503
       0.519
      -0.429
       0.565
       1.000
       0.313
    
    
      Proline
       0.644
      -0.192
       0.224
      -0.441
       0.393
       0.498
       0.494
      -0.311
       0.330
       0.316
       0.236
       0.313
       1.000



In [9]:

    
# Scatter plot suggests that there might be 2 or 3 underlying components. 
pd.tools.plotting.scatter_matrix(wine, figsize = (15, 20), alpha = 0.2, diagonal = 'kde');



In [10]:

    
# Make KMeans etc. output (starts with 0) comparable to wine.wine (start with 1). 
wine_category1 = wine_category - 1



In [11]:

    
# Need to scale data.
scaler = StandardScaler()



In [12]:

    
wine_std = scaler.fit_transform(wine)
wine_std









    Out[12]:





array([[ 1.5186, -0.5622,  0.2321, ...,  0.3622,  1.8479,  1.013 ],
       [ 0.2463, -0.4994, -0.828 , ...,  0.4061,  1.1134,  0.9652],
       [ 0.1969,  0.0212,  1.1093, ...,  0.3183,  0.7886,  1.3951],
       ..., 
       [ 0.3328,  1.7447, -0.3894, ..., -1.6121, -1.4854,  0.2806],
       [ 0.2092,  0.2277,  0.0127, ..., -1.5683, -1.4007,  0.2965],
       [ 1.3951,  1.5832,  1.3652, ..., -1.5244, -1.4289, -0.5952]])

(1) KMeans with n_clusters = 3



In [13]:

    
kmeans = KMeans(n_clusters = 3, n_jobs = -1)



In [14]:

    
kmeans1 = kmeans.fit(wine_std)



In [15]:

    
kmeans1_pred = kmeans1.labels_
print kmeans1_pred









    



[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]



In [16]:

    
print wine_category1 # eye-balling true values - looks ok









    



[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]



In [17]:

    
print kmeans1.cluster_centers_









    



[[-0.9261 -0.394  -0.4945  0.1706 -0.4917 -0.076   0.0208 -0.0335  0.0583
  -0.9019  0.4618  0.2708 -0.7538]
 [ 0.8352 -0.3038  0.3647 -0.6102  0.5776  0.8852  0.9778 -0.5621  0.5803
   0.1711  0.474   0.7792  1.1252]
 [ 0.1649  0.8715  0.1869  0.5244 -0.0755 -0.9793 -1.2152  0.7261 -0.7797
   0.9415 -1.1648 -1.2924 -0.4071]]

(2) PCA on scaled wine



In [18]:

    
# PCA. 
pca = PCA()



In [19]:

    
# Plot number of components and explained variance. 
explained_variance = []
components = np.arange(1, 14, 1)
for comp in components:
    pca = PCA(n_components = comp)
    pca.fit_transform(wine_std)
    explained_variance.append(pca.explained_variance_ratio_.sum())
plt.plot(components, explained_variance)
plt.tight_layout()
# That's a bit weird - one basically needs 11 components to cover 99% of variance.

(3) PCA with 2 components, then KMeans, then hierarchical clustering



In [20]:

    
# Use scaled data, 11 components.
pca2 = PCA(n_components = 11)



In [21]:

    
wine_pca = pca2.fit_transform(wine_std)



In [22]:

    
kmeans2 = KMeans(n_clusters = 3, n_jobs = -1, random_state = 1)



In [23]:

    
kmeans2_pca = kmeans2.fit(wine_pca)



In [24]:

    
print 'predicted labels\n', kmeans2_pca.labels_









    



predicted labels
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 2 2 2 2 2 2 2 1
 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 0 2 2 1 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]



In [25]:

    
print 'true labels\n', wine_category1 #Looks ok as well.









    



true labels
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]



In [39]:

    
# Plot PCA components - test data. 
plt.scatter(wine_pca[:, 0], wine_pca[:, 1], c = kmeans2_pca.labels_, s = 50, alpha = 0.5)
plt.scatter(kmeans2_pca.cluster_centers_[:,0], kmeans2_pca.cluster_centers_[:,1], s=100, c = np.unique(kmeans2_pca.labels_))









    Out[39]:





<matplotlib.collections.PathCollection at 0x11722b6d0>



In [40]:

    
plt.scatter(wine_pca[:, 0], wine_pca[:, 2], c = kmeans2_pca.labels_, s = 50, alpha = 0.5)
plt.scatter(kmeans2_pca.cluster_centers_[:, 0], kmeans2_pca.cluster_centers_[:,2], s=100, c=np.unique(kmeans2_pca.labels_))









    Out[40]:





<matplotlib.collections.PathCollection at 0x113fe6690>



In [41]:

    
plt.scatter(wine_pca[:, 3], wine_pca[:, 7], c = kmeans2_pca.labels_, s = 50, alpha = 0.5) # Looks not so good. 
plt.scatter(kmeans2_pca.cluster_centers_[:,3], kmeans2_pca.cluster_centers_[:,7], s=100, c=np.unique(kmeans2_pca.labels_))









    Out[41]:





<matplotlib.collections.PathCollection at 0x1121f8790>



In [ ]:

    
# Let's try hierarchical clustering. Use scaled PCA data.



In [33]:

    
distance_matrix = squareform(pdist(wine_pca, metric='euclidean'))
distance_matrix









    Out[33]:





array([[ 0.    ,  3.4935,  2.9767, ...,  6.4875,  6.0753,  7.1426],
       [ 3.4935,  0.    ,  4.1247, ...,  6.389 ,  6.0947,  7.3378],
       [ 2.9767,  4.1247,  0.    , ...,  6.2144,  5.8418,  6.3416],
       ..., 
       [ 6.4875,  6.389 ,  6.2144, ...,  0.    ,  1.7881,  3.2503],
       [ 6.0753,  6.0947,  5.8418, ...,  1.7881,  0.    ,  3.268 ],
       [ 7.1426,  7.3378,  6.3416, ...,  3.2503,  3.268 ,  0.    ]])



In [34]:

    
# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distance_matrix, method='ward'), labels = wine_category1)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontsize=14);



In [ ]:

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
count	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000	178.000
mean	1.938	13.001	2.336	2.367	19.495	99.742	2.295	2.029	0.362	1.591	5.058	0.957	2.612	746.893
std	0.775	0.812	1.117	0.274	3.340	14.282	0.626	0.999	0.124	0.572	2.318	0.229	0.710	314.907
min	1.000	11.030	0.740	1.360	10.600	70.000	0.980	0.340	0.130	0.410	1.280	0.480	1.270	278.000
25%	1.000	12.362	1.603	2.210	17.200	88.000	1.742	1.205	0.270	1.250	3.220	0.782	1.938	500.500
50%	2.000	13.050	1.865	2.360	19.500	98.000	2.355	2.135	0.340	1.555	4.690	0.965	2.780	673.500
75%	3.000	13.678	3.083	2.558	21.500	107.000	2.800	2.875	0.438	1.950	6.200	1.120	3.170	985.000
max	3.000	14.830	5.800	3.230	30.000	162.000	3.880	5.080	0.660	3.580	13.000	1.710	4.000	1680.000

	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
Alcohol	1.000	0.094	0.212	-0.310	0.271	0.289	0.237	-0.156	0.137	0.546	-0.072	0.072	0.644
Malic.acid	0.094	1.000	0.164	0.289	-0.055	-0.335	-0.411	0.293	-0.221	0.249	-0.561	-0.369	-0.192
Ash	0.212	0.164	1.000	0.443	0.287	0.129	0.115	0.186	0.010	0.259	-0.075	0.004	0.224
Acl	-0.310	0.289	0.443	1.000	-0.083	-0.321	-0.351	0.362	-0.197	0.019	-0.274	-0.277	-0.441
Mg	0.271	-0.055	0.287	-0.083	1.000	0.214	0.196	-0.256	0.236	0.200	0.055	0.066	0.393
Phenols	0.289	-0.335	0.129	-0.321	0.214	1.000	0.865	-0.450	0.612	-0.055	0.434	0.700	0.498
Flavanoids	0.237	-0.411	0.115	-0.351	0.196	0.865	1.000	-0.538	0.653	-0.172	0.543	0.787	0.494
Nonflavanoid.phenols	-0.156	0.293	0.186	0.362	-0.256	-0.450	-0.538	1.000	-0.366	0.139	-0.263	-0.503	-0.311
Proanth	0.137	-0.221	0.010	-0.197	0.236	0.612	0.653	-0.366	1.000	-0.025	0.296	0.519	0.330
Color.int	0.546	0.249	0.259	0.019	0.200	-0.055	-0.172	0.139	-0.025	1.000	-0.522	-0.429	0.316
Hue	-0.072	-0.561	-0.075	-0.274	0.055	0.434	0.543	-0.263	0.296	-0.522	1.000	0.565	0.236
OD	0.072	-0.369	0.004	-0.277	0.066	0.700	0.787	-0.503	0.519	-0.429	0.565	1.000	0.313
Proline	0.644	-0.192	0.224	-0.441	0.393	0.498	0.494	-0.311	0.330	0.316	0.236	0.313	1.000