Homework 4:

  1. Follow the steps below to:
    • Read wine.csv in the data folder.
    • The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
  2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
  3. Try PCA and see how much can you reduce the variable space.
    • How many Components did you need to explain 99% of variance in this dataset?
    • Plot the PCA variables to see if it brings out the clusters.
  4. Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks


In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.cross_validation as cv

import sys
import re
import os
import pprint
import random 

# from fastcluster import *
from scipy import stats 
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, silhouette_score
from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC, SVC
from sklearn import preprocessing
from collections import Counter
from sklearn.metrics import matthews_corrcoef
from datetime import datetime
from collections import Counter
from fuzzywuzzy import fuzz
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import pdist, squareform

pd.set_option('display.max_rows', 15)
pd.set_option('display.precision', 4)
np.set_printoptions(precision = 4, suppress = True)

%matplotlib inline

print 'Python version ' + sys.version
print 'Pandas version ' + pd.__version__
print 'Numpy version ' + np.__version__


Python version 2.7.9 |Anaconda 2.1.0 (x86_64)| (default, Dec 15 2014, 10:37:34) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version 0.15.2
Numpy version 1.9.1

In [2]:
wine = pd.read_csv('./wine.csv', sep = ',')

In [3]:
wine.head()


Out[3]:
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

In [4]:
wine.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 20.9 KB

In [5]:
wine.describe() # No weird looking values or anything suggesting NaNs. Different scales - preprocess data.


Out[5]:
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
count 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000 178.000
mean 1.938 13.001 2.336 2.367 19.495 99.742 2.295 2.029 0.362 1.591 5.058 0.957 2.612 746.893
std 0.775 0.812 1.117 0.274 3.340 14.282 0.626 0.999 0.124 0.572 2.318 0.229 0.710 314.907
min 1.000 11.030 0.740 1.360 10.600 70.000 0.980 0.340 0.130 0.410 1.280 0.480 1.270 278.000
25% 1.000 12.362 1.603 2.210 17.200 88.000 1.742 1.205 0.270 1.250 3.220 0.782 1.938 500.500
50% 2.000 13.050 1.865 2.360 19.500 98.000 2.355 2.135 0.340 1.555 4.690 0.965 2.780 673.500
75% 3.000 13.678 3.083 2.558 21.500 107.000 2.800 2.875 0.438 1.950 6.200 1.120 3.170 985.000
max 3.000 14.830 5.800 3.230 30.000 162.000 3.880 5.080 0.660 3.580 13.000 1.710 4.000 1680.000

In [6]:
# Drop column that will not be used in models. 
wine_category = wine['Wine'].values
wine.drop('Wine', axis = 1, inplace = True)

In [7]:
wine.head()


Out[7]:
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

In [8]:
# Correlations - some feats significantly correlated. 
wine.corr()


Out[8]:
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
Alcohol 1.000 0.094 0.212 -0.310 0.271 0.289 0.237 -0.156 0.137 0.546 -0.072 0.072 0.644
Malic.acid 0.094 1.000 0.164 0.289 -0.055 -0.335 -0.411 0.293 -0.221 0.249 -0.561 -0.369 -0.192
Ash 0.212 0.164 1.000 0.443 0.287 0.129 0.115 0.186 0.010 0.259 -0.075 0.004 0.224
Acl -0.310 0.289 0.443 1.000 -0.083 -0.321 -0.351 0.362 -0.197 0.019 -0.274 -0.277 -0.441
Mg 0.271 -0.055 0.287 -0.083 1.000 0.214 0.196 -0.256 0.236 0.200 0.055 0.066 0.393
Phenols 0.289 -0.335 0.129 -0.321 0.214 1.000 0.865 -0.450 0.612 -0.055 0.434 0.700 0.498
Flavanoids 0.237 -0.411 0.115 -0.351 0.196 0.865 1.000 -0.538 0.653 -0.172 0.543 0.787 0.494
Nonflavanoid.phenols -0.156 0.293 0.186 0.362 -0.256 -0.450 -0.538 1.000 -0.366 0.139 -0.263 -0.503 -0.311
Proanth 0.137 -0.221 0.010 -0.197 0.236 0.612 0.653 -0.366 1.000 -0.025 0.296 0.519 0.330
Color.int 0.546 0.249 0.259 0.019 0.200 -0.055 -0.172 0.139 -0.025 1.000 -0.522 -0.429 0.316
Hue -0.072 -0.561 -0.075 -0.274 0.055 0.434 0.543 -0.263 0.296 -0.522 1.000 0.565 0.236
OD 0.072 -0.369 0.004 -0.277 0.066 0.700 0.787 -0.503 0.519 -0.429 0.565 1.000 0.313
Proline 0.644 -0.192 0.224 -0.441 0.393 0.498 0.494 -0.311 0.330 0.316 0.236 0.313 1.000

In [9]:
# Scatter plot suggests that there might be 2 or 3 underlying components. 
pd.tools.plotting.scatter_matrix(wine, figsize = (15, 20), alpha = 0.2, diagonal = 'kde');



In [10]:
# Make KMeans etc. output (starts with 0) comparable to wine.wine (start with 1). 
wine_category1 = wine_category - 1

In [11]:
# Need to scale data.
scaler = StandardScaler()

In [12]:
wine_std = scaler.fit_transform(wine)
wine_std


Out[12]:
array([[ 1.5186, -0.5622,  0.2321, ...,  0.3622,  1.8479,  1.013 ],
       [ 0.2463, -0.4994, -0.828 , ...,  0.4061,  1.1134,  0.9652],
       [ 0.1969,  0.0212,  1.1093, ...,  0.3183,  0.7886,  1.3951],
       ..., 
       [ 0.3328,  1.7447, -0.3894, ..., -1.6121, -1.4854,  0.2806],
       [ 0.2092,  0.2277,  0.0127, ..., -1.5683, -1.4007,  0.2965],
       [ 1.3951,  1.5832,  1.3652, ..., -1.5244, -1.4289, -0.5952]])

(1) KMeans with n_clusters = 3


In [13]:
kmeans = KMeans(n_clusters = 3, n_jobs = -1)

In [14]:
kmeans1 = kmeans.fit(wine_std)

In [15]:
kmeans1_pred = kmeans1.labels_
print kmeans1_pred


[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

In [16]:
print wine_category1 # eye-balling true values - looks ok


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

In [17]:
print kmeans1.cluster_centers_


[[-0.9261 -0.394  -0.4945  0.1706 -0.4917 -0.076   0.0208 -0.0335  0.0583
  -0.9019  0.4618  0.2708 -0.7538]
 [ 0.8352 -0.3038  0.3647 -0.6102  0.5776  0.8852  0.9778 -0.5621  0.5803
   0.1711  0.474   0.7792  1.1252]
 [ 0.1649  0.8715  0.1869  0.5244 -0.0755 -0.9793 -1.2152  0.7261 -0.7797
   0.9415 -1.1648 -1.2924 -0.4071]]

(2) PCA on scaled wine


In [18]:
# PCA. 
pca = PCA()

In [19]:
# Plot number of components and explained variance. 
explained_variance = []
components = np.arange(1, 14, 1)
for comp in components:
    pca = PCA(n_components = comp)
    pca.fit_transform(wine_std)
    explained_variance.append(pca.explained_variance_ratio_.sum())
plt.plot(components, explained_variance)
plt.tight_layout()
# That's a bit weird - one basically needs 11 components to cover 99% of variance.


(3) PCA with 2 components, then KMeans, then hierarchical clustering


In [20]:
# Use scaled data, 11 components.
pca2 = PCA(n_components = 11)

In [21]:
wine_pca = pca2.fit_transform(wine_std)

In [22]:
kmeans2 = KMeans(n_clusters = 3, n_jobs = -1, random_state = 1)

In [23]:
kmeans2_pca = kmeans2.fit(wine_pca)

In [24]:
print 'predicted labels\n', kmeans2_pca.labels_


predicted labels
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 0 2 2 2 2 2 2 2 2 2 2 2 1
 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 0 2 2 1 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

In [25]:
print 'true labels\n', wine_category1 #Looks ok as well.


true labels
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

In [39]:
# Plot PCA components - test data. 
plt.scatter(wine_pca[:, 0], wine_pca[:, 1], c = kmeans2_pca.labels_, s = 50, alpha = 0.5)
plt.scatter(kmeans2_pca.cluster_centers_[:,0], kmeans2_pca.cluster_centers_[:,1], s=100, c = np.unique(kmeans2_pca.labels_))


Out[39]:
<matplotlib.collections.PathCollection at 0x11722b6d0>

In [40]:
plt.scatter(wine_pca[:, 0], wine_pca[:, 2], c = kmeans2_pca.labels_, s = 50, alpha = 0.5)
plt.scatter(kmeans2_pca.cluster_centers_[:, 0], kmeans2_pca.cluster_centers_[:,2], s=100, c=np.unique(kmeans2_pca.labels_))


Out[40]:
<matplotlib.collections.PathCollection at 0x113fe6690>

In [41]:
plt.scatter(wine_pca[:, 3], wine_pca[:, 7], c = kmeans2_pca.labels_, s = 50, alpha = 0.5) # Looks not so good. 
plt.scatter(kmeans2_pca.cluster_centers_[:,3], kmeans2_pca.cluster_centers_[:,7], s=100, c=np.unique(kmeans2_pca.labels_))


Out[41]:
<matplotlib.collections.PathCollection at 0x1121f8790>

In [ ]:
# Let's try hierarchical clustering. Use scaled PCA data.

In [33]:
distance_matrix = squareform(pdist(wine_pca, metric='euclidean'))
distance_matrix


Out[33]:
array([[ 0.    ,  3.4935,  2.9767, ...,  6.4875,  6.0753,  7.1426],
       [ 3.4935,  0.    ,  4.1247, ...,  6.389 ,  6.0947,  7.3378],
       [ 2.9767,  4.1247,  0.    , ...,  6.2144,  5.8418,  6.3416],
       ..., 
       [ 6.4875,  6.389 ,  6.2144, ...,  0.    ,  1.7881,  3.2503],
       [ 6.0753,  6.0947,  5.8418, ...,  1.7881,  0.    ,  3.268 ],
       [ 7.1426,  7.3378,  6.3416, ...,  3.2503,  3.268 ,  0.    ]])

In [34]:
# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distance_matrix, method='ward'), labels = wine_category1)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontsize=14);



In [ ]: