Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.



In [70]:

    
# Standard imports for data analysis packages in Python

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
import matplotlib.pyplot as plt

# This enables inline Plots

%matplotlib inline

# Limit the rows displayed in dataframe by inserting this line along with your imports.

pd.set_option('display.max_rows', 10)



In [71]:

    
# Create a data frame from the listings dataset

wine = pd.read_csv('../data/wine.csv', delimiter=",", header=0)



In [72]:

    
# Check the header of the file

print wine.head()









    



   Wine  Alcohol  Malic.acid   Ash   Acl   Mg  Phenols  Flavanoids  \
0     1    14.23        1.71  2.43  15.6  127     2.80        3.06   
1     1    13.20        1.78  2.14  11.2  100     2.65        2.76   
2     1    13.16        2.36  2.67  18.6  101     2.80        3.24   
3     1    14.37        1.95  2.50  16.8  113     3.85        3.49   
4     1    13.24        2.59  2.87  21.0  118     2.80        2.69   

   Nonflavanoid.phenols  Proanth  Color.int   Hue    OD  Proline  
0                  0.28     2.29       5.64  1.04  3.92     1065  
1                  0.26     1.28       4.38  1.05  3.40     1050  
2                  0.30     2.81       5.68  1.03  3.17     1185  
3                  0.24     2.18       7.80  0.86  3.45     1480  
4                  0.39     1.82       4.32  1.04  2.93      735



In [6]:

    
# Check for missing values

print wine.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 20.9 KB
None



In [7]:

    
# Check for values in the target variable

print wine.Wine.value_counts()









    



2    71
1    59
3    48
dtype: int64

Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.



In [86]:

    
# Import the KMeans package

from sklearn.cluster import KMeans

# Separate the X and Y data fields

y = wine.Wine
X = wine.drop('Wine', axis=1)

# Scale the dataset using the standard scaler

X_scaled = preprocessing.scale(X)



In [76]:

    
# Create a bivariate scatterplot matrix

scatter_matrix = pd.scatter_matrix(X, diagonal="kde")



In [77]:

    
# Run KMeans off of the X dataset

kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans = kmeans.fit(X_scaled).labels_



In [78]:

    
# View the predictions made by the KMeans model

print Y_hat_kmeans









    



[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]



In [79]:

    
# Import the metrics package from scikit-learn

from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Print the confusion matrix

cm = confusion_matrix(y, Y_hat_kmeans)
print(cm)

plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()









    



[[ 0  0  0  0]
 [ 0 59  0  0]
 [65  3  3  0]
 [ 0  0 48  0]]



In [ ]:

    
# The confusion matrix reveals that a total of 59 + 65 + 48 = 172 of the wines were correctly grouped
# Note: The predicted labels do not match the actual labels because we were just grouping based on the X values
# That means that 178 - 172 = 6 of the wines were incorrectly categorized

Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.



In [87]:

    
# Import the PCA estimator

from sklearn.decomposition import PCA

# Use PCA to fit the X data

pca = PCA()
X_pca = pca.fit_transform(X_scaled)



In [88]:

    
# View the reduced dataset (the first two variables are much larger in magnitude than the others)

print X_pca[:10,:]









    



[[ -3.31675081e+00   1.44346263e+00  -1.65739045e-01   2.15631188e-01
    6.93042841e-01   2.23880128e-01   5.96426546e-01  -6.51390947e-02
   -6.41442706e-01   1.02095585e+00   4.51563395e-01   5.40810414e-01
   -6.62386309e-02]
 [ -2.20946492e+00  -3.33392887e-01  -2.02645737e+00   2.91358318e-01
   -2.57654635e-01   9.27120244e-01   5.37756128e-02  -1.02441595e+00
    3.08846753e-01   1.59701372e-01   1.42657306e-01   3.88237741e-01
    3.63650247e-03]
 [ -2.51674015e+00   1.03115130e+00   9.82818670e-01  -7.24902309e-01
   -2.51033118e-01  -5.49276047e-01   4.24205451e-01   3.44216131e-01
    1.17783447e+00   1.13360857e-01   2.86672847e-01   5.83573183e-04
    2.17165104e-02]
 [ -3.75706561e+00   2.75637191e+00  -1.76191842e-01  -5.67983308e-01
   -3.11841591e-01  -1.14431000e-01  -3.83337297e-01  -6.43593498e-01
   -5.25444215e-02   2.39412605e-01  -7.59584312e-01  -2.42019563e-01
   -3.69483531e-01]
 [ -1.00890849e+00   8.69830821e-01   2.02668822e+00   4.09765788e-01
    2.98457503e-01   4.06519601e-01   4.44074463e-01  -4.16700470e-01
   -3.26819165e-01  -7.83664820e-02   5.25945083e-01  -2.16664158e-01
   -7.93635655e-02]
 [ -3.05025392e+00   2.12240111e+00  -6.29395827e-01   5.15637495e-01
   -6.32018734e-01  -1.23430557e-01   4.01653758e-01  -3.94893421e-01
    1.52146076e-01  -1.01995816e-01  -4.05585316e-01  -3.79432684e-01
    1.45155331e-01]
 [ -2.44908967e+00   1.17485013e+00  -9.77094891e-01   6.58305046e-02
   -1.02776191e+00   6.20120743e-01   5.28907285e-02   3.71933862e-01
    4.57015855e-01   1.01656346e+00   4.42433411e-01   1.41229844e-01
   -2.71778184e-01]
 [ -2.05943687e+00   1.60896307e+00   1.46281883e-01   1.19260801e+00
    7.69034938e-02   1.43980622e+00   3.23755923e-02  -2.32978954e-01
   -1.23370316e-01   7.35600047e-01  -2.93554859e-01   3.79663026e-01
   -1.10163787e-01]
 [ -2.51087430e+00   9.18070957e-01  -1.77096903e+00  -5.62703612e-02
   -8.92256977e-01   1.29181048e-01   1.25285071e-01   4.99577904e-01
   -6.06589198e-01   1.74106613e-01   5.08932893e-01  -6.35249336e-01
    1.42083536e-01]
 [ -2.75362819e+00   7.89437674e-01  -9.84247490e-01  -3.49381568e-01
   -4.68553076e-01  -1.63391650e-01  -8.74352245e-01  -1.50579503e-01
   -2.30489152e-01   1.79420103e-01  -1.24781710e-02   5.50326823e-01
   -4.24548533e-02]]



In [89]:

    
# Calculate the explained variance by all of the variables

print pca.explained_variance_ratio_
plt.plot(pca.explained_variance_ratio_);









    



[ 0.36198848  0.1920749   0.11123631  0.0706903   0.06563294  0.04935823
  0.04238679  0.02680749  0.02222153  0.01930019  0.01736836  0.01298233
  0.00795215]



In [ ]:

    
# To explain 99% of the variance in the data, we need virtually all of the components
# However, the top three factors alone account for 36% + 19% + 11% = 66% of the variance



In [90]:

    
# Plot a scatterplot while color-coding the points with their predictions

plt.scatter(X_pca[:,0], X_pca[:,1], c=Y_hat_kmeans_pca);

Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.



In [91]:

    
# Run KMeans off of the X dataset

kmeans = KMeans(n_clusters=3, init='random', n_init=10 , max_iter = 300, random_state=1)
Y_hat_kmeans_pca = kmeans.fit(X_pca).labels_



In [92]:

    
# View the predictions made by the KMeans model

print Y_hat_kmeans_pca









    



[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]



In [93]:

    
# Import the metrics package from scikit-learn

from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Print the confusion matrix

cm = confusion_matrix(y, Y_hat_kmeans_pca)
print(cm)

plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()









    



[[ 0  0  0  0]
 [ 0 59  0  0]
 [65  3  3  0]
 [ 0  0 48  0]]



In [ ]:

    
# Using the PCA transformed dataset was just as accurate as using the original dataset



In [94]:

    
# Import distance matrix for hierarchical clustering

from scipy.spatial.distance import pdist, squareform

distx = squareform(pdist(X_pca, metric='euclidean'))
distx









    Out[94]:





array([[ 0.        ,  3.49753522,  3.02660794, ...,  6.4909413 ,
         6.07878091,  7.18442107],
       [ 3.49753522,  0.        ,  4.1429119 , ...,  6.39689969,
         6.09492714,  7.36771922],
       [ 3.02660794,  4.1429119 ,  0.        , ...,  6.25367723,
         5.85179331,  6.35388503],
       ..., 
       [ 6.4909413 ,  6.39689969,  6.25367723, ...,  0.        ,
         1.82621785,  3.39251526],
       [ 6.07878091,  6.09492714,  5.85179331, ...,  1.82621785,
         0.        ,  3.32427633],
       [ 7.18442107,  7.36771922,  6.35388503, ...,  3.39251526,
         3.32427633,  0.        ]])



In [95]:

    
# Use scipy.cluster.hierarchy.linkage to create the hierarchy and the dendrogram to plot it.

from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(X_pca, method='single'), color_threshold=10)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);



In [ ]: