Homework 4:

Follow the steps below to:
- Read wine.csv in the data folder.
- The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
Try PCA and see how much can you reduce the variable space.
- How many Components did you need to explain 99% of variance in this dataset?
- Plot the PCA variables to see if it brings out the clusters.
Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks



In [1]:

    
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)



In [2]:

    
!pwd









    



/home/Sam/DAT_SF_11/homeworks/hw4



In [3]:

    
# dataset is in homeworks/data, so we back up a level
df = pd.read_csv('..\data\wine.csv')



In [4]:

    
# do we need to fiddle with a header row or the delimiter (unlikely since csv, but...)?
df.head(5)









    Out[4]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      0
       1
       14.2
       1.7
       2.4
       15.6
       127
       2.8
       3.1
       0.3
       2.3
       5.6
       1.0
       3.9
       1065
    
    
      1
       1
       13.2
       1.8
       2.1
       11.2
       100
       2.6
       2.8
       0.3
       1.3
       4.4
       1.1
       3.4
       1050
    
    
      2
       1
       13.2
       2.4
       2.7
       18.6
       101
       2.8
       3.2
       0.3
       2.8
       5.7
       1.0
       3.2
       1185
    
    
      3
       1
       14.4
       1.9
       2.5
       16.8
       113
       3.9
       3.5
       0.2
       2.2
       7.8
       0.9
       3.5
       1480
    
    
      4
       1
       13.2
       2.6
       2.9
       21.0
       118
       2.8
       2.7
       0.4
       1.8
       4.3
       1.0
       2.9
        735



In [5]:

    
df.tail(5)
# and maybe we'll see how many 'real' categories of wine there are









    Out[5]:






  
    
      
      Wine
      Alcohol
      Malic.acid
      Ash
      Acl
      Mg
      Phenols
      Flavanoids
      Nonflavanoid.phenols
      Proanth
      Color.int
      Hue
      OD
      Proline
    
  
  
    
      173
       3
       13.7
       5.7
       2.5
       20.5
        95
       1.7
       0.6
       0.5
       1.1
        7.7
       0.6
       1.7
       740
    
    
      174
       3
       13.4
       3.9
       2.5
       23.0
       102
       1.8
       0.8
       0.4
       1.4
        7.3
       0.7
       1.6
       750
    
    
      175
       3
       13.3
       4.3
       2.3
       20.0
       120
       1.6
       0.7
       0.4
       1.4
       10.2
       0.6
       1.6
       835
    
    
      176
       3
       13.2
       2.6
       2.4
       20.0
       120
       1.6
       0.7
       0.5
       1.5
        9.3
       0.6
       1.6
       840
    
    
      177
       3
       14.1
       4.1
       2.7
       24.5
        96
       2.0
       0.8
       0.6
       1.4
        9.2
       0.6
       1.6
       560



In [6]:

    
# nope! it looks like it read correctly!

df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)



In [7]:

    
# ok, so we have a category (Wine), which we'll be omitting for our unsupervised learning, and 13 characteristics for each obs

# because we have 178 entries, and everything shows up as 178 non-null numeric values, we don't appear to have any missing data
# !!!!!!!

Step 2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.



In [8]:

    
from sklearn.learning_curve import learning_curve
from sklearn import feature_extraction
from sklearn.preprocessing import scale

from sklearn.cluster import KMeans
from sklearn.metrics import classification_report

from sklearn.decomposition import PCA



In [9]:

    
X = df.drop('Wine', axis = 1)
y = df['Wine']



In [10]:

    
kmn = KMeans(n_clusters=3, n_init=20, random_state=24)
kmn.fit(X)









    Out[10]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=20,
    n_jobs=1, precompute_distances=True, random_state=24, tol=0.0001,
    verbose=0)



In [11]:

    
y_pred = kmn.predict(X)



In [12]:

    
print classification_report(y, y_pred)









    



             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00        59
          2       0.32      0.28      0.30        71
          3       0.00      0.00      0.00        48

avg / total       0.13      0.11      0.12       178







    



C:\Anaconda\lib\site-packages\sklearn\metrics\metrics.py:1771: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Anaconda\lib\site-packages\sklearn\metrics\metrics.py:1773: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)



In [13]:

    
# oof! looks like we didn't classify any samples as Wine=3?

# Most likely explanation: we DIDN'T SCALE OUR VARIABLES, so only a couple of them are doing any work in the K-Means



In [14]:

    
X2 = scale(np.array(X.values))



In [15]:

    
kmn.fit(X2)
y_pred = kmn.predict(X2)



In [16]:

    
print classification_report(y, y_pred)









    



             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00        59
          2       0.06      0.04      0.05        71
          3       0.00      0.00      0.00        48

avg / total       0.02      0.02      0.02       178



In [18]:

    
# actually, that seems to have made things worse?
print y_pred
print y.values









    



[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 2 1 1 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]



In [19]:

    
# NOPE! It's because our predictions are generating values over the set [0, 1, 2] rather than [1, 2, 3]
# and, because it's unsupervised, the algorithm is permuting them in an arbitrary way (could just as easily be
# [2, 0, 1] with a different random seed)

# could use 'applymap' to adjust y2_pred to match y's categories...
# in this case, what each label is 'supposed' to be is obvious, so it wouldn't be too hard...

# [bunch of error-filled code trying to use applymap]

# hm, let's just set this aside for now, and say "when we know what to label the categories, it doesn't look so bad"



In [20]:

    
# one alternative may be a confusion matrix, without the expectation that the big values will be on the diagonal

use PCA to reduce the dimensionality of the data



In [21]:

    
# we imported PCA from scikit earlier, so we just need to implement it

# here, we create a couple alternatives: one with 2 components, one with 3, and the third with "enough components to
# explain 99% of the variance" (per documentation)
pca2 = PCA(n_components=2)
pca3 = PCA(n_components=3)
pca99 = PCA(n_components=0.99)



In [22]:

    
# we use the scaled dataset from above
pca2.fit(X2)
pca3.fit(X2)
pca99.fit(X2)









    Out[22]:





PCA(copy=True, n_components=0.99, whiten=False)



In [23]:

    
print pca99.explained_variance_ratio_
# So it looks like the 99% of variance PCA has 12 components (on a variable space of 13 dimensions)

print 'variance explained by pca2: ', sum(pca2.explained_variance_ratio_)
print 'variance explained by pca3: ', sum(pca3.explained_variance_ratio_)
print 'variance explained by pca99: ', sum(pca99.explained_variance_ratio_)









    



[ 0.36198848  0.1920749   0.11123631  0.0706903   0.06563294  0.04935823
  0.04238679  0.02680749  0.02222153  0.01930019  0.01736836  0.01298233]
variance explained by pca2:  0.554063383569
variance explained by pca3:  0.665299688932
variance explained by pca99:  0.992047851101



In [24]:

    
# so it looks like the first PCA component is the same no matter how many components we use in our model (variance
# explained is additive)

# from the chain of values for pca99, the 'best' cut point is probably either 2, 4, or 5 components -- since the dataset
# is fairly small (and the score for 2 is only a shade over 50%), I think we probably want to go with 5

pca5 = PCA(n_components=5)
pca5.fit(X2)
Xpca = pca5.fit_transform(X2)



In [25]:

    
# plotting on PCA2 space to visualize clusters


# first: from scikit documentation -- we end up fitting KMeans on PCA2 so that we can draw pretty boundaries in the space
reduced_data = PCA(n_components=2).fit_transform(X2)
kmeans = KMeans(init='k-means++', n_clusters=3)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()



In [26]:

    
# an alternative from Lab 15 -- we can plot in PCA2 space but use our PCA5 model to actually classify

Y_hat_kmeans = kmeans.fit(Xpca).labels_



In [27]:

    
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=Y_hat_kmeans)









    Out[27]:





<matplotlib.collections.PathCollection at 0x16a55f60>

conduct KMeans, Hierarchical Clustering on PCA5 models to categorize, score each



In [28]:

    
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

# not printed as pretty, but the values are correct
distx = squareform(pdist(Xpca, metric='euclidean'))
distx

# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distx, method='ward'), color_threshold=100)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);



In [29]:

    
# so we can see from the dedrogram that we get three clusters when we set the threshold between about 90 and 180;
# if we go finer, by 70 we're already getting another couple clusters popping up... so even without the a priori knowledge 
# of the number of types of wine in the dataset, we may have ended up with three classes anyway


# would love to plot these as labels on a scatter, although I guess it wouldn't look too different from the other one...
# Y_hat_hier = kmeans.fit(Xpca).labels_



In [ ]:

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
0	1	14.2	1.7	2.4	15.6	127	2.8	3.1	0.3	2.3	5.6	1.0	3.9	1065
1	1	13.2	1.8	2.1	11.2	100	2.6	2.8	0.3	1.3	4.4	1.1	3.4	1050
2	1	13.2	2.4	2.7	18.6	101	2.8	3.2	0.3	2.8	5.7	1.0	3.2	1185
3	1	14.4	1.9	2.5	16.8	113	3.9	3.5	0.2	2.2	7.8	0.9	3.5	1480
4	1	13.2	2.6	2.9	21.0	118	2.8	2.7	0.4	1.8	4.3	1.0	2.9	735

	Wine	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
173	3	13.7	5.7	2.5	20.5	95	1.7	0.6	0.5	1.1	7.7	0.6	1.7	740
174	3	13.4	3.9	2.5	23.0	102	1.8	0.8	0.4	1.4	7.3	0.7	1.6	750
175	3	13.3	4.3	2.3	20.0	120	1.6	0.7	0.4	1.4	10.2	0.6	1.6	835
176	3	13.2	2.6	2.4	20.0	120	1.6	0.7	0.5	1.5	9.3	0.6	1.6	840
177	3	14.1	4.1	2.7	24.5	96	2.0	0.8	0.6	1.4	9.2	0.6	1.6	560