Homework 4:

  1. Follow the steps below to:
    • Read wine.csv in the data folder.
    • The First Column contains the Wine Category. Don't use it in the models below. We are going to treat it as unsupervized learning and compare the results to the Wine column.
  2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.
  3. Try PCA and see how much can you reduce the variable space.
    • How many Components did you need to explain 99% of variance in this dataset?
    • Plot the PCA variables to see if it brings out the clusters.
  4. Try KMeans and Hierarchical Clustering using data from PCA and compare again the clusters to the Wine column.

Dataset

wine.csv is in data folder under homeworks


In [1]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# This enables inline Plots
%matplotlib inline

# Limit rows displayed in notebook
pd.set_option('display.max_rows', 10)
pd.set_option('display.precision', 2)

In [2]:
!pwd


/home/Sam/DAT_SF_11/homeworks/hw4

In [3]:
# dataset is in homeworks/data, so we back up a level
df = pd.read_csv('..\data\wine.csv')

In [4]:
# do we need to fiddle with a header row or the delimiter (unlikely since csv, but...)?
df.head(5)


Out[4]:
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 1 14.2 1.7 2.4 15.6 127 2.8 3.1 0.3 2.3 5.6 1.0 3.9 1065
1 1 13.2 1.8 2.1 11.2 100 2.6 2.8 0.3 1.3 4.4 1.1 3.4 1050
2 1 13.2 2.4 2.7 18.6 101 2.8 3.2 0.3 2.8 5.7 1.0 3.2 1185
3 1 14.4 1.9 2.5 16.8 113 3.9 3.5 0.2 2.2 7.8 0.9 3.5 1480
4 1 13.2 2.6 2.9 21.0 118 2.8 2.7 0.4 1.8 4.3 1.0 2.9 735

In [5]:
df.tail(5)
# and maybe we'll see how many 'real' categories of wine there are


Out[5]:
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
173 3 13.7 5.7 2.5 20.5 95 1.7 0.6 0.5 1.1 7.7 0.6 1.7 740
174 3 13.4 3.9 2.5 23.0 102 1.8 0.8 0.4 1.4 7.3 0.7 1.6 750
175 3 13.3 4.3 2.3 20.0 120 1.6 0.7 0.4 1.4 10.2 0.6 1.6 835
176 3 13.2 2.6 2.4 20.0 120 1.6 0.7 0.5 1.5 9.3 0.6 1.6 840
177 3 14.1 4.1 2.7 24.5 96 2.0 0.8 0.6 1.4 9.2 0.6 1.6 560

In [6]:
# nope! it looks like it read correctly!

df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 14 columns):
Wine                    178 non-null int64
Alcohol                 178 non-null float64
Malic.acid              178 non-null float64
Ash                     178 non-null float64
Acl                     178 non-null float64
Mg                      178 non-null int64
Phenols                 178 non-null float64
Flavanoids              178 non-null float64
Nonflavanoid.phenols    178 non-null float64
Proanth                 178 non-null float64
Color.int               178 non-null float64
Hue                     178 non-null float64
OD                      178 non-null float64
Proline                 178 non-null int64
dtypes: float64(11), int64(3)

In [7]:
# ok, so we have a category (Wine), which we'll be omitting for our unsupervised learning, and 13 characteristics for each obs

# because we have 178 entries, and everything shows up as 178 non-null numeric values, we don't appear to have any missing data
# !!!!!!!

Step 2. Try KMeans where n_clusters = 3 and compare the clusters to the Wine column.


In [8]:
from sklearn.learning_curve import learning_curve
from sklearn import feature_extraction
from sklearn.preprocessing import scale

from sklearn.cluster import KMeans
from sklearn.metrics import classification_report

from sklearn.decomposition import PCA

In [9]:
X = df.drop('Wine', axis = 1)
y = df['Wine']

In [10]:
kmn = KMeans(n_clusters=3, n_init=20, random_state=24)
kmn.fit(X)


Out[10]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=20,
    n_jobs=1, precompute_distances=True, random_state=24, tol=0.0001,
    verbose=0)

In [11]:
y_pred = kmn.predict(X)

In [12]:
print classification_report(y, y_pred)


             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00        59
          2       0.32      0.28      0.30        71
          3       0.00      0.00      0.00        48

avg / total       0.13      0.11      0.12       178

C:\Anaconda\lib\site-packages\sklearn\metrics\metrics.py:1771: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Anaconda\lib\site-packages\sklearn\metrics\metrics.py:1773: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)

In [13]:
# oof! looks like we didn't classify any samples as Wine=3?

# Most likely explanation: we DIDN'T SCALE OUR VARIABLES, so only a couple of them are doing any work in the K-Means

In [14]:
X2 = scale(np.array(X.values))

In [15]:
kmn.fit(X2)
y_pred = kmn.predict(X2)

In [16]:
print classification_report(y, y_pred)


             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00        59
          2       0.06      0.04      0.05        71
          3       0.00      0.00      0.00        48

avg / total       0.02      0.02      0.02       178


In [18]:
# actually, that seems to have made things worse?
print y_pred
print y.values


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 2 1 1 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]

In [19]:
# NOPE! It's because our predictions are generating values over the set [0, 1, 2] rather than [1, 2, 3]
# and, because it's unsupervised, the algorithm is permuting them in an arbitrary way (could just as easily be
# [2, 0, 1] with a different random seed)

# could use 'applymap' to adjust y2_pred to match y's categories...
# in this case, what each label is 'supposed' to be is obvious, so it wouldn't be too hard...

# [bunch of error-filled code trying to use applymap]

# hm, let's just set this aside for now, and say "when we know what to label the categories, it doesn't look so bad"

In [20]:
# one alternative may be a confusion matrix, without the expectation that the big values will be on the diagonal

use PCA to reduce the dimensionality of the data


In [21]:
# we imported PCA from scikit earlier, so we just need to implement it

# here, we create a couple alternatives: one with 2 components, one with 3, and the third with "enough components to
# explain 99% of the variance" (per documentation)
pca2 = PCA(n_components=2)
pca3 = PCA(n_components=3)
pca99 = PCA(n_components=0.99)

In [22]:
# we use the scaled dataset from above
pca2.fit(X2)
pca3.fit(X2)
pca99.fit(X2)


Out[22]:
PCA(copy=True, n_components=0.99, whiten=False)

In [23]:
print pca99.explained_variance_ratio_
# So it looks like the 99% of variance PCA has 12 components (on a variable space of 13 dimensions)

print 'variance explained by pca2: ', sum(pca2.explained_variance_ratio_)
print 'variance explained by pca3: ', sum(pca3.explained_variance_ratio_)
print 'variance explained by pca99: ', sum(pca99.explained_variance_ratio_)


[ 0.36198848  0.1920749   0.11123631  0.0706903   0.06563294  0.04935823
  0.04238679  0.02680749  0.02222153  0.01930019  0.01736836  0.01298233]
variance explained by pca2:  0.554063383569
variance explained by pca3:  0.665299688932
variance explained by pca99:  0.992047851101

In [24]:
# so it looks like the first PCA component is the same no matter how many components we use in our model (variance
# explained is additive)

# from the chain of values for pca99, the 'best' cut point is probably either 2, 4, or 5 components -- since the dataset
# is fairly small (and the score for 2 is only a shade over 50%), I think we probably want to go with 5

pca5 = PCA(n_components=5)
pca5.fit(X2)
Xpca = pca5.fit_transform(X2)

In [25]:
# plotting on PCA2 space to visualize clusters


# first: from scikit documentation -- we end up fitting KMeans on PCA2 so that we can draw pretty boundaries in the space
reduced_data = PCA(n_components=2).fit_transform(X2)
kmeans = KMeans(init='k-means++', n_clusters=3)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1
y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the digits dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()



In [26]:
# an alternative from Lab 15 -- we can plot in PCA2 space but use our PCA5 model to actually classify

Y_hat_kmeans = kmeans.fit(Xpca).labels_

In [27]:
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=Y_hat_kmeans)


Out[27]:
<matplotlib.collections.PathCollection at 0x16a55f60>

conduct KMeans, Hierarchical Clustering on PCA5 models to categorize, score each


In [28]:
# compute distance matrix
from scipy.spatial.distance import pdist, squareform

# not printed as pretty, but the values are correct
distx = squareform(pdist(Xpca, metric='euclidean'))
distx

# perform clustering and plot the dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

R = dendrogram(linkage(distx, method='ward'), color_threshold=100)

plt.xlabel('points')
plt.ylabel('Height')
plt.suptitle('Cluster Dendrogram', fontweight='bold', fontsize=14);



In [29]:
# so we can see from the dedrogram that we get three clusters when we set the threshold between about 90 and 180;
# if we go finer, by 70 we're already getting another couple clusters popping up... so even without the a priori knowledge 
# of the number of types of wine in the dataset, we may have ended up with three classes anyway


# would love to plot these as labels on a scatter, although I guess it wouldn't look too different from the other one...
# Y_hat_hier = kmeans.fit(Xpca).labels_

In [ ]: