Mini Project (Unsupervised Learning)


Principal Component Analysis of Wine Data


Steps involved

PCA involves following broad level steps –

1. Standardize the d-dimensional dataset.
2. Construct the covariance matrix.
3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
4. Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new   feature subspace (k≤d)
5. Construct a projection matrix W from the "top" k eigenvectors.
6. Transform the d-dimensional input dataset x using the projection matrix W to obtain the new k-dimensional feature subspace

In [1]:
# Import the modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore

In [2]:
# Read the dataset
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header=None)

In [34]:
# Descriptive analytics
print("Shape of the dataset: ", dataset.shape)
dataset.columns = ['class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash',
                  'magnesium', 'total_phenol', 'flavanoids', 'nonflavanoid_phenols',
                  'proanthocyanins', 'color_intensity', 'hue', 'diluted_wines',
                  'proline']


Shape of the dataset:  (178, 14)

In [35]:
# Displaying the top 5 rows of the dataset
dataset.head(5)


Out[35]:
class alcohol malic_acid ash alcalinity_ash magnesium total_phenol flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue diluted_wines proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

In [5]:
# Check for null values
dataset.isnull().values.sum()


Out[5]:
0

Attribute definition

1st attribute is class identifier (1-3). Other attributes are below:

  1. Alcohol
  2. Malic acid
  3. Ash
  4. Alcalinity of ash
  5. Magnesium
  6. Total phenols
  7. Flavanoids
  8. Nonflavanoid phenols
  9. Proanthocyanins
  10. Color intensity
  11. Hue
  12. OD280/OD315 of diluted wines
  13. Proline

So we will consider 13 attributes for PCA.


In [6]:
# Excluding first attribute
X = dataset.iloc[:, 1:].values

Standardizing the 13-dimensional dataset

We will use the StandardScaler class from sklearn.preprocessing library to standardize the dataset.


In [7]:
# Standardize the dataset
sc_X = StandardScaler()
X_std = sc_X.fit_transform(X)
X_std.shape


Out[7]:
(178, 13)

In [8]:
# Display the standardized dataset
X_std[:3, :]


Out[8]:
array([[ 1.51861254, -0.5622498 ,  0.23205254, -1.16959318,  1.91390522,
         0.80899739,  1.03481896, -0.65956311,  1.22488398,  0.25171685,
         0.36217728,  1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, -2.49084714,  0.01814502,
         0.56864766,  0.73362894, -0.82071924, -0.54472099, -0.29332133,
         0.40605066,  1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, -0.2687382 ,  0.08835836,
         0.80899739,  1.21553297, -0.49840699,  2.13596773,  0.26901965,
         0.31830389,  0.78858745,  1.39514818]])

Constructing the covariance matrix

We will use cov function from numpy module to find the covariance matrix.


In [9]:
cov_matrix = np.cov(X_std.transpose())
cov_matrix


Out[9]:
array([[ 1.00564972,  0.09493026,  0.21273976, -0.31198788,  0.27232816,
         0.29073446,  0.23815287, -0.15681042,  0.13747022,  0.549451  ,
        -0.07215255,  0.07275191,  0.64735687],
       [ 0.09493026,  1.00564972,  0.16497228,  0.29013035, -0.05488343,
        -0.3370606 , -0.41332866,  0.29463237, -0.22199334,  0.25039204,
        -0.56446685, -0.37079354, -0.19309537],
       [ 0.21273976,  0.16497228,  1.00564972,  0.44587209,  0.28820583,
         0.12970824,  0.11572743,  0.1872826 ,  0.00970647,  0.2603499 ,
        -0.07508874,  0.00393333,  0.22488969],
       [-0.31198788,  0.29013035,  0.44587209,  1.00564972, -0.0838039 ,
        -0.32292752, -0.353355  ,  0.36396647, -0.19844168,  0.01883781,
        -0.27550299, -0.27833221, -0.44308618],
       [ 0.27232816, -0.05488343,  0.28820583, -0.0838039 ,  1.00564972,
         0.21561254,  0.19688989, -0.25774204,  0.23777643,  0.20107967,
         0.05571118,  0.06637684,  0.39557317],
       [ 0.29073446, -0.3370606 ,  0.12970824, -0.32292752,  0.21561254,
         1.00564972,  0.86944804, -0.45247731,  0.61587304, -0.05544792,
         0.43613151,  0.70390388,  0.50092909],
       [ 0.23815287, -0.41332866,  0.11572743, -0.353355  ,  0.19688989,
         0.86944804,  1.00564972, -0.54093859,  0.65637929, -0.17335329,
         0.54654907,  0.79164133,  0.49698518],
       [-0.15681042,  0.29463237,  0.1872826 ,  0.36396647, -0.25774204,
        -0.45247731, -0.54093859,  1.00564972, -0.36791202,  0.13984265,
        -0.26412347, -0.50611293, -0.31314443],
       [ 0.13747022, -0.22199334,  0.00970647, -0.19844168,  0.23777643,
         0.61587304,  0.65637929, -0.36791202,  1.00564972, -0.02539259,
         0.29721399,  0.52199968,  0.33228346],
       [ 0.549451  ,  0.25039204,  0.2603499 ,  0.01883781,  0.20107967,
        -0.05544792, -0.17335329,  0.13984265, -0.02539259,  1.00564972,
        -0.52476129, -0.43123763,  0.31788599],
       [-0.07215255, -0.56446685, -0.07508874, -0.27550299,  0.05571118,
         0.43613151,  0.54654907, -0.26412347,  0.29721399, -0.52476129,
         1.00564972,  0.56866303,  0.23751782],
       [ 0.07275191, -0.37079354,  0.00393333, -0.27833221,  0.06637684,
         0.70390388,  0.79164133, -0.50611293,  0.52199968, -0.43123763,
         0.56866303,  1.00564972,  0.31452809],
       [ 0.64735687, -0.19309537,  0.22488969, -0.44308618,  0.39557317,
         0.50092909,  0.49698518, -0.31314443,  0.33228346,  0.31788599,
         0.23751782,  0.31452809,  1.00564972]])

In [10]:
# Print the covariance matrix
plt.figure(figsize=(15, 15))
sns.heatmap(cov_matrix, annot=True, cmap="Greens")


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f14620f3128>

In [11]:
# Pair plot for this dataset
sns.pairplot(pd.DataFrame(X_std))


Out[11]:
<seaborn.axisgrid.PairGrid at 0x7f145d1df208>

Decomposing the covariance matrix into its eigenvectors and eigenvalues

To do this we will use linalg.eig from numpy module.


In [12]:
# Converting to eigen values and eigen vectors
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)

In [13]:
eigen_pairs = [(np.abs(eig_vals[i]), eig_vecs[ :, i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort()
eigen_pairs.reverse()

In [14]:
eigen_pairs


Out[14]:
[(4.7324369775835953,
  array([-0.1443294 ,  0.24518758,  0.00205106,  0.23932041, -0.14199204,
         -0.39466085, -0.4229343 ,  0.2985331 , -0.31342949,  0.0886167 ,
         -0.29671456, -0.37616741, -0.28675223])),
 (2.5110809296451233,
  array([ 0.48365155,  0.22493093,  0.31606881, -0.0105905 ,  0.299634  ,
          0.06503951, -0.00335981,  0.02877949,  0.03930172,  0.52999567,
         -0.27923515, -0.16449619,  0.36490283])),
 (1.4542418678464692,
  array([-0.20738262,  0.08901289,  0.6262239 ,  0.61208035,  0.13075693,
          0.14617896,  0.1506819 ,  0.17036816,  0.14945431, -0.13730621,
          0.08522192,  0.16600459, -0.12674592])),
 (0.92416586682487556,
  array([ 0.0178563 , -0.53689028,  0.21417556, -0.06085941,  0.35179658,
         -0.19806835, -0.15229479,  0.20330102, -0.39905653, -0.06592568,
          0.42777141, -0.18412074,  0.23207086])),
 (0.85804867653711181,
  array([-0.26566365,  0.03521363, -0.14302547,  0.06610294,  0.72704851,
         -0.14931841, -0.10902584, -0.50070298,  0.13685982, -0.07643678,
         -0.17361452, -0.10116099, -0.1578688 ])),
 (0.645282212467854,
  array([ 0.21353865,  0.53681385,  0.15447466, -0.10082451,  0.03814394,
         -0.0841223 , -0.01892002, -0.25859401, -0.53379539, -0.41864414,
          0.10598274,  0.26585107,  0.11972557])),
 (0.55414146624578486,
  array([ 0.05639636, -0.42052391,  0.14917061,  0.28696914, -0.3228833 ,
          0.02792498,  0.06068521, -0.59544729, -0.37213935,  0.22771214,
         -0.23207564,  0.0447637 , -0.0768045 ])),
 (0.35046627494625449,
  array([ 0.39613926,  0.06582674, -0.17026002,  0.42797018, -0.15636143,
         -0.40593409, -0.18724536, -0.23328465,  0.36822675, -0.03379692,
          0.43662362, -0.07810789,  0.12002267])),
 (0.29051203269397752,
  array([-0.50861912,  0.07528304,  0.30769445, -0.20044931, -0.27140257,
         -0.28603452, -0.04957849, -0.19550132,  0.20914487, -0.05621752,
         -0.08582839, -0.1372269 ,  0.57578611])),
 (0.25232001036082513,
  array([ 0.21160473, -0.30907994, -0.02712539,  0.05279942,  0.06787022,
         -0.32013135, -0.16315051,  0.21553507,  0.1341839 , -0.29077518,
         -0.52239889,  0.52370587,  0.162116  ])),
 (0.22706428173088539,
  array([-0.22591696,  0.07648554, -0.49869142,  0.47931378,  0.07128891,
          0.30434119, -0.02569409,  0.11689586, -0.23736257,  0.0318388 ,
         -0.04821201,  0.0464233 ,  0.53926983])),
 (0.1697237389801218,
  array([-0.26628645,  0.12169604, -0.04962237, -0.05574287,  0.06222011,
         -0.30388245, -0.04289883,  0.04235219, -0.09555303,  0.60422163,
          0.259214  ,  0.60095872, -0.07940162])),
 (0.10396199182075286,
  array([-0.01496997, -0.02596375,  0.14121803, -0.09168285, -0.05677422,
          0.46390791, -0.83225706, -0.11403985,  0.11691707,  0.0119928 ,
          0.08988884,  0.15671813, -0.01444734]))]

In [16]:
# Display the eigen Vectors
print("Eigen Vectors:")
pd.DataFrame(eig_vecs)


Eigen Vectors:
Out[16]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 -0.144329 0.483652 -0.207383 0.017856 -0.265664 0.213539 0.056396 -0.014970 0.396139 -0.266286 -0.508619 -0.225917 0.211605
1 0.245188 0.224931 0.089013 -0.536890 0.035214 0.536814 -0.420524 -0.025964 0.065827 0.121696 0.075283 0.076486 -0.309080
2 0.002051 0.316069 0.626224 0.214176 -0.143025 0.154475 0.149171 0.141218 -0.170260 -0.049622 0.307694 -0.498691 -0.027125
3 0.239320 -0.010591 0.612080 -0.060859 0.066103 -0.100825 0.286969 -0.091683 0.427970 -0.055743 -0.200449 0.479314 0.052799
4 -0.141992 0.299634 0.130757 0.351797 0.727049 0.038144 -0.322883 -0.056774 -0.156361 0.062220 -0.271403 0.071289 0.067870
5 -0.394661 0.065040 0.146179 -0.198068 -0.149318 -0.084122 0.027925 0.463908 -0.405934 -0.303882 -0.286035 0.304341 -0.320131
6 -0.422934 -0.003360 0.150682 -0.152295 -0.109026 -0.018920 0.060685 -0.832257 -0.187245 -0.042899 -0.049578 -0.025694 -0.163151
7 0.298533 0.028779 0.170368 0.203301 -0.500703 -0.258594 -0.595447 -0.114040 -0.233285 0.042352 -0.195501 0.116896 0.215535
8 -0.313429 0.039302 0.149454 -0.399057 0.136860 -0.533795 -0.372139 0.116917 0.368227 -0.095553 0.209145 -0.237363 0.134184
9 0.088617 0.529996 -0.137306 -0.065926 -0.076437 -0.418644 0.227712 0.011993 -0.033797 0.604222 -0.056218 0.031839 -0.290775
10 -0.296715 -0.279235 0.085222 0.427771 -0.173615 0.105983 -0.232076 0.089889 0.436624 0.259214 -0.085828 -0.048212 -0.522399
11 -0.376167 -0.164496 0.166005 -0.184121 -0.101161 0.265851 0.044764 0.156718 -0.078108 0.600959 -0.137227 0.046423 0.523706
12 -0.286752 0.364903 -0.126746 0.232071 -0.157869 0.119726 -0.076805 -0.014447 0.120023 -0.079402 0.575786 0.539270 0.162116

In [17]:
# Display the eigen values
print("Eigen Values:")
pd.DataFrame(eig_vals).transpose()


Eigen Values:
Out[17]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 4.732437 2.511081 1.454242 0.924166 0.858049 0.645282 0.554141 0.103962 0.350466 0.169724 0.290512 0.227064 0.25232

In [18]:
eig_vecs.shape


Out[18]:
(13, 13)

Selecting k eigenvectors

Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k≤d)


In [19]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)


Cumulative Variance Explained [  36.1988481    55.40633836   66.52996889   73.59899908   80.16229276
   85.09811607   89.3367954    92.01754435   94.23969775   96.16971684
   97.90655253   99.20478511  100.        ]

In [20]:
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(13), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(13), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()



In [21]:
# Select top 7 eigen vectors
eig_vecs_selected = eig_vecs[:, :7]

Insights

Lets select the first 7 eigen vectors which gives us 89.33% coverage of variance from original data.

Projection matrix W

Constructing a projection matrix W from the "top" 7 eigenvectors.

Here, we are reducing the 13-dimensional feature space to a 7-dimensional feature subspace, by choosing the "top 7" eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix W.


In [22]:
# Projection matrix
W = eig_vecs[:, :7]

In [23]:
# Display the eigen Vectors
print("First 7 Eigen Vectors:")
pd.DataFrame(W)


First 7 Eigen Vectors:
Out[23]:
0 1 2 3 4 5 6
0 -0.144329 0.483652 -0.207383 0.017856 -0.265664 0.213539 0.056396
1 0.245188 0.224931 0.089013 -0.536890 0.035214 0.536814 -0.420524
2 0.002051 0.316069 0.626224 0.214176 -0.143025 0.154475 0.149171
3 0.239320 -0.010591 0.612080 -0.060859 0.066103 -0.100825 0.286969
4 -0.141992 0.299634 0.130757 0.351797 0.727049 0.038144 -0.322883
5 -0.394661 0.065040 0.146179 -0.198068 -0.149318 -0.084122 0.027925
6 -0.422934 -0.003360 0.150682 -0.152295 -0.109026 -0.018920 0.060685
7 0.298533 0.028779 0.170368 0.203301 -0.500703 -0.258594 -0.595447
8 -0.313429 0.039302 0.149454 -0.399057 0.136860 -0.533795 -0.372139
9 0.088617 0.529996 -0.137306 -0.065926 -0.076437 -0.418644 0.227712
10 -0.296715 -0.279235 0.085222 0.427771 -0.173615 0.105983 -0.232076
11 -0.376167 -0.164496 0.166005 -0.184121 -0.101161 0.265851 0.044764
12 -0.286752 0.364903 -0.126746 0.232071 -0.157869 0.119726 -0.076805

Transform the dataset

Transforming the 13-dimensional input dataset x using the projection matrix W to obtain the new 7-dimensional feature subspace.


In [24]:
W.shape


Out[24]:
(13, 7)

In [25]:
X_std.shape


Out[25]:
(178, 13)

In [26]:
# creating new subspace
new_features = np.dot(X_std, W)
pd.DataFrame(new_features)


Out[26]:
0 1 2 3 4 5 6
0 -3.316751 1.443463 -0.165739 0.215631 0.693043 0.223880 -0.596427
1 -2.209465 -0.333393 -2.026457 0.291358 -0.257655 0.927120 -0.053776
2 -2.516740 1.031151 0.982819 -0.724902 -0.251033 -0.549276 -0.424205
3 -3.757066 2.756372 -0.176192 -0.567983 -0.311842 -0.114431 0.383337
4 -1.008908 0.869831 2.026688 0.409766 0.298458 0.406520 -0.444074
5 -3.050254 2.122401 -0.629396 0.515637 -0.632019 -0.123431 -0.401654
6 -2.449090 1.174850 -0.977095 0.065831 -1.027762 0.620121 -0.052891
7 -2.059437 1.608963 0.146282 1.192608 0.076903 1.439806 -0.032376
8 -2.510874 0.918071 -1.770969 -0.056270 -0.892257 0.129181 -0.125285
9 -2.753628 0.789438 -0.984247 -0.349382 -0.468553 -0.163392 0.874352
10 -3.479737 1.302333 -0.422735 -0.026842 -0.338375 0.182902 -0.248162
11 -1.754753 0.611977 -1.190878 0.890164 -0.738573 0.553055 0.434266
12 -2.113462 0.675706 -0.865086 0.356438 -1.209929 0.215076 0.242597
13 -3.458157 1.130630 -1.204276 -0.162458 -2.023127 -0.745781 -1.475773
14 -4.312784 2.095976 -1.263913 -0.305773 -1.029693 -0.795643 -0.999971
15 -2.305188 1.662552 0.217903 1.440590 -0.469550 0.422213 0.180968
16 -2.171955 2.327305 0.831730 0.912601 -0.000115 0.066529 -0.109488
17 -1.898971 1.631369 0.794914 1.082380 -0.438705 -0.364931 -0.091647
18 -3.541985 2.518344 -0.485459 0.910323 -1.153079 -0.303877 0.033464
19 -2.084522 1.061138 -0.164747 -0.484997 0.882511 1.393018 0.102472
20 -3.124403 0.786897 -0.364887 0.025562 0.972414 0.106922 -0.264762
21 -1.086570 0.241744 0.936962 -1.029910 0.315972 1.211015 -0.296932
22 -2.535224 -0.091841 -0.311933 0.048391 -0.429582 1.014943 0.127770
23 -1.644988 -0.516279 0.143885 0.413720 -0.375720 0.784506 0.668402
24 -1.761576 -0.317149 0.890286 0.115116 -0.556668 0.898749 0.623551
25 -0.990079 0.940667 3.820908 1.321561 0.159005 0.265128 -0.481907
26 -1.775278 0.686175 -0.086700 0.232907 -1.142943 0.571474 0.458028
27 -1.235424 -0.089807 -1.386897 0.495683 -0.375941 0.608088 0.363010
28 -2.188406 0.689570 1.394567 0.777492 -0.810584 0.602072 -0.117933
29 -2.256109 0.191462 -1.092657 -0.286152 -0.483073 0.335234 0.158350
... ... ... ... ... ... ... ...
148 2.807064 1.570534 -0.472528 -0.627358 -0.260173 -0.553243 0.449464
149 2.899659 2.041057 -0.495960 -0.471156 1.438877 -0.178543 0.393520
150 2.320737 2.356366 0.437682 0.052260 2.234427 -0.076779 1.273439
151 2.549831 2.045283 -0.312268 -0.386972 1.847523 -0.980029 1.627192
152 1.812541 1.527646 1.362590 0.189396 1.749387 -0.970292 1.541851
153 2.760145 2.138932 -0.964629 -0.668386 -0.477042 -1.802113 -1.018406
154 2.737151 0.409886 -1.190405 0.663045 0.459037 -1.881890 -0.023338
155 3.604869 1.802384 -0.094037 -1.268840 -0.609465 -0.191933 -1.418423
156 2.889826 1.925219 -0.782323 -1.324725 -0.571243 -0.430606 -0.217856
157 3.392156 1.311876 1.602026 0.482842 -0.670871 -0.803565 -0.095837
158 1.048182 3.515090 1.160039 -0.935329 -0.899449 -3.284281 0.642463
159 1.609912 2.406638 0.548560 -0.754310 -0.995207 -2.919477 0.711056
160 3.143131 0.738161 -0.090999 -0.980648 -0.409814 -0.398922 0.074031
161 2.240157 1.175465 -0.101377 1.165279 -0.264449 0.801039 -0.539883
162 2.847674 0.556044 0.804215 0.897888 -0.254801 0.288568 -0.867762
163 2.597497 0.697966 -0.884940 0.274229 0.772235 0.719842 -0.272500
164 2.949299 1.555309 -0.983401 -0.015480 -0.364082 -0.491206 0.985935
165 3.530032 0.882527 -0.466029 -0.580790 -0.668960 0.458814 -0.522221
166 2.406111 2.592356 0.428226 0.184335 0.447661 -0.569506 -0.035802
167 2.929085 1.274447 -1.213358 -0.295316 -0.267350 -0.381213 0.644321
168 2.181413 2.077537 0.763783 0.389593 0.359874 -0.629568 0.753322
169 2.380928 2.588667 1.418044 -0.588502 1.127997 0.983645 0.930473
170 3.211617 -0.251249 -0.847129 0.217065 0.609095 0.395378 0.291849
171 3.677919 0.847748 -1.339420 0.125176 -0.486112 -0.857959 1.025640
172 2.465556 2.193798 -0.918781 -0.018025 -0.701210 -0.680855 0.829199
173 3.370524 2.216289 -0.342570 -1.058527 -0.574164 1.108788 -0.958416
174 2.601956 1.757229 0.207581 -0.349496 0.255063 0.026465 -0.146894
175 2.677839 2.760899 -0.940942 -0.312035 1.271355 -0.273068 -0.679235
176 2.387017 2.297347 -0.550696 0.688285 0.813955 -1.178783 -0.633975
177 3.208758 2.768920 1.013914 -0.596903 -0.895193 -0.296092 -0.005741

178 rows × 7 columns

Conclusion

Using PCA, we were able to reduce the huge 13 dimensional dataset to 7 dimensional subspace, by retaining 89.33% of the variance of the original dataset.

We used 6 steps to apply Principal Component Analysis on our dataset.

This can also accomplished using PCA implementation from scikit-learn. Ex: from sklearn.decomposition import PCA as sklearnPCA

Constructing a classification model


In [27]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header=None)
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [28]:
# Model score
classifier.score(X_test, y_test)


Out[28]:
0.97777777777777775

In [29]:
# Explained Variance
explained_variance


Out[29]:
array([ 0.37281068,  0.18739996])

In [30]:
sns.heatmap(cm, annot=True)


Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f14485c5390>

In [32]:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()


Insights


  • Using SVM to classify the wine dataset to 3 classes
  • Applied PCA and reducing the dimentionality to just 2
  • These 2 Principal Components explained total variance of 56.02% of the total variance in the dataset
  • The SVM model was able to get a accuracy of 97.77%