Mini Project (Unsupervised Learning)

Principal Component Analysis of Wine Data

Steps involved

PCA involves following broad level steps –

1. Standardize the d-dimensional dataset.
2. Construct the covariance matrix.
3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
4. Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new   feature subspace (k≤d)
5. Construct a projection matrix W from the "top" k eigenvectors.
6. Transform the d-dimensional input dataset x using the projection matrix W to obtain the new k-dimensional feature subspace



In [1]:

    
# Import the modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore



In [2]:

    
# Read the dataset
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header=None)



In [34]:

    
# Descriptive analytics
print("Shape of the dataset: ", dataset.shape)
dataset.columns = ['class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash',
                  'magnesium', 'total_phenol', 'flavanoids', 'nonflavanoid_phenols',
                  'proanthocyanins', 'color_intensity', 'hue', 'diluted_wines',
                  'proline']









    



Shape of the dataset:  (178, 14)



In [35]:

    
# Displaying the top 5 rows of the dataset
dataset.head(5)









    Out[35]:







  
    
      
      class
      alcohol
      malic_acid
      ash
      alcalinity_ash
      magnesium
      total_phenol
      flavanoids
      nonflavanoid_phenols
      proanthocyanins
      color_intensity
      hue
      diluted_wines
      proline
    
  
  
    
      0
      1
      14.23
      1.71
      2.43
      15.6
      127
      2.80
      3.06
      0.28
      2.29
      5.64
      1.04
      3.92
      1065
    
    
      1
      1
      13.20
      1.78
      2.14
      11.2
      100
      2.65
      2.76
      0.26
      1.28
      4.38
      1.05
      3.40
      1050
    
    
      2
      1
      13.16
      2.36
      2.67
      18.6
      101
      2.80
      3.24
      0.30
      2.81
      5.68
      1.03
      3.17
      1185
    
    
      3
      1
      14.37
      1.95
      2.50
      16.8
      113
      3.85
      3.49
      0.24
      2.18
      7.80
      0.86
      3.45
      1480
    
    
      4
      1
      13.24
      2.59
      2.87
      21.0
      118
      2.80
      2.69
      0.39
      1.82
      4.32
      1.04
      2.93
      735



In [5]:

    
# Check for null values
dataset.isnull().values.sum()









    Out[5]:





0

Attribute definition

1st attribute is class identifier (1-3). Other attributes are below:

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

So we will consider 13 attributes for PCA.



In [6]:

    
# Excluding first attribute
X = dataset.iloc[:, 1:].values

Standardizing the 13-dimensional dataset

We will use the StandardScaler class from sklearn.preprocessing library to standardize the dataset.



In [7]:

    
# Standardize the dataset
sc_X = StandardScaler()
X_std = sc_X.fit_transform(X)
X_std.shape









    Out[7]:





(178, 13)



In [8]:

    
# Display the standardized dataset
X_std[:3, :]









    Out[8]:





array([[ 1.51861254, -0.5622498 ,  0.23205254, -1.16959318,  1.91390522,
         0.80899739,  1.03481896, -0.65956311,  1.22488398,  0.25171685,
         0.36217728,  1.84791957,  1.01300893],
       [ 0.24628963, -0.49941338, -0.82799632, -2.49084714,  0.01814502,
         0.56864766,  0.73362894, -0.82071924, -0.54472099, -0.29332133,
         0.40605066,  1.1134493 ,  0.96524152],
       [ 0.19687903,  0.02123125,  1.10933436, -0.2687382 ,  0.08835836,
         0.80899739,  1.21553297, -0.49840699,  2.13596773,  0.26901965,
         0.31830389,  0.78858745,  1.39514818]])

Constructing the covariance matrix

We will use cov function from numpy module to find the covariance matrix.



In [9]:

    
cov_matrix = np.cov(X_std.transpose())
cov_matrix









    Out[9]:





array([[ 1.00564972,  0.09493026,  0.21273976, -0.31198788,  0.27232816,
         0.29073446,  0.23815287, -0.15681042,  0.13747022,  0.549451  ,
        -0.07215255,  0.07275191,  0.64735687],
       [ 0.09493026,  1.00564972,  0.16497228,  0.29013035, -0.05488343,
        -0.3370606 , -0.41332866,  0.29463237, -0.22199334,  0.25039204,
        -0.56446685, -0.37079354, -0.19309537],
       [ 0.21273976,  0.16497228,  1.00564972,  0.44587209,  0.28820583,
         0.12970824,  0.11572743,  0.1872826 ,  0.00970647,  0.2603499 ,
        -0.07508874,  0.00393333,  0.22488969],
       [-0.31198788,  0.29013035,  0.44587209,  1.00564972, -0.0838039 ,
        -0.32292752, -0.353355  ,  0.36396647, -0.19844168,  0.01883781,
        -0.27550299, -0.27833221, -0.44308618],
       [ 0.27232816, -0.05488343,  0.28820583, -0.0838039 ,  1.00564972,
         0.21561254,  0.19688989, -0.25774204,  0.23777643,  0.20107967,
         0.05571118,  0.06637684,  0.39557317],
       [ 0.29073446, -0.3370606 ,  0.12970824, -0.32292752,  0.21561254,
         1.00564972,  0.86944804, -0.45247731,  0.61587304, -0.05544792,
         0.43613151,  0.70390388,  0.50092909],
       [ 0.23815287, -0.41332866,  0.11572743, -0.353355  ,  0.19688989,
         0.86944804,  1.00564972, -0.54093859,  0.65637929, -0.17335329,
         0.54654907,  0.79164133,  0.49698518],
       [-0.15681042,  0.29463237,  0.1872826 ,  0.36396647, -0.25774204,
        -0.45247731, -0.54093859,  1.00564972, -0.36791202,  0.13984265,
        -0.26412347, -0.50611293, -0.31314443],
       [ 0.13747022, -0.22199334,  0.00970647, -0.19844168,  0.23777643,
         0.61587304,  0.65637929, -0.36791202,  1.00564972, -0.02539259,
         0.29721399,  0.52199968,  0.33228346],
       [ 0.549451  ,  0.25039204,  0.2603499 ,  0.01883781,  0.20107967,
        -0.05544792, -0.17335329,  0.13984265, -0.02539259,  1.00564972,
        -0.52476129, -0.43123763,  0.31788599],
       [-0.07215255, -0.56446685, -0.07508874, -0.27550299,  0.05571118,
         0.43613151,  0.54654907, -0.26412347,  0.29721399, -0.52476129,
         1.00564972,  0.56866303,  0.23751782],
       [ 0.07275191, -0.37079354,  0.00393333, -0.27833221,  0.06637684,
         0.70390388,  0.79164133, -0.50611293,  0.52199968, -0.43123763,
         0.56866303,  1.00564972,  0.31452809],
       [ 0.64735687, -0.19309537,  0.22488969, -0.44308618,  0.39557317,
         0.50092909,  0.49698518, -0.31314443,  0.33228346,  0.31788599,
         0.23751782,  0.31452809,  1.00564972]])



In [10]:

    
# Print the covariance matrix
plt.figure(figsize=(15, 15))
sns.heatmap(cov_matrix, annot=True, cmap="Greens")









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f14620f3128>



In [11]:

    
# Pair plot for this dataset
sns.pairplot(pd.DataFrame(X_std))









    Out[11]:





<seaborn.axisgrid.PairGrid at 0x7f145d1df208>

Decomposing the covariance matrix into its eigenvectors and eigenvalues

To do this we will use linalg.eig from numpy module.



In [12]:

    
# Converting to eigen values and eigen vectors
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)



In [13]:

    
eigen_pairs = [(np.abs(eig_vals[i]), eig_vecs[ :, i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort()
eigen_pairs.reverse()



In [14]:

    
eigen_pairs









    Out[14]:





[(4.7324369775835953,
  array([-0.1443294 ,  0.24518758,  0.00205106,  0.23932041, -0.14199204,
         -0.39466085, -0.4229343 ,  0.2985331 , -0.31342949,  0.0886167 ,
         -0.29671456, -0.37616741, -0.28675223])),
 (2.5110809296451233,
  array([ 0.48365155,  0.22493093,  0.31606881, -0.0105905 ,  0.299634  ,
          0.06503951, -0.00335981,  0.02877949,  0.03930172,  0.52999567,
         -0.27923515, -0.16449619,  0.36490283])),
 (1.4542418678464692,
  array([-0.20738262,  0.08901289,  0.6262239 ,  0.61208035,  0.13075693,
          0.14617896,  0.1506819 ,  0.17036816,  0.14945431, -0.13730621,
          0.08522192,  0.16600459, -0.12674592])),
 (0.92416586682487556,
  array([ 0.0178563 , -0.53689028,  0.21417556, -0.06085941,  0.35179658,
         -0.19806835, -0.15229479,  0.20330102, -0.39905653, -0.06592568,
          0.42777141, -0.18412074,  0.23207086])),
 (0.85804867653711181,
  array([-0.26566365,  0.03521363, -0.14302547,  0.06610294,  0.72704851,
         -0.14931841, -0.10902584, -0.50070298,  0.13685982, -0.07643678,
         -0.17361452, -0.10116099, -0.1578688 ])),
 (0.645282212467854,
  array([ 0.21353865,  0.53681385,  0.15447466, -0.10082451,  0.03814394,
         -0.0841223 , -0.01892002, -0.25859401, -0.53379539, -0.41864414,
          0.10598274,  0.26585107,  0.11972557])),
 (0.55414146624578486,
  array([ 0.05639636, -0.42052391,  0.14917061,  0.28696914, -0.3228833 ,
          0.02792498,  0.06068521, -0.59544729, -0.37213935,  0.22771214,
         -0.23207564,  0.0447637 , -0.0768045 ])),
 (0.35046627494625449,
  array([ 0.39613926,  0.06582674, -0.17026002,  0.42797018, -0.15636143,
         -0.40593409, -0.18724536, -0.23328465,  0.36822675, -0.03379692,
          0.43662362, -0.07810789,  0.12002267])),
 (0.29051203269397752,
  array([-0.50861912,  0.07528304,  0.30769445, -0.20044931, -0.27140257,
         -0.28603452, -0.04957849, -0.19550132,  0.20914487, -0.05621752,
         -0.08582839, -0.1372269 ,  0.57578611])),
 (0.25232001036082513,
  array([ 0.21160473, -0.30907994, -0.02712539,  0.05279942,  0.06787022,
         -0.32013135, -0.16315051,  0.21553507,  0.1341839 , -0.29077518,
         -0.52239889,  0.52370587,  0.162116  ])),
 (0.22706428173088539,
  array([-0.22591696,  0.07648554, -0.49869142,  0.47931378,  0.07128891,
          0.30434119, -0.02569409,  0.11689586, -0.23736257,  0.0318388 ,
         -0.04821201,  0.0464233 ,  0.53926983])),
 (0.1697237389801218,
  array([-0.26628645,  0.12169604, -0.04962237, -0.05574287,  0.06222011,
         -0.30388245, -0.04289883,  0.04235219, -0.09555303,  0.60422163,
          0.259214  ,  0.60095872, -0.07940162])),
 (0.10396199182075286,
  array([-0.01496997, -0.02596375,  0.14121803, -0.09168285, -0.05677422,
          0.46390791, -0.83225706, -0.11403985,  0.11691707,  0.0119928 ,
          0.08988884,  0.15671813, -0.01444734]))]



In [16]:

    
# Display the eigen Vectors
print("Eigen Vectors:")
pd.DataFrame(eig_vecs)









    



Eigen Vectors:






    Out[16]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
    
  
  
    
      0
      -0.144329
      0.483652
      -0.207383
      0.017856
      -0.265664
      0.213539
      0.056396
      -0.014970
      0.396139
      -0.266286
      -0.508619
      -0.225917
      0.211605
    
    
      1
      0.245188
      0.224931
      0.089013
      -0.536890
      0.035214
      0.536814
      -0.420524
      -0.025964
      0.065827
      0.121696
      0.075283
      0.076486
      -0.309080
    
    
      2
      0.002051
      0.316069
      0.626224
      0.214176
      -0.143025
      0.154475
      0.149171
      0.141218
      -0.170260
      -0.049622
      0.307694
      -0.498691
      -0.027125
    
    
      3
      0.239320
      -0.010591
      0.612080
      -0.060859
      0.066103
      -0.100825
      0.286969
      -0.091683
      0.427970
      -0.055743
      -0.200449
      0.479314
      0.052799
    
    
      4
      -0.141992
      0.299634
      0.130757
      0.351797
      0.727049
      0.038144
      -0.322883
      -0.056774
      -0.156361
      0.062220
      -0.271403
      0.071289
      0.067870
    
    
      5
      -0.394661
      0.065040
      0.146179
      -0.198068
      -0.149318
      -0.084122
      0.027925
      0.463908
      -0.405934
      -0.303882
      -0.286035
      0.304341
      -0.320131
    
    
      6
      -0.422934
      -0.003360
      0.150682
      -0.152295
      -0.109026
      -0.018920
      0.060685
      -0.832257
      -0.187245
      -0.042899
      -0.049578
      -0.025694
      -0.163151
    
    
      7
      0.298533
      0.028779
      0.170368
      0.203301
      -0.500703
      -0.258594
      -0.595447
      -0.114040
      -0.233285
      0.042352
      -0.195501
      0.116896
      0.215535
    
    
      8
      -0.313429
      0.039302
      0.149454
      -0.399057
      0.136860
      -0.533795
      -0.372139
      0.116917
      0.368227
      -0.095553
      0.209145
      -0.237363
      0.134184
    
    
      9
      0.088617
      0.529996
      -0.137306
      -0.065926
      -0.076437
      -0.418644
      0.227712
      0.011993
      -0.033797
      0.604222
      -0.056218
      0.031839
      -0.290775
    
    
      10
      -0.296715
      -0.279235
      0.085222
      0.427771
      -0.173615
      0.105983
      -0.232076
      0.089889
      0.436624
      0.259214
      -0.085828
      -0.048212
      -0.522399
    
    
      11
      -0.376167
      -0.164496
      0.166005
      -0.184121
      -0.101161
      0.265851
      0.044764
      0.156718
      -0.078108
      0.600959
      -0.137227
      0.046423
      0.523706
    
    
      12
      -0.286752
      0.364903
      -0.126746
      0.232071
      -0.157869
      0.119726
      -0.076805
      -0.014447
      0.120023
      -0.079402
      0.575786
      0.539270
      0.162116



In [17]:

    
# Display the eigen values
print("Eigen Values:")
pd.DataFrame(eig_vals).transpose()









    



Eigen Values:






    Out[17]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
    
  
  
    
      0
      4.732437
      2.511081
      1.454242
      0.924166
      0.858049
      0.645282
      0.554141
      0.103962
      0.350466
      0.169724
      0.290512
      0.227064
      0.25232



In [18]:

    
eig_vecs.shape









    Out[18]:





(13, 13)

Selecting k eigenvectors

Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k≤d)



In [19]:

    
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)









    



Cumulative Variance Explained [  36.1988481    55.40633836   66.52996889   73.59899908   80.16229276
   85.09811607   89.3367954    92.01754435   94.23969775   96.16971684
   97.90655253   99.20478511  100.        ]



In [20]:

    
# Ploting 
plt.figure(figsize=(10 , 5))
plt.bar(range(13), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(13), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()



In [21]:

    
# Select top 7 eigen vectors
eig_vecs_selected = eig_vecs[:, :7]

Insights

Lets select the first 7 eigen vectors which gives us 89.33% coverage of variance from original data.

Projection matrix W

Constructing a projection matrix W from the "top" 7 eigenvectors.

Here, we are reducing the 13-dimensional feature space to a 7-dimensional feature subspace, by choosing the "top 7" eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix W.



In [22]:

    
# Projection matrix
W = eig_vecs[:, :7]



In [23]:

    
# Display the eigen Vectors
print("First 7 Eigen Vectors:")
pd.DataFrame(W)









    



First 7 Eigen Vectors:






    Out[23]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      -0.144329
      0.483652
      -0.207383
      0.017856
      -0.265664
      0.213539
      0.056396
    
    
      1
      0.245188
      0.224931
      0.089013
      -0.536890
      0.035214
      0.536814
      -0.420524
    
    
      2
      0.002051
      0.316069
      0.626224
      0.214176
      -0.143025
      0.154475
      0.149171
    
    
      3
      0.239320
      -0.010591
      0.612080
      -0.060859
      0.066103
      -0.100825
      0.286969
    
    
      4
      -0.141992
      0.299634
      0.130757
      0.351797
      0.727049
      0.038144
      -0.322883
    
    
      5
      -0.394661
      0.065040
      0.146179
      -0.198068
      -0.149318
      -0.084122
      0.027925
    
    
      6
      -0.422934
      -0.003360
      0.150682
      -0.152295
      -0.109026
      -0.018920
      0.060685
    
    
      7
      0.298533
      0.028779
      0.170368
      0.203301
      -0.500703
      -0.258594
      -0.595447
    
    
      8
      -0.313429
      0.039302
      0.149454
      -0.399057
      0.136860
      -0.533795
      -0.372139
    
    
      9
      0.088617
      0.529996
      -0.137306
      -0.065926
      -0.076437
      -0.418644
      0.227712
    
    
      10
      -0.296715
      -0.279235
      0.085222
      0.427771
      -0.173615
      0.105983
      -0.232076
    
    
      11
      -0.376167
      -0.164496
      0.166005
      -0.184121
      -0.101161
      0.265851
      0.044764
    
    
      12
      -0.286752
      0.364903
      -0.126746
      0.232071
      -0.157869
      0.119726
      -0.076805

Transform the dataset

Transforming the 13-dimensional input dataset x using the projection matrix W to obtain the new 7-dimensional feature subspace.



In [24]:

    
W.shape









    Out[24]:





(13, 7)



In [25]:

    
X_std.shape









    Out[25]:





(178, 13)



In [26]:

    
# creating new subspace
new_features = np.dot(X_std, W)
pd.DataFrame(new_features)









    Out[26]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      -3.316751
      1.443463
      -0.165739
      0.215631
      0.693043
      0.223880
      -0.596427
    
    
      1
      -2.209465
      -0.333393
      -2.026457
      0.291358
      -0.257655
      0.927120
      -0.053776
    
    
      2
      -2.516740
      1.031151
      0.982819
      -0.724902
      -0.251033
      -0.549276
      -0.424205
    
    
      3
      -3.757066
      2.756372
      -0.176192
      -0.567983
      -0.311842
      -0.114431
      0.383337
    
    
      4
      -1.008908
      0.869831
      2.026688
      0.409766
      0.298458
      0.406520
      -0.444074
    
    
      5
      -3.050254
      2.122401
      -0.629396
      0.515637
      -0.632019
      -0.123431
      -0.401654
    
    
      6
      -2.449090
      1.174850
      -0.977095
      0.065831
      -1.027762
      0.620121
      -0.052891
    
    
      7
      -2.059437
      1.608963
      0.146282
      1.192608
      0.076903
      1.439806
      -0.032376
    
    
      8
      -2.510874
      0.918071
      -1.770969
      -0.056270
      -0.892257
      0.129181
      -0.125285
    
    
      9
      -2.753628
      0.789438
      -0.984247
      -0.349382
      -0.468553
      -0.163392
      0.874352
    
    
      10
      -3.479737
      1.302333
      -0.422735
      -0.026842
      -0.338375
      0.182902
      -0.248162
    
    
      11
      -1.754753
      0.611977
      -1.190878
      0.890164
      -0.738573
      0.553055
      0.434266
    
    
      12
      -2.113462
      0.675706
      -0.865086
      0.356438
      -1.209929
      0.215076
      0.242597
    
    
      13
      -3.458157
      1.130630
      -1.204276
      -0.162458
      -2.023127
      -0.745781
      -1.475773
    
    
      14
      -4.312784
      2.095976
      -1.263913
      -0.305773
      -1.029693
      -0.795643
      -0.999971
    
    
      15
      -2.305188
      1.662552
      0.217903
      1.440590
      -0.469550
      0.422213
      0.180968
    
    
      16
      -2.171955
      2.327305
      0.831730
      0.912601
      -0.000115
      0.066529
      -0.109488
    
    
      17
      -1.898971
      1.631369
      0.794914
      1.082380
      -0.438705
      -0.364931
      -0.091647
    
    
      18
      -3.541985
      2.518344
      -0.485459
      0.910323
      -1.153079
      -0.303877
      0.033464
    
    
      19
      -2.084522
      1.061138
      -0.164747
      -0.484997
      0.882511
      1.393018
      0.102472
    
    
      20
      -3.124403
      0.786897
      -0.364887
      0.025562
      0.972414
      0.106922
      -0.264762
    
    
      21
      -1.086570
      0.241744
      0.936962
      -1.029910
      0.315972
      1.211015
      -0.296932
    
    
      22
      -2.535224
      -0.091841
      -0.311933
      0.048391
      -0.429582
      1.014943
      0.127770
    
    
      23
      -1.644988
      -0.516279
      0.143885
      0.413720
      -0.375720
      0.784506
      0.668402
    
    
      24
      -1.761576
      -0.317149
      0.890286
      0.115116
      -0.556668
      0.898749
      0.623551
    
    
      25
      -0.990079
      0.940667
      3.820908
      1.321561
      0.159005
      0.265128
      -0.481907
    
    
      26
      -1.775278
      0.686175
      -0.086700
      0.232907
      -1.142943
      0.571474
      0.458028
    
    
      27
      -1.235424
      -0.089807
      -1.386897
      0.495683
      -0.375941
      0.608088
      0.363010
    
    
      28
      -2.188406
      0.689570
      1.394567
      0.777492
      -0.810584
      0.602072
      -0.117933
    
    
      29
      -2.256109
      0.191462
      -1.092657
      -0.286152
      -0.483073
      0.335234
      0.158350
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      148
      2.807064
      1.570534
      -0.472528
      -0.627358
      -0.260173
      -0.553243
      0.449464
    
    
      149
      2.899659
      2.041057
      -0.495960
      -0.471156
      1.438877
      -0.178543
      0.393520
    
    
      150
      2.320737
      2.356366
      0.437682
      0.052260
      2.234427
      -0.076779
      1.273439
    
    
      151
      2.549831
      2.045283
      -0.312268
      -0.386972
      1.847523
      -0.980029
      1.627192
    
    
      152
      1.812541
      1.527646
      1.362590
      0.189396
      1.749387
      -0.970292
      1.541851
    
    
      153
      2.760145
      2.138932
      -0.964629
      -0.668386
      -0.477042
      -1.802113
      -1.018406
    
    
      154
      2.737151
      0.409886
      -1.190405
      0.663045
      0.459037
      -1.881890
      -0.023338
    
    
      155
      3.604869
      1.802384
      -0.094037
      -1.268840
      -0.609465
      -0.191933
      -1.418423
    
    
      156
      2.889826
      1.925219
      -0.782323
      -1.324725
      -0.571243
      -0.430606
      -0.217856
    
    
      157
      3.392156
      1.311876
      1.602026
      0.482842
      -0.670871
      -0.803565
      -0.095837
    
    
      158
      1.048182
      3.515090
      1.160039
      -0.935329
      -0.899449
      -3.284281
      0.642463
    
    
      159
      1.609912
      2.406638
      0.548560
      -0.754310
      -0.995207
      -2.919477
      0.711056
    
    
      160
      3.143131
      0.738161
      -0.090999
      -0.980648
      -0.409814
      -0.398922
      0.074031
    
    
      161
      2.240157
      1.175465
      -0.101377
      1.165279
      -0.264449
      0.801039
      -0.539883
    
    
      162
      2.847674
      0.556044
      0.804215
      0.897888
      -0.254801
      0.288568
      -0.867762
    
    
      163
      2.597497
      0.697966
      -0.884940
      0.274229
      0.772235
      0.719842
      -0.272500
    
    
      164
      2.949299
      1.555309
      -0.983401
      -0.015480
      -0.364082
      -0.491206
      0.985935
    
    
      165
      3.530032
      0.882527
      -0.466029
      -0.580790
      -0.668960
      0.458814
      -0.522221
    
    
      166
      2.406111
      2.592356
      0.428226
      0.184335
      0.447661
      -0.569506
      -0.035802
    
    
      167
      2.929085
      1.274447
      -1.213358
      -0.295316
      -0.267350
      -0.381213
      0.644321
    
    
      168
      2.181413
      2.077537
      0.763783
      0.389593
      0.359874
      -0.629568
      0.753322
    
    
      169
      2.380928
      2.588667
      1.418044
      -0.588502
      1.127997
      0.983645
      0.930473
    
    
      170
      3.211617
      -0.251249
      -0.847129
      0.217065
      0.609095
      0.395378
      0.291849
    
    
      171
      3.677919
      0.847748
      -1.339420
      0.125176
      -0.486112
      -0.857959
      1.025640
    
    
      172
      2.465556
      2.193798
      -0.918781
      -0.018025
      -0.701210
      -0.680855
      0.829199
    
    
      173
      3.370524
      2.216289
      -0.342570
      -1.058527
      -0.574164
      1.108788
      -0.958416
    
    
      174
      2.601956
      1.757229
      0.207581
      -0.349496
      0.255063
      0.026465
      -0.146894
    
    
      175
      2.677839
      2.760899
      -0.940942
      -0.312035
      1.271355
      -0.273068
      -0.679235
    
    
      176
      2.387017
      2.297347
      -0.550696
      0.688285
      0.813955
      -1.178783
      -0.633975
    
    
      177
      3.208758
      2.768920
      1.013914
      -0.596903
      -0.895193
      -0.296092
      -0.005741
    
  

178 rows × 7 columns

Conclusion

Using PCA, we were able to reduce the huge 13 dimensional dataset to 7 dimensional subspace, by retaining 89.33% of the variance of the original dataset.

We used 6 steps to apply Principal Component Analysis on our dataset.

This can also accomplished using PCA implementation from scikit-learn. Ex: from sklearn.decomposition import PCA as sklearnPCA

Constructing a classification model



In [27]:

    
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header=None)
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)



In [28]:

    
# Model score
classifier.score(X_test, y_test)









    Out[28]:





0.97777777777777775



In [29]:

    
# Explained Variance
explained_variance









    Out[29]:





array([ 0.37281068,  0.18739996])



In [30]:

    
sns.heatmap(cm, annot=True)









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f14485c5390>



In [32]:

    
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

Insights

Using SVM to classify the wine dataset to 3 classes
Applied PCA and reducing the dimentionality to just 2
These 2 Principal Components explained total variance of 56.02% of the total variance in the dataset
The SVM model was able to get a accuracy of 97.77%

	class	alcohol	malic_acid	ash	alcalinity_ash	magnesium	total_phenol	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	diluted_wines	proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	-0.144329	0.483652	-0.207383	0.017856	-0.265664	0.213539	0.056396	-0.014970	0.396139	-0.266286	-0.508619	-0.225917	0.211605
1	0.245188	0.224931	0.089013	-0.536890	0.035214	0.536814	-0.420524	-0.025964	0.065827	0.121696	0.075283	0.076486	-0.309080
2	0.002051	0.316069	0.626224	0.214176	-0.143025	0.154475	0.149171	0.141218	-0.170260	-0.049622	0.307694	-0.498691	-0.027125
3	0.239320	-0.010591	0.612080	-0.060859	0.066103	-0.100825	0.286969	-0.091683	0.427970	-0.055743	-0.200449	0.479314	0.052799
4	-0.141992	0.299634	0.130757	0.351797	0.727049	0.038144	-0.322883	-0.056774	-0.156361	0.062220	-0.271403	0.071289	0.067870
5	-0.394661	0.065040	0.146179	-0.198068	-0.149318	-0.084122	0.027925	0.463908	-0.405934	-0.303882	-0.286035	0.304341	-0.320131
6	-0.422934	-0.003360	0.150682	-0.152295	-0.109026	-0.018920	0.060685	-0.832257	-0.187245	-0.042899	-0.049578	-0.025694	-0.163151
7	0.298533	0.028779	0.170368	0.203301	-0.500703	-0.258594	-0.595447	-0.114040	-0.233285	0.042352	-0.195501	0.116896	0.215535
8	-0.313429	0.039302	0.149454	-0.399057	0.136860	-0.533795	-0.372139	0.116917	0.368227	-0.095553	0.209145	-0.237363	0.134184
9	0.088617	0.529996	-0.137306	-0.065926	-0.076437	-0.418644	0.227712	0.011993	-0.033797	0.604222	-0.056218	0.031839	-0.290775
10	-0.296715	-0.279235	0.085222	0.427771	-0.173615	0.105983	-0.232076	0.089889	0.436624	0.259214	-0.085828	-0.048212	-0.522399
11	-0.376167	-0.164496	0.166005	-0.184121	-0.101161	0.265851	0.044764	0.156718	-0.078108	0.600959	-0.137227	0.046423	0.523706
12	-0.286752	0.364903	-0.126746	0.232071	-0.157869	0.119726	-0.076805	-0.014447	0.120023	-0.079402	0.575786	0.539270	0.162116

	0	1	2	3	4	5	6
0	-3.316751	1.443463	-0.165739	0.215631	0.693043	0.223880	-0.596427
1	-2.209465	-0.333393	-2.026457	0.291358	-0.257655	0.927120	-0.053776
2	-2.516740	1.031151	0.982819	-0.724902	-0.251033	-0.549276	-0.424205
3	-3.757066	2.756372	-0.176192	-0.567983	-0.311842	-0.114431	0.383337
4	-1.008908	0.869831	2.026688	0.409766	0.298458	0.406520	-0.444074
5	-3.050254	2.122401	-0.629396	0.515637	-0.632019	-0.123431	-0.401654
6	-2.449090	1.174850	-0.977095	0.065831	-1.027762	0.620121	-0.052891
7	-2.059437	1.608963	0.146282	1.192608	0.076903	1.439806	-0.032376
8	-2.510874	0.918071	-1.770969	-0.056270	-0.892257	0.129181	-0.125285
9	-2.753628	0.789438	-0.984247	-0.349382	-0.468553	-0.163392	0.874352
10	-3.479737	1.302333	-0.422735	-0.026842	-0.338375	0.182902	-0.248162
11	-1.754753	0.611977	-1.190878	0.890164	-0.738573	0.553055	0.434266
12	-2.113462	0.675706	-0.865086	0.356438	-1.209929	0.215076	0.242597
13	-3.458157	1.130630	-1.204276	-0.162458	-2.023127	-0.745781	-1.475773
14	-4.312784	2.095976	-1.263913	-0.305773	-1.029693	-0.795643	-0.999971
15	-2.305188	1.662552	0.217903	1.440590	-0.469550	0.422213	0.180968
16	-2.171955	2.327305	0.831730	0.912601	-0.000115	0.066529	-0.109488
17	-1.898971	1.631369	0.794914	1.082380	-0.438705	-0.364931	-0.091647
18	-3.541985	2.518344	-0.485459	0.910323	-1.153079	-0.303877	0.033464
19	-2.084522	1.061138	-0.164747	-0.484997	0.882511	1.393018	0.102472
20	-3.124403	0.786897	-0.364887	0.025562	0.972414	0.106922	-0.264762
21	-1.086570	0.241744	0.936962	-1.029910	0.315972	1.211015	-0.296932
22	-2.535224	-0.091841	-0.311933	0.048391	-0.429582	1.014943	0.127770
23	-1.644988	-0.516279	0.143885	0.413720	-0.375720	0.784506	0.668402
24	-1.761576	-0.317149	0.890286	0.115116	-0.556668	0.898749	0.623551
25	-0.990079	0.940667	3.820908	1.321561	0.159005	0.265128	-0.481907
26	-1.775278	0.686175	-0.086700	0.232907	-1.142943	0.571474	0.458028
27	-1.235424	-0.089807	-1.386897	0.495683	-0.375941	0.608088	0.363010
28	-2.188406	0.689570	1.394567	0.777492	-0.810584	0.602072	-0.117933
29	-2.256109	0.191462	-1.092657	-0.286152	-0.483073	0.335234	0.158350
...	...	...	...	...	...	...	...
148	2.807064	1.570534	-0.472528	-0.627358	-0.260173	-0.553243	0.449464
149	2.899659	2.041057	-0.495960	-0.471156	1.438877	-0.178543	0.393520
150	2.320737	2.356366	0.437682	0.052260	2.234427	-0.076779	1.273439
151	2.549831	2.045283	-0.312268	-0.386972	1.847523	-0.980029	1.627192
152	1.812541	1.527646	1.362590	0.189396	1.749387	-0.970292	1.541851
153	2.760145	2.138932	-0.964629	-0.668386	-0.477042	-1.802113	-1.018406
154	2.737151	0.409886	-1.190405	0.663045	0.459037	-1.881890	-0.023338
155	3.604869	1.802384	-0.094037	-1.268840	-0.609465	-0.191933	-1.418423
156	2.889826	1.925219	-0.782323	-1.324725	-0.571243	-0.430606	-0.217856
157	3.392156	1.311876	1.602026	0.482842	-0.670871	-0.803565	-0.095837
158	1.048182	3.515090	1.160039	-0.935329	-0.899449	-3.284281	0.642463
159	1.609912	2.406638	0.548560	-0.754310	-0.995207	-2.919477	0.711056
160	3.143131	0.738161	-0.090999	-0.980648	-0.409814	-0.398922	0.074031
161	2.240157	1.175465	-0.101377	1.165279	-0.264449	0.801039	-0.539883
162	2.847674	0.556044	0.804215	0.897888	-0.254801	0.288568	-0.867762
163	2.597497	0.697966	-0.884940	0.274229	0.772235	0.719842	-0.272500
164	2.949299	1.555309	-0.983401	-0.015480	-0.364082	-0.491206	0.985935
165	3.530032	0.882527	-0.466029	-0.580790	-0.668960	0.458814	-0.522221
166	2.406111	2.592356	0.428226	0.184335	0.447661	-0.569506	-0.035802
167	2.929085	1.274447	-1.213358	-0.295316	-0.267350	-0.381213	0.644321
168	2.181413	2.077537	0.763783	0.389593	0.359874	-0.629568	0.753322
169	2.380928	2.588667	1.418044	-0.588502	1.127997	0.983645	0.930473
170	3.211617	-0.251249	-0.847129	0.217065	0.609095	0.395378	0.291849
171	3.677919	0.847748	-1.339420	0.125176	-0.486112	-0.857959	1.025640
172	2.465556	2.193798	-0.918781	-0.018025	-0.701210	-0.680855	0.829199
173	3.370524	2.216289	-0.342570	-1.058527	-0.574164	1.108788	-0.958416
174	2.601956	1.757229	0.207581	-0.349496	0.255063	0.026465	-0.146894
175	2.677839	2.760899	-0.940942	-0.312035	1.271355	-0.273068	-0.679235
176	2.387017	2.297347	-0.550696	0.688285	0.813955	-1.178783	-0.633975
177	3.208758	2.768920	1.013914	-0.596903	-0.895193	-0.296092	-0.005741