# Linear Discriminant Analysis



In [1]:

import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')




Populating the interactive namespace from numpy and matplotlib



The goal of an LDA is to project a feature space (a dataset n-dimensional samples) onto a smaller subspace k (where k≤n−1) while maintaining the class-discriminatory information.

In general, dimensionality reduction does not only help reducing computational costs for a given classification task, but it can also be helpful to avoid overfitting by minimizing the error in parameter estimation (“curse of dimensionality”).

Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems), and it also has some practical uses as classifier. The original Linear discriminant was described for a 2-class problem, and it was then later generalized as “multi-class Linear Discriminant Analysis” or “Multiple Discriminant Analysis” by C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification)

The general LDA approach is very similar to a Principal Component Analysis (for more information about the PCA, see the previous article Implementing a Principal Component Analysis (PCA) in Python step by step), but in addition to finding the component axes that maximize the variance of our data (PCA), we are additionally interested in the axes that maximize the separation between multiple classes (LDA).

## Difference between LDA and PCA

Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that that maximize the separation between multiple classes.

In the rest of this notebook, we demonstrate the LDA method using Fisher's IRIS dataset.



In [2]:

import seaborn as sns




In [3]:




In [4]:

iris_df = iris_df.set_index('species')




In [5]:




Out[5]:

sepal_length
sepal_width
petal_length
petal_width

species

setosa
5.1
3.5
1.4
0.2

setosa
4.9
3.0
1.4
0.2

setosa
4.7
3.2
1.3
0.2

setosa
4.6
3.1
1.5
0.2

setosa
5.0
3.6
1.4
0.2



## Normality assumptions

It should be mentioned that LDA assumes normal distributed data, features that are statistically independent, and identical covariance matrices for every class. However, this only applies for LDA as classifier and LDA for dimensionality reduction can also work reasonably well if those assumptions are violated. And even for classification tasks LDA seems can be quite robust to the distribution of the data.



In [6]:

sns.pairplot(hue="species", data=iris_df.reset_index())




Out[6]:

<seaborn.axisgrid.PairGrid at 0x218f70257f0>



## LDA Step 1: Computing the d-dimensional mean vectors



In [7]:

label_means = iris_df.groupby(iris_df.index).mean()




In [8]:

label_means




Out[8]:

sepal_length
sepal_width
petal_length
petal_width

species

setosa
5.006
3.428
1.462
0.246

versicolor
5.936
2.770
4.260
1.326

virginica
6.588
2.974
5.552
2.026



## LDA Step 2: Within class Scatter Matrix



In [9]:

within_class_cov = iris_df.groupby(iris_df.index).cov()




In [10]:

within_class_cov




Out[10]:

sepal_length
sepal_width
petal_length
petal_width

species

setosa
sepal_length
0.124249
0.099216
0.016355
0.010331

sepal_width
0.099216
0.143690
0.011698
0.009298

petal_length
0.016355
0.011698
0.030159
0.006069

petal_width
0.010331
0.009298
0.006069
0.011106

versicolor
sepal_length
0.266433
0.085184
0.182898
0.055780

sepal_width
0.085184
0.098469
0.082653
0.041204

petal_length
0.182898
0.082653
0.220816
0.073102

petal_width
0.055780
0.041204
0.073102
0.039106

virginica
sepal_length
0.404343
0.093763
0.303290
0.049094

sepal_width
0.093763
0.104004
0.071380
0.047629

petal_length
0.303290
0.071380
0.304588
0.048824

petal_width
0.049094
0.047629
0.048824
0.075433




In [11]:

label_counts = iris_df.index.value_counts()




In [12]:

label_counts




Out[12]:

virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64



The within class scatter matrix is defined by:

$S_{within} = \sum_{i}^c (N_i - 1) \times cov(D_i)$

Where

• $c$ is the number of classes
• $N_i$ is the number of samples in class $i$
• $cov(D_i)$ is the covariance matrix of the subset of samples in class $i$.


In [13]:

within_class_cov.loc['setosa'] * (label_counts['setosa'] - 1)




Out[13]:

sepal_length
sepal_width
petal_length
petal_width

sepal_length
6.0882
4.8616
0.8014
0.5062

sepal_width
4.8616
7.0408
0.5732
0.4556

petal_length
0.8014
0.5732
1.4778
0.2974

petal_width
0.5062
0.4556
0.2974
0.5442




In [14]:

feature_names = within_class_cov.columns
reset_cov = within_class_cov.reset_index()

label_grouper = reset_cov.groupby(by='species')
s_within = label_grouper[feature_names].apply(lambda g: g * (label_counts[g.name]-1))

s_within.loc[:, 'level_1'] = reset_cov['level_1']




In [15]:

s_within = s_within.groupby(by='level_1').sum()
s_within.index.name = 'feature_name'




In [16]:

s_within




Out[16]:

sepal_length
sepal_width
petal_length
petal_width

feature_name

petal_length
24.6246
8.1208
27.2226
6.2718

petal_width
5.6450
4.8084
6.2718
6.1566

sepal_length
38.9562
13.6300
24.6246
5.6450

sepal_width
13.6300
16.9620
8.1208
4.8084



## LDA Step 3: Between class Scatter Matrix



In [17]:

overall_means = iris_df.mean()
overall_means.index.name = None




In [18]:

overall_means




Out[18]:

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64




In [19]:

label_means




Out[19]:

sepal_length
sepal_width
petal_length
petal_width

species

setosa
5.006
3.428
1.462
0.246

versicolor
5.936
2.770
4.260
1.326

virginica
6.588
2.974
5.552
2.026




In [20]:

mean_diff = label_means.sub(overall_means, axis=1)




In [21]:

s_between = mean_diff.T.dot(mean_diff.mul(label_counts, axis=0))
s_between.columns = mean_diff.columns.copy()
s_between.index.name = 'feature_name'




In [22]:

s_between




Out[22]:

sepal_length
sepal_width
petal_length
petal_width

feature_name

sepal_length
63.212133
-19.952667
165.2484
71.279333

sepal_width
-19.952667
11.344933
-57.2396
-22.932667

petal_length
165.248400
-57.239600
437.1028
186.774000

petal_width
71.279333
-22.932667
186.7740
80.413333



## LDA Step 4: EigenValues and EigenVectors of $S_{within}^{-1} S_{between}$



In [23]:

s_within_inv = pd.DataFrame(data=np.linalg.inv(s_within), columns=s_within.columns, index=s_within.index)




In [24]:

s_dot = s_within_inv.dot(s_between)




In [25]:

eig_vals, eig_vecs = np.linalg.eig(s_dot)

eig_vals = pd.Series(eig_vals, index=feature_names)
eig_vecs = pd.DataFrame(eig_vecs, index=feature_names, columns=feature_names)




In [26]:

eig_vals




Out[26]:

sepal_length   -1.010279e+01
sepal_width     3.407094e-01
petal_length    1.967844e-15
petal_width    -7.669689e-15
dtype: float64




In [27]:

eig_vecs




Out[27]:

sepal_length
sepal_width
petal_length
petal_width

sepal_length
0.454953
0.793436
0.745904
-0.270434

sepal_width
0.279078
0.454150
-0.064125
0.409473

petal_length
-0.105142
-0.321864
-0.007123
0.469501

petal_width
-0.839094
0.246192
-0.662921
-0.734007