In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
The goal of an LDA is to project a feature space (a dataset n-dimensional samples) onto a smaller subspace k (where k≤n−1) while maintaining the class-discriminatory information.
In general, dimensionality reduction does not only help reducing computational costs for a given classification task, but it can also be helpful to avoid overfitting by minimizing the error in parameter estimation (“curse of dimensionality”).
Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems), and it also has some practical uses as classifier. The original Linear discriminant was described for a 2-class problem, and it was then later generalized as “multi-class Linear Discriminant Analysis” or “Multiple Discriminant Analysis” by C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification)
The general LDA approach is very similar to a Principal Component Analysis (for more information about the PCA, see the previous article Implementing a Principal Component Analysis (PCA) in Python step by step), but in addition to finding the component axes that maximize the variance of our data (PCA), we are additionally interested in the axes that maximize the separation between multiple classes (LDA).
Both Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are linear transformation techniques that are commonly used for dimensionality reduction. PCA can be described as an “unsupervised” algorithm, since it “ignores” class labels and its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is “supervised” and computes the directions (“linear discriminants”) that will represent the axes that that maximize the separation between multiple classes.
In the rest of this notebook, we demonstrate the LDA method using Fisher's IRIS dataset.
In [2]:
import seaborn as sns
In [3]:
iris_df = sns.load_dataset('iris')
In [4]:
iris_df = iris_df.set_index('species')
In [5]:
iris_df.head()
Out[5]:
It should be mentioned that LDA assumes normal distributed data, features that are statistically independent, and identical covariance matrices for every class. However, this only applies for LDA as classifier and LDA for dimensionality reduction can also work reasonably well if those assumptions are violated. And even for classification tasks LDA seems can be quite robust to the distribution of the data.
In [6]:
sns.pairplot(hue="species", data=iris_df.reset_index())
Out[6]:
In [7]:
label_means = iris_df.groupby(iris_df.index).mean()
In [8]:
label_means
Out[8]:
In [9]:
within_class_cov = iris_df.groupby(iris_df.index).cov()
In [10]:
within_class_cov
Out[10]:
In [11]:
label_counts = iris_df.index.value_counts()
In [12]:
label_counts
Out[12]:
The within class scatter matrix is defined by:
$S_{within} = \sum_{i}^c (N_i - 1) \times cov(D_i)$
Where
In [13]:
within_class_cov.loc['setosa'] * (label_counts['setosa'] - 1)
Out[13]:
In [14]:
feature_names = within_class_cov.columns
reset_cov = within_class_cov.reset_index()
label_grouper = reset_cov.groupby(by='species')
s_within = label_grouper[feature_names].apply(lambda g: g * (label_counts[g.name]-1))
s_within.loc[:, 'level_1'] = reset_cov['level_1']
In [15]:
s_within = s_within.groupby(by='level_1').sum()
s_within.index.name = 'feature_name'
In [16]:
s_within
Out[16]:
In [17]:
overall_means = iris_df.mean()
overall_means.index.name = None
In [18]:
overall_means
Out[18]:
In [19]:
label_means
Out[19]:
In [20]:
mean_diff = label_means.sub(overall_means, axis=1)
In [21]:
s_between = mean_diff.T.dot(mean_diff.mul(label_counts, axis=0))
s_between.columns = mean_diff.columns.copy()
s_between.index.name = 'feature_name'
In [22]:
s_between
Out[22]:
In [23]:
s_within_inv = pd.DataFrame(data=np.linalg.inv(s_within), columns=s_within.columns, index=s_within.index)
In [24]:
s_dot = s_within_inv.dot(s_between)
In [25]:
eig_vals, eig_vecs = np.linalg.eig(s_dot)
eig_vals = pd.Series(eig_vals, index=feature_names)
eig_vecs = pd.DataFrame(eig_vecs, index=feature_names, columns=feature_names)
In [26]:
eig_vals
Out[26]:
In [27]:
eig_vecs
Out[27]: