Let's use dimensionality reduction in a real example. We will examine the MNIST database of digits, which has a large number of features 28x28. We reduce the dimensionality of our dataset to then use our preferred algorithm to predict the data.
We will see how different dimensionality reduction method work in the practice.
Linear Discriminat Analysis (LDA): supervised learning algorithm that takes into account the class labels and maximizes the class separation.
Principal Component Analysis (PCA): unsupervised learning algorithm, i.e., it ignores the class labels and its goal is to find the principal components
that maximize the variance.
For more information on LDA and PCA see Sebastian Raschka's article..
In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import seaborn as sns
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
# Import the 3 dimensionality reduction methods
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
#from sklearn.manifold import TSNE
We choose the popular MNIST (Mixed National Institute of Standards and Technology) computer vision digit dataset. This contains a series for images of handwriting letters, each of them of 28x28 pixels, see a few pick below.
The datasets are large, please download them from: https://pjreddie.com/projects/mnist-in-csv/
In [2]:
train = pd.read_csv('./datasets/mnist_train.csv').head(2000)
#reduce the size to 3000 to make things fast
columns=[]
columns.append("label")
for ii in range(784):
columns.append("pixel"+str(ii+1))
train.columns=columns
print("Shape of train dataset: "+str(train.shape))
train.head()
Out[2]:
The MNIST set consists of 59999 rows and 785 columns. There are 28 x 28 pixel images of digits ( contributing to 784 columns) as well as one extra label column which is essentially a class label to state whether the row-wise contribution to each digit gives a 1 or a 9. See a few pics here.
In [3]:
# Copy the features and target columns to different arrays:
y_train= train['label']
# Drop the label feature
X_train = train.drop("label",axis=1)
# plot some of the numbers
plt.figure(figsize=(7,7))
for digit_num in range(0,6):
plt.subplot(2,3,digit_num+1)
grid_data = X_train.iloc[digit_num,:].as_matrix().reshape(28,28) # reshape from 1d to 2d pixel array
plt.imshow(grid_data, interpolation = "none", cmap = "afmhot")
plt.xticks([])
plt.yticks([])
plt.tight_layout()
Now we proceed to reduce the dimensionality of our dataset.
Several methods are proposed:
I found that the third method is used in the practice, but in this case the results are comparable to doing just LDA. In this example I found that PCA together with a polynomial regression yields the best results.
In [4]:
#Note than n_components for lda is < n_class (9)
method=2 #PCA, read above.
if ( method == 1 ):
dimensionality_reduction_method="lda"
n_components=9
reduction_method = LDA(n_components=n_components)
elif ( method == 2 ):
dimensionality_reduction_method="pca"
n_components=14
reduction_method = PCA(n_components=n_components)
elif ( method == 3 ):
dimensionality_reduction_method="lda"
n_components=784
reduction_method = LDA()
print ( "Reducing dimensionality to %d components\n" %(n_components))
print(X_train.shape)
#del X_train_red
# Taking in as second argument the Target as labels
reduction_method = reduction_method.fit(X_train.values, y_train.values )
X_train_red = reduction_method.transform(X_train.values)
print(X_train_red.shape)
if ( method == 3 ):
dimensionality_reduction_method="pca"
n_components=5
reduction_method2 = PCA(n_components=n_components)
# Taking in as second argument the Target as labels
reduction_method2 = reduction_method2.fit(X_train_red, y_train.values )
X_train_red1 = reduction_method2.transform(X_train_red)
X_train_red = X_train_red1
del X_train_red1
print(X_train_red.shape)
In [5]:
from IPython.display import display, Math, Latex
# Using the Plotly library again
traceDIM = go.Scatter(
x = X_train_red[:,0],
y = X_train_red[:,1],
name = y_train,
mode = 'markers',
text = y_train,
showlegend = True,
marker = dict(
size = 8,
color = y_train,
colorscale ='Jet',
showscale = False,
line = dict(
width = 2,
color = 'rgb(255, 255, 255)'
),
opacity = 0.8
)
)
data = [traceDIM]
layout = go.Layout(
# title= title,
hovermode= 'closest',
xaxis= dict(
title= 'First Linear Discriminant',
ticklen= 5,
zeroline= False,
gridwidth= 2,
),
yaxis=dict(
title= 'Second Linear Discriminant',
ticklen= 5,
gridwidth= 2,
),
showlegend= False
)
if ( dimensionality_reduction_method == "lda"):
title = 'Linear Discriminant Analysis (LDA)'
elif ( dimensionality_reduction_method == "pca" ):
title = 'Principal Component Analysis (PCA)'
elif ( dimensionality_reduction_method == "tsne" ):
title = 'TSNE (T-Distributed Stochastic Neighbour Embedding)'
layout.title=title
fig = dict(data=data, layout=layout )
py.iplot(fig, filename='styled-scatter')
The above plot shows how the method results or not in separated classes. The LDA is made to maximize the class separation while the PCA maximizes the class variance. Here I chose PCA that results in not so clearly separated classes, to see this plot within LDA set method=1 above.
In [6]:
test = pd.read_csv('./datasets/mnist_test.csv')
test.columns=columns
test.head()
Out[6]:
In [7]:
# save the labels to a Pandas series target
y_test = test['label']
# Drop the label feature
X_test = test.drop("label",axis=1)
Now predict in the test set. Note that the LDA implementation in sklearn contains a predict feature that we test below, if not LDA then we skip this.
Remember the precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
In [8]:
#Total of true positives:
from sklearn.metrics import precision_score
#Only LDA has the prediction method:
if ( dimensionality_reduction_method == "lda"):
y_pred = reduction_method.predict(X_test.values)
#Precision score for test dataset:
print("Precision score for test dataset: \n")
precision_score(y_test, y_pred, average='micro')
In [9]:
X_test_red = reduction_method.transform(X_test)
if ( method == 3 ):
X_test_red1 = reduction_method2.transform(X_test_red)
X_test_red=X_test_red1
del X_test_red1
Now lets compare this performance to that of svm methods: linear and polynomial
In [10]:
from sklearn import svm
from sklearn.metrics import classification_report
estimator = svm.LinearSVC(C=1.0)
estimator.fit(X_train_red, y_train)
y_pred = estimator.predict(X_test_red)
ntest=y_pred.size
#Precision score for test dataset:
print("Precision score for test dataset: \n")
precision_score(y_test, y_pred, average='micro')
Out[10]:
In [11]:
# Import
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train_red)
X_test_poly=poly.fit_transform(X_test_red)
estimator.fit(X_train_poly, y_train)
y_pred = estimator.predict(X_test_poly)
#Precision score for test dataset:
print("Precision score for test dataset: \n")
precision_score(y_test, y_pred, average='micro')
Out[11]:
For LDA: the linear regression performs best with a precision score of ~0.7.
For PCA: the linear regression performs poor with a precision score of ~0.5. The accuracy is improved with a polynomial regression, for degree=5 the precision score is ~0.9. For degree higher than 5 the accuracy decreases due to overfitting.
PCA on LDA performs as well as LDA.
Note these results are particular to this example.
In [ ]: