Introduction

Let's use dimensionality reduction in a real example. We will examine the MNIST database of digits, which has a large number of features 28x28. We reduce the dimensionality of our dataset to then use our preferred algorithm to predict the data.

We will see how different dimensionality reduction method work in the practice.

Linear Discriminat Analysis (LDA): supervised learning algorithm that takes into account the class labels and maximizes the class separation.

Principal Component Analysis (PCA): unsupervised learning algorithm, i.e., it ignores the class labels and its goal is to find the principal components that maximize the variance.

For more information on LDA and PCA see Sebastian Raschka's article..



In [1]:

    
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import seaborn as sns
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

# Import the 3 dimensionality reduction methods
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
#from sklearn.manifold import TSNE

MNIST Dataset

We choose the popular MNIST (Mixed National Institute of Standards and Technology) computer vision digit dataset. This contains a series for images of handwriting letters, each of them of 28x28 pixels, see a few pick below.

The datasets are large, please download them from: https://pjreddie.com/projects/mnist-in-csv/



In [2]:

    
train = pd.read_csv('./datasets/mnist_train.csv').head(2000) 
#reduce the size to 3000 to make things fast

columns=[]
columns.append("label")
for ii in range(784):
    columns.append("pixel"+str(ii+1))
train.columns=columns
    
print("Shape of train dataset: "+str(train.shape))
train.head()









    



Shape of train dataset: (2000, 785)






    Out[2]:






  
    
      
      label
      pixel1
      pixel2
      pixel3
      pixel4
      pixel5
      pixel6
      pixel7
      pixel8
      pixel9
      ...
      pixel775
      pixel776
      pixel777
      pixel778
      pixel779
      pixel780
      pixel781
      pixel782
      pixel783
      pixel784
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      9
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 785 columns

The MNIST set consists of 59999 rows and 785 columns. There are 28 x 28 pixel images of digits ( contributing to 784 columns) as well as one extra label column which is essentially a class label to state whether the row-wise contribution to each digit gives a 1 or a 9. See a few pics here.



In [3]:

    
# Copy the features and target columns to different arrays: 
y_train= train['label']
# Drop the label feature
X_train = train.drop("label",axis=1)

# plot some of the numbers
plt.figure(figsize=(7,7))
for digit_num in range(0,6):
    plt.subplot(2,3,digit_num+1)
    grid_data = X_train.iloc[digit_num,:].as_matrix().reshape(28,28)  # reshape from 1d to 2d pixel array
    plt.imshow(grid_data, interpolation = "none", cmap = "afmhot")
    plt.xticks([])
    plt.yticks([])
plt.tight_layout()

Now we proceed to reduce the dimensionality of our dataset.

Several methods are proposed:

LDA
PCA
LDA followed by PCA

I found that the third method is used in the practice, but in this case the results are comparable to doing just LDA. In this example I found that PCA together with a polynomial regression yields the best results.



In [4]:

    
#Note than n_components for lda is < n_class (9)

method=2 #PCA, read above.

if ( method == 1 ):
    dimensionality_reduction_method="lda"
    n_components=9
    reduction_method = LDA(n_components=n_components)
elif ( method == 2 ):
    dimensionality_reduction_method="pca"
    n_components=14
    reduction_method = PCA(n_components=n_components)
elif ( method == 3 ):
    dimensionality_reduction_method="lda"
    n_components=784
    reduction_method = LDA()


print ( "Reducing dimensionality to %d components\n" %(n_components))
    
print(X_train.shape)

#del X_train_red
# Taking in as second argument the Target as labels
reduction_method = reduction_method.fit(X_train.values, y_train.values )
X_train_red = reduction_method.transform(X_train.values)
print(X_train_red.shape)

if ( method == 3 ):
    dimensionality_reduction_method="pca"
    n_components=5
    reduction_method2 = PCA(n_components=n_components)

# Taking in as second argument the Target as labels
    reduction_method2 = reduction_method2.fit(X_train_red, y_train.values )
    X_train_red1 = reduction_method2.transform(X_train_red)
    X_train_red = X_train_red1
    del X_train_red1
    print(X_train_red.shape)









    



Reducing dimensionality to 14 components

(2000, 784)
(2000, 14)



In [5]:

    
from IPython.display import display, Math, Latex

# Using the Plotly library again
traceDIM = go.Scatter(
    x = X_train_red[:,0],
    y = X_train_red[:,1],
    name = y_train,
    mode = 'markers',
    text = y_train,
    showlegend = True,
    marker = dict(
        size = 8,
        color = y_train,
        colorscale ='Jet',
        showscale = False,
        line = dict(
            width = 2,
            color = 'rgb(255, 255, 255)'
        ),
        opacity = 0.8
    )
)
data = [traceDIM]

layout = go.Layout(
#    title= title,
    hovermode= 'closest',
    xaxis= dict(
         title= 'First Linear Discriminant',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
    yaxis=dict(
        title= 'Second Linear Discriminant',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)

if ( dimensionality_reduction_method == "lda"):
    title = 'Linear Discriminant Analysis (LDA)'
elif ( dimensionality_reduction_method == "pca" ):
    title = 'Principal Component Analysis (PCA)'
elif ( dimensionality_reduction_method == "tsne" ):
    title = 'TSNE (T-Distributed Stochastic Neighbour Embedding)'

layout.title=title

fig = dict(data=data, layout=layout )
py.iplot(fig, filename='styled-scatter')

The above plot shows how the method results or not in separated classes. The LDA is made to maximize the class separation while the PCA maximizes the class variance. Here I chose PCA that results in not so clearly separated classes, to see this plot within LDA set method=1 above.



In [6]:

    
test = pd.read_csv('./datasets/mnist_test.csv')
test.columns=columns
               
test.head()









    Out[6]:






  
    
      
      label
      pixel1
      pixel2
      pixel3
      pixel4
      pixel5
      pixel6
      pixel7
      pixel8
      pixel9
      ...
      pixel775
      pixel776
      pixel777
      pixel778
      pixel779
      pixel780
      pixel781
      pixel782
      pixel783
      pixel784
    
  
  
    
      0
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 785 columns



In [7]:

    
# save the labels to a Pandas series target
y_test = test['label']
# Drop the label feature
X_test = test.drop("label",axis=1)

Now predict in the test set. Note that the LDA implementation in sklearn contains a predict feature that we test below, if not LDA then we skip this.

Remember the precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.



In [8]:

    
#Total of true positives:
from sklearn.metrics import precision_score

#Only LDA has the prediction method:
if ( dimensionality_reduction_method == "lda"):
    y_pred = reduction_method.predict(X_test.values) 

#Precision score for test dataset:
    print("Precision score for test dataset: \n")
    precision_score(y_test, y_pred, average='micro')



In [9]:

    
X_test_red = reduction_method.transform(X_test) 

if ( method == 3 ):
    X_test_red1 = reduction_method2.transform(X_test_red) 
    X_test_red=X_test_red1
    del X_test_red1

Now lets compare this performance to that of svm methods: linear and polynomial



In [10]:

    
from sklearn import svm

from sklearn.metrics import classification_report

estimator = svm.LinearSVC(C=1.0)

estimator.fit(X_train_red, y_train) 

y_pred = estimator.predict(X_test_red)

ntest=y_pred.size


#Precision score for test dataset:
print("Precision score for test dataset: \n")
precision_score(y_test, y_pred, average='micro')









    



Precision score for test dataset: 







    Out[10]:





0.34233423342334235



In [11]:

    
# Import
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train_red)
X_test_poly=poly.fit_transform(X_test_red)

estimator.fit(X_train_poly, y_train)
y_pred = estimator.predict(X_test_poly)

#Precision score for test dataset:
print("Precision score for test dataset: \n")
precision_score(y_test, y_pred, average='micro')









    



Precision score for test dataset: 







    Out[11]:





0.87278727872787276

Results:

For LDA: the linear regression performs best with a precision score of ~0.7.

For PCA: the linear regression performs poor with a precision score of ~0.5. The accuracy is improved with a polynomial regression, for degree=5 the precision score is ~0.9. For degree higher than 5 the accuracy decreases due to overfitting.

PCA on LDA performs as well as LDA.

Note these results are particular to this example.



In [ ]:

	label	...
0	0	...
1	4	...
2	1	...
3	9	...
4	2	...

	label	...
0	2	...
1	1	...
2	0	...
3	4	...
4	1	...

	label	...
0	0	...
1	4	...
2	1	...
3	9	...
4	2	...

	label	...
0	2	...
1	1	...
2	0	...
3	4	...
4	1	...

	label	...
0	0	...
1	4	...
2	1	...
3	9	...
4	2	...

	label	...
0	2	...
1	1	...
2	0	...
3	4	...
4	1	...