Classification tools in MLLib

In this notebook we will learn to manage the different classification tools avaliable in MLLIb. Furthermore, we will extend some of these tools to a multiclass classification scenario.

Along this lab session, we will use the MNIST dataset, which is a widely used dataset in machine learning for testing classification algorithms. The dataset has 60.000 training patterns and additional 10.000 observations for testing purposes and each data corresponds to a digit image with 780 pixels. The goal of the problem is automatically classify a new image among the ten possible digits.

The outline of this notebook is:

1. Data reading and preprocessing

1. Read data.
2. Data analysis.
3. Data normalization.
4. Split data for training, validation, and testing.

2. Model training. Here, we will start analyzing multiclass classification approaches:

1. Decission trees
2. Random Forest

3. Evaluating the model performance over a test dataset

4. Selecting the model parameters by cross-validation

5. Creating a multiclass classifier from a binary Support Vector Machine

6. Interpretability analysis: feature selection approaches

1. Data reading and preprocessing

Read data

Start downloading the MNIST dataset from:

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist

After completing this notebook, you can analyze the scalability of the different approaches over a larger dataset by using the large version of the MNIST dataset (with 8.100.000 patterns):

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m

The standard method to load text files is: sc.textFile("file_name") which automatically creates an RDD with as many elements as lines in the data file. Run the following cell and analyze:

  1. The number of lines in the file
  2. The content of a line

How can we process this file to transform each line into data which are easy to handle?


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# You need to include mnist file in your working directory
lines = sc.textFile("mnist")
# Examine dataset format
# 1. Number of lines
n_lines = #FILL 
print 'Number of lines: ' + str(n_lines)
# 2. Content of the first line
line = # FILL 
print 'A line content:'
print line
# 3. Data type of a line
type_data = # FILL 
print 'Data type:'
print type_data

In [ ]:
###########################################################
# TEST CELL
###########################################################

from test_helper import Test

Test.assertEquals(n_lines, 60000, 'incorrect result: number of file lines is uncorrect')
Test.assertEquals(line[:10], '5 153:3 15', 'incorrect result: first line is uncorrect')

MLlib includes different data types (see http://spark.apache.org/docs/latest/mllib-data-types.html), such as, Local vectors, Labeled points, Local matrices or Distributed matrices. In supervised learning algorthms, the default data type is the “labeled point” (see http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point).

So we need to transform our data file to an RDD of LabeledPoint. Both features and label fields in a LabeledPoint are of type Double; however, the input dataset has both the features and label in string format. So, as you can imagine, the proccess to transform input data to numeric values can be a bit thedious...

However, this data file has an especific format, known as LIBSVM text file format, where each line is:

label index1:value1 index2:value2 ...

That is, it is representing a labeled sparse feature vector.

Furthermore, MLLib includes a specific funtion:

loadLibSVMFile(sc, path, numFeatures=-1, minPartitions=None, multiclass=None) 

which directly reads this format data file and returns an RDD of LabeledPoint elements.

Run the following cell to load the MNIST data as LabelPoint elements.

Note: we have explicitely defined the number of features to be sure that the sparse feature vectors are created with all the dimensions


In [ ]:
from pyspark.mllib.util import MLUtils
data = MLUtils.loadLibSVMFile(sc, "mnist", numFeatures= 784)

Complete the following code to examine the content of the RDD data:

1. Get the first pattern
2. Extract its label 
3. Extract its features

Note: take into account that the LabelePoint type includes its own methods to extract the label and the features


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# 1. Get the first pattern
dat = data.first()
# 2. Extract its label 
label = # <FILL IN>
print label
# 3. Extract its features
features = # <FILL IN>
print features

In [ ]:
###########################################################
# TEST CELL
###########################################################

from test_helper import Test

Test.assertEquals(label, 5, 'incorrect result: label is uncorrect')
Test.assertEquals(sum(features), 27525, 'features are uncorrect')

Data analysis

To analyze the data, let's start plotting each digit image. For this purpose, you can use the code provided in the following cell.


In [ ]:
import matplotlib.pyplot as plt
from pyspark.mllib.linalg import Vectors

def plot_data(images, h, w, n_row=1, n_col=10):
    """Plots the set of images provided in images

    Args:
        images (list of sparse vectors or numpy arrays): list of images where each image contains the 
            features corresponding to the pixels of an image.  
        h: heigth of the image (in number of pixels).
        w: width of the image (in number of pixels).
        n_row: Number of rows to use when plotting all the images
        n_col: Number of columns to use when plotting all the images

    """
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    
    for i in range(len(images)):
        plt.subplot(n_row, n_col, i + 1)      
        try:
            img = images[i].toArray()
        except:
            img = images[i]
            
        plt.imshow(img.reshape((h, w)), cmap=plt.cm.jet)
        plt.xticks(())
        plt.yticks(())

In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Define the height and width of the images
h= 28
w =28

# From the data RDD, create a new RDD where each element only has the features (pixel values) of each image
features = #FILL

# Pick up 10 images and plot them with plot_data() function
images= #FILL
plot_data(images, h, w)

Now, let's compute some statistics of the data set.

For this purpose, we can use the Statistics MLLIB library which let us compute (distribuitelly) some statistical parameters of the features, such as, the mean, the standard deviation, the maximum and minimun values, or the number of times that a pixel is not zero.

Complete the following cell to compute all the abovementioned statistics.

Note: Use the Statistics.colStats( ) function.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

from pyspark.mllib.stat import Statistics

# Compute summary statistics using the rdd of features as input.
stats = # FILL

# Extract the desired statistics
mean = # FILL
variance = #FILL
maximum = # FILL
minimum = # FILL
numNonzeros = # FILL

# Use the plot_data function to plot them
statistics = [mean, variance, maximum, minimum, numNonzeros]
plot_data(statistics, h, w)

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(np.sum(mean.ravel()),0), 26122, 'incorrect result: mean is uncorrect')
Test.assertEquals(np.round(np.sum(variance.ravel()),0), 3428503, 'incorrect result: variance is uncorrect')
Test.assertEquals(np.round(np.sum(maximum.ravel()),0), 172093, 'incorrect result: maximum is uncorrect')
Test.assertEquals(np.round(np.sum(minimum.ravel()),0), 0, 'incorrect result: minimum is uncorrect')
Test.assertEquals(np.round(np.sum(numNonzeros.ravel()),0), 8994156, 'incorrect result: numNonzeros is uncorrect')

Data normalization

Now, let's normalize the data. Usually, the data normalization consist of two steps:

  1. Remove the mean of each feature
  2. Reescale each feature to make them have unitary standard deviation

Due to we are working with sparse data (most of the input fatures are zero), if we removed the mean, we would make zero values to take a a non-null value, which will increase the size (in memory) of the data set. To avoid this, here we are only going to reescale the data.

Complete the following cell to reescale the training data by making use of the StandardScaler method of MLLIB (http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler).

Note 1: until now, we have only loaded the training data, so use all the pattern in data variable to fit the Scaler method.

Note 2: StandardScaler method has two input variables, 'withMean' and 'withStd', which let you select, respectivelly, whether the mean and standard deviation are corrected or not. By default, 'withMean' is set to False and 'withStd' to True, that is, only standard deviation is corrected.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.regression import LabeledPoint

# Create two new RDD by extracting the labels and features of the data
label = # FILL IN
features = # FILL IN

# Define the StandardScaler() object and fit it with the data features 
scaler = # FILL IN

# Normalize the data features
features_norm = # FILL IN

# Create a new RDD of LabeledPoint data using the normalized features
# 1. Construct a RDD of tuples (label, featatures): check zip() method of RDD objects
data_norm = # FILL IN
# 2. Create the label point RDD
data_LP =  # FILL IN

print data_LP.first()

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(sum(features_norm.first()),0), 319, 'incorrect result: normalized featues are uncorrect')
Test.assertEquals(data_LP.first().label, 5, 'incorrect result: normalized Label Point data are uncorrect')
Test.assertEquals(np.round(sum(data_LP.first().features),0), 319, 'incorrect result: normalized Label Point data are uncorrect')

Create training, validation and test partitions

In this subsection, let’s split the normalized dataset into training and validation data. We will use 40% of the data for training a model and 60% for validating the hyperparameters of the different learning algorithms. To save computational time, cache both normalized training and validation RDDs, sicen we will use them several times.

You can use the randomSplit() method for this purpose.

Note: when you call to randomSplit, please, set seed=0 for comparison purposes: randomSplit([....], seed=0)


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Create trainign and validation partitions
(trainingData, valData) = # FILL

# Our learning algorithms will make several passes over these datasets, so let’s cache these RDD in memory
trainingData.cache()
valData.cache()

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(sum(trainingData.first().features),0), 355, 'incorrect result: training data are uncorrect')
Test.assertEquals(np.round(sum(valData.first().features),0), 319, 'incorrect result: validation data are uncorrect')

Finally, let's load the test data from the "mnist.t" file. To be able to use it for testing purposes, we will also have to normalize them using the normalization parameters learned with the training data.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Load test data
data = MLUtils.loadLibSVMFile(sc, "mnist.t", numFeatures=784)

# Normalize test data:

# 1. Create two new RDD by extracting the labels and features of the data
label = # FILL IN
features = # FILL IN

# 2. Normalize the data features (use the scaler object fitted with the training data)
features_norm = # FILL IN

# 3. Create a new RDD of LabeledPoint data using the normalized features and cache it!
# 3.1 RDD with tuples (label, features)
test = # FILL IN
# 3.2 RDD with Label Point data
testData = # FILL IN

testData.cache()
testData.first()

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(sum(testData.first().features),0), 199, 'incorrect result: training data are uncorrect')

2. Model training

MLLIb includes several classification methods. Most of them are only implemented for solving binary problems. As we intend to solve a multiclass problem, let's start using the classifiers with multiclass implementations:

  • Decision trees
  • Random Forest

Decision trees

As we already know, a decision tree works by selecting the most discriminative features and setting different threholds over them, in such a way, that each tree node splits the training data in different subsets with the aim to find the node purity (the data partitions belongs to a single class). For this purpose, a purity measure, such as the gini index, is used to select both the most discriminative features and the threshold to apply.

Review http://spark.apache.org/docs/latest/mllib-decision-tree.html for implementation details.

The following cell contains the necessary code to train a DecisionTree and creates in the variable model a DecisionTreeModel. Note that all free parameters (such as the maximum depth of the tree) have been set by default.


In [ ]:
from pyspark.mllib.tree import DecisionTree

#  Train a DecisionTree model
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model_tree = DecisionTree.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

Random Forests

A Random Forest builds an ensemble of trees by training multiple trees in parallel (with different data and features) and combining their outputs. In the case of classification, the combination is carried out by a majority vote.

See https://spark.apache.org/docs/1.2.0/mllib-ensembles.html#random-forests for further details.

Next cells includes the code to train a Random Forest for some default parameters.


In [ ]:
from pyspark.mllib.tree import RandomForest, RandomForestModel

#  Train a RandomForest model
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model_RF = RandomForest.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
                                     numTrees=50, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

3. Evaluating the model performance over a test dataset

To be able to evaluate the performance of the different models that we have trained, let's create a function that given a MLLib classification model and an RDD with a data set of LabelPoints, compute the classification error of the model over the given data. This function has to follow these steps:

  • Compute the model output over the data using the model.predict() method (MLLib classification models include this method). Note that this method only receives as input the data features (instead of complete LabelPoint)
  • Create an RDD with the same length of the data with tuples given by the original label of a data and its corresponding output.
  • Compute the test error: number of missclassified data divided by the number of data.

In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

from pyspark.mllib.util import MLUtils

def compute_classifier_error(model, Data):

    """ Compute the classification error of the model over the samples given in Data.

    Args:
       model: MLLib classification model
       Data: an RDD with a data set of LabelPoints
    
    Returns:
        Int: A single value, between 0 and 1, indicating the classification error. 
        A value of 1 indicates that all the samples are missclassified (100% of error)
        and a value of 0 that all the samples are correctly classified (100% accuracy).
    """

    # Evaluate model on test instances and compute test error
    # 1. Compute the model output
    predictions = # FILL IN
    # 2. Create an RDD of tuples (label, output)
    labelsAndPredictions = # FILL IN
    # 3. Compute test error
    testErr = # FILL IN

    return testErr

Use the function to compute the test error of the tree classifier and the Random Forest model.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Test error of the decision tree
tree_testErr = # FILL IN
print('Tree test error = ' + str(tree_testErr))

# Test error of the random forest 
RF_testErr = # FILL IN
print('Random Forest test error = ' + str(RF_testErr))

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*tree_testErr,0), 31, 'incorrect result: decision tree error is uncorrect')
Test.assertEquals(np.round(100*RF_testErr,0), 22, 'incorrect result: RF error is uncorrect')

4. Selecting the model parameters by cross-validation

Until now, both tree and RF model have used predefined parameter values. Here, we are going to adjust the free parameters of these models by cross validation.

Decision trees: adjusting the tree depth

The most critical parameter of a decision tree is its maximum depth. Note that deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.

The following cell explores different tree depths and evaluate its over the validation data to, finally, select the optimum tree depth as the depth value which provides the minimum classification error over the validation data. Complete the missing code lines.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

from pyspark.mllib.tree import DecisionTree
import numpy as np

# Initialize variables 
bestModel = # FILL IN
best_error = # FILL IN
best_depth = # FILL IN
# Range of depth values to explore
depth_params = [5, 10, 15]

for depth_value in depth_params:
    # Train a decision tree fixing maxDepth to depth_value 
    model_tree = DecisionTree.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=#FILL IN, maxBins=32)
    
    # Compute the model error over the validation data (use compute_classifier_error function)
    tree_valErr = # FILL IN
    
    print 'Tree depth is ' + str(depth_value) + ' and the validation error is '+ str(tree_valErr) 
    
    # If the error has reduced, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
    if (tree_valErr < best_error):
            bestModel = # FILL IN
            best_depth = # FILL IN
            best_error = # FILL IN
            
print 'Optimum tree depth: ' + str(best_depth)

Analyze the validation error behaviour with the tree depth. Is this the expected behaviour?

Taking into account the trade-off computational cost vs. accuracy, which tree depth will you select?


In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(best_depth, 15, 'incorrect result: best_depth is uncorrect')
Test.assertEquals(np.round(100*best_error,0), 15, 'incorrect result: best_error is uncorrect')

Now, evaluate the error the selected model (bestModel) over the test data


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Finally, evaluate the test error over the best model
tree_finalErr = # FILL IN

print 'Final test error of the validated tree: ' + str(tree_finalErr)

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*tree_finalErr,0), 15, 'incorrect result: decision tree test error is uncorrect')

Random Forest: adjusting the number of trees

In these case, the most critical parameter is the number of trees in the forest ('numTrees'). Note that increasing the number of trees will decrease the variance in predictions, improving the generalization capability of the ensemble.

Complete the following cell to adjust this parameter by cross validation, that is, selecting the value which provides the minimum classification error over the validation data.

Note: it is also quite common adjusting the tree depth. However, in this exercise, due to computational reasons, we are going to prefix its value to 5.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

from pyspark.mllib.tree import RandomForest
import numpy as np

# Initialize variables 
bestModel = # FILL IN
best_error =  # FILL IN
best_depth = # FILL IN
best_ntrees =  # FILL IN

# Range of values to explore
ntrees_params = [20, 50, 100] 

for ntrees_value in ntrees_params:
    # Train a RandomForest model.
    model_RF = RandomForest.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
                                     numTrees= # FILL IN, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=5, maxBins=32)
    # Compute the model error over the validation data (use compute_classifier_error function)
    RF_valErr =  # FILL IN

    print 'Number of trees is ' + str(ntrees_value) + ' and the validation error is '+ str(RF_valErr) 
    # Check if the error has improved. If it has, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
    if (RF_valErr < best_error):
            bestModel =  # FILL IN
            best_ntrees  =  # FILL IN
            best_error =  # FILL IN

print 'Optimum number of trees: ' + str(best_ntrees)

Analyze the validation error behaviour with the number of trees. Is this the expected behaviour?

Taking into account the trade-off computational cost vs. accuracy, which number of trees will you select?


In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(best_ntrees, 100, 'incorrect result: best_ntrees is uncorrect')
Test.assertEquals(np.round(100*best_error,0), 15, 'incorrect result: best_error is uncorrect')

Now, evaluate the error the selected model (bestModel) over the test data


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Finally, evaluate the test error over the best model
tree_finalErr =  # FILL IN

print 'Final test error of the validated tree: ' + str(tree_finalErr)

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*tree_finalErr,0), 14, 'incorrect result: best_error is uncorrect')

5. Creating a multiclass classifier from a binary Support Vector Machine

MLLib includes a distributed SVM implementation, but it is only avaliable for binary problems (see MLLIb help at: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms)

As we are working with a multiclass problem, let's adapt this implementation to be used in a 1vs. all fashion and aplly it to our multiclass problem.

Solving a single 1 vs. all problem

Let's start considering a single 1 vs. all problem, for instance, let's consider that we want to classify the digit '0' from the remaining digits.

Then, we will procced as follows:

  1. Convert the training labels to the 1 vs. all scheme
  2. Train a binary SVM model with the new labels
  3. Evaluate the model

1. Convert the training labels to the 1 vs. all scheme

Create a convert_label() funtion to generate the labels associated to the 1 vs. all problem


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

def convert_label(label, label1):
    """Produce a 1 vs. all label encoding for a single label and the label to be included in the class 1.

    Args:
        label (int, str): the label to be coded 
        label1 (int, str): the label to be included in the class 1.
    
    Returns:
        Int: A single value indicating the label (0 or 1) of the 1 vs. all problem.
    """
    # <FILL IN> : use all code lines that you need

Let's select as the class 1 the digit '0' and transform the labels of both training and test data


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

label_1 = 0
# Transform the labels of training data
trainingData_1vsall= # FILL IN
# Transform the labels of test data
testData_1vsall= # FILL IN

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(sum(trainingData_1vsall.first().features),0), 355, 'incorrect result: trainingData_1vsall are uncorrect')
Test.assertEquals(np.round(sum(testData_1vsall.first().features),0), 199, 'incorrect result: testData_1vsall are uncorrect')

2. Train a binary SVM

Note: The SVMWithSGD.train( ) has several parameters related to the SGD search and other ones associated to the SVM. The next cell only includes the latter ones in the training call, since we will have to adjust some of them along this notebook. The reaming parameters are set by default.


In [ ]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel

model = SVMWithSGD.train(trainingData_1vsall, regParam=0.01, regType='l2', intercept=True)

3. Compute the test error of this model

Here, you can use the function compute_classifier_error( ) of the previous section.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

error_1vsall = # FILL IN

print("Test Error os 1 vs. all model (to classify class 0) is: " + str(error_1vsall))

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*error_1vsall,0), 2, 'incorrect result: best_error is uncorrect')

Implementing a 1 vs. all multiclass SVM

Starting from the previous function, let's create a 1 vs. all multiclass SVM. For these purpose, we need to implement:

  1. A function to train the set of 1 vs. all SVM
  2. A function to compute the output of the 1 vs. all set of classifiers.
  3. A function to compute the error over a set of data

Note: The function includes the model.clearThreshold() command to transform the model outputs from discrete values (labels) to real values (smooth or continuous outputs). This will be necessary to later combine the output of different SVM in a 1 vs. all fashion.

1. Let's create a training function to build all 1 vs. all SVMs


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

def train_1vsall_SVM(trainingData, regParam=0.01, regType='l2'):
    """Produce a list of SVM models solving all the 1 vs. all problems

    Args:
        trainingData (RDD of labeled points): the training data to adjust the model
        regParam: The regularizer parameter (default: 0.01).
        regType: The type of regularizer used for training our model (default: “l2”). Allowed values:
                “l1” for using L1 regularization
                “l2” for using L2 regularization
                None for no regularization
        
    Returns:
        List of SVM models: A list of length number of classes where each element is a tuple (lab, model). Variable
            lab indicates the label 1 of the 1 vs. all problem and model is the SVM model solving the problem
    """
    # Get all possible labels from training data
    labels = # FILL IN
    # Initialize the list of models to be returned
    list_models = []
    for lab in labels:
        # Convert labels of trainingData to 1 vs. all format (use convert_label function)
        trainingData_1vsall=# FILL IN
        # Train the SVM model with 1 vs. all data
        model = # FILL IN

        # Modify the model to get smooth outputs
        model.clearThreshold()
        # Create the tuple (lab, model) and add it to the list of models
        # FILL IN
        
    return list_models

2. Let's create a function to compute the output of a single data


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

import numpy as np
def output_1vsall_SVM(data, list_models):
    """Compute the output of a list of 1 vs. all SVM models for a test data 

    Args:
        data (labeled point): data to be evaluated over the 1 vs. all SVM model
        list_models: a list of length number of classes where each element is a tuple (lab, model). Variable 
            lab indicates the label 1 of the 1 vs. all problem and model is the SVM model solving the problem
    Returns:
        Output_label: The label estimated by the 1 vs. all model for the data
    """
    # Split the tuples of list_models into a list of labels and a list of models (you can use zip() method)
    labels, models = # FILL IN
    
    outputs = []
    # For each model...
    for model in models:
        # Compute the output over the test data
        out = # FILL IN
        # Add this output to outputs list
        # FILL IN
        
    # Get the test output label as the label associated to the model with the maximum output value
    pos = # FILL IN
    Output_label = # FILL IN
    return Output_label

3. Let's create a function to compute the error over a set of data

Note that the SVM final output is a real value, so you can compute the number of errors as the number of products label$\times$output lower than zero.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

from pyspark.mllib.util import MLUtils

def compute_1vsall_SVM_error(list_models, Data):

    """ Compute the classification error of the 1 vs. all SVM model over the samples given in Data.

    Args:
       list_models: a list of length number of classes where each element is a tuple (lab, model). Variable 
            lab indicates the label 1 of the 1 vs. all problem and model is the SVM model solving the problem
       Data: an RDD with a data set of LabelPoints
    
    Returns:
        Int: A single value, between 0 and 1, indicating the classification error. 
        A value of 1 indicates that all the samples are missclassfied and a value 
        of 0 that all the samples are correctly classified.
    """

    # Evaluate model on test instances and compute test error
    # 1. Compute the model output 
    predictions = # FILL IN
    # 2. Create an RDD of tuples (label, output)
    labelsAndPredictions = # FILL IN
    # 3. Compute test error 
    testErr = # FILL IN

    return testErr

Finally, let's use the above functions to train the multiclass model and evaluate it over all test data.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Train the 1 vs. all SVM models (use multiclass_SVM() function with default parameters)
multiclass_SVM = # FILL IN
# Compute the test error
error_1vsall = # FILL IN
print("Test Error of the 1 vs. all SVM = " + str(error_1vsall))

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*error_1vsall,0), 12, 'incorrect result: test error is uncorrect')

Cross validating the regularization parameter

To really obtain the SVM performance, we should cross validate the regularization parameter (C value), since its value is critical to obtain a good performance. Complete the following cell to adjust this value.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Initialize variables 
bestModel = # FILL IN
best_error = # FILL IN
best_C = # FILL IN
# Range of depth values to explore
C_params = [0.01, 0.1, 1]

for C_value in C_params:
    # Train the 1 vs. all SVM models (set regularization parameter to C_value)
    multiclass_SVM = # FILL IN
    
    # Compute the model error over the validation data 
    error_1vsall = # FILL IN
    
    print 'C value is ' + str(C_value) + ' and the validation error is '+ str(error_1vsall) 
    
    # If the error has reduced, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
    if (error_1vsall < best_error):
            bestModel = # FILL IN
            best_C = # FILL IN
            best_error = # FILL IN
            
print 'Optimum C value is: ' + str(best_C)

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(best_C, 0.1, 'incorrect result: best_C is uncorrect')
Test.assertEquals(np.round(100*best_error,0), 12, 'incorrect result: best_error is uncorrect')

6. Interpretability analysis: feature selection approaches

The MNIST dataset has a lot of features (pixels) which are useless for the classification, for instance, some of the are constant overall the data, so they haven't any discriminatory capability.

In this last section, let's implement some easy (but efficient) distribuited feature selection approaches, so that we can extract the most relevant features.

6.1 Remove features with zero variance

Let's start removing all pixels which are zero over all the images (background pixels), that is, its variance is zero. For this purpose, follow these steps:

  1. Comupute the number of non zeros in each pixel
  2. Select variables to keep
  3. Create a function which removes the no desired features from the data
  4. Evaluate the classification performance after removing useless pixels.

1. Compute the number of zeros in each pixel

Use Statistics.colStats() funtion to compute the number of non zeros in each pixel (review first section of this notebook).

Note: we will compute this over the training data and, later, apply the selection over trainign, validation and test data sets.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

from pyspark.mllib.stat import Statistics
features = trainingData.map(lambda x: x.features)

# Compute column summary statistics.
stats = # FILL IN
plot_data([stats.numNonzeros()], h, w)

2. Select variables to keep


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
idx_keep = # FILL IN
print('Number of selected features = ' + str(len(idx_keep)))

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(sum(idx_keep), 281576, 'incorrect result: idx_keep is uncorrect')

3. Removes the no desired features from the data

Use the following function to remove the useless features


In [ ]:
from pyspark.mllib.linalg import SparseVector

def remove_features(all_features, idx_keep):
    """ From all_features vector it selects the features given in  idx_keep and it returns them by means of a 
    Sparse Vector

    Args:
       all_features: SparseVector with the feature values
       idx_keep: indexes with the positions to keep
    
    Returns:
        SparseVector with the selected features
    """
    values = all_features.toArray()[idx_keep]
    val_nonzero = np.where(values>0)[0]
        
    return SparseVector(len(idx_keep), val_nonzero, values[val_nonzero])

In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

import numpy as np
from pyspark.mllib.regression import LabeledPoint

# Remove features from training data
trainingData_sel = # FILL IN
# Remove features from validation data
valData_sel = # FILL IN
# Remove features from test data
testData_sel = # FILL IN

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(sum(trainingData_sel.first().features),0), 355, 'incorrect result: trainingData_sel is uncorrect')
Test.assertEquals(np.round(sum(valData_sel.first().features),0), 319, 'incorrect result: valData_sel is uncorrect')
Test.assertEquals(np.round(sum(testData_sel.first().features),0), 199, 'incorrect result: testData_sel is uncorrect')

4. Evaluate the classification performance after removing useless pixels

Here, let's use a tree with default parameters. For comparison purposes, remember that the test error using all the features is around 30%.


In [ ]:
from pyspark.mllib.tree import DecisionTree
#  Train a DecisionTree model with selected features
model_tree = DecisionTree.trainClassifier(trainingData_sel, numClasses=10, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Test error of the decision tree
tree_testErr = compute_classifier_error(model_tree, testData_sel)
print('Tree test error = ' + str(tree_testErr))

In [ ]:
###########################################################
# TEST CELL
###########################################################

import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*tree_testErr,0), 31, 'incorrect result: test error is uncorrect')

6.2 Remove features with a low variance

Now, let's modify the code of the previous section to remove the features which variance is lower than a given threhold.

Complete the next cell following the indications.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

import numpy as np
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
from pyspark.mllib.tree import DecisionTree

# Set a variance threshold
th_var = 0.5

# 1. Compute the variance with Statistics.colStats( )
variance = # FILL IN (you may need several code lines to compute this)

# 2. Get the positions of the features to keep
idx_keep = # FILL IN 

# 3. Remove features from training, validation and test data 
trainingData_sel = # FILL IN 
valData_sel = # FILL IN 
testData_sel = # FILL IN 

# 4. Evaluate performance with a decision tree
# Train the model
model_tree = DecisionTree.trainClassifier(trainingData_sel, numClasses=10, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Compute its test error 
tree_testErr = # FILL IN 
print('Tree test error = ' + str(tree_testErr))
print('Number of selected features = ' + str(len(idx_keep)))

In [ ]:
###########################################################
# TEST CELL
###########################################################

import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*tree_testErr,0), 31, 'incorrect result: test error is uncorrect')
Test.assertEquals(len(idx_keep), 686, 'incorrect result: test error is uncorrect')

As you can imagine, the final performance depends on the threshold over the variance, so we should select this value by cross validation. Please, complete the following code cell to select the optimum value of the threshold.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Initialize variables 
best_val_error =# FILL IN 
final_test_error = # FILL IN 
best_th = # FILL IN 
best_num_var_sel = # FILL IN 

# Range of threshold values to explore
th_range = [0.25, 0.5, 0.75, 1]

# Compute the variance 
variance = # FILL IN (you may need several code lines to compute this)

for th_var in th_range:
    
    # 2. Get the positions of the features to keep
    idx_keep = # FILL IN
    # Compute the number of selected features
    num_var_sel = # FILL IN

    # 3. Remove features from training, validation and test data 
    trainingData_sel = # FILL IN
    valData_sel = # FILL IN
    testData_sel = # FILL IN
    
    # 4. Evaluate performance with a decision tree
    # Train the model
    model_tree = DecisionTree.trainClassifier(trainingData_sel, numClasses=10, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)
    # Compute its validation error 
    tree_valErr = # FILL IN
    # Compute its test error
    tree_testErr = # FILL IN
    
    print 'Threshold value is ' + str(th_var) + ', the number of selected features is ' + str(num_var_sel) + ' and the validation error is '+ str(tree_valErr) 
    
    # If the error has reduced, save the model, ....
    if (tree_valErr < best_error):
            best_th = # FILL IN
            best_val_error = # FILL IN
            best_num_var_sel = # FILL IN
            final_test_error = # FILL IN 
            
print 'Optimum threshold value is: ' + str(best_th)
print 'The test error is: ' + str(final_test_error)
print 'The number of selected features is: ' + str(best_num_var_sel)

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(np.round(100*final_test_error,0), 31, 'incorrect result: test error is uncorrect')
Test.assertEquals(best_th, 0.5, 'incorrect result: best_th is uncorrect')

6.2 Remove features with L1 regularization: L1-SVM

As we know, L1 regularization provides sparsity over the vector weights. If we have a linear model, this can provide a feature selection.

So, in this section, let's use the above multiclass SVM implemention using an L1 regularization. In this way, we will obtain both a classifier and a feature selection.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Initialize variables 
C_value = 0.1

# Train the 1 vs. all SVM models (set regularization parameter to C_value)
# Note that we have added the parameter regType='l1'
multiclass_SVM = train_1vsall_SVM(trainingData, regParam=C_value, regType='l1')
    
# Compute the model error over the test data 
error_1vsall = # FILL IN

# Analyze the number of zero weigths: compute the positions where all the vectors are zero 
# multiclass_SVM is a list of tuples (class, model), getting the parameter 
# model.weights you can access to the vector weights of each SVM

pos_sel = # FILL IN (You may need several lines to compute this)

print 'El número de variables seleccionadas son: ' + str(len(pos_sel))

In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test

Test.assertEquals(sum(pos_sel), 178643, 'incorrect result: test error is uncorrect')

In this case, the number of selected features depends on the C value. Complete the next cell to select this value by cross validation.


In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################

# Initialize variables 
bestModel = # FILL IN
best_error = # FILL IN
best_C = # FILL IN
# Range of depth values to explore
C_params = [0.01, 0.1, 1]

for C_value in C_params:
    # Train the 1 vs. all SVM models (set regularization parameter to C_value)
    multiclass_SVM = train_1vsall_SVM(trainingData, regParam=C_value, regType='l1')
    
    # Compute the model error over the validation data 
    error_1vsall = # FILL IN
    
    # Compute the number of selected features
    num_var_sel = # FILL IN

    print 'C value is ' + str(C_value) + ', the number of selected features is ' + str(num_var_sel) + ' and the validation error is '+ str(error_1vsall) 
    
    # If the error has reduced, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
    if (error_1vsall < best_error):
            bestModel = # FILL IN
            best_C = # FILL IN
            best_error = # FILL IN
            
print 'Optimum C value is: ' + str(best_C)