In this notebook we will learn to manage the different classification tools avaliable in MLLIb. Furthermore, we will extend some of these tools to a multiclass classification scenario.
Along this lab session, we will use the MNIST dataset, which is a widely used dataset in machine learning for testing classification algorithms. The dataset has 60.000 training patterns and additional 10.000 observations for testing purposes and each data corresponds to a digit image with 780 pixels. The goal of the problem is automatically classify a new image among the ten possible digits.
The outline of this notebook is:
1. Data reading and preprocessing
1. Read data.
2. Data analysis.
3. Data normalization.
4. Split data for training, validation, and testing.
2. Model training. Here, we will start analyzing multiclass classification approaches:
1. Decission trees
2. Random Forest
3. Evaluating the model performance over a test dataset
4. Selecting the model parameters by cross-validation
5. Creating a multiclass classifier from a binary Support Vector Machine
6. Interpretability analysis: feature selection approaches
Start downloading the MNIST dataset from:
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist
After completing this notebook, you can analyze the scalability of the different approaches over a larger dataset by using the large version of the MNIST dataset (with 8.100.000 patterns):
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m
The standard method to load text files is: sc.textFile("file_name") which automatically creates an RDD with as many elements as lines in the data file. Run the following cell and analyze:
How can we process this file to transform each line into data which are easy to handle?
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# You need to include mnist file in your working directory
lines = sc.textFile("mnist")
# Examine dataset format
# 1. Number of lines
n_lines = #FILL
print 'Number of lines: ' + str(n_lines)
# 2. Content of the first line
line = # FILL
print 'A line content:'
print line
# 3. Data type of a line
type_data = # FILL
print 'Data type:'
print type_data
In [ ]:
###########################################################
# TEST CELL
###########################################################
from test_helper import Test
Test.assertEquals(n_lines, 60000, 'incorrect result: number of file lines is uncorrect')
Test.assertEquals(line[:10], '5 153:3 15', 'incorrect result: first line is uncorrect')
MLlib includes different data types (see http://spark.apache.org/docs/latest/mllib-data-types.html), such as, Local vectors, Labeled points, Local matrices or Distributed matrices. In supervised learning algorthms, the default data type is the “labeled point” (see http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point).
So we need to transform our data file to an RDD of LabeledPoint. Both features and label fields in a LabeledPoint are of type Double; however, the input dataset has both the features and label in string format. So, as you can imagine, the proccess to transform input data to numeric values can be a bit thedious...
However, this data file has an especific format, known as LIBSVM text file format, where each line is:
label index1:value1 index2:value2 ...
That is, it is representing a labeled sparse feature vector.
Furthermore, MLLib includes a specific funtion:
loadLibSVMFile(sc, path, numFeatures=-1, minPartitions=None, multiclass=None)
which directly reads this format data file and returns an RDD of LabeledPoint elements.
Run the following cell to load the MNIST data as LabelPoint elements.
Note: we have explicitely defined the number of features to be sure that the sparse feature vectors are created with all the dimensions
In [ ]:
from pyspark.mllib.util import MLUtils
data = MLUtils.loadLibSVMFile(sc, "mnist", numFeatures= 784)
Complete the following code to examine the content of the RDD data:
1. Get the first pattern
2. Extract its label
3. Extract its features
Note: take into account that the LabelePoint type includes its own methods to extract the label and the features
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# 1. Get the first pattern
dat = data.first()
# 2. Extract its label
label = # <FILL IN>
print label
# 3. Extract its features
features = # <FILL IN>
print features
In [ ]:
###########################################################
# TEST CELL
###########################################################
from test_helper import Test
Test.assertEquals(label, 5, 'incorrect result: label is uncorrect')
Test.assertEquals(sum(features), 27525, 'features are uncorrect')
In [ ]:
import matplotlib.pyplot as plt
from pyspark.mllib.linalg import Vectors
def plot_data(images, h, w, n_row=1, n_col=10):
"""Plots the set of images provided in images
Args:
images (list of sparse vectors or numpy arrays): list of images where each image contains the
features corresponding to the pixels of an image.
h: heigth of the image (in number of pixels).
w: width of the image (in number of pixels).
n_row: Number of rows to use when plotting all the images
n_col: Number of columns to use when plotting all the images
"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(len(images)):
plt.subplot(n_row, n_col, i + 1)
try:
img = images[i].toArray()
except:
img = images[i]
plt.imshow(img.reshape((h, w)), cmap=plt.cm.jet)
plt.xticks(())
plt.yticks(())
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Define the height and width of the images
h= 28
w =28
# From the data RDD, create a new RDD where each element only has the features (pixel values) of each image
features = #FILL
# Pick up 10 images and plot them with plot_data() function
images= #FILL
plot_data(images, h, w)
Now, let's compute some statistics of the data set.
For this purpose, we can use the Statistics MLLIB library which let us compute (distribuitelly) some statistical parameters of the features, such as, the mean, the standard deviation, the maximum and minimun values, or the number of times that a pixel is not zero.
Complete the following cell to compute all the abovementioned statistics.
Note: Use the Statistics.colStats( ) function.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.stat import Statistics
# Compute summary statistics using the rdd of features as input.
stats = # FILL
# Extract the desired statistics
mean = # FILL
variance = #FILL
maximum = # FILL
minimum = # FILL
numNonzeros = # FILL
# Use the plot_data function to plot them
statistics = [mean, variance, maximum, minimum, numNonzeros]
plot_data(statistics, h, w)
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(np.sum(mean.ravel()),0), 26122, 'incorrect result: mean is uncorrect')
Test.assertEquals(np.round(np.sum(variance.ravel()),0), 3428503, 'incorrect result: variance is uncorrect')
Test.assertEquals(np.round(np.sum(maximum.ravel()),0), 172093, 'incorrect result: maximum is uncorrect')
Test.assertEquals(np.round(np.sum(minimum.ravel()),0), 0, 'incorrect result: minimum is uncorrect')
Test.assertEquals(np.round(np.sum(numNonzeros.ravel()),0), 8994156, 'incorrect result: numNonzeros is uncorrect')
Now, let's normalize the data. Usually, the data normalization consist of two steps:
Due to we are working with sparse data (most of the input fatures are zero), if we removed the mean, we would make zero values to take a a non-null value, which will increase the size (in memory) of the data set. To avoid this, here we are only going to reescale the data.
Complete the following cell to reescale the training data by making use of the StandardScaler method of MLLIB (http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler).
Note 1: until now, we have only loaded the training data, so use all the pattern in data variable to fit the Scaler method.
Note 2: StandardScaler method has two input variables, 'withMean' and 'withStd', which let you select, respectivelly, whether the mean and standard deviation are corrected or not. By default, 'withMean' is set to False and 'withStd' to True, that is, only standard deviation is corrected.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.feature import StandardScaler
from pyspark.mllib.regression import LabeledPoint
# Create two new RDD by extracting the labels and features of the data
label = # FILL IN
features = # FILL IN
# Define the StandardScaler() object and fit it with the data features
scaler = # FILL IN
# Normalize the data features
features_norm = # FILL IN
# Create a new RDD of LabeledPoint data using the normalized features
# 1. Construct a RDD of tuples (label, featatures): check zip() method of RDD objects
data_norm = # FILL IN
# 2. Create the label point RDD
data_LP = # FILL IN
print data_LP.first()
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(sum(features_norm.first()),0), 319, 'incorrect result: normalized featues are uncorrect')
Test.assertEquals(data_LP.first().label, 5, 'incorrect result: normalized Label Point data are uncorrect')
Test.assertEquals(np.round(sum(data_LP.first().features),0), 319, 'incorrect result: normalized Label Point data are uncorrect')
In this subsection, let’s split the normalized dataset into training and validation data. We will use 40% of the data for training a model and 60% for validating the hyperparameters of the different learning algorithms. To save computational time, cache both normalized training and validation RDDs, sicen we will use them several times.
You can use the randomSplit() method for this purpose.
Note: when you call to randomSplit, please, set seed=0 for comparison purposes: randomSplit([....], seed=0)
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Create trainign and validation partitions
(trainingData, valData) = # FILL
# Our learning algorithms will make several passes over these datasets, so let’s cache these RDD in memory
trainingData.cache()
valData.cache()
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(sum(trainingData.first().features),0), 355, 'incorrect result: training data are uncorrect')
Test.assertEquals(np.round(sum(valData.first().features),0), 319, 'incorrect result: validation data are uncorrect')
Finally, let's load the test data from the "mnist.t" file. To be able to use it for testing purposes, we will also have to normalize them using the normalization parameters learned with the training data.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Load test data
data = MLUtils.loadLibSVMFile(sc, "mnist.t", numFeatures=784)
# Normalize test data:
# 1. Create two new RDD by extracting the labels and features of the data
label = # FILL IN
features = # FILL IN
# 2. Normalize the data features (use the scaler object fitted with the training data)
features_norm = # FILL IN
# 3. Create a new RDD of LabeledPoint data using the normalized features and cache it!
# 3.1 RDD with tuples (label, features)
test = # FILL IN
# 3.2 RDD with Label Point data
testData = # FILL IN
testData.cache()
testData.first()
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(sum(testData.first().features),0), 199, 'incorrect result: training data are uncorrect')
As we already know, a decision tree works by selecting the most discriminative features and setting different threholds over them, in such a way, that each tree node splits the training data in different subsets with the aim to find the node purity (the data partitions belongs to a single class). For this purpose, a purity measure, such as the gini index, is used to select both the most discriminative features and the threshold to apply.
Review http://spark.apache.org/docs/latest/mllib-decision-tree.html for implementation details.
The following cell contains the necessary code to train a DecisionTree and creates in the variable model a DecisionTreeModel. Note that all free parameters (such as the maximum depth of the tree) have been set by default.
In [ ]:
from pyspark.mllib.tree import DecisionTree
# Train a DecisionTree model
# Empty categoricalFeaturesInfo indicates all features are continuous.
model_tree = DecisionTree.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
A Random Forest builds an ensemble of trees by training multiple trees in parallel (with different data and features) and combining their outputs. In the case of classification, the combination is carried out by a majority vote.
See https://spark.apache.org/docs/1.2.0/mllib-ensembles.html#random-forests for further details.
Next cells includes the code to train a Random Forest for some default parameters.
In [ ]:
from pyspark.mllib.tree import RandomForest, RandomForestModel
# Train a RandomForest model
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model_RF = RandomForest.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
numTrees=50, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
To be able to evaluate the performance of the different models that we have trained, let's create a function that given a MLLib classification model and an RDD with a data set of LabelPoints, compute the classification error of the model over the given data. This function has to follow these steps:
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.util import MLUtils
def compute_classifier_error(model, Data):
""" Compute the classification error of the model over the samples given in Data.
Args:
model: MLLib classification model
Data: an RDD with a data set of LabelPoints
Returns:
Int: A single value, between 0 and 1, indicating the classification error.
A value of 1 indicates that all the samples are missclassified (100% of error)
and a value of 0 that all the samples are correctly classified (100% accuracy).
"""
# Evaluate model on test instances and compute test error
# 1. Compute the model output
predictions = # FILL IN
# 2. Create an RDD of tuples (label, output)
labelsAndPredictions = # FILL IN
# 3. Compute test error
testErr = # FILL IN
return testErr
Use the function to compute the test error of the tree classifier and the Random Forest model.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Test error of the decision tree
tree_testErr = # FILL IN
print('Tree test error = ' + str(tree_testErr))
# Test error of the random forest
RF_testErr = # FILL IN
print('Random Forest test error = ' + str(RF_testErr))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*tree_testErr,0), 31, 'incorrect result: decision tree error is uncorrect')
Test.assertEquals(np.round(100*RF_testErr,0), 22, 'incorrect result: RF error is uncorrect')
The most critical parameter of a decision tree is its maximum depth. Note that deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
The following cell explores different tree depths and evaluate its over the validation data to, finally, select the optimum tree depth as the depth value which provides the minimum classification error over the validation data. Complete the missing code lines.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.tree import DecisionTree
import numpy as np
# Initialize variables
bestModel = # FILL IN
best_error = # FILL IN
best_depth = # FILL IN
# Range of depth values to explore
depth_params = [5, 10, 15]
for depth_value in depth_params:
# Train a decision tree fixing maxDepth to depth_value
model_tree = DecisionTree.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
impurity='gini', maxDepth=#FILL IN, maxBins=32)
# Compute the model error over the validation data (use compute_classifier_error function)
tree_valErr = # FILL IN
print 'Tree depth is ' + str(depth_value) + ' and the validation error is '+ str(tree_valErr)
# If the error has reduced, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
if (tree_valErr < best_error):
bestModel = # FILL IN
best_depth = # FILL IN
best_error = # FILL IN
print 'Optimum tree depth: ' + str(best_depth)
Analyze the validation error behaviour with the tree depth. Is this the expected behaviour?
Taking into account the trade-off computational cost vs. accuracy, which tree depth will you select?
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(best_depth, 15, 'incorrect result: best_depth is uncorrect')
Test.assertEquals(np.round(100*best_error,0), 15, 'incorrect result: best_error is uncorrect')
Now, evaluate the error the selected model (bestModel) over the test data
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Finally, evaluate the test error over the best model
tree_finalErr = # FILL IN
print 'Final test error of the validated tree: ' + str(tree_finalErr)
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*tree_finalErr,0), 15, 'incorrect result: decision tree test error is uncorrect')
In these case, the most critical parameter is the number of trees in the forest ('numTrees'). Note that increasing the number of trees will decrease the variance in predictions, improving the generalization capability of the ensemble.
Complete the following cell to adjust this parameter by cross validation, that is, selecting the value which provides the minimum classification error over the validation data.
Note: it is also quite common adjusting the tree depth. However, in this exercise, due to computational reasons, we are going to prefix its value to 5.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.tree import RandomForest
import numpy as np
# Initialize variables
bestModel = # FILL IN
best_error = # FILL IN
best_depth = # FILL IN
best_ntrees = # FILL IN
# Range of values to explore
ntrees_params = [20, 50, 100]
for ntrees_value in ntrees_params:
# Train a RandomForest model.
model_RF = RandomForest.trainClassifier(trainingData, numClasses=10, categoricalFeaturesInfo={},
numTrees= # FILL IN, featureSubsetStrategy="auto",
impurity='gini', maxDepth=5, maxBins=32)
# Compute the model error over the validation data (use compute_classifier_error function)
RF_valErr = # FILL IN
print 'Number of trees is ' + str(ntrees_value) + ' and the validation error is '+ str(RF_valErr)
# Check if the error has improved. If it has, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
if (RF_valErr < best_error):
bestModel = # FILL IN
best_ntrees = # FILL IN
best_error = # FILL IN
print 'Optimum number of trees: ' + str(best_ntrees)
Analyze the validation error behaviour with the number of trees. Is this the expected behaviour?
Taking into account the trade-off computational cost vs. accuracy, which number of trees will you select?
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(best_ntrees, 100, 'incorrect result: best_ntrees is uncorrect')
Test.assertEquals(np.round(100*best_error,0), 15, 'incorrect result: best_error is uncorrect')
Now, evaluate the error the selected model (bestModel) over the test data
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Finally, evaluate the test error over the best model
tree_finalErr = # FILL IN
print 'Final test error of the validated tree: ' + str(tree_finalErr)
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*tree_finalErr,0), 14, 'incorrect result: best_error is uncorrect')
MLLib includes a distributed SVM implementation, but it is only avaliable for binary problems (see MLLIb help at: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms)
As we are working with a multiclass problem, let's adapt this implementation to be used in a 1vs. all fashion and aplly it to our multiclass problem.
Let's start considering a single 1 vs. all problem, for instance, let's consider that we want to classify the digit '0' from the remaining digits.
Then, we will procced as follows:
1. Convert the training labels to the 1 vs. all scheme
Create a convert_label() funtion to generate the labels associated to the 1 vs. all problem
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
def convert_label(label, label1):
"""Produce a 1 vs. all label encoding for a single label and the label to be included in the class 1.
Args:
label (int, str): the label to be coded
label1 (int, str): the label to be included in the class 1.
Returns:
Int: A single value indicating the label (0 or 1) of the 1 vs. all problem.
"""
# <FILL IN> : use all code lines that you need
Let's select as the class 1 the digit '0' and transform the labels of both training and test data
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
label_1 = 0
# Transform the labels of training data
trainingData_1vsall= # FILL IN
# Transform the labels of test data
testData_1vsall= # FILL IN
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(sum(trainingData_1vsall.first().features),0), 355, 'incorrect result: trainingData_1vsall are uncorrect')
Test.assertEquals(np.round(sum(testData_1vsall.first().features),0), 199, 'incorrect result: testData_1vsall are uncorrect')
2. Train a binary SVM
Note: The SVMWithSGD.train( ) has several parameters related to the SGD search and other ones associated to the SVM. The next cell only includes the latter ones in the training call, since we will have to adjust some of them along this notebook. The reaming parameters are set by default.
In [ ]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel
model = SVMWithSGD.train(trainingData_1vsall, regParam=0.01, regType='l2', intercept=True)
3. Compute the test error of this model
Here, you can use the function compute_classifier_error( ) of the previous section.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
error_1vsall = # FILL IN
print("Test Error os 1 vs. all model (to classify class 0) is: " + str(error_1vsall))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*error_1vsall,0), 2, 'incorrect result: best_error is uncorrect')
Starting from the previous function, let's create a 1 vs. all multiclass SVM. For these purpose, we need to implement:
Note: The function includes the model.clearThreshold() command to transform the model outputs from discrete values (labels) to real values (smooth or continuous outputs). This will be necessary to later combine the output of different SVM in a 1 vs. all fashion.
1. Let's create a training function to build all 1 vs. all SVMs
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
def train_1vsall_SVM(trainingData, regParam=0.01, regType='l2'):
"""Produce a list of SVM models solving all the 1 vs. all problems
Args:
trainingData (RDD of labeled points): the training data to adjust the model
regParam: The regularizer parameter (default: 0.01).
regType: The type of regularizer used for training our model (default: “l2”). Allowed values:
“l1” for using L1 regularization
“l2” for using L2 regularization
None for no regularization
Returns:
List of SVM models: A list of length number of classes where each element is a tuple (lab, model). Variable
lab indicates the label 1 of the 1 vs. all problem and model is the SVM model solving the problem
"""
# Get all possible labels from training data
labels = # FILL IN
# Initialize the list of models to be returned
list_models = []
for lab in labels:
# Convert labels of trainingData to 1 vs. all format (use convert_label function)
trainingData_1vsall=# FILL IN
# Train the SVM model with 1 vs. all data
model = # FILL IN
# Modify the model to get smooth outputs
model.clearThreshold()
# Create the tuple (lab, model) and add it to the list of models
# FILL IN
return list_models
2. Let's create a function to compute the output of a single data
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
import numpy as np
def output_1vsall_SVM(data, list_models):
"""Compute the output of a list of 1 vs. all SVM models for a test data
Args:
data (labeled point): data to be evaluated over the 1 vs. all SVM model
list_models: a list of length number of classes where each element is a tuple (lab, model). Variable
lab indicates the label 1 of the 1 vs. all problem and model is the SVM model solving the problem
Returns:
Output_label: The label estimated by the 1 vs. all model for the data
"""
# Split the tuples of list_models into a list of labels and a list of models (you can use zip() method)
labels, models = # FILL IN
outputs = []
# For each model...
for model in models:
# Compute the output over the test data
out = # FILL IN
# Add this output to outputs list
# FILL IN
# Get the test output label as the label associated to the model with the maximum output value
pos = # FILL IN
Output_label = # FILL IN
return Output_label
3. Let's create a function to compute the error over a set of data
Note that the SVM final output is a real value, so you can compute the number of errors as the number of products label$\times$output lower than zero.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.util import MLUtils
def compute_1vsall_SVM_error(list_models, Data):
""" Compute the classification error of the 1 vs. all SVM model over the samples given in Data.
Args:
list_models: a list of length number of classes where each element is a tuple (lab, model). Variable
lab indicates the label 1 of the 1 vs. all problem and model is the SVM model solving the problem
Data: an RDD with a data set of LabelPoints
Returns:
Int: A single value, between 0 and 1, indicating the classification error.
A value of 1 indicates that all the samples are missclassfied and a value
of 0 that all the samples are correctly classified.
"""
# Evaluate model on test instances and compute test error
# 1. Compute the model output
predictions = # FILL IN
# 2. Create an RDD of tuples (label, output)
labelsAndPredictions = # FILL IN
# 3. Compute test error
testErr = # FILL IN
return testErr
Finally, let's use the above functions to train the multiclass model and evaluate it over all test data.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Train the 1 vs. all SVM models (use multiclass_SVM() function with default parameters)
multiclass_SVM = # FILL IN
# Compute the test error
error_1vsall = # FILL IN
print("Test Error of the 1 vs. all SVM = " + str(error_1vsall))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*error_1vsall,0), 12, 'incorrect result: test error is uncorrect')
Cross validating the regularization parameter
To really obtain the SVM performance, we should cross validate the regularization parameter (C value), since its value is critical to obtain a good performance. Complete the following cell to adjust this value.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Initialize variables
bestModel = # FILL IN
best_error = # FILL IN
best_C = # FILL IN
# Range of depth values to explore
C_params = [0.01, 0.1, 1]
for C_value in C_params:
# Train the 1 vs. all SVM models (set regularization parameter to C_value)
multiclass_SVM = # FILL IN
# Compute the model error over the validation data
error_1vsall = # FILL IN
print 'C value is ' + str(C_value) + ' and the validation error is '+ str(error_1vsall)
# If the error has reduced, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
if (error_1vsall < best_error):
bestModel = # FILL IN
best_C = # FILL IN
best_error = # FILL IN
print 'Optimum C value is: ' + str(best_C)
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(best_C, 0.1, 'incorrect result: best_C is uncorrect')
Test.assertEquals(np.round(100*best_error,0), 12, 'incorrect result: best_error is uncorrect')
The MNIST dataset has a lot of features (pixels) which are useless for the classification, for instance, some of the are constant overall the data, so they haven't any discriminatory capability.
In this last section, let's implement some easy (but efficient) distribuited feature selection approaches, so that we can extract the most relevant features.
Let's start removing all pixels which are zero over all the images (background pixels), that is, its variance is zero. For this purpose, follow these steps:
1. Compute the number of zeros in each pixel
Use Statistics.colStats() funtion to compute the number of non zeros in each pixel (review first section of this notebook).
Note: we will compute this over the training data and, later, apply the selection over trainign, validation and test data sets.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
from pyspark.mllib.stat import Statistics
features = trainingData.map(lambda x: x.features)
# Compute column summary statistics.
stats = # FILL IN
plot_data([stats.numNonzeros()], h, w)
2. Select variables to keep
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
idx_keep = # FILL IN
print('Number of selected features = ' + str(len(idx_keep)))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(sum(idx_keep), 281576, 'incorrect result: idx_keep is uncorrect')
3. Removes the no desired features from the data
Use the following function to remove the useless features
In [ ]:
from pyspark.mllib.linalg import SparseVector
def remove_features(all_features, idx_keep):
""" From all_features vector it selects the features given in idx_keep and it returns them by means of a
Sparse Vector
Args:
all_features: SparseVector with the feature values
idx_keep: indexes with the positions to keep
Returns:
SparseVector with the selected features
"""
values = all_features.toArray()[idx_keep]
val_nonzero = np.where(values>0)[0]
return SparseVector(len(idx_keep), val_nonzero, values[val_nonzero])
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
import numpy as np
from pyspark.mllib.regression import LabeledPoint
# Remove features from training data
trainingData_sel = # FILL IN
# Remove features from validation data
valData_sel = # FILL IN
# Remove features from test data
testData_sel = # FILL IN
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(sum(trainingData_sel.first().features),0), 355, 'incorrect result: trainingData_sel is uncorrect')
Test.assertEquals(np.round(sum(valData_sel.first().features),0), 319, 'incorrect result: valData_sel is uncorrect')
Test.assertEquals(np.round(sum(testData_sel.first().features),0), 199, 'incorrect result: testData_sel is uncorrect')
4. Evaluate the classification performance after removing useless pixels
Here, let's use a tree with default parameters. For comparison purposes, remember that the test error using all the features is around 30%.
In [ ]:
from pyspark.mllib.tree import DecisionTree
# Train a DecisionTree model with selected features
model_tree = DecisionTree.trainClassifier(trainingData_sel, numClasses=10, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
# Test error of the decision tree
tree_testErr = compute_classifier_error(model_tree, testData_sel)
print('Tree test error = ' + str(tree_testErr))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*tree_testErr,0), 31, 'incorrect result: test error is uncorrect')
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
import numpy as np
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
from pyspark.mllib.tree import DecisionTree
# Set a variance threshold
th_var = 0.5
# 1. Compute the variance with Statistics.colStats( )
variance = # FILL IN (you may need several code lines to compute this)
# 2. Get the positions of the features to keep
idx_keep = # FILL IN
# 3. Remove features from training, validation and test data
trainingData_sel = # FILL IN
valData_sel = # FILL IN
testData_sel = # FILL IN
# 4. Evaluate performance with a decision tree
# Train the model
model_tree = DecisionTree.trainClassifier(trainingData_sel, numClasses=10, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
# Compute its test error
tree_testErr = # FILL IN
print('Tree test error = ' + str(tree_testErr))
print('Number of selected features = ' + str(len(idx_keep)))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*tree_testErr,0), 31, 'incorrect result: test error is uncorrect')
Test.assertEquals(len(idx_keep), 686, 'incorrect result: test error is uncorrect')
As you can imagine, the final performance depends on the threshold over the variance, so we should select this value by cross validation. Please, complete the following code cell to select the optimum value of the threshold.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Initialize variables
best_val_error =# FILL IN
final_test_error = # FILL IN
best_th = # FILL IN
best_num_var_sel = # FILL IN
# Range of threshold values to explore
th_range = [0.25, 0.5, 0.75, 1]
# Compute the variance
variance = # FILL IN (you may need several code lines to compute this)
for th_var in th_range:
# 2. Get the positions of the features to keep
idx_keep = # FILL IN
# Compute the number of selected features
num_var_sel = # FILL IN
# 3. Remove features from training, validation and test data
trainingData_sel = # FILL IN
valData_sel = # FILL IN
testData_sel = # FILL IN
# 4. Evaluate performance with a decision tree
# Train the model
model_tree = DecisionTree.trainClassifier(trainingData_sel, numClasses=10, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
# Compute its validation error
tree_valErr = # FILL IN
# Compute its test error
tree_testErr = # FILL IN
print 'Threshold value is ' + str(th_var) + ', the number of selected features is ' + str(num_var_sel) + ' and the validation error is '+ str(tree_valErr)
# If the error has reduced, save the model, ....
if (tree_valErr < best_error):
best_th = # FILL IN
best_val_error = # FILL IN
best_num_var_sel = # FILL IN
final_test_error = # FILL IN
print 'Optimum threshold value is: ' + str(best_th)
print 'The test error is: ' + str(final_test_error)
print 'The number of selected features is: ' + str(best_num_var_sel)
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(np.round(100*final_test_error,0), 31, 'incorrect result: test error is uncorrect')
Test.assertEquals(best_th, 0.5, 'incorrect result: best_th is uncorrect')
As we know, L1 regularization provides sparsity over the vector weights. If we have a linear model, this can provide a feature selection.
So, in this section, let's use the above multiclass SVM implemention using an L1 regularization. In this way, we will obtain both a classifier and a feature selection.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Initialize variables
C_value = 0.1
# Train the 1 vs. all SVM models (set regularization parameter to C_value)
# Note that we have added the parameter regType='l1'
multiclass_SVM = train_1vsall_SVM(trainingData, regParam=C_value, regType='l1')
# Compute the model error over the test data
error_1vsall = # FILL IN
# Analyze the number of zero weigths: compute the positions where all the vectors are zero
# multiclass_SVM is a list of tuples (class, model), getting the parameter
# model.weights you can access to the vector weights of each SVM
pos_sel = # FILL IN (You may need several lines to compute this)
print 'El número de variables seleccionadas son: ' + str(len(pos_sel))
In [ ]:
###########################################################
# TEST CELL
###########################################################
import numpy as np
from test_helper import Test
Test.assertEquals(sum(pos_sel), 178643, 'incorrect result: test error is uncorrect')
In this case, the number of selected features depends on the C value. Complete the next cell to select this value by cross validation.
In [ ]:
#################################################
# TODO: Replace <FILL IN> with appropriate code
#################################################
# Initialize variables
bestModel = # FILL IN
best_error = # FILL IN
best_C = # FILL IN
# Range of depth values to explore
C_params = [0.01, 0.1, 1]
for C_value in C_params:
# Train the 1 vs. all SVM models (set regularization parameter to C_value)
multiclass_SVM = train_1vsall_SVM(trainingData, regParam=C_value, regType='l1')
# Compute the model error over the validation data
error_1vsall = # FILL IN
# Compute the number of selected features
num_var_sel = # FILL IN
print 'C value is ' + str(C_value) + ', the number of selected features is ' + str(num_var_sel) + ' and the validation error is '+ str(error_1vsall)
# If the error has reduced, save the model, the optimum depth and error in bestModel, best_depth and best_error variables
if (error_1vsall < best_error):
bestModel = # FILL IN
best_C = # FILL IN
best_error = # FILL IN
print 'Optimum C value is: ' + str(best_C)