Multiclass Classification

Last modification: 2017-10-16



In [1]:

    
# Load the iris dataset and randomly permute it
import numpy as np
#import logging
#logger = logging.getLogger()
#logger.setLevel(logging.DEBUG)
#logging.debug("test")

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn import preprocessing

from mltoolbox.model_selection.classification import MultiClassifier

In this example, the wine dataset contains 178 samples, 13 features, and 3 clases. Details of the dataset can be find here.



In [2]:

    
# Load data
iris = load_iris()
X = iris.data
y = iris.target
n_samples, n_features = X.shape
print("samples:{}, features:{}, labels:{}".format(n_samples, n_features, np.unique(y)))









    



samples:150, features:4, labels:[0 1 2]



In [3]:

    
# Preprocessing
std_scale = preprocessing.StandardScaler().fit(X)
X_std = std_scale.transform(X)
print('After standardization:{:.4f}, {:.4f}'.format(X_std.mean(), X_std.std()))









    



After standardization:-0.0000, 1.0000

This section is used to configure the classifiers that are going to be used in the classification, in this case, the Support Vector Machines (SVC) and Random Forest classifiers (RF) are going to be applied. The dictionary models is used to declare the classifiers and their parameters that are NOT going to be tunned in the cross-validation. The dictionary model_params is used to specify the parameters that DO will be tunned in the cross-validation. The dictionary cv_params is used to configure how the grid cross-validation is going to be performed.



In [4]:

    
# Configuration

random_state = 2017 # seed used by the random number generator

models = {
    # NOTE: SVC and RFC are the names that will be used to make reference to the models after the training step.
    'SVC': SVC(probability=True,
               random_state=random_state),
    'RFC': RandomForestClassifier(random_state=random_state)
}

model_params = {
    'SVC': {'kernel':['linear', 'rbf', 'sigmoid']},
    'RFC': {'n_estimators': [25,50, 75, 100]}
}

cv_params = {
    'cv': StratifiedKFold(n_splits=3, shuffle=False, random_state=random_state)
}

The MultiClassifier trains the multiple estimators previouly configured. First, the data is divided n_splits times, in this case in 5 folds using the StratifiedKFold class. As it is shown in the following table, the data is divided 5 times, four of the 5 blocks will be used for training (blue ones), while one will be used for testing (orange one). In addition, if the parameter shuffle=True, the data will be rearranged before splitting into blocks.



In [5]:

    
# Training
mc = MultiClassifier(n_splits=5, shuffle=True, random_state=random_state)

Second, the method train() receives the data, and the dictionaries with the configrations to compute the training. As an example the fold_1 is taken, it is divided in 3 parts to perform the cross-validation (specified in the dictionary cv_params). In the cross-validation, two parts are taken to tune the parameters of the classifiers, and one to test them.



In [6]:

    
mc.train(X, y, models, model_params,  cv_params=cv_params)

Third, once that the best parameters were obtained, a model is generated from the training data. Following the example, this model is then tested on the fold_1:test block. As soon as the training and test were perfromed for each fold, the results can be visualized in a report.



In [7]:

    
# Results
print('RFC\n{}\n'.format(mc.report_score_summary_by_classifier('RFC')))
print('SVC\n{}\n'.format(mc.report_score_summary_by_classifier('SVC')))









    



RFC
             Accuracy  Precision     Recall   F1-score

        1      0.9000     0.9024     0.9000     0.8997
        2      0.9667     0.9697     0.9667     0.9666
        3      1.0000     1.0000     1.0000     1.0000
        4      0.9667     0.9697     0.9667     0.9666
        5      0.9333     0.9333     0.9333     0.9333

  Average      0.9533     0.9550     0.9533     0.9532


SVC
             Accuracy  Precision     Recall   F1-score

        1      0.9667     0.9697     0.9667     0.9666
        2      0.9667     0.9697     0.9667     0.9666
        3      1.0000     1.0000     1.0000     1.0000
        4      1.0000     1.0000     1.0000     1.0000
        5      0.9333     0.9333     0.9333     0.9333

  Average      0.9733     0.9745     0.9733     0.9733

In order to analize an especific fold, you can obtaine the indices of the data used for training and testing, the model trained, as well as the prediction on the test data. The method best_estimator() has the parameter fold_key, if it is not set, the method returns the fold with the highest accuracy.

TODO: Use the measurement as a parameter to get the best estimator



In [8]:

    
# Get the results of the parition that has the high accuracy

fold, bm_model, bm_y_pred, bm_train_indices, bm_test_indices = mc.best_estimator('RFC')['RFC']

print(">>Best model in fold: {}".format(fold))
print(">>>Trained model \n{}".format(bm_model))
print(">>>Predicted labels: \n{}".format(bm_y_pred))
print(">>>Indices of the samples used for training: \n{}".format(bm_train_indices))
print(">>>Indices of samples used for predicting: \n{}".format(bm_test_indices))









    



>>Best model in fold: 3
>>>Trained model 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=75, n_jobs=1, oob_score=False, random_state=2017,
            verbose=0, warm_start=False)
>>>Predicted labels: 
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2]
>>>Indices of the samples used for training: 
[  0   2   3   6   7   8   9  13  14  15  16  17  18  19  20  21  22  23
  25  26  28  29  30  31  32  33  34  35  36  37  38  39  41  42  43  44
  45  46  47  48  50  51  52  53  54  55  56  57  58  59  60  62  64  65
  68  70  71  73  74  75  77  78  80  81  82  83  84  85  87  88  89  90
  91  92  94  95  96  97  98  99 100 101 104 105 106 108 109 111 112 113
 114 115 116 117 118 119 120 121 124 125 126 127 128 129 130 131 133 134
 135 136 137 139 142 143 144 145 146 147 148 149]
>>>Indices of samples used for predicting: 
[  1   4   5  10  11  12  24  27  40  49  61  63  66  67  69  72  76  79
  86  93 102 103 107 110 122 123 132 138 140 141]

In the case that you need to train the model again using the data of an especific fold, you can use the bm_train_indices and bm_test_indices.



In [9]:

    
# Recover the partition of the dataset based on the results of the best model
X_train_final, X_test_final = X[bm_train_indices], X[bm_test_indices]
y_train_final, y_test_final = y[bm_train_indices], y[bm_test_indices]



In [10]:

    
# Testing the best model using again all the training set
bm_model.fit(X_train_final, y_train_final)
print("Final score {0:.4f}".format(bm_model.score(X_test_final, y_test_final)))









    



Final score 1.0000

Also, the feature importances can be obtained if the algorithm has the option.



In [11]:

    
importances = mc.feature_importances('RFC')



In [12]:

    
%matplotlib inline
import matplotlib.pyplot as plt

indices = range(n_features)
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Feature importances", fontsize = 20)
plt.bar(indices, importances, color="black", align="center")

plt.xticks(indices)
plt.ylabel("Importance", fontsize = 18)
plt.xlabel("Index of the feature", fontsize = 18)









    Out[12]:





<matplotlib.text.Text at 0x24268c25358>



In [ ]: