Binary Classification

Last modification: 2017-10-16


In [1]:
# Load the iris dataset and randomly permute it
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

from mltoolbox.model_selection.classification import MultiClassifier

In this example, the breast cancer wisconsin dataset contains 569 samples and 30 features, details of the dataset can be find here.


In [2]:
# Load data
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
n_samples, n_features = X.shape
print("samples:{}, features:{}".format(n_samples, n_features))


samples:569, features:30

This section is used to configure the classifiers that are going to be used in the classification, in this case, the Support Vector Machines (SVC) and Random Forest classifiers (RF) are going to be applied. The dictionary models is used to declare the classifiers and their parameters that are NOT going to be tunned in the cross-validation. The dictionary model_params is used to specify the parameters that DO will be tunned in the cross-validation. The dictionary cv_params is used to configure how the grid cross-validation is going to be performed.


In [3]:
# Configuration

random_state = 2017 # seed used by the random number generator

models = {
    # NOTE: SVC and RFC are the names that will be used to make reference to the models after the training step.
    'SVC': SVC(probability=True,
               random_state=random_state),
    'RFC': RandomForestClassifier(random_state=random_state)
}

model_params = {
    'SVC': {'kernel':['linear', 'rbf', 'sigmoid']},
    'RFC': {'n_estimators': [25,50, 75, 100]}
}

cv_params = {
    'cv': StratifiedKFold(n_splits=3, shuffle=False, random_state=random_state)
}

The MultiClassifier trains the multiple estimators previouly configured. First, the data is divided n_splits times, in this case in 5 folds using the StratifiedKFold class. As it is shown in the following table, the data is divided 5 times, four of the 5 blocks will be used for training (blue ones), while one will be used for testing (orange one). In addition, if the parameter shuffle=True, the data will be rearranged before splitting into blocks.


In [4]:
# Training
mc = MultiClassifier(n_splits=5, shuffle=True, random_state=random_state)

Second, the method train() receives the data, and the dictionaries with the configrations to compute the training. As an example the fold_1 is taken, it is divided in 3 parts to perform the cross-validation (specified in the dictionary cv_params). In the cross-validation, two parts are taken to tune the parameters of the classifiers, and one to test them.


In [5]:
mc.train(X, y, models, model_params,  cv_params=cv_params)

Third, once that the best parameters were obtained, a model is generated from the training data. Following the example, this model is then tested on the fold_1:test block. As soon as the training and test were perfromed for each fold, the results can be visualized in a report.


In [6]:
# Results
print('RFC\n{}\n'.format(mc.report_score_summary_by_classifier('RFC')))
print('SVC\n{}\n'.format(mc.report_score_summary_by_classifier('SVC')))


RFC
               Accuracy Specificity  Precision     Recall   F1-score        AUC

          1      0.9478     1.0000     1.0000     0.9167     0.9565     0.9583
          2      0.9739     0.9302     0.9600     1.0000     0.9796     0.9651
          3      0.9735     0.9524     0.9722     0.9859     0.9790     0.9691
          4      0.9646     0.9048     0.9467     1.0000     0.9726     0.9524
          5      0.9469     0.8810     0.9333     0.9859     0.9589     0.9334

    Average      0.9613     0.9337     0.9624     0.9777     0.9693     0.9557


SVC
               Accuracy Specificity  Precision     Recall   F1-score        AUC

          1      0.9478     1.0000     1.0000     0.9167     0.9565     0.9583
          2      0.9652     0.9302     0.9595     0.9861     0.9726     0.9582
          3      0.9735     0.9286     0.9595     1.0000     0.9793     0.9643
          4      0.9735     0.9762     0.9857     0.9718     0.9787     0.9740
          5      0.9115     0.8095     0.8961     0.9718     0.9324     0.8907

    Average      0.9543     0.9289     0.9601     0.9693     0.9639     0.9491


In order to analize an especific fold, you can obtaine the indices of the data used for training and testing, the model trained, as well as the prediction on the test data. The method best_estimator() has the parameter fold_key, if it is not set, the method returns the fold with the highest accuracy.

TODO: Use the measurement as a parameter to get the best estimator


In [7]:
# Get the results of the parition that has the high accuracy

fold, bm_model, bm_y_pred, bm_train_indices, bm_test_indices = mc.best_estimator('RFC')['RFC']

print(">>Best model in fold: {}".format(fold))
print(">>>Trained model \n{}".format(bm_model))
print(">>>Predicted labels: \n{}".format(bm_y_pred))
print(">>>Indices of the samples used for training: \n{}".format(bm_train_indices))
print(">>>Indices of samples used for predicting: \n{}".format(bm_test_indices))


>>Best model in fold: 2
>>>Trained model 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
            oob_score=False, random_state=2017, verbose=0,
            warm_start=False)
>>>Predicted labels: 
[0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 1
 0 0 0 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 1 0
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1
 1 1 0 0]
>>>Indices of the samples used for training: 
[  0   1   2   3   4   5   6   7   8   9  10  13  14  15  16  17  19  20
  21  22  23  24  25  26  27  28  29  30  31  33  34  36  37  38  39  40
  41  42  43  44  45  46  47  48  49  50  51  52  54  55  56  57  59  60
  62  63  65  66  67  68  70  74  76  77  78  80  81  82  83  84  85  86
  88  89  90  92  94  95  97  99 100 101 102 103 104 105 107 108 109 110
 113 114 115 117 118 119 120 121 122 124 125 127 128 129 130 131 133 134
 135 136 138 140 142 143 144 145 146 148 149 150 151 152 153 154 157 158
 159 160 162 163 165 166 167 168 169 170 172 173 175 176 178 179 181 182
 183 184 185 186 187 188 189 191 195 196 198 199 200 201 202 203 204 205
 206 207 208 209 210 211 212 213 215 217 218 220 221 222 223 224 225 226
 227 229 230 232 233 235 237 238 239 240 241 242 244 246 248 249 250 252
 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 271
 272 275 276 277 278 279 280 281 282 283 284 286 287 288 289 290 292 293
 294 295 296 297 298 299 300 301 302 305 307 308 309 310 311 312 313 315
 316 318 319 320 321 322 323 325 327 328 329 330 331 332 333 334 335 338
 340 341 343 344 345 346 347 348 349 350 351 353 354 356 357 358 359 360
 361 362 363 364 365 367 368 369 370 371 372 373 374 375 376 378 379 380
 381 382 385 386 387 388 390 391 392 393 394 395 396 397 399 400 401 402
 403 406 407 408 410 411 413 414 415 416 417 418 421 422 423 424 425 427
 428 430 431 432 433 434 435 436 437 438 440 442 443 444 446 448 449 450
 451 452 453 454 455 456 457 458 460 461 462 463 464 465 466 467 469 472
 473 474 475 476 477 478 479 480 482 483 484 485 486 487 488 489 490 491
 493 495 496 497 498 499 501 503 506 507 509 510 511 512 513 515 516 517
 518 519 520 521 523 524 525 526 529 530 531 533 534 537 538 539 540 542
 543 544 545 546 547 548 549 550 551 552 553 555 557 558 559 560 561 562
 563 566 567 568]
>>>Indices of samples used for predicting: 
[ 11  12  18  32  35  53  58  61  64  69  71  72  73  75  79  87  91  93
  96  98 106 111 112 116 123 126 132 137 139 141 147 155 156 161 164 171
 174 177 180 190 192 193 194 197 214 216 219 228 231 234 236 243 245 247
 251 270 273 274 285 291 303 304 306 314 317 324 326 336 337 339 342 352
 355 366 377 383 384 389 398 404 405 409 412 419 420 426 429 439 441 445
 447 459 468 470 471 481 492 494 500 502 504 505 508 514 522 527 528 532
 535 536 541 554 556 564 565]

In the case that you need to train the model again using the data of an especific fold, you can use the bm_train_indices and bm_test_indices.


In [8]:
# Recover the partition of the dataset based on the results of the best model
X_train_final, X_test_final = X[bm_train_indices], X[bm_test_indices]
y_train_final, y_test_final = y[bm_train_indices], y[bm_test_indices]

In [9]:
# Testing the best model using again all the training set
bm_model.fit(X_train_final, y_train_final)
print("Final score {0:.4f}".format(bm_model.score(X_test_final, y_test_final)))


Final score 0.9739

Also, the feature importances can be obtained if the algorithm has the option.


In [10]:
importances = mc.feature_importances('RFC')

In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

indices = range(n_features)
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Feature importances", fontsize = 20)
plt.bar(indices, importances, color="black", align="center")

plt.xticks(indices)
plt.ylabel("Importance", fontsize = 18)
plt.xlabel("Index of the feature", fontsize = 18)


Out[11]:
<matplotlib.text.Text at 0x1dfdaeb9358>