Last modification: 2017-10-16
In [1]:
# Load the iris dataset and randomly permute it
import numpy as np
#import logging
#logger = logging.getLogger()
#logger.setLevel(logging.DEBUG)
#logging.debug("test")
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn import preprocessing
from mltoolbox.model_selection.classification import MultiClassifier
In this example, the wine dataset contains 178 samples, 13 features, and 3 clases. Details of the dataset can be find here.
In [2]:
# Load data
iris = load_iris()
X = iris.data
y = iris.target
n_samples, n_features = X.shape
print("samples:{}, features:{}, labels:{}".format(n_samples, n_features, np.unique(y)))
In [3]:
# Preprocessing
std_scale = preprocessing.StandardScaler().fit(X)
X_std = std_scale.transform(X)
print('After standardization:{:.4f}, {:.4f}'.format(X_std.mean(), X_std.std()))
This section is used to configure the classifiers that are going to be used in the classification, in this case, the Support Vector Machines (SVC) and Random Forest classifiers (RF) are going to be applied. The dictionary models is used to declare the classifiers and their parameters that are NOT going to be tunned in the cross-validation. The dictionary model_params is used to specify the parameters that DO will be tunned in the cross-validation. The dictionary cv_params is used to configure how the grid cross-validation is going to be performed.
In [4]:
# Configuration
random_state = 2017 # seed used by the random number generator
models = {
# NOTE: SVC and RFC are the names that will be used to make reference to the models after the training step.
'SVC': SVC(probability=True,
random_state=random_state),
'RFC': RandomForestClassifier(random_state=random_state)
}
model_params = {
'SVC': {'kernel':['linear', 'rbf', 'sigmoid']},
'RFC': {'n_estimators': [25,50, 75, 100]}
}
cv_params = {
'cv': StratifiedKFold(n_splits=3, shuffle=False, random_state=random_state)
}
The MultiClassifier trains the multiple estimators previouly configured. First, the data is divided n_splits times, in this case in 5 folds using the StratifiedKFold class. As it is shown in the following table, the data is divided 5 times, four of the 5 blocks will be used for training (blue ones), while one will be used for testing (orange one). In addition, if the parameter shuffle=True, the data will be rearranged before splitting into blocks.
In [5]:
# Training
mc = MultiClassifier(n_splits=5, shuffle=True, random_state=random_state)
Second, the method train() receives the data, and the dictionaries with the configrations to compute the training. As an example the fold_1 is taken, it is divided in 3 parts to perform the cross-validation (specified in the dictionary cv_params). In the cross-validation, two parts are taken to tune the parameters of the classifiers, and one to test them.
In [6]:
mc.train(X, y, models, model_params, cv_params=cv_params)
Third, once that the best parameters were obtained, a model is generated from the training data. Following the example, this model is then tested on the fold_1:test block. As soon as the training and test were perfromed for each fold, the results can be visualized in a report.
In [7]:
# Results
print('RFC\n{}\n'.format(mc.report_score_summary_by_classifier('RFC')))
print('SVC\n{}\n'.format(mc.report_score_summary_by_classifier('SVC')))
In order to analize an especific fold, you can obtaine the indices of the data used for training and testing, the model trained, as well as the prediction on the test data. The method best_estimator() has the parameter fold_key, if it is not set, the method returns the fold with the highest accuracy.
In [8]:
# Get the results of the parition that has the high accuracy
fold, bm_model, bm_y_pred, bm_train_indices, bm_test_indices = mc.best_estimator('RFC')['RFC']
print(">>Best model in fold: {}".format(fold))
print(">>>Trained model \n{}".format(bm_model))
print(">>>Predicted labels: \n{}".format(bm_y_pred))
print(">>>Indices of the samples used for training: \n{}".format(bm_train_indices))
print(">>>Indices of samples used for predicting: \n{}".format(bm_test_indices))
In the case that you need to train the model again using the data of an especific fold, you can use the bm_train_indices and bm_test_indices.
In [9]:
# Recover the partition of the dataset based on the results of the best model
X_train_final, X_test_final = X[bm_train_indices], X[bm_test_indices]
y_train_final, y_test_final = y[bm_train_indices], y[bm_test_indices]
In [10]:
# Testing the best model using again all the training set
bm_model.fit(X_train_final, y_train_final)
print("Final score {0:.4f}".format(bm_model.score(X_test_final, y_test_final)))
Also, the feature importances can be obtained if the algorithm has the option.
In [11]:
importances = mc.feature_importances('RFC')
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
indices = range(n_features)
f, ax = plt.subplots(figsize=(11, 9))
plt.title("Feature importances", fontsize = 20)
plt.bar(indices, importances, color="black", align="center")
plt.xticks(indices)
plt.ylabel("Importance", fontsize = 18)
plt.xlabel("Index of the feature", fontsize = 18)
Out[12]:
In [ ]: