This document will guide you through the process of selecting a classifier for your problem.
Note that there is no established, scientifically proven rule-set for selecting a classifier to solve a general multi-label classification problem. Succesful approaches often come from mixing intuitions about which classifiers are worth considering, decomposition in to subproblems, and experimental model selection.
There are two things you need to consider before choosing a classifier:
There are two ways to make the choice:
Let's load up a data set to see have some thing to work on first.
In [2]:
from skmultilearn.dataset import load_dataset
In [4]:
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ =load_dataset('emotions', 'test')
Usually classifier's performance depends on three elements:
We can obtain the first two from the shape of our output space matrices:
In [8]:
y_train.shape, y_test.shape
Out[8]:
We can use numpy and the list of rows with non-zero values in output matrices to get the number of unique label combinations.
In [9]:
import numpy as np
np.unique(y_train.rows).shape, np.unique(y_test.rows).shape
Out[9]:
Number of features can be found in the shape of the input matrix:
In [10]:
X_train.shape[1]
Out[10]:
There are several ways to measure a classifier's generalization quality:
These measures are conveniently provided by sklearn:
In [ ]:
from skmultilearn.adapt import MLkNN
classifier = MLkNN(k=3)
prediction = classifier.fit(X_train, y_train).predict(X_test)
In [7]:
import sklearn.metrics as metrics
metrics.hamming_loss(y_test, prediction)
Out[7]:
Scikit-multilearn provides 11 classifiers that allow a strong variety of classification scenarios through label partitioning and ensemble classification, let's look at the important factors influencing performance. $ g(x) $ denotes the performance of the base classifier in some of the classifiers.
Scikit-multilearn allows estimating parameters to select best models for multi-label classification using scikit-learn's model selection GridSearchCV API. In the simplest version it can look for the best parameter of a scikit-multilearn's classifier, which we'll show on the example case of estimating parameters for MLkNN, and in the more complicated cases of problem transformation methods it can estimate both the method's hyper parameters and the base classifiers parameter.
In the case of estimating the hyperparameter of a multi-label classifier, we first import the relevant classifier and
scikit-learn's GridSearchCV class. Then we define the values of parameters we want to evaluate. We are interested in which
combination of k
- the number of neighbours, s
- the smoothing parameter works best. We also need to select a measure
which we want to optimize - we've chosen the F1 macro score.
After selecting the parameters we intialize and _run the cross validation grid search and print the best hyper parameters.
In [5]:
from skmultilearn.adapt import MLkNN
from sklearn.model_selection import GridSearchCV
parameters = {'k': range(1,3), 's': [0.5, 0.7, 1.0]}
clf = GridSearchCV(MLkNN(), parameters, scoring='f1_macro')
clf.fit(X_train, y_train)
print (clf.best_params_, clf.best_score_)
These values can be then used directly with the classifier.
In problem transformation classifiers we often need to estimate not only a hyper parameter, but also the parameter of the base classifier, and also - maybe even the problem transformation method. Let's take a look at this on a three-layer construction of ensemble of problem transformation classifiers using label space partitioning, the parameters include:
classifier
: which takes a parameter - a classifier for transforming multi-label classification problem to a single-label classification, we will decide between the Label Powerset and Classifier Chainsclassifier__classifier
: which is the base classifier for the transformation strategy, we will use random forests hereclassifier__classifier__n_estimators
: the number of trees to be used in the forest, will be passed to the random forest objectclusterer
: a label space partitioning class, we will decide between two approaches provided by the NetworkX library.
In [10]:
from skmultilearn.problem_transform import ClassifierChain, LabelPowerset
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from skmultilearn.cluster import NetworkXLabelGraphClusterer
from skmultilearn.cluster import LabelCooccurrenceGraphBuilder
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from sklearn.svm import SVC
parameters = {
'classifier': [LabelPowerset(), ClassifierChain()],
'classifier__classifier': [RandomForestClassifier()],
'classifier__classifier__n_estimators': [10, 20, 50],
'clusterer' : [
NetworkXLabelGraphClusterer(LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False), 'louvain'),
NetworkXLabelGraphClusterer(LabelCooccurrenceGraphBuilder(weighted=True, include_self_edges=False), 'lpa')
]
}
clf = GridSearchCV(LabelSpacePartitioningClassifier(), parameters, scoring = 'f1_macro')
clf.fit(X_train, y_train)
print (clf.best_params_, clf.best_score_)
In [ ]: