Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.
To install it just run the command:
$ pip install scikit-multilearn
Scikit-multilearn works with Python 2 and 3 on Windows, Linux and OSX. The module name is skmultilearn
.
In [1]:
from skmultilearn.dataset import load_dataset
Let's load up some data. In this tutorial we will be working with the emotions
data set introduced in emotions.
In [2]:
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ = load_dataset('emotions', 'test')
The feature_names
variable contains list of pairs (feature name, type) that were provided in the original data set. In the case of emotions
data the authors write:
The extracted features fall into two categories: rhythmic and timbre.
Let's take a look at the first few features:
In [3]:
feature_names[:10]
Out[3]:
The label_names
variable contains list of pairs (label name, type) of labels that were used to annotate the music. The paper states that:
The Tellegen-Watson-Clark model was employed for labeling the data with emotions. The sound clips were annotated by three male experts of age 20, 25 and 30 from the School of Music Studies
The labels counts in the training data are as follows:
Label | Description | # Examples |
---|---|---|
L1 | amazed-surprised | 173 |
L2 | happy-pleased | 166 |
L3 | relaxing-calm | 264 |
L4 | quiet-still | 148 |
L5 | sad-lonely | 168 |
L6 | angry-fearful | 189 |
Let's see the contents of label_names
:
In [4]:
label_names
Out[4]:
In [5]:
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC
In [6]:
clf = BinaryRelevance(
classifier=SVC(),
require_dense=[False, True]
)
On a side note, Binary Relevance trains a classifier per each of the labels, we can see that the classifier hasn't been trained yet:
In [7]:
clf.classifiers_
Scikit-learn introduces a convention of how classifiers are organized. The typical usage of classifier is:
Scikit-multilearn follows these conventions, let's train a multi-label classifier:
In [8]:
clf.fit(X_train, y_train)
Out[8]:
The base classifiers have been trained now:
In [9]:
clf.classifiers_
Out[9]:
In [10]:
prediction = clf.predict(X_test)
In [11]:
prediction
Out[11]:
In [12]:
## Measure the quality
In [13]:
import sklearn.metrics as metrics
Scikit-learn provides a set of metrics useful for evaluating the quality of the model. They are most often used by providing the true assignment matrix/array as the first argument, and the prediction matrix/array as the second argument.
In [14]:
metrics.hamming_loss(y_test, prediction)
Out[14]:
In [15]:
metrics.accuracy_score(y_test, prediction)
Out[15]: