Getting started with scikit-multilearn

Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.

To install it just run the command:

$ pip install scikit-multilearn

Scikit-multilearn works with Python 2 and 3 on Windows, Linux and OSX. The module name is skmultilearn.



In [1]:

    
from skmultilearn.dataset import load_dataset

Let's load up some data. In this tutorial we will be working with the emotions data set introduced in emotions.



In [2]:

    
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
X_test, y_test, _, _ = load_dataset('emotions', 'test')









    



emotions:train - exists, not redownloading
emotions:test - exists, not redownloading

The feature_names variable contains list of pairs (feature name, type) that were provided in the original data set. In the case of emotions data the authors write:

The extracted features fall into two categories: rhythmic and timbre.

Let's take a look at the first few features:



In [3]:

    
feature_names[:10]









    Out[3]:





[(u'Mean_Acc1298_Mean_Mem40_Centroid', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_Rolloff', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_Flux', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_0', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_1', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_2', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_3', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_4', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_5', u'NUMERIC'),
 (u'Mean_Acc1298_Mean_Mem40_MFCC_6', u'NUMERIC')]

The label_names variable contains list of pairs (label name, type) of labels that were used to annotate the music. The paper states that:

The Tellegen-Watson-Clark model was employed for labeling the data with emotions. The sound clips were annotated by three male experts of age 20, 25 and 30 from the School of Music Studies

The labels counts in the training data are as follows:

Label	Description	# Examples
L1	amazed-surprised	173
L2	happy-pleased	166
L3	relaxing-calm	264
L4	quiet-still	148
L5	sad-lonely	168
L6	angry-fearful	189

Let's see the contents of label_names:



In [4]:

    
label_names









    Out[4]:





[(u'amazed-suprised', [u'0', u'1']),
 (u'happy-pleased', [u'0', u'1']),
 (u'relaxing-calm', [u'0', u'1']),
 (u'quiet-still', [u'0', u'1']),
 (u'sad-lonely', [u'0', u'1']),
 (u'angry-aggresive', [u'0', u'1'])]



In [5]:

    
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC



In [6]:

    
clf = BinaryRelevance(
    classifier=SVC(),
    require_dense=[False, True]
)

On a side note, Binary Relevance trains a classifier per each of the labels, we can see that the classifier hasn't been trained yet:



In [7]:

    
clf.classifiers_









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-5aa82f5c3cc2> in <module>()
----> 1 clf.classifiers

AttributeError: 'BinaryRelevance' object has no attribute 'classifiers'

Scikit-learn introduces a convention of how classifiers are organized. The typical usage of classifier is:

fit it to the data (trains the classifier and returns self)
predict results on new data (returns predicted results)

Scikit-multilearn follows these conventions, let's train a multi-label classifier:



In [8]:

    
clf.fit(X_train, y_train)









    Out[8]:





BinaryRelevance(classifier=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
        require_dense=[False, True])

The base classifiers have been trained now:



In [9]:

    
clf.classifiers_









    Out[9]:





[SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False)]



In [10]:

    
prediction = clf.predict(X_test)



In [11]:

    
prediction









    Out[11]:





<202x6 sparse matrix of type '<type 'numpy.int64'>'
	with 246 stored elements in Compressed Sparse Column format>



In [12]:

    
## Measure the quality



In [13]:

    
import sklearn.metrics as metrics

Scikit-learn provides a set of metrics useful for evaluating the quality of the model. They are most often used by providing the true assignment matrix/array as the first argument, and the prediction matrix/array as the second argument.



In [14]:

    
metrics.hamming_loss(y_test, prediction)









    Out[14]:





0.26485148514851486



In [15]:

    
metrics.accuracy_score(y_test, prediction)









    Out[15]:





0.14356435643564355