TextExplainer: debugging black-box text classifiers

While eli5 supports many classifiers and preprocessing methods, it can't support them all.

If a library is not supported by eli5 directly, or the text processing pipeline is too complex for eli5, eli5 can still help - it provides an implementation of LIME (Ribeiro et al., 2016) algorithm which allows to explain predictions of arbitrary classifiers, including text classifiers. eli5.lime can also help when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved.

Example problem: LSA+SVM for 20 Newsgroups dataset

Let's load "20 Newsgroups" dataset and create a text processing pipeline which is hard to debug using conventional methods: SVM with RBF kernel trained on LSA features.


In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)
twenty_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=('headers', 'footers'),
)

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline

vec = TfidfVectorizer(min_df=3, stop_words='english',
                      ngram_range=(1, 2))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
lsa = make_pipeline(vec, svd)

clf = SVC(C=150, gamma=2e-2, probability=True)
pipe = make_pipeline(lsa, clf)
pipe.fit(twenty_train.data, twenty_train.target)
pipe.score(twenty_test.data, twenty_test.target)


Out[2]:
0.89014647137150471

The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents.

This is what the pipeline returns for a document - it is pretty sure the first message in test data belongs to sci.med:


In [3]:
def print_prediction(doc):
    y_pred = pipe.predict_proba([doc])[0]
    for target, prob in zip(twenty_train.target_names, y_pred):
        print("{:.3f} {}".format(prob, target))    

doc = twenty_test.data[0]
print_prediction(doc)


0.001 alt.atheism
0.001 comp.graphics
0.995 sci.med
0.004 soc.religion.christian

TextExplainer

Such pipelines are not supported by eli5 directly, but one can use eli5.lime.TextExplainer to debug the prediction - to check what was important in the document to make this decision.

Create a TextExplainer instance, then pass the document to explain and a black-box classifier (a function which returns probabilities) to the TextExplainer.fit method, then check the explanation:


In [4]:
import eli5
from eli5.lime import TextExplainer

te = TextExplainer(random_state=42)
te.fit(doc, pipe.predict_proba)
te.show_prediction(target_names=twenty_train.target_names)


Out[4]:

y=alt.atheism (probability 0.000, score -9.663) top features

Contribution? Feature
-0.360 <BIAS>
-9.303 Highlighted in text (sum)

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=comp.graphics (probability 0.000, score -8.503) top features

Contribution? Feature
-0.210 <BIAS>
-8.293 Highlighted in text (sum)

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=sci.med (probability 0.996, score 5.826) top features

Contribution? Feature
+5.929 Highlighted in text (sum)
-0.103 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

y=soc.religion.christian (probability 0.004, score -5.504) top features

Contribution? Feature
-0.342 <BIAS>
-5.162 Highlighted in text (sum)

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Why it works

Explanation makes sense - we expect reasonable classifier to take highlighted words in account. But how can we be sure this is how the pipeline works, not just a nice-looking lie? A simple sanity check is to remove or change the highlighted words, to confirm that they change the outcome:


In [5]:
import re
doc2 = re.sub(r'(recall|kidney|stones|medication|pain|tech)', '', doc, flags=re.I)
print_prediction(doc2)


0.065 alt.atheism
0.145 comp.graphics
0.376 sci.med
0.414 soc.religion.christian

Predicted probabilities changed a lot indeed.

And in fact, TextExplainer did something similar to get the explanation. TextExplainer generated a lot of texts similar to the document (by removing some of the words), and then trained a white-box classifier which predicts the output of the black-box classifier (not the true labels!). The explanation we saw is for this white-box classifier.

This approach follows the LIME algorithm; for text data the algorithm is actually pretty straightforward:

  1. generate distorted versions of the text;
  2. predict probabilities for these distorted texts using the black-box classifier;
  3. train another classifier (one of those eli5 supports) which tries to predict output of a black-box classifier on these texts.

The algorithm works because even though it could be hard or impossible to approximate a black-box classifier globally (for every possible text), approximating it in a small neighbourhood near a given text often works well, even with simple white-box classifiers.

Generated samples (distorted texts) are available in samples_ attribute:


In [6]:
print(te.samples_[0])


As    my   kidney ,  isn' any
  can        .

Either they ,     be    ,   
to   .

   ,  - tech  to mention  ' had kidney
 and ,     .

By default TextExplainer generates 5000 distorted texts (use n_samples argument to change the amount):


In [7]:
len(te.samples_)


Out[7]:
5000

Trained white-box classifier and vectorizer are available as vec_ and clf_ attributes:


In [8]:
te.vec_, te.clf_


Out[8]:
(CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 2), preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
         vocabulary=None),
 SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,
        eta0=0.0, fit_intercept=True, l1_ratio=0.15,
        learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
        penalty='elasticnet', power_t=0.5,
        random_state=<mtrand.RandomState object at 0x10e1dcf78>,
        shuffle=True, verbose=0, warm_start=False))

Should we trust the explanation?

Ok, this sounds fine, but how can we be sure that this simple text classification pipeline approximated the black-box classifier well?

One way to do that is to check the quality on a held-out dataset (which is also generated). TextExplainer does that by default and stores metrics in metrics_ attribute:


In [9]:
te.metrics_


Out[9]:
{'mean_KL_divergence': 0.020120624088861134, 'score': 0.98625304704899297}
  • 'score' is an accuracy score weighted by cosine distance between generated sample and the original document (i.e. texts which are closer to the example are more important). Accuracy shows how good are 'top 1' predictions.
  • 'mean_KL_divergence' is a mean Kullback–Leibler divergence for all target classes; it is also weighted by distance. KL divergence shows how well are probabilities approximated; 0.0 means a perfect match.

In this example both accuracy and KL divergence are good; it means our white-box classifier usually assigns the same labels as the black-box classifier on the dataset we generated, and its predicted probabilities are close to those predicted by our LSA+SVM pipeline. So it is likely (though not guaranteed, we'll discuss it later) that the explanation is correct and can be trusted.

When working with LIME (e.g. via TextExplainer) it is always a good idea to check these scores. If they are not good then you can tell that something is not right.

Let's make it fail

By default TextExplainer uses a very basic text processing pipeline: Logistic Regression trained on bag-of-words and bag-of-bigrams features (see te.clf_ and te.vec_ attributes). It limits a set of black-box classifiers it can explain: because the text is seen as "bag of words/ngrams", the default white-box pipeline can't distinguish e.g. between the same word in the beginning of the document and in the end of the document. Bigrams help to alleviate the problem in practice, but not completely.

Black-box classifiers which use features like "text length" (not directly related to tokens) can be also hard to approximate using the default bag-of-words/ngrams model.

This kind of failure is usually detectable though - scores (accuracy and KL divergence) will be low. Let's check it on a completely synthetic example - a black-box classifier which assigns a class based on oddity of document length and on a presence of 'medication' word.


In [10]:
import numpy as np

def predict_proba_len(docs):
    # nasty predict_proba - the result is based on document length,
    # and also on a presence of "medication"
    proba = [
        [0, 0, 1.0, 0] if len(doc) % 2 or 'medication' in doc else [1.0, 0, 0, 0] 
        for doc in docs
    ]
    return np.array(proba)    

te3 = TextExplainer().fit(doc, predict_proba_len)
te3.show_prediction(target_names=twenty_train.target_names)


Out[10]:

y=sci.med (probability 0.989, score 4.466) top features

Contribution? Feature
+4.576 Highlighted in text (sum)
-0.110 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

TextExplainer correctly figured out that 'medication' is important, but failed to account for "len(doc) % 2" condition, so the explanation is incomplete. We can detect this failure by looking at metrics - they are low:


In [11]:
te3.metrics_


Out[11]:
{'mean_KL_divergence': 0.3312922355257879, 'score': 0.79050673156810314}

If (a big if...) we suspect that the fact document length is even or odd is important, it is possible to customize TextExplainer to check this hypothesis.

To do that, we need to create a vectorizer which returns both "is odd" feature and bag-of-words features, and pass this vectorizer to TextExplainer. This vectorizer should follow scikit-learn API. The easiest way is to use FeatureUnion - just make sure all transformers joined by FeatureUnion have get_feature_names() methods.


In [12]:
from sklearn.pipeline import make_union
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin

class DocLength(TransformerMixin):
    def fit(self, X, y=None):  # some boilerplate
        return self
    
    def transform(self, X):
        return [
            # note that we needed both positive and negative 
            # feature - otherwise for linear model there won't 
            # be a feature to show in a half of the cases
            [len(doc) % 2, not len(doc) % 2] 
            for doc in X
        ]
    
    def get_feature_names(self):
        return ['is_odd', 'is_even']

vec = make_union(DocLength(), CountVectorizer(ngram_range=(1,2)))
te4 = TextExplainer(vec=vec).fit(doc[:-1], predict_proba_len)

print(te4.metrics_)
te4.explain_prediction(target_names=twenty_train.target_names)


{'mean_KL_divergence': 0.024826114773734968, 'score': 1.0}
Out[12]:

y=sci.med (probability 0.996, score 5.511) top features

Contribution? Feature
+8.590 countvectorizer: Highlighted in text (sum)
-0.043 <BIAS>
-3.037 doclength__is_even

countvectorizer: as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less

Much better! It was a toy example, but the idea stands - if you think something could be important, add it to the mix as a feature for TextExplainer.

Let's make it fail, again

Another possible issue is the dataset generation method. Not only feature extraction should be powerful enough, but auto-generated texts also should be diverse enough.

TextExplainer removes random words by default, so by default it can't e.g. provide a good explanation for a black-box classifier which works on character level. Let's try to use TextExplainer to explain a classifier which uses char ngrams as features:


In [13]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vec_char = HashingVectorizer(analyzer='char_wb', ngram_range=(4,5))
clf_char = SGDClassifier(loss='log')

pipe_char = make_pipeline(vec_char, clf_char)
pipe_char.fit(twenty_train.data, twenty_train.target)
pipe_char.score(twenty_test.data, twenty_test.target)


Out[13]:
0.88082556591211714

This pipeline is supported by eli5 directly, so in practice there is no need to use TextExplainer for it. We're using this pipeline as an example - it is possible check the "true" explanation first, without using TextExplainer, and then compare the results with TextExplainer results.


In [14]:
eli5.show_prediction(clf_char, doc, vec=vec_char,
                    targets=['sci.med'], target_names=twenty_train.target_names)


Out[14]:

y=sci.med (probability 0.565, score -0.037) top features

Contribution? Feature
+0.943 Highlighted in text (sum)
-0.980 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

TextExplainer produces a different result:


In [15]:
te = TextExplainer(random_state=42).fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)


{'mean_KL_divergence': 0.020247299052285436, 'score': 0.92434669226497945}
Out[15]:

y=sci.med (probability 0.576, score 0.621) top features

Contribution? Feature
+0.972 Highlighted in text (sum)
-0.351 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Scores look OK but not great; the explanation kind of makes sense on a first sight, but we know that the classifier works in a different way.

To explain such black-box classifiers we need to change both dataset generation method (change/remove individual characters, not only words) and feature extraction method (e.g. use char ngrams instead of words and word ngrams).

TextExplainer has an option (char_based=True) to use char-based sampling and char-based classifier. If this makes a more powerful explanation engine why not always use it?


In [16]:
te = TextExplainer(char_based=True, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)


{'mean_KL_divergence': 0.22136004391576117, 'score': 0.55669450678688481}
Out[16]:

y=sci.med (probability 0.366, score -0.003) top features

Contribution? Feature
+0.199 Highlighted in text (sum)
-0.202 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Hm, the result look worse. TextExplainer detected correctly that only the first part of word "medication" is important, but the result is noisy overall, and scores are bad. Let's try it with more samples:


In [17]:
te = TextExplainer(char_based=True, n_samples=50000, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)


{'mean_KL_divergence': 0.060019833958355841, 'score': 0.86048000626542609}
Out[17]:

y=sci.med (probability 0.630, score 0.800) top features

Contribution? Feature
+1.018 Highlighted in text (sum)
-0.219 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

It is getting closer, but still not there yet. The problem is that it is much more resource intensive - you need a lot more samples to get non-noisy results. Here explaining a single example took more time than training the original pipeline.

Generally speaking, to do an efficient explanation we should make some assumptions about black-box classifier, such as:

  1. it uses words as features and doesn't take word position in account;
  2. it uses words as features and takes word positions in account;
  3. it uses words ngrams as features;
  4. it uses char ngrams as features, positions don't matter (i.e. an ngram means the same everywhere);
  5. it uses arbitrary attention over the text characters, i.e. every part of text could be potentionally important for a classifier on its own;
  6. it is important to have a particular token at a particular position, e.g. "third token is X", and if we delete 2nd token then prediction changes not because 2nd token changed, but because 3rd token is shifted.

Depending on assumptions we should choose both dataset generation method and a white-box classifier. There is a tradeoff between generality and speed.

Simple bag-of-words assumptions allow for fast sample generation, and just a few hundreds of samples could be required to get an OK quality if the assumption is correct. But such generation methods / models will fail to explain a more complex classifier properly (they could still provide an explanation which is useful in practice though).

On the other hand, allowing for each character to be important is a more powerful method, but it can require a lot of samples (maybe hundreds thousands) and a lot of CPU time to get non-noisy results.

What's bad about this kind of failure (wrong assumption about the black-box pipeline) is that it could be impossible to detect the failure by looking at the scores. Scores could be high because generated dataset is not diverse enough, not because our approximation is good.

The takeaway is that it is important to understand the "lenses" you're looking through when using LIME to explain a prediction.

Customizing TextExplainer: sampling

TextExplainer uses MaskingTextSampler or MaskingTextSamplers instances to generate texts to train on. MaskingTextSampler is the main text generation class; MaskingTextSamplers provides a way to combine multiple samplers in a single object with the same interface.

A custom sampler instance can be passed to TextExplainer if we want to experiment with sampling. For example, let's try a sampler which replaces no more than 3 characters in the text (default is to replace a random number of characters):


In [18]:
from eli5.lime.samplers import MaskingTextSampler
sampler = MaskingTextSampler(
    # Regex to split text into tokens.
    # "." means any single character is a token, i.e.
    # we work on chars.
    token_pattern='.',

    # replace no more than 3 tokens
    max_replace=3,

    # by default all tokens are replaced;
    # replace only a token at a given position.
    bow=False,
)
samples, similarity = sampler.sample_near(doc)
print(samples[0])


As I recal from my bout with kidney stones, there isn't any
medication that can do anything about them except relieve the ain.

Either thy pass, or they have to be broken up with sound, or they have
to be extracted surgically.

When I was in, the X-ray tech happened to mention that she'd had kidney
stones and children, and the childbirth hurt less.

In [19]:
te = TextExplainer(char_based=True, sampler=sampler, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)


{'mean_KL_divergence': 0.71042368337755823, 'score': 0.99933430578588944}
Out[19]:

y=sci.med (probability 0.958, score 2.434) top features

Contribution? Feature
+2.430 Highlighted in text (sum)
+0.005 <BIAS>

as i recall from my bout with kidney stones, there isn't any medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney stones and children, and the childbirth hurt less.

Note that accuracy score is perfect, but KL divergence is bad. It means this sampler was not very useful: most generated texts were "easy" in sense that most (or all?) of them should be still classified as sci.med, so it was easy to get a good accuracy. But because generated texts were not diverse enough classifier haven't learned anything useful; it's having a hard time predicting the probability output of the black-box pipeline on a held-out dataset.

By default TextExplainer uses a mix of several sampling strategies which seems to work OK for token-based explanations. But a good sampling strategy which works for many real-world tasks could be a research topic on itself. If you've got some experience with it we'd love to hear from you - please share your findings in eli5 issue tracker ( https://github.com/TeamHG-Memex/eli5/issues )!

Customizing TextExplainer: classifier

In one of the previous examples we already changed the vectorizer TextExplainer uses (to take additional features in account). It is also possible to change the white-box classifier - for example, use a small decision tree:


In [20]:
from sklearn.tree import DecisionTreeClassifier

te5 = TextExplainer(clf=DecisionTreeClassifier(max_depth=2), random_state=0)
te5.fit(doc, pipe.predict_proba)
print(te5.metrics_)
te5.show_weights()


{'mean_KL_divergence': 0.037836554598348969, 'score': 0.9838155527960798}
Out[20]:
Weight Feature
0.5461 kidney
0.4539 pain



Tree


0

kidney <= 0.5
gini = 0.1561
samples = 100.0%
value = [0.01, 0.03, 0.92, 0.04]


1

pain <= 0.5
gini = 0.3834
samples = 38.9%
value = [0.03, 0.09, 0.77, 0.11]


0->1


True


4

pain <= 0.5
gini = 0.0456
samples = 61.1%
value = [0.0, 0.01, 0.98, 0.01]


0->4


False


2

gini = 0.5185
samples = 28.4%
value = [0.04, 0.14, 0.66, 0.16]


1->2




3

gini = 0.0434
samples = 10.6%
value = [0.0, 0.0, 0.98, 0.02]


1->3




5

gini = 0.1153
samples = 22.8%
value = [0.01, 0.02, 0.94, 0.04]


4->5




6

gini = 0.0114
samples = 38.2%
value = [0.0, 0.0, 0.99, 0.0]


4->6





How to read it: "kidney <= 0.5" means "word 'kidney' is not in the document" (we're explaining the orginal LDA+SVM pipeline again).

So according to this tree if "kidney" is not in the document and "pain" is not in the document then the probability of a document belonging to sci.med drops to 0.65. If at least one of these words remain sci.med probability stays 0.9+.


In [21]:
print("both words removed::")
print_prediction(re.sub(r"(kidney|pain)", "", doc, flags=re.I))
print("\nonly 'pain' removed:")
print_prediction(re.sub(r"pain", "", doc, flags=re.I))


both words removed::
0.013 alt.atheism
0.022 comp.graphics
0.894 sci.med
0.072 soc.religion.christian

only 'pain' removed:
0.002 alt.atheism
0.004 comp.graphics
0.979 sci.med
0.015 soc.religion.christian

As expected, after removing both words probability of sci.med decreased, though not as much as our simple decision tree predicted (to 0.9 instead of 0.64). Removing pain provided exactly the same effect as predicted - probability of sci.med became 0.98.