While eli5 supports many classifiers and preprocessing methods, it can't support them all.
If a library is not supported by eli5 directly, or the text processing pipeline is too complex for eli5, eli5 can still help - it provides an implementation of LIME (Ribeiro et al., 2016) algorithm which allows to explain predictions of arbitrary classifiers, including text classifiers. eli5.lime
can also help when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved.
Let's load "20 Newsgroups" dataset and create a text processing pipeline which is hard to debug using conventional methods: SVM with RBF kernel trained on LSA features.
In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(
subset='train',
categories=categories,
shuffle=True,
random_state=42,
remove=('headers', 'footers'),
)
twenty_test = fetch_20newsgroups(
subset='test',
categories=categories,
shuffle=True,
random_state=42,
remove=('headers', 'footers'),
)
In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
vec = TfidfVectorizer(min_df=3, stop_words='english',
ngram_range=(1, 2))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
lsa = make_pipeline(vec, svd)
clf = SVC(C=150, gamma=2e-2, probability=True)
pipe = make_pipeline(lsa, clf)
pipe.fit(twenty_train.data, twenty_train.target)
pipe.score(twenty_test.data, twenty_test.target)
Out[2]:
The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents.
This is what the pipeline returns for a document - it is pretty sure the first message in test data belongs to sci.med:
In [3]:
def print_prediction(doc):
y_pred = pipe.predict_proba([doc])[0]
for target, prob in zip(twenty_train.target_names, y_pred):
print("{:.3f} {}".format(prob, target))
doc = twenty_test.data[0]
print_prediction(doc)
Such pipelines are not supported by eli5 directly, but one can use eli5.lime.TextExplainer
to debug the prediction - to check what was important in the document to make this decision.
Create a TextExplainer
instance, then pass the document to explain and a black-box classifier (a function which returns probabilities) to the TextExplainer.fit
method, then check the explanation:
In [4]:
import eli5
from eli5.lime import TextExplainer
te = TextExplainer(random_state=42)
te.fit(doc, pipe.predict_proba)
te.show_prediction(target_names=twenty_train.target_names)
Out[4]:
Explanation makes sense - we expect reasonable classifier to take highlighted words in account. But how can we be sure this is how the pipeline works, not just a nice-looking lie? A simple sanity check is to remove or change the highlighted words, to confirm that they change the outcome:
In [5]:
import re
doc2 = re.sub(r'(recall|kidney|stones|medication|pain|tech)', '', doc, flags=re.I)
print_prediction(doc2)
Predicted probabilities changed a lot indeed.
And in fact, TextExplainer
did something similar to get the explanation. TextExplainer
generated a lot of texts similar to the document (by removing some of the words), and then trained a white-box classifier which predicts the output of the black-box classifier (not the true labels!). The explanation we saw is for this white-box classifier.
This approach follows the LIME algorithm; for text data the algorithm is actually pretty straightforward:
The algorithm works because even though it could be hard or impossible to approximate a black-box classifier globally (for every possible text), approximating it in a small neighbourhood near a given text often works well, even with simple white-box classifiers.
Generated samples (distorted texts) are available in samples_
attribute:
In [6]:
print(te.samples_[0])
By default TextExplainer
generates 5000 distorted texts (use n_samples
argument to change the amount):
In [7]:
len(te.samples_)
Out[7]:
Trained white-box classifier and vectorizer are available as vec_
and clf_
attributes:
In [8]:
te.vec_, te.clf_
Out[8]:
Ok, this sounds fine, but how can we be sure that this simple text classification pipeline approximated the black-box classifier well?
One way to do that is to check the quality on a held-out dataset (which is also generated). TextExplainer
does that by default and stores metrics in metrics_
attribute:
In [9]:
te.metrics_
Out[9]:
In this example both accuracy and KL divergence are good; it means our white-box classifier usually assigns the same labels as the black-box classifier on the dataset we generated, and its predicted probabilities are close to those predicted by our LSA+SVM pipeline. So it is likely (though not guaranteed, we'll discuss it later) that the explanation is correct and can be trusted.
When working with LIME (e.g. via TextExplainer
) it is always a good idea to check these scores. If they are not good then you can tell that something is not right.
By default TextExplainer
uses a very basic text processing pipeline: Logistic Regression trained on bag-of-words and bag-of-bigrams features (see te.clf_
and te.vec_
attributes). It limits a set of black-box classifiers it can explain: because the text is seen as "bag of words/ngrams", the default white-box pipeline can't distinguish e.g. between the same word in the beginning of the document and in the end of the document. Bigrams help to alleviate the problem in practice, but not completely.
Black-box classifiers which use features like "text length" (not directly related to tokens) can be also hard to approximate using the default bag-of-words/ngrams model.
This kind of failure is usually detectable though - scores (accuracy and KL divergence) will be low. Let's check it on a completely synthetic example - a black-box classifier which assigns a class based on oddity of document length and on a presence of 'medication' word.
In [10]:
import numpy as np
def predict_proba_len(docs):
# nasty predict_proba - the result is based on document length,
# and also on a presence of "medication"
proba = [
[0, 0, 1.0, 0] if len(doc) % 2 or 'medication' in doc else [1.0, 0, 0, 0]
for doc in docs
]
return np.array(proba)
te3 = TextExplainer().fit(doc, predict_proba_len)
te3.show_prediction(target_names=twenty_train.target_names)
Out[10]:
TextExplainer
correctly figured out that 'medication' is important, but failed to account for "len(doc) % 2" condition, so the explanation is incomplete. We can detect this failure by looking at metrics - they are low:
In [11]:
te3.metrics_
Out[11]:
If (a big if...) we suspect that the fact document length is even or odd is important, it is possible to customize TextExplainer
to check this hypothesis.
To do that, we need to create a vectorizer which returns both "is odd" feature and bag-of-words features, and pass this vectorizer to TextExplainer
. This vectorizer should follow scikit-learn API. The easiest way is to use FeatureUnion
- just make sure all transformers joined by FeatureUnion
have get_feature_names()
methods.
In [12]:
from sklearn.pipeline import make_union
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
class DocLength(TransformerMixin):
def fit(self, X, y=None): # some boilerplate
return self
def transform(self, X):
return [
# note that we needed both positive and negative
# feature - otherwise for linear model there won't
# be a feature to show in a half of the cases
[len(doc) % 2, not len(doc) % 2]
for doc in X
]
def get_feature_names(self):
return ['is_odd', 'is_even']
vec = make_union(DocLength(), CountVectorizer(ngram_range=(1,2)))
te4 = TextExplainer(vec=vec).fit(doc[:-1], predict_proba_len)
print(te4.metrics_)
te4.explain_prediction(target_names=twenty_train.target_names)
Out[12]:
Much better! It was a toy example, but the idea stands - if you think something could be important, add it to the mix as a feature for TextExplainer
.
Another possible issue is the dataset generation method. Not only feature extraction should be powerful enough, but auto-generated texts also should be diverse enough.
TextExplainer
removes random words by default, so by default it can't e.g. provide a good explanation for a black-box classifier which works on character level. Let's try to use TextExplainer
to explain a classifier which uses char ngrams as features:
In [13]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vec_char = HashingVectorizer(analyzer='char_wb', ngram_range=(4,5))
clf_char = SGDClassifier(loss='log')
pipe_char = make_pipeline(vec_char, clf_char)
pipe_char.fit(twenty_train.data, twenty_train.target)
pipe_char.score(twenty_test.data, twenty_test.target)
Out[13]:
This pipeline is supported by eli5 directly, so in practice there is no need to use TextExplainer
for it. We're using this pipeline as an example - it is possible check the "true" explanation first, without using TextExplainer
, and then compare the results with TextExplainer
results.
In [14]:
eli5.show_prediction(clf_char, doc, vec=vec_char,
targets=['sci.med'], target_names=twenty_train.target_names)
Out[14]:
TextExplainer
produces a different result:
In [15]:
te = TextExplainer(random_state=42).fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
Out[15]:
Scores look OK but not great; the explanation kind of makes sense on a first sight, but we know that the classifier works in a different way.
To explain such black-box classifiers we need to change both dataset generation method (change/remove individual characters, not only words) and feature extraction method (e.g. use char ngrams instead of words and word ngrams).
TextExplainer
has an option (char_based=True
) to use char-based sampling and char-based classifier. If this makes a more powerful explanation engine why not always use it?
In [16]:
te = TextExplainer(char_based=True, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
Out[16]:
Hm, the result look worse. TextExplainer
detected correctly that only the first part of word "medication" is important, but the result is noisy overall, and scores are bad. Let's try it with more samples:
In [17]:
te = TextExplainer(char_based=True, n_samples=50000, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
Out[17]:
It is getting closer, but still not there yet. The problem is that it is much more resource intensive - you need a lot more samples to get non-noisy results. Here explaining a single example took more time than training the original pipeline.
Generally speaking, to do an efficient explanation we should make some assumptions about black-box classifier, such as:
Depending on assumptions we should choose both dataset generation method and a white-box classifier. There is a tradeoff between generality and speed.
Simple bag-of-words assumptions allow for fast sample generation, and just a few hundreds of samples could be required to get an OK quality if the assumption is correct. But such generation methods / models will fail to explain a more complex classifier properly (they could still provide an explanation which is useful in practice though).
On the other hand, allowing for each character to be important is a more powerful method, but it can require a lot of samples (maybe hundreds thousands) and a lot of CPU time to get non-noisy results.
What's bad about this kind of failure (wrong assumption about the black-box pipeline) is that it could be impossible to detect the failure by looking at the scores. Scores could be high because generated dataset is not diverse enough, not because our approximation is good.
The takeaway is that it is important to understand the "lenses" you're looking through when using LIME to explain a prediction.
TextExplainer
uses MaskingTextSampler
or MaskingTextSamplers
instances to generate texts to train on. MaskingTextSampler
is the main text generation class; MaskingTextSamplers
provides a way to combine multiple samplers in a single object with the same interface.
A custom sampler instance can be passed to TextExplainer
if we want to experiment with sampling. For example, let's try a sampler which replaces no more than 3 characters in the text (default is to replace a random number of characters):
In [18]:
from eli5.lime.samplers import MaskingTextSampler
sampler = MaskingTextSampler(
# Regex to split text into tokens.
# "." means any single character is a token, i.e.
# we work on chars.
token_pattern='.',
# replace no more than 3 tokens
max_replace=3,
# by default all tokens are replaced;
# replace only a token at a given position.
bow=False,
)
samples, similarity = sampler.sample_near(doc)
print(samples[0])
In [19]:
te = TextExplainer(char_based=True, sampler=sampler, random_state=42)
te.fit(doc, pipe_char.predict_proba)
print(te.metrics_)
te.show_prediction(targets=['sci.med'], target_names=twenty_train.target_names)
Out[19]:
Note that accuracy score is perfect, but KL divergence is bad. It means this sampler was not very useful: most generated texts were "easy" in sense that most (or all?) of them should be still classified as sci.med
, so it was easy to get a good accuracy. But because generated texts were not diverse enough classifier haven't learned anything useful; it's having a hard time predicting the probability output of the black-box pipeline on a held-out dataset.
By default TextExplainer
uses a mix of several sampling strategies which seems to work OK for token-based explanations. But a good sampling strategy which works for many real-world tasks could be a research topic on itself. If you've got some experience with it we'd love to hear from you - please share your findings in eli5 issue tracker ( https://github.com/TeamHG-Memex/eli5/issues )!
In [20]:
from sklearn.tree import DecisionTreeClassifier
te5 = TextExplainer(clf=DecisionTreeClassifier(max_depth=2), random_state=0)
te5.fit(doc, pipe.predict_proba)
print(te5.metrics_)
te5.show_weights()
Out[20]:
How to read it: "kidney <= 0.5" means "word 'kidney' is not in the document" (we're explaining the orginal LDA+SVM pipeline again).
So according to this tree if "kidney" is not in the document and "pain" is not in the document then the probability of a document belonging to sci.med
drops to 0.65
. If at least one of these words remain sci.med
probability stays 0.9+
.
In [21]:
print("both words removed::")
print_prediction(re.sub(r"(kidney|pain)", "", doc, flags=re.I))
print("\nonly 'pain' removed:")
print_prediction(re.sub(r"pain", "", doc, flags=re.I))
As expected, after removing both words probability of sci.med
decreased, though not as much as our simple decision tree predicted (to 0.9 instead of 0.64). Removing pain
provided exactly the same effect as predicted - probability of sci.med
became 0.98
.