This tutorial will show you how to identify features that help your models in a way that might just be too good to be true. This can happen if there was a problem with the way the dataset was put together, if the machine learning problem wasn't scoped properly, or even because of a bug in one of the feature generators. At times it is hard to understand what a model is really doing, behind the scenes. That's where MLDB's classifier.explain comes to the rescue. In particular, it can help discover that a model is cheating, or in other words, that it has learnt to use bits of information that won't be available when applying the model in real life.
To illustrate this, we are going to train a model on some data where we know a feature is biased. You can find the details here. Basically the task is to predict if the client will subscribe to a term deposit after he receives a call from the bank, given some informations about the client (the employee calling, scocioeconomic conditions at the time, etc.).
The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.
In [2]:
import pymldb
mldb = pymldb.Connection()
In [3]:
print mldb.put('/v1/procedures/_', {
'type': 'import.text',
'params': {
'dataFileUrl':
'archive+http://public.mldb.ai/datasets/bank-additional.zip#bank-additional/bank-additional-full.csv',
'outputDataset': 'bank_raw',
'delimiter': ';'
}
})
Here is a sneak peek of the data.
In [4]:
mldb.query("""
SELECT *
FROM bank_raw
LIMIT 10
""")
Out[4]:
In [6]:
print mldb.put('/v1/procedures/_', {
'type': 'classifier.train',
'params': {
'trainingData': """
SELECT {* EXCLUDING (y)} AS features,
y = 'yes' AS label
FROM bank_raw
WHERE rowHash() % 4 != 0
""",
'modelFileUrl': 'file://bank_model.cls',
'algorithm': 'bbdt',
'functionName': 'score',
'mode': 'boolean'
}
})
This creates a classifier function named "score" that we can use on examples from our test set. The higher the score, the more likely the client is going to subscribe. We can try it on examples from our test set.
In [7]:
mldb.query("""
SELECT score({features: {* EXCLUDING (y)}}) AS *
FROM bank_raw
WHERE rowHash() % 4 = 0
LIMIT 10
""")
Out[7]:
Now let's see how well our model does on the 25% of the data we didn't train on and get a feel of how good it should perform in real life.
In [6]:
mldb.put('/v1/procedures/_', {
'type': 'classifier.test',
'params': {
'testingData': """
SELECT score: score({features: {* EXCLUDING (y)}})[score], label: y = 'yes'
FROM bank_raw
WHERE rowHash() % 4 = 0
""",
'outputDataset': 'bank_test',
'mode': 'boolean'
}
})
Out[6]:
As we can see by inspecting the different statistics returned by the classifier.test procedure, that model seems to be doing pretty good! The AUC is 0.95: let's ship this thing in production right now! ... Or let's be cautious!
To understand what's going on, let's use the classifier.explain function. This will give us an idea of how much each feature helps (or hurts) in making the predictions.
In [7]:
print mldb.put('/v1/functions/explain', {
'type': 'classifier.explain',
'params': {
'modelFileUrl': 'file://bank_model.cls'
}
})
You can "explain" every single example, and know how much each feature influences the final score, like this:
In [8]:
mldb.query("""
SELECT explain({features: {* EXCLUDING (y)}, label: y = 'yes'}) AS *
FROM bank_raw
WHERE rowHash() % 4 = 0
LIMIT 10
""")
Out[8]:
Or you can do the average on all the examples. Here we then transpose the result and sort it by the absolute value.
In [9]:
mldb.query("""
SELECT *
FROM transpose((
SELECT avg({explain({features: {* EXCLUDING (y)}, label: y='yes'})[explanation] as *}) AS *
NAMED 'explanation'
FROM bank_raw
WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")
Out[9]:
Now what is striking here is that there is one feature that really stands out: duration. This is the actual duration of the call. Clearly, that information would not be available in a real life setting: you can't know the duration of a call before it's over, and when it's over you already know if the client has subscribed or not. If you look at the detailed description of the data, you can in fact see a warning saying that using that piece of information is probably a bad idea for any realistic modeling.
In [10]:
print mldb.put('/v1/procedures/_', {
'type': 'classifier.train',
'params': {
'trainingData': """
SELECT {* EXCLUDING (y, duration)} AS features,
y = 'yes' AS label
FROM bank_raw
WHERE rowHash() % 4 != 0
""",
'modelFileUrl': 'file://bank_model.cls',
'algorithm': 'bbdt',
'functionName': 'score',
'mode': 'boolean'
}
})
In [11]:
mldb.put('/v1/procedures/_', {
'type': 'classifier.test',
'params': {
'testingData': """
SELECT score: score({features: {* EXCLUDING (y)}})[score], label: y = 'yes'
FROM bank_raw
WHERE rowHash() % 4 = 0
""",
'outputDataset': 'bank_test',
'mode': 'boolean'
}
})
Out[11]:
Now a AUC of 0.80 sounds more reasonable!
If we run the explanation again, the highest ranking features seem more legitimate.
In [12]:
print mldb.put('/v1/functions/explain', {
'type': 'classifier.explain',
'params': {
'modelFileUrl': 'file://bank_model.cls'
}
})
In [13]:
mldb.query("""
SELECT *
FROM transpose((
SELECT avg({explain({features: {* EXCLUDING (y)}, label: y='yes'})[explanation] as *}) AS *
NAMED 'explanation'
FROM bank_raw
WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")
Out[13]:
We have shown how to use MLDB to identify "too good to be true" features when training a model. Keep in mind that features that really help are not necessarily biased, they might just be really good features! Understanding your data is key, and the tool presented here makes it much simpler.
Check out the other Tutorials and Demos.