In this worksheet, we will explore the Baserate Fallacy and its importance in making functional predictions, or performing any other kind of (binary yes/no) classification.
This provides an interactive example to demonstrate the variation in effector class (or any other binary) prediction accuracy as the following factors vary:
References
Predicting whether a biological sequence belongs to a particular class is at the heart of nearly all functional prediction and annotation activities. The great variety and success of tools such as BLAST
, and resources like PFam
might lead one to believe that the process has been routinely solved for all cases, but this is not in fact true.
The diversity of plant pathogen effector sequences, our collective uncertainty about the relationships between function and sequence for these proteins, and the inherent difficulty of effective classification mean that their prediction is an ongoing area of research, decades after their identification.
We take as our example a paper from 2009 (Arnold et al.) that describes a software tool to identify bacterial type III secretion system (T3SS) effector proteins (EffectiveT3). In that paper, the authors use their tool to screen 739 completely sequenced bacterial genomes, finding a large number of type III secretion system effectors in all bacteria analysed - including those that do not possess a functional Type III Secretion System. The presence of effectors in these organisms would seem, on the face of it, to be implausible.
What is striking is the number of predicted effector proteins, and the proportion of the total proteome that they describe. Happily, in a success for open science and supplementary material, the paper includes a complete account of their predictions. As two examples:
An isolate with T3SS
An isolate with no T3SS
Something is clearly awry. Despite having wildly disparate CDS counts, both organisms are predicted to have around 200 T3SS effectors, despite one of the organisms having no mechanism to secrete them. The authors note:
The surprisingly high number of (false) positives in genomes without TTSS exceeds the expected false positive rate [...] and thus raised questions about their nature.
and propose:
The missing clear difference between Gram-negatives with and without TTSS may be explained by the noise caused by misannotations which seem to be present in all selected genomes (data not shown). Additionally, putative Type III effectors may not be a unique feature of species encoding a TTSS but could be ubiquitous in a broad range of phylogenetically diverse microbes.
There is, however, an alternative, simple statistical explanation for these observations that explains their results precisely: The Baserate Fallacy.
Imagine that you work at an airport, in security, and have a test for smugglers. This test of yours is able to identify smugglers quite well:
This means that your test has two important performance metrics:
You may be quite confident about the ability of your method to classify smugglers correctly. But what happens when it's put in place at an airport.
Let's imagine a busy airport. London's Heathrow airport handles around 200,000 passengers per day (2016). So, applying your test for smugglers will identify $200,000 \times 0.01 = 2,000$ per day as smugglers, even if there are no smugglers. These are all false positives.
If there are 100 smugglers a day at Heathrow, your method will identify $100 \times 0.99 = 99$ of them, but these numbers are swamped 20:1 by the number of false positives.
This is the heart of the baserate fallacy:
Suppose that all objects in a group belong to exactly one of two classes: positive and negative.
Then the probability that a randomly-chosen member of the group belongs to the positive class is $P(\text{pos})$. We call this the base rate of 'positive' class members.
Note: $P(\text{pos}) = 1 - P(\overline{\text{pos}})$
Suppose also that we have a classifier method that identifies objects presented to it as either positive or negative.
The conditional probability that an object identified by the classifier as positive belongs to the positive class (a true positive prediction), $P(\text{pos}|\text{+ve})$, is then given by Bayes' Theorem:
$$P(\text{pos}\ |\ \text{+ve}) = \frac{P(\text{+ve}\ |\ \text{pos})P(\text{pos})}{P(\text{+ve}\ |\ \text{pos})P(\text{pos}) + P(\text{+ve}\ |\ \overline{\text{pos}})P(\overline{\text{pos}})}$$from which it follows that
$$P(\text{pos}) \rightarrow\ 0\ \implies\ P(\text{pos}\ |\ \text{+ve})\ \rightarrow\ 0$$and
$$P(\text{+ve}\ |\ \overline{\text{pos}})P(\overline{\text{pos}}) > 0 \implies P(\text{pos}\ |\ \text{+ve}) < P(\text{+ve}\ |\ \text{pos})P(\text{pos})$$which is another way to express the thought that:
and also reveals that
In [ ]:
%pylab inline
# The magic above lets us plot directly within the notebook.
# The helpers module contains code that plots the relationship between
# the effector baserate, and the probability that a positive prediction
# is correct, for given predictor performance metrics (sensitivity, FPR)
from helpers import ex03
# for interactive widgets, requires iPython notebook v2
from ipywidgets import interact
The example below uses the plot_prob_effector()
function from the ex03
helper module to plot a curve of the expected results from a hypothetical effector classifier tool.
Specifically, the function plots the value of $P(\text{effector}\ |\ \text{+ve})$ against baseline effector rate ($P(\text{pos})$). In the code below, we set a sensitivity of 0.95, and false positive rate of 0.05.
In [ ]:
# Plot a curve for 0.95 sensitivity, 0.05 FPR, with 30% of
# the genome being effectors
ex03.plot_prob_effector(0.95, 0.05, baserate=0.3)
As can be seen from the curve above, a test with these characteristics would perform very well, such that $P(\text{pos}\ |\ \text{+ve}) > 0.8$ for the majority of baseline effector rates - specifically all rates above around 20%.
However, a typical effector class constitutes 3% or less of the total CDS complement of any pathogen (see Kemen et al. (2011), and Boch & Bonas (2010)).
We redraw the plot, reducing the base rate, accordingly:
In [ ]:
# Plot a curve for 0.95 sensitivity, 0.05 FPR, with 3% of
# the genome being effectors
ex03.plot_prob_effector(0.95, 0.05, baserate=0.03)
So, a test like this would actually be expected to have $P(\text{pos}\ |\ \text{+ve}) \approx 0.37$ or lower if applied to a complete pathogen genome CDS complement.
One approach to improve predictive performance in practice is to stratify the data to which the classifier is applied. This means, effectively, reducing the set of object that the classifier sees to a group that is more likely to contain the class of interest: i.e. a greater proportion of effectors, in this example. This essentially increases the base rate of positive examples, $P(\text{pos})$.
The interactive example below allows you to choose the sensitivity (sens
) and false positive rate (fpr
) for the effector classifier, and renders a plot of expected $P(\text{pos}|\text{+ve})$ against base rate of effectors in the input set to the classifier.
By setting the value of xmax
appropriately, you can zoom in the plot to explore $P(\text{pos}|\text{+ve})$ values for low base rates of effector occurrence.
In [ ]:
# To see the curve for the T3E prediction method in the
# Arnold et al. paper, set sens=0.71, fpr=0.15
interact(ex03.plot_prob_effector, sens=(0.01, 0.99, 0.01),
fpr=(0.01, 0.99, 0.01), xmax=(0.1, 1, 0.01),
baserate=(0.01, 0.99, 0.01));
In [ ]: