Effector Prediction and The Baserate Fallacy

Introduction

In this notebook, we will explore the Baserate Fallacy and its importance in making functional predictions, or performing any other kind of (binary yes/no) classification.

The notebook provides an interactive example to demonstrate the variation in effector class (or any other binary) prediction accuracy as the following factors vary:

  • The performance of the prediction method (in particular, its sensitivity and false positive rate)
  • Base rate of occurrence of positive examples in the population being tested (e.g. the proportion of effectors in the predicted proteome)

 Learning outcomes

  • Understand the baserate fallacy
  • Understand how Bayes' Theorem can be used with performance metrics for predictions to interpret bioinformatics tool outputs

Running cells in this notebook

If this is successful, you should see the input marker to the left of the cell change from

In [ ]:

to (for example)

In [1]:

and you may see output appear below the cell.

References

Requirements

To complete this worksheet, you will need:
  • the ipywidget libraries

Motivation

Predicting whether a biological sequence belongs to a particular class is at the heart of nearly all functional prediction and annotation activities. The great variety and success of tools such as BLAST, and resources like PFam might lead one to believe that the process has been routinely solved for all cases, but this is not in fact true.

The diversity of plant pathogen effector sequences, our collective uncertainty about the relationships between function and sequence for these proteins, and the inherent difficulty of effective classification mean that their prediction is an ongoing area of research, decades after their identification.

We take as our example a paper from 2009 (Arnold et al.) that describes a software tool to identify bacterial type III secretion system (T3SS) effector proteins (EffectiveT3). In that paper, the authors use their tool to screen 739 completely sequenced bacterial genomes, finding a large number of type III secretion system effectors in all bacteria analysed - including those that do not possess a functional Type III Secretion System. The presence of effectors in these organisms would seem, on the face of it, to be implausible.

What is striking is the number of predicted effector proteins, and the proportion of the total proteome that they describe. Happily, in a success for open science and supplementary material, the paper includes a complete account of their predictions. As two examples:

An isolate with T3SS

  • Pseudomonas syringae pv. phaseolicola:
  • number of CDS tested: 5169
  • proportion (count) of CDS predicted as effectors: 3.8% (196)

An isolate with no T3SS

  • Corynebacterium jeikeium
  • number of CDS tested: 2119
  • proportion (count) of CDS predicted as effectors: 10.3% (219)

Something is clearly awry. Despite having wildly disparate CDS counts, both organisms are predicted to have around 200 T3SS effectors, despite one of the organisms having no mechanism to secrete them. The authors note:

The surprisingly high number of (false) positives in genomes without TTSS exceeds the expected false positive rate [...] and thus raised questions about their nature.

and propose:

The missing clear difference between Gram-negatives with and without TTSS may be explained by the noise caused by misannotations which seem to be present in all selected genomes (data not shown). Additionally, putative Type III effectors may not be a unique feature of species encoding a TTSS but could be ubiquitous in a broad range of phylogenetically diverse microbes.

There is, however, an alternative, simple statistical explanation for these observations that explains these observations: The Baserate Fallacy.

Theory

In the text below, there are two explanations of the baserate fallacy - one in straightforward terms, and one technical.

Straightforward version

Imagine that you work at an airport, in security, and have a test for smugglers. This test of yours is able to identify smugglers quite well:

  • For every 100 smugglers subjected to your test, 99 of them are called positive by your test, as smugglers.
  • For every 100 non-smugglers subjected to your test, 99 of them are called negative by your test, as non-smugglers.

This means that your test has two important performance metrics:

  • The sensitivity (Sn) of your test (how many smugglers are correctly called as smugglers) is $99/100 = 0.99$
  • The false positive rate (FPR) of your test (how many non-smugglers are incorrectly called as smugglers) is $1 - 99/100 = 0.01$.
NOTE: This is actually a good set of performance metrics. In the Arnold et al. paper, the reported values were:
  • sensitivity: 0.71
  • FPR: 0.15
These are good values for a bioinformatics/biological classifier.

You may be quite confident about the ability of your method to classify smugglers correctly. But what happens when it's put in place at an airport.

Let's imagine a busy airport. London's Heathrow airport handles around 200,000 passengers per day (2016). So, applying your test for smugglers will identify $200,000 \times 0.01 = 2,000$ per day as smugglers, even if there are no smugglers. These are all false positives.

If there are 100 smugglers a day at Heathrow, your method will identify $100 \times 0.99 = 99$ of them, but these numbers are swamped 20:1 by the number of false positives.

This is the heart of the baserate fallacy:

If the proportion of positive examples in the dataset being tested is small, then even an exceptionally good predictor/classifier will overwhelmingly produce false negative results.

Technical version

Suppose that all objects in a group belong to exactly one of two classes: positive and negative.

  • We define a member of the positive class as $\text{pos}$ (so members of the negative class are $\overline{\text{pos}}$).

Then the probability that a randomly-chosen member of the group belongs to the positive class is $P(\text{pos})$. We call this the base rate of 'positive' class members.

Note: $P(\text{pos}) = 1 - P(\overline{\text{pos}})$

Suppose also that we have a classifier method that identifies objects presented to it as either positive or negative.

  • We define a positive prediction by our classifier as $\text{+ve}$
  • The sensitivity of the test is then given by the conditional probability: $P(\text{+ve}\ |\ \text{pos})$
  • The FPR is then described by the conditional probability: $1-P(\text{+ve}\ |\ \overline{\text{pos}})$.

The conditional probability that an object identified by the classifier as positive belongs to the positive class (a true positive prediction), $P(\text{pos}|\text{+ve})$, is then given by Bayes' Theorem:

$$P(\text{pos}\ |\ \text{+ve}) = \frac{P(\text{+ve}\ |\ \text{pos})P(\text{pos})}{P(\text{+ve}\ |\ \text{pos})P(\text{pos}) + P(\text{+ve}\ |\ \overline{\text{pos}})P(\overline{\text{pos}})}$$

from which it follows that

$$P(\text{pos}) \rightarrow\ 0\ \implies\ P(\text{pos}\ |\ \text{+ve})\ \rightarrow\ 0$$

and

$$P(\text{+ve}\ |\ \overline{\text{pos}})P(\overline{\text{pos}}) > 0 \implies P(\text{pos}\ |\ \text{+ve}) < P(\text{+ve}\ |\ \text{pos})P(\text{pos})$$

which is another way to express the thought that:

If the proportion of positive examples in the dataset being tested is small, then even an exceptionally good predictor/classifier will overwhelmingly produce false negative results.

and also reveals that

The presence of negative examples in the set of objects being tested by a classifier with a non-zero FPR means that the classifer can always be expected to produce false positives.

General example in Python

Below, we will run through some illustrative plots of $P(\text{pos}\ |\ \text{+ve})$ in response to changes in sensitivity and FPR, and for a range of positive base rates in the set being tested.


In [ ]:
%pylab inline
# The magic above lets us plot directly within the notebook.
# The helpers module contains code that plots the relationship between
# the effector baserate, and the probability that a positive prediction
# is correct, for given predictor performance metrics (sensitivity, FPR)
from helpers import baserate
# for interactive widgets, requires iPython notebook v2
from ipywidgets import interact

Single Plot Example

The example below uses the plot_prob_effector() function from the ex03 helper module to plot a curve of the expected results from a hypothetical effector classifier tool.

Specifically, the function plots the value of $P(\text{effector}\ |\ \text{+ve})$ against baseline effector rate ($P(\text{pos})$). In the code below, we set a sensitivity of 0.95, and false positive rate of 0.05.

NOTE: These values would be exceptionally good for a real-life effector classifier.

In [ ]:
# Plot a curve for 0.95 sensitivity, 0.05 FPR, with 30% of
# the genome being effectors
baserate.plot_prob_effector(0.95, 0.05, baserate=0.3)

As can be seen from the curve above, a test with these characteristics would perform very well, such that $P(\text{pos}\ |\ \text{+ve}) > 0.8$ for the majority of baseline effector rates - specifically all rates above around 20%.

However, a typical effector class constitutes 3% or less of the total CDS complement of any pathogen (see Kemen et al. (2011), and Boch & Bonas (2010)).

We redraw the plot, reducing the base rate, accordingly:


In [ ]:
# Plot a curve for 0.95 sensitivity, 0.05 FPR, with 3% of
# the genome being effectors
baserate.plot_prob_effector(0.95, 0.05, baserate=0.03)

So, a test like this would actually be expected to have $P(\text{pos}\ |\ \text{+ve}) \approx 0.37$ or lower if applied to a complete pathogen genome CDS complement.

Stratification of input

One approach to improve predictive performance in practice is to stratify the data to which the classifier is applied. This means, effectively, reducing the set of object that the classifier sees to a group that is more likely to contain the class of interest: i.e. a greater proportion of effectors, in this example. This essentially increases the base rate of positive examples, $P(\text{pos})$.

That is to say, if there are criteria that are necessary for membership of the effector class (e.g. presence of a signal peptide, or characteristic regulatory sequence), these can be used to exclude a large proportion of the genome, and effectively increase the base rate of effectors with respect to the set of sequences passed to the classifier.

Interactive Example

The interactive example below allows you to choose the sensitivity (sens) and false positive rate (fpr) for the effector classifier, and renders a plot of expected $P(\text{pos}|\text{+ve})$ against base rate of effectors in the input set to the classifier.

By setting the value of xmax appropriately, you can zoom in the plot to explore $P(\text{pos}|\text{+ve})$ values for low base rates of effector occurrence.


In [ ]:
# To see the curve for the T3E prediction method in the
# Arnold et al. paper, set sens=0.71, fpr=0.15
interact(baserate.plot_prob_effector, sens=(0.01, 0.99, 0.01), 
         fpr=(0.01, 0.99, 0.01), xmax=(0.1, 1, 0.01),
         baserate=(0.01, 0.99, 0.01));
QUESTIONS
In the Arnold et al. paper, they reported 0.71 sensitivity and 0.15 FPR for Effective T3.
  • For a genome with 2119 CDS (all submitted to the predictor), how many CDS would you expect to call positive, even in the absence of any true T3SS effectors?
  • Could this reasonably explain the results observed for bacteria with no T3SS in the Arnold et al. paper?
  • What is the expected proportion of true positive predictions of T3SS effectors for these performance metrics, with a base rate of 3% of the genome being T3SS effectors?
  • How many of the predicted 196 T3SS effectors in *P. syringae* pv. *phaseolicola* are likely to be true positives? Is this consistent with published estimates of functional T3SS effectors?