01b - Classifiers (15min)

1. Biological motivation

You are interested in a predicted gene coding region on a newly-sequenced genome, and want to know if its protein product belongs to a particular functional class of proteins (RxLR effector proteins).

You have access to a software tool that *classifies* proteins as `effector` or `not effector`.

QUESTION: (2min)

If the software tool says that the protein product is an effector, should you believe the prediction?

2. Performance Metrics and Contingency Tables

We consider *classifier tools* that decide whether an input belongs to a *class*, or does not belong to that class. This is a *binary classifier*.

The performance of binary classifier tools is typically measured on a test set of data, which can be called a member of the class (positive) or not a member (negative). Calculations of four values can be made:

  • True Positives (TP): the number of positive examples that the classifer correctly assigns as positive
  • True Negatives (TN): the number of negative examples that the classifer correctly assigns as negative
  • False Positives (FP): the number of negative examples that the classifer incorrectly assigns as positive
  • False Negatives (FN): the number of positive examples that the classifer incorrectly assigns as negative

These are often represented as contingency tables or a confusion matrix:

These values can be combined into a number of useful performance metrics that summarise the classifier's ability to perform particular tasks.

  • Sensitivity (Sn, true positive rate, TPR): The proportion of positive examples that are correctly classified
$$\textrm{Sensitivity (Sn)} = \frac{\textrm{TP}}{(\textrm{TP} + \textrm{FN})}$$
  • Specificity (Sp, true negative rate, TNR): The proportion of negative examples that are correctly classified
$$\textrm{Specificity (Sp)} = \frac{\textrm{TN}}{(\textrm{FP} + \textrm{TN})}$$
  • False Positive Rate (FPR, $1 - \textrm{Sp}$): The proportion of negative examples incorrectly classed as positive.
$$\textrm{FPR} = \frac{\textrm{FP}}{(\textrm{FP} + \textrm{TN})}$$

If you do not have this information, you cannot interpret predictive results!

3. Classifier Performance

In our example, we assume that our functional classifier that determines whether a protein sequence is likely to be that of an effector has the following *performance metrics*:

  • Sensitivity: Sn = 0.95
  • False Positive Rate: FPR = 0.01

QUESTION: (2min)

Do you think these are good performance characteristics?

The classifier says your protein is an effector!

QUESTION: (2min)

What is the probability that your protein is really an effector? 0.01, 0.05, 0.50, 0.95, 0.99

4. Baseline Frequency

Unless you know the baseline occurrence of a class in the set of inputs, you cannot calculate the probability that the classifier is correct in any particular case.

Denoting the baseline occurrence, or baseline frequency (the proportion of all proteins that are effectors) with which a protein may be an effector as $f_{x}$:

$$f_{x} = 0.01 \implies P(\textrm{effector}|\textrm{+ve}) = 0.490 \approx 0.5$$ $$f_{x} = 0.8 \implies P(\textrm{effector}|\textrm{+ve}) = 0.997 \approx 1.0$$

If effectors are *rare*, the functional classification is more likely to be false

QUESTION: (2min)

You run the classifier tool on 20,000 proteins from your favourite organism. You expect around 200 of these proteins to be effectors.

What is the approximate probability that any individual classification is correct?

BONUS: Approximately how many positive classifications would you expect?

5. Bayes' Theorem

The relatively low probability of any classification being correct, even if the classifier has excellent performance metrics, can be counterintuitive. But we can always understand the statistics using Bayes' Theorem

With:

  • $P(\textrm{positive})$ = the expected proportion of positive examples (the baserate)
  • $P(\textrm{negative})$ = the expected proportion of negative examples
  • $P(+|\textrm{positive})$ = the probability the classifier calls positive, if the example is positive (TPR, Sn)
  • $P(+|\textrm{negative})$ = the probability the classifier calls positive, if the example is negative (FPR)

The probability that an example is positive, given the classifier says that it is positive is $P(\textrm{positive}|+)$ and can be calculated:

$$P(\textrm{positive}|+) = \frac{P(+|\textrm{positive})P(\textrm{positive})}{P(+|\textrm{positive})P(\textrm{positive}) + P(+|\textrm{negative})P(\textrm{negative})}$$

We can visualise how the probability of a positive classification being correct ($P(\textrm{eff}|\textrm{pos})$) varies with baserate, using the Python code in the cell below:


In [ ]:
# Import Python libraries
%matplotlib inline
import seaborn as sns         # This produces pretty graphical output
import tools.classifier as tc # This lets us plot some classifer-specific visualisation

# Define sensitivity and FPR
sn = 0.90    # sensitivity
fpr = 0.05   # false positive rate

# Define baserate (frequency of positive examples)
baserate = 0.3

# Static plot of 
tc.plot_prob_effector(sn, fpr, baserate);

In the plot above, we see the effector classifier's response curve (red line) as a function of baserate, assuming it has a 90% sensitivity, and a 5% false positive rate.

The black arrow points at a particular response rate - when the baserate of positives in the population is 30%. At this point, any positive classification has about 89% probability of really being a positive example.

So long as about 20% of all proteins are effectors, the classifier predictions are correct about 80% of the time.

As the baserate drops below about 7%, predictions are more likely to be incorrect, rather than correct.

QUESTION: (4min)

What is the probability that a positive result from our fictional classifier (Sn=0.95, FPR=0.01) is correct, with baserate 0.01?

What if the sensitivity falls to 90%, and the FPR increases to 5%?

In the cell below you can see code that renders an interactive version of the plot above, which allows you to vary the sensitivity, false positive rate, and baserate of positive examples using sliders. You can also zoom in to the left-hand region of the graph for clarity.

6. Real-World Example

In their 2009 paper, Arnold *et al.* describe a tool to predict bacterial Type III effector proteins called `EffectiveT3`. This tool is reported with sensitivity 71% and selectivity 85% (thus FPR is 15%).

In their paper, the authors identify hundreds of type III effectors in genomes that possess no annotated type III secretion system (over 10% of the complete protein complement, in some cases). They note:

The surprisingly high number of (false) positives in genomes without TTSS exceeds the expected false positive rate (Table 1) and thus raised questions about their nature.

QUESTION: (4min)

What is the expected probability that a positive prediction from `EffectiveT3` is really a type III effector, given that *Pseudomonas syringae* possesses fewer than 100 effectors in a 5000-gene genome (baserate <≈ 3%)?

Why do you think the authors saw so many likely false positives?

How do you think you might be able to improve the probability that a positive prediction/classification is a real positive example, when making predictions/classifying all proteins on a genome?


In [ ]:
# Import Python libraries
from ipywidgets import interact, FloatSlider  # for interactive widgets

# Define sensitivity and FPR
sn = 0.90    # sensitivity
fpr = 0.05   # false positive rate

# Define baserate (frequency of positive examples)
baserate = 0.3

# Create an interactive plot 
interact(tc.plot_prob_effector,
         sens=FloatSlider(min=0.01, max=0.99, step=0.01, value=sn), 
         fpr=FloatSlider(min=0.01, max=0.99, step=0.01, value=fpr),
         baserate=FloatSlider(min=0.01, max=0.99, step=0.01, value=baserate),
         xmax=FloatSlider(min=0.1, max=1, step=0.1, value=1));

7. Comments

1. Predictions/Classifiers Identify Groups, Not Individuals

Predictors and classifiers identify groups of positive/negative examples, not individual members of the group. For example, if a test for smugglers at an airport has $P(\textrm{smuggler}|+) = 0.9$ and 100 potential smugglers are identified, how do we tell which 10 smugglers are wrongly identified? We always need more evidence to distinguish within the predicted group members.

2. Stratification Can Improve Classifier Performance

If there are a set of criteria that an example must meet in order to be a member of a class, then excluding all examples that do not meet these criteria reduces the scope for false positives, and raises the baserate, increasing the probability that a positive classification implies a positive example.