On the normalization of genomic gene count data to gene length

by Luke Thompson - 2016/03/21

A common method in metagenomics and transcriptomics is normalization of gene count data by gene size.

In RPK, read counts (R) are divided directly by gene size (per kbp, or PK). In RPKM, they are further normalized by dividing by total counts and multiplying by one million (M).

RPK and RPKM normalizations assume a purely linear relationship of sequencing probability to gene size. It has occurred to me that in cases of small genes and/or long reads, this purely linear relationship does not hold up.

Given: A single bacterial genome containing many different sized genes is sampled randomly, generating a set of equally sized sequence reads.

The probability, $P$, of counting a gene by random, assuming completely random DNA fragmentation and sequencing, is given by

$P = \frac{g + 2 (r-o)}{d}$

where $g$ = gene length, $r$ = read length, $o$ = overlap required for match, and $d$ = genome or database size.

The "flap term" is given by $2 (r-o)$.

As gene size increases relative to the flap term, $P$ approaches $g/d$, which is the typical RPKM normalization.

However, when gene size is similar to the flap term, the flap term is significant and increases the probability of counting the gene.

Therefore, in circumstances where gene size is comparable to read length, with read length greater than ~200 bp, smaller genes will be observed more often than estimated by the $P = g/d$ formula.

Admittedly, given current short read technology, most genes are not short enough to make much difference. But it still is a small difference for all genes, and a big difference for the small ones. Further, as read lengths increase, the effect of the flap term will increase. Illumina is always increasing read lengths. We need to take the flap term into account!

When we do RPK normalization, we divide counts by gene length. That is, our correction factor is

$C = \frac{1}{g}$

whereas it really should be

$C = \frac{1}{g + 2(r-o)}$

Bokeh slider graph


In [ ]:
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_notebook

import numpy as np

In [ ]:
output_notebook()

In [ ]:
read = 100
overlap = 100
gene = np.arange(overlap, 5001, dtype=float)
prob = gene + 2*(read - overlap)

In [ ]:
source = ColumnDataSource(data=dict(x=gene, y=prob, gene=gene, prob=prob))

p = figure(title="simple line example", plot_height=300, plot_width=600)
p.line(gene, prob, color="#2222aa", line_width=3, source=source, name="foo")

In [ ]:
def update(read=100, overlap=100):
    source.data['y'] = gene + 2*(read - overlap)
    source.push_notebook()

In [ ]:
show(p)

In [ ]:
from IPython.html.widgets import interact
interact(update, read=(0,1000), overlap=(0,200))

In [ ]: