Normalisation

Introduction

In the previous section, we looked at estimating transcript abundance with Kallisto. The abundances are reported as transcripts per million (TPM), but what does TPM mean and how is it calculated?

The objectives of this part of the tutorial are:

understand why RNA-Seq normalisation metrics are used
understand the difference between RPKM, FPKM and TPM
calculate RPKM and TPM for a gene of interest

There are many useful websites, publications and blog posts which go into much more detail about RNA-Seq normalisation methods. Here are just a couple (in no particular order):

Why do we use normalisation units instead of raw counts?

Raw reads counts are the number of reads originating from each transcript which can be affected by several factors:

sequencing depth (total number of reads)
The more we sequence a sample, the more reads we expect to be assigned.

gene/transcript length
The longer the gene or transcript, the more reads we expect to be assigned to it.

Figure 4. Effect of sequencing depth and gene length on raw read counts

Look at the top part of Figure 4. In which sample, X or Y, is the gene more highly expressed?

Neither, it's the same in both. What we didn't tell you was that the total number of reads generated for sample A was twice the number than for sample B. That meant almost twice the number of reads are assigned to the same gene in sample A than in sample B.

Look at the bottom part of Figure 4. Which gene, X or Y, has the greatest gene level expression?

Neither, they are both expressed at the same level. This time we didn't tell you that gene X is twice the length of gene Y. This meant that almost twice the number reads were assigned to gene X than gene Y.

In the top part of Figure 4, the gene in sample X has twice the number of reads assigned to it than the same gene in sample Y. What isn't shown is that sample X had twice the number or total reads than sample Y so we would expect more reads to be assigned in sample X. Thus, the gene is expressed at roughly the same level in both samples. In the bottom part of Figure 4, gene X has twice the number of reads assigned to it than gene Y. However, gene X is twice the length of gene Y and so we expect more reads to be assigned to gene X. Again, the expression level is roughly the same.

Reads per kilobase per million (RPKM)

Reads per kilobase (of exon) per million (reads mapped) or RPKM is a within sample normalisation method which takes into account sequencing depth and length biases.

To calculate RPKM, you first normalise by sequencing depth and then by gene/transcript length.

Get your per million scaling factor
Count up the total number of reads which have been assigned (mapped) in the sample. Divide this number by 1,000,000 (1 million) to get your per million scaling factor (N).
Normalise for sequencing depth
Divide the number of reads which have been assigned to the gene or transcript (C) by the per million scaling factor you calculated in step 1. This will give you your reads per million (RPM).
Get your per kilobase scaling factor
Divide the total length of the exons in your transcript or gene in base pairs by 1,000 (1 thousand) to get your per kilobase scaling factor (L).
Normalise for length
Divide your RPM value from step 2 by your per kilobase scaling factor (length of the gene/transcript in kilobases) from step 3. This will give you your reads per kilobase per million or RPKM.

This can be simplified into the following equation:

\begin{equation*} RPKM = \frac {C}{LN} \end{equation*}

Where:

C is number of reads mapped to the transcript or gene
L is the total exon length of the transcript or gene in kilobases
N is the total number of reads mapped in millions

Fragments per kilobase per million (FPKM)

Fragments per kilobase per million or FPKM is essentially the same as RPKM except that:

RPKM is designed for single-end RNA-Seq experiments
FPKM is designed for paired-end RNA-Seq experiments

In a paired-end RNA-Seq experiment, two reads may be assigned to a single fragment (in any orientation). Also, in some cases, only one of those reads will be assigned to a fragment (singleton). The only difference between RPKM and FPKM is that FPKM takes into consideration that two reads may be assigned to the same fragment.

Transcripts per million (TPM)

Calculating the transcripts per million or TPM is a similar process to RPKM and FPKM. The main difference is that you will first normalise for length bias and then for sequencing depth bias. In a nut shell, we are swapping the order of normalisations.

Get your per kilobase scaling factor
Divide the total length of the exons in your transcript in base pairs by 1,000 (1 thousand) to get your per kilobase scaling factor.
Normalise for length
Divide the number of reads which have been assigned to the transcript by the per kilobase scaling factor you calculated in step 1. This will give you your reads per kilobase (RPK).
Get the sum of all RPK values in your sample
Calculate the RPK value for all of the transcripts in your sample. Add all of these together to get your total RPK value.
Get your per million scaling factor
Divide your total RPK value from step 3 by 1,000,000 (1 million) to get your per million scaling factor.
Normalise for sequencing depth
Divide your RPK value calculated in step 2 by the per million scaling factor from step 4. You now have your transcripts per millions value or TPM.

Calculating RPKM and TPM values

To try and answer this, let's look at a worked example. Here, we have three genes (A-C) and three biological replicates (1-3).

Gene	Length	Replicate 1	Replicate 2	Replicate 3
A	2,000 bases	10	12	30
B	4,000 bases	20	25	60
C	1,000 bases	5	8	15

There are two things to notice in our dataset:

Gene B has twice number reads mapped than gene A, possibly as it’s twice the length

Replicate 3 has more reads mapped than any of the other replicates, regardless of which gene we look at

Calculating RPKM

Step 1: get your per million scaling factor

In the table below is the total number of reads which mapped for each of the replicates. To get our per million scaling factor, we divide each of these values by 1,000,000 (1 million).

Gene	Replicate 1	Replicate 2	Replicate 3
Total reads mapped	3,500,000	4,500,000	10,600,000
Per million reads	3.5	4.5	10.6

Step 2: normalise for sequencing depth

We now divide our read counts by the per million scaling factor to get our reads per million (RPM).

Before:

Gene	Replicate 1	Replicate 2	Replicate 3
A	10	12	30
B	2	25	60
C	5	8	15

After:

Gene	Replicate 1 RPM	Replicate 2 RPM	Replicate 3 RPM
A	2.857	2.667	2.830
B	5.714	5.556	5.660
C	1.429	1.778	1.415

Step 3: get your per kilobase scaling factor

Here we have our gene length in base pairs. For our per kilobase scaling factor we need to get our gene length in kilobases by dividing it by 1,000.

Gene	Length (base pairs)	Length (kilobases)
A	2,000	2
B	4,000	4
C	1,000	1

Step 4: normalise for length

Finally, we divide our RPM values from step 2 by our per kilobase scaling factor from step 3 to get our reads per kilobase per million (RPKM).

Before:

Gene	Replicate 1 RPM	Replicate 2 RPM	Replicate 3 RPM
A	2.857	2.667	2.830
B	5.714	5.556	5.660
C	1.429	1.778	1.415

After:

Gene	Replicate 1 RPKM	Replicate 2 RPKM	Replicate 3 RPKM
A	1.43	1.33	1.42
B	1.43	1.39	1.42
C	1.43	1.78	1.42

Notice that even though replicate 3 had more reads assigned than the other samples and a greater sequencing depth, its RPKM is quite similar. And, that although gene B had twice the number of reads assigned than gene A, its RPKM is the same. This is because we have normalised by both length and sequencing depth.

Calculating TPM

Now we're going to calculate the TPM values for the same example data. As a reminder, here are our three genes (A-C) and three biological replicates (1-3).

Gene	Length	Replicate 1	Replicate 2	Replicate 3
A	2,000 bases	10	12	30
B	4,000 bases	20	25	60
C	1,000 bases	5	8	15

Step 1: get your per kilobase scaling factor

Again, our gene lengths are in base pairs. For our per kilobase scaling factor we need to get our gene length in kilobases by dividing it by 1,000.

Gene	Length (base pairs)	Length (kilobases)
A	2,000	2
B	4,000	4
C	1,000	1

Step 2: normalise for length

Now we divide the number of reads which have been assigned to each gene by the per kilobase scaling factor we just calculated. This will give us our reads per kilobase (RPK).

Before:

Gene	Replicate 1	Replicate 2	Replicate 3
A	10	12	30
B	2	25	60
C	5	8	15

After:

Gene	Replicate 1	Replicate 2	Replicate 3
A	5	6	15
B	5	6.25	15
C	5	8	15

Step 3: get the sum of all RPK values in your sample

Next, we sum the RPK values for each of our replices. This will give use our total RPK value for each replicate. To make this example scalable, we assume there are other genes so the total RPK is made up.

Gene	Replicate 1	Replicate 2	Replicate 3
A	5	6	15
B	5	6.25	15
C	5	8	15
...	...	...	...
Total RPK	150,000	202,500	352,500

Step 4: get your per million scaling factor

Here, instead of dividing our total mapped reads by 1,000,000 (1 million) to get our per million scaling factor, we divide our total RPK values by 1,000,000 (1 million).

Gene	Replicate 1	Replicate 2	Replicate 3
Total RPK	150,000	202,500	352,500
Per million RPK	0.1500	0.2025	0.3525

Step 5: normalise for sequencing depth

Finally, we divide our individual RPK values from step 2 by the per million scaling factor in step 4 to give us our TPM values.

Before:

Gene	Replicate 1	Replicate 2	Replicate 3
A	5	6	15
B	5	6.25	15
C	5	8	15

After:

Gene	Replicate 1	Replicate 2	Replicate 3
A	33.33	29.63	31.21
B	33.33	30.86	31.91
C	33.33	39.51	36.88

Which normalisation unit should I use?

Well, there's a lot of debate around this, so let's look at our total normalised values for each replicate.

RPKM

Gene	Replicate 1 RPKM	Replicate 2 RPKM	Replicate 3 RPKM
A	1.43	1.33	1.42
B	1.43	1.39	1.42
C	1.43	1.78	1.42
Total RPKM	4.29	4.50	4.25

TPM

Gene	Replicate 1	Replicate 2	Replicate 3
A	33.33	29.63	31.21
B	33.33	30.86	31.91
C	33.33	39.51	36.88
Total TPM	100	100	100

Notice that that total TPM value for each of the replicates is the same. This is not true for RPKM and FPKM where the total values differ. With TPM, having the same total value for each replicate makes it easier to compare the proportion of reads mapping to each gene across replicates (although you shouldn't really compare across experiments). With RPKM and FPKM, the differing total values make it much harder to compare replicates.

Questions

Below is the information for each of the five samples. You will need this information to answer the questions. We have put all of commands used to get this information in the answers.

Sample	Total mapped reads	Transcript length	Assigned reads	Total RPK
MT1	2,353,750	3,697	2,541	293,431
MT2	2,292,271	3,709	3,392	675,190
SBP1	2,329,235	3,699	14,605	1,719,970
SBP2	2,187,718	3,696	17,302	1,429,540
SBP3	2,163,979	3,699	14,646	1,561,310

_Note: values have been rounded up to integers to make calculations easier. Assigned reads are the est_count from Kallisto for PCHAS_1402500. Transcript lengths are the est_length from Kallisto for PCHAS1402500.

Q1: Using the abundance.tsv files generated by Kallisto and the information above, calculate the RPKM for PCHAS_1402500 in each of our five samples.

Sample	Per million scaling factor	RPM	Per kilobase scaling factor	RPKM
MT1
MT2
SBP1
SBP2
SBP3

Q2: Using the abundance.tsv files generated by Kallisto and the information above, calculate the TPM for PCHAS_1402500 in each of our five samples.
Hint: don't forget to get your per million scaling factor.

Sample	Per kilobase scaling factor	Reads per kilobase (RPK)	TPM
MT1
MT2
SBP1
SBP2
SBP3

Q3: Do these match the TPM values from Kallisto?
Hint: look at the abundance.tsv files for each of your samples.

Q4: Do you think PCHAS_1402500 is differentially expressed between the MT and SBP samples?

What's next?

You can head back to transcript quantification with Kallisto or continue on to Identifying differentially expressed genes with Sleuth .