In the previous section, we looked at estimating transcript abundance with Kallisto. The abundances are reported as transcripts per million (TPM), but what does TPM mean and how is it calculated?
The objectives of this part of the tutorial are:
There are many useful websites, publications and blog posts which go into much more detail about RNA-Seq normalisation methods. Here are just a couple (in no particular order):
Raw reads counts are the number of reads originating from each transcript which can be affected by several factors:
Look at the top part of Figure 4. In which sample, X or Y, is the gene more highly expressed?
Neither, it's the same in both. What we didn't tell you was that the total number of reads generated for sample A was twice the number than for sample B. That meant almost twice the number of reads are assigned to the same gene in sample A than in sample B.
Look at the bottom part of Figure 4. Which gene, X or Y, has the greatest gene level expression?
Neither, they are both expressed at the same level. This time we didn't tell you that gene X is twice the length of gene Y. This meant that almost twice the number reads were assigned to gene X than gene Y.
In the top part of Figure 4, the gene in sample X has twice the number of reads assigned to it than the same gene in sample Y. What isn't shown is that sample X had twice the number or total reads than sample Y so we would expect more reads to be assigned in sample X. Thus, the gene is expressed at roughly the same level in both samples. In the bottom part of Figure 4, gene X has twice the number of reads assigned to it than gene Y. However, gene X is twice the length of gene Y and so we expect more reads to be assigned to gene X. Again, the expression level is roughly the same.
Reads per kilobase (of exon) per million (reads mapped) or RPKM is a within sample normalisation method which takes into account sequencing depth and length biases.
To calculate RPKM, you first normalise by sequencing depth and then by gene/transcript length.
Get your per million scaling factor
Count up the total number of reads which have been assigned (mapped) in the sample. Divide this number by 1,000,000 (1 million) to get your per million scaling factor (N).
Normalise for sequencing depth
Divide the number of reads which have been assigned to the gene or transcript (C) by the per million scaling factor you calculated in step 1. This will give you your reads per million (RPM).
Get your per kilobase scaling factor
Divide the total length of the exons in your transcript or gene in base pairs by 1,000 (1 thousand) to get your per kilobase scaling factor (L).
Normalise for length
Divide your RPM value from step 2 by your per kilobase scaling factor (length of the gene/transcript in kilobases) from step 3. This will give you your reads per kilobase per million or RPKM.
This can be simplified into the following equation:
\begin{equation*} RPKM = \frac {C}{LN} \end{equation*}Where:
Fragments per kilobase per million or FPKM is essentially the same as RPKM except that:
RPKM is designed for single-end RNA-Seq experiments
FPKM is designed for paired-end RNA-Seq experiments
In a paired-end RNA-Seq experiment, two reads may be assigned to a single fragment (in any orientation). Also, in some cases, only one of those reads will be assigned to a fragment (singleton). The only difference between RPKM and FPKM is that FPKM takes into consideration that two reads may be assigned to the same fragment.
Calculating the transcripts per million or TPM is a similar process to RPKM and FPKM. The main difference is that you will first normalise for length bias and then for sequencing depth bias. In a nut shell, we are swapping the order of normalisations.
Get your per kilobase scaling factor
Divide the total length of the exons in your transcript in base pairs by 1,000 (1 thousand) to get your per kilobase scaling factor.
Normalise for length
Divide the number of reads which have been assigned to the transcript by the per kilobase scaling factor you calculated in step 1. This will give you your reads per kilobase (RPK).
Get the sum of all RPK values in your sample
Calculate the RPK value for all of the transcripts in your sample. Add all of these together to get your total RPK value.
Get your per million scaling factor
Divide your total RPK value from step 3 by 1,000,000 (1 million) to get your per million scaling factor.
Normalise for sequencing depth
Divide your RPK value calculated in step 2 by the per million scaling factor from step 4. You now have your transcripts per millions value or TPM.
To try and answer this, let's look at a worked example. Here, we have three genes (A-C) and three biological replicates (1-3).
Gene | Length | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|---|
A | 2,000 bases | 10 | 12 | 30 |
B | 4,000 bases | 20 | 25 | 60 |
C | 1,000 bases | 5 | 8 | 15 |
There are two things to notice in our dataset:
In the table below is the total number of reads which mapped for each of the replicates. To get our per million scaling factor, we divide each of these values by 1,000,000 (1 million).
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
Total reads mapped | 3,500,000 | 4,500,000 | 10,600,000 |
Per million reads | 3.5 | 4.5 | 10.6 |
We now divide our read counts by the per million scaling factor to get our reads per million (RPM).
Before:
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 10 | 12 | 30 |
B | 2 | 25 | 60 |
C | 5 | 8 | 15 |
After:
Gene | Replicate 1 RPM | Replicate 2 RPM | Replicate 3 RPM |
---|---|---|---|
A | 2.857 | 2.667 | 2.830 |
B | 5.714 | 5.556 | 5.660 |
C | 1.429 | 1.778 | 1.415 |
Here we have our gene length in base pairs. For our per kilobase scaling factor we need to get our gene length in kilobases by dividing it by 1,000.
Gene | Length (base pairs) | Length (kilobases) |
---|---|---|
A | 2,000 | 2 |
B | 4,000 | 4 |
C | 1,000 | 1 |
Finally, we divide our RPM values from step 2 by our per kilobase scaling factor from step 3 to get our reads per kilobase per million (RPKM).
Before:
Gene | Replicate 1 RPM | Replicate 2 RPM | Replicate 3 RPM |
---|---|---|---|
A | 2.857 | 2.667 | 2.830 |
B | 5.714 | 5.556 | 5.660 |
C | 1.429 | 1.778 | 1.415 |
After:
Gene | Replicate 1 RPKM | Replicate 2 RPKM | Replicate 3 RPKM |
---|---|---|---|
A | 1.43 | 1.33 | 1.42 |
B | 1.43 | 1.39 | 1.42 |
C | 1.43 | 1.78 | 1.42 |
Notice that even though replicate 3 had more reads assigned than the other samples and a greater sequencing depth, its RPKM is quite similar. And, that although gene B had twice the number of reads assigned than gene A, its RPKM is the same. This is because we have normalised by both length and sequencing depth.
Now we're going to calculate the TPM values for the same example data. As a reminder, here are our three genes (A-C) and three biological replicates (1-3).
Gene | Length | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|---|
A | 2,000 bases | 10 | 12 | 30 |
B | 4,000 bases | 20 | 25 | 60 |
C | 1,000 bases | 5 | 8 | 15 |
Again, our gene lengths are in base pairs. For our per kilobase scaling factor we need to get our gene length in kilobases by dividing it by 1,000.
Gene | Length (base pairs) | Length (kilobases) |
---|---|---|
A | 2,000 | 2 |
B | 4,000 | 4 |
C | 1,000 | 1 |
Now we divide the number of reads which have been assigned to each gene by the per kilobase scaling factor we just calculated. This will give us our reads per kilobase (RPK).
Before:
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 10 | 12 | 30 |
B | 2 | 25 | 60 |
C | 5 | 8 | 15 |
After:
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 5 | 6 | 15 |
B | 5 | 6.25 | 15 |
C | 5 | 8 | 15 |
Next, we sum the RPK values for each of our replices. This will give use our total RPK value for each replicate. To make this example scalable, we assume there are other genes so the total RPK is made up.
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 5 | 6 | 15 |
B | 5 | 6.25 | 15 |
C | 5 | 8 | 15 |
... | ... | ... | ... |
Total RPK | 150,000 | 202,500 | 352,500 |
Here, instead of dividing our total mapped reads by 1,000,000 (1 million) to get our per million scaling factor, we divide our total RPK values by 1,000,000 (1 million).
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
Total RPK | 150,000 | 202,500 | 352,500 |
Per million RPK | 0.1500 | 0.2025 | 0.3525 |
Finally, we divide our individual RPK values from step 2 by the per million scaling factor in step 4 to give us our TPM values.
Before:
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 5 | 6 | 15 |
B | 5 | 6.25 | 15 |
C | 5 | 8 | 15 |
After:
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 33.33 | 29.63 | 31.21 |
B | 33.33 | 30.86 | 31.91 |
C | 33.33 | 39.51 | 36.88 |
Well, there's a lot of debate around this, so let's look at our total normalised values for each replicate.
Gene | Replicate 1 RPKM | Replicate 2 RPKM | Replicate 3 RPKM |
---|---|---|---|
A | 1.43 | 1.33 | 1.42 |
B | 1.43 | 1.39 | 1.42 |
C | 1.43 | 1.78 | 1.42 |
Total RPKM | 4.29 | 4.50 | 4.25 |
Gene | Replicate 1 | Replicate 2 | Replicate 3 |
---|---|---|---|
A | 33.33 | 29.63 | 31.21 |
B | 33.33 | 30.86 | 31.91 |
C | 33.33 | 39.51 | 36.88 |
Total TPM | 100 | 100 | 100 |
Notice that that total TPM value for each of the replicates is the same. This is not true for RPKM and FPKM where the total values differ. With TPM, having the same total value for each replicate makes it easier to compare the proportion of reads mapping to each gene across replicates (although you shouldn't really compare across experiments). With RPKM and FPKM, the differing total values make it much harder to compare replicates.
Below is the information for each of the five samples. You will need this information to answer the questions. We have put all of commands used to get this information in the answers.
Sample | Total mapped reads | Transcript length | Assigned reads | Total RPK |
---|---|---|---|---|
MT1 | 2,353,750 | 3,697 | 2,541 | 293,431 |
MT2 | 2,292,271 | 3,709 | 3,392 | 675,190 |
SBP1 | 2,329,235 | 3,699 | 14,605 | 1,719,970 |
SBP2 | 2,187,718 | 3,696 | 17,302 | 1,429,540 |
SBP3 | 2,163,979 | 3,699 | 14,646 | 1,561,310 |
_Note: values have been rounded up to integers to make calculations easier. Assigned reads are the est_count from Kallisto for PCHAS_1402500. Transcript lengths are the est_length from Kallisto for PCHAS1402500.
Q1: Using the abundance.tsv
files generated by Kallisto and the information above, calculate the RPKM for PCHAS_1402500 in each of our five samples.
Sample | Per million scaling factor | RPM | Per kilobase scaling factor | RPKM |
---|---|---|---|---|
MT1 | ||||
MT2 | ||||
SBP1 | ||||
SBP2 | ||||
SBP3 |
Q2: Using the abundance.tsv
files generated by Kallisto and the information above, calculate the TPM for PCHAS_1402500 in each of our five samples.
Hint: don't forget to get your per million scaling factor.
Sample | Per kilobase scaling factor | Reads per kilobase (RPK) | TPM |
---|---|---|---|
MT1 | |||
MT2 | |||
SBP1 | |||
SBP2 | |||
SBP3 |
Q3: Do these match the TPM values from Kallisto?
Hint: look at the abundance.tsv
files for each of your samples.
Q4: Do you think PCHAS_1402500 is differentially expressed between the MT and SBP samples?
You can head back to transcript quantification with Kallisto or continue on to Identifying differentially expressed genes with Sleuth .