Interpreting the results

Introduction

The main objective of this part of the tutorial is to use simple Unix commands to get a list of significantly differentially expressed genes. Using this gene list and the quantitative information from our analysis we can then start to make biological inferences about our dataset.

Using the R script (sleuth.R), we printed out a file of results describing the differentially expressed genes in our dataset. This file is called kallisto.results.

The file contains several columns, of which the most important are:

  • Column 1: target_id (gene id)
  • Column 2: description (some more useful description of the gene than its id)
  • Column 3: pval (p value)
  • Column 4: qval (p value corrected for multiple hypothesis testing)
  • Column 5: b (fold change)

With a little Linux magic we can get the list of differentially expressed genes with only the columns of interest as above.


 Exercise 6

Make sure you are in the data directory with the tutorial files.


In [ ]:
cd data

To get the genes which are most highly expressed in our SBP samples, we must first filter our results. There are two columns we want to filter our data on: b (column 5) and qval (column 4). These columns represent whether the gene is differentially expressed and whether that change is significant.

The following command will get those genes which have an adjusted p value (qval) less than 0.01 and a positive fold change. These genes are more highly expressed in the SBP samples.


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 > 0' kallisto.results | cut -f1,2,3,4,5 | head

We used awk to filter the gene list and print only the lines which met our search criteria (qval > 0.01, b > 0). The option -F tells awk what delimiter is used to separate the columns. In this case, it was a tab or its regular expression "\t". We then use cut to only print out columns 1-5. You can also do that within the awk command. Finally, we use head to get the first 10 lines of the output.

Alternatively, we can look for the genes which are more highly expressed in the MT samples.


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0' kallisto.results | cut -f1,2,3,4,5 | head

It can be useful to have a quick look and compare gene lists. For example, whether a certain gene product is seen more often in the genes most highly expressed in one condition or another. A quick and dirty method would be to use the gene descriptions (or gene products).

You could extract the gene products (column 2) for genes which are more highly expressed in the SBP samples using sort and then uniq.


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | sort | uniq

We can count each time these unique gene products occur in the list using uniq -c.


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | \
sort | uniq -c

And, if we wanted to make it a bit easier to see commonly found gene products we can sort this again by the frequency count we got from the uniq command. The sort command will put these in ascending numerical (-n) order.


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | \
sort | uniq -c | sort -n

If you wanted to look for the frequency of a particular gene product you could also use grep.


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | grep -c CIR

Or building on the earlier command:


In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | \
sort | uniq -c | grep CIR

If you want to read more about this work related to this data it is published:

Vector transmission regulates immune control of Plasmodium virulence
Philip J. Spence, William Jarra, Prisca Lévy, Adam J. Reid, Lia Chappell, Thibaut Brugat, Mandy Sanders, Matthew Berriman and Jean Langhorne
Nature. 2013 Jun 13; 498(7453): 228–231 doi:10.1038/nature12231


Questions

Q1: How many genes are more highly expressed in the SBP samples?

Hint: try replacing head in the earlier command with another unix command to count the number of lines

Q2: How many genes are more highly expressed in the MT samples?

Hint: try replacing head in the earlier command with another unix command to count the number of lines

Q3: Do you notice any particular genes that came up in the analysis?

Hint: look for gene products that are seen more often in genes more highly expressed in the SBP samples than those more highly expressed in the MT samples