The main objective of this part of the tutorial is to use simple Unix commands to get a list of significantly differentially expressed genes. Using this gene list and the quantitative information from our analysis we can then start to make biological inferences about our dataset.
Using the R script (sleuth.R
), we printed out a file of results describing the differentially expressed genes in our dataset. This file is called kallisto.results
.
The file contains several columns, of which the most important are:
With a little Linux magic we can get the list of differentially expressed genes with only the columns of interest as above.
Make sure you are in the data
directory with the tutorial files.
In [ ]:
cd data
To get the genes which are most highly expressed in our SBP samples, we must first filter our results. There are two columns we want to filter our data on: b (column 5) and qval (column 4). These columns represent whether the gene is differentially expressed and whether that change is significant.
The following command will get those genes which have an adjusted p value (qval) less than 0.01 and a positive fold change. These genes are more highly expressed in the SBP samples.
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 > 0' kallisto.results | cut -f1,2,3,4,5 | head
We used awk
to filter the gene list and print only the lines which met our search criteria (qval > 0.01, b > 0). The option -F
tells awk what delimiter is used to separate the columns. In this case, it was a tab or its regular expression "\t". We then use cut to only print out columns 1-5. You can also do that within the awk
command. Finally, we use head
to get the first 10 lines of the output.
Alternatively, we can look for the genes which are more highly expressed in the MT samples.
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0' kallisto.results | cut -f1,2,3,4,5 | head
It can be useful to have a quick look and compare gene lists. For example, whether a certain gene product is seen more often in the genes most highly expressed in one condition or another. A quick and dirty method would be to use the gene descriptions (or gene products).
You could extract the gene products (column 2) for genes which are more highly expressed in the SBP samples using sort
and then uniq
.
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | sort | uniq
We can count each time these unique gene products occur in the list using uniq -c
.
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | \
sort | uniq -c
And, if we wanted to make it a bit easier to see commonly found gene products we can sort this again by the frequency count we got from the uniq
command. The sort
command will put these in ascending numerical (-n) order.
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | \
sort | uniq -c | sort -n
If you wanted to look for the frequency of a particular gene product you could also use grep
.
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | grep -c CIR
Or building on the earlier command:
In [ ]:
awk -F "\t" '$4 < 0.01 && $5 < 0 {print $2}' kallisto.results | \
sort | uniq -c | grep CIR
If you want to read more about this work related to this data it is published:
Vector transmission regulates immune control of Plasmodium virulence
Philip J. Spence, William Jarra, Prisca Lévy, Adam J. Reid, Lia Chappell, Thibaut Brugat, Mandy Sanders, Matthew Berriman and Jean Langhorne
Nature. 2013 Jun 13; 498(7453): 228–231 doi:10.1038/nature12231
Hint: try replacing head
in the earlier command with another unix command to count the number of lines
Hint: try replacing head
in the earlier command with another unix command to count the number of lines
Hint: look for gene products that are seen more often in genes more highly expressed in the SBP samples than those more highly expressed in the MT samples
You can head back to identifying differentially expressed genes with Sleuth or continue on to key aspects of differential expression analysis.