In this part of the tutorial we will look at preparing input data for use with DEAGO. The objectives of this part of the tutorial are:
We will be using three types of input data in this tutorial. Our gene counts and sample/condition mapping are required:
We can also use an annotation file which gives us details about the genes such as their gene names and the gene ontology (GO) terms they are associated with. An annotation file is optional for quality control (QC) and differential expression (DE) analyses, but required for a GO term enrichment analysis.
For more information on the input datatypes DEAGO uses and their formats see the user guide.
The data
directory contains all of our input files. Let's go there.
In [ ]:
cd data
In [ ]:
ls counts
You should see a list of 32 count files were generated using featureCounts, one file per sample. Each file contains gene counts i.e. the number or reads assigned to each gene or our gene expression profiles.
Let's take a look inside one of the count files:
In [ ]:
head counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv
There are several things to notice about the featureCounts file format:
The reason we're highlighting this is because DEAGO has an option called count_type
which we will be using in our analyses. By default, DEAGO is designed to take the expression counts which are the output from the Sanger pathogens RNA-Seq Expression pipeline. These have a different format from the featurecounts files. So, we'll be adding --count_type featurecounts
to our commands which will tell DEAGO to expect count files which were generated by featureCounts and have the above characteristics.
You can also use raw count data as input as long as you have one file per sample. There are several options you can use which will configure DEAGO to be able to use them:
In [ ]:
head targets.txt
In our targets file, each row corresponds to a sample. There are three columns which DEAGO always expects to find in our tab-delimited targets file:
The filename is the name of the file containing the count data for the sample. We don't need to put the directory name in the filename (e.g. count/
The condition is a short label which will be used to identify the condition. The condition label should be the same for all of the samples which share that condition. Here, our condition is a combination (e.g. WT_Ctrl) of the cell type (e.g. WT) and the treatment (e.g. Ctrl).
This dataset has 4 biological replicates and 2 technical replicates for each condition represented in the replicate column. For example, replicate 1.2 is in biological replicate group 1 and is the second technical replicate for that sample.
Finally, you may have noticed that, in addition to the three expected columns, we also have two extra columns, cell_type
and treatment
. That's because the tutorial dataset had two factors, cell type (WT/KO) and treatment (Ctrl/IL22). These are here just for our information and DEAGO will ignore them.
Because we want to see the gene symbols that are linked to each gene identifier and to perform gene ontology (GO) term enrichment analyses we also need an annotation file. You should see two annotation files in your tutorial directory:
The gene annotation for the mouse genome (mm10) from Ensembl BioMart, one line per annotation, is in ensembl_mm10.tsv.
Let's take a look at "ensembl_mm10.tsv":
In [ ]:
head ensembl_mm10.tsv
Here you can see that the GO terms for "ENSMUSG00000064370" are split over 39 lines, one per annotation.
In [ ]:
grep '^ENSMUSG00000064370' ensembl_mm10.tsv | wc -l
We'll look at how to get annotations from Ensembl BioMart in the exercise below.
The DEAGO-formatted annotation is in ensembl_mm10_deago_formatted.tsv. The difference between this and the previous file is that the DEAGO-formatted annotation has one line per gene. Essentially, we bring together all the gene names and GO terms associated to a gene together.
Let's take a look at "ensembl_mm10_deago_formatted.tsv":
In [ ]:
head ensembl_mm10_deago_formatted.tsv
Now, if we look for the same gene "ENSMUSG00000064370" there is only one line.
In [ ]:
grep '^ENSMUSG00000064370'
There are three tab-delimited columns:
gene identifier
gene name
GO terms
Where there are multiple gene names or GO terms associated with a gene, they will be concatenated together with a semi-colon (';').
Note: the gene identifiers must match those found in the count data files.
Let's see how many GO terms were associated with our gene.
In [ ]:
grep '^ENSMUSG00000064370' ensembl_mm10_deago_formatted.tsv | \
cut -f 3 | tr ';' '\n' | wc -l
There were 36 go terms associated with ENSMUSG00000064370.
Note: we used tr
here because grep -c "GO:"
wouldn't work as it doesn't count multiple occurences of the same phrase in a single line.
Let's take a quick look at how to get annotations for mm10 in Ensembl BioMart.
Either click on the link below or type the URL into a web browser to go the Ensembl page for the mm10 genome.
https://www.ensembl.org/Mus_musculus
Follow the steps below to download your Ensembl annotation file.
The downloaded file will be called mart_export.txt and is the same as ensembl_mm10.tsv.
While deago
is the main running command for DEAGO, there are also some other commands built in which perform the intermediate steps in a DEAGO analysis. One of those is mart_to_deago
which will convert a BioMart annotation into a DEAGO-formatted annotation.
Let's take a look at the usage for mart_to_deago
.
In [ ]:
mart_to_deago -h
So, we need to specify the annotation file we want to convert using the -a
option.
Let's try converting ensembl_mm10.tsv into a DEAGO-formatted annotation file using mart_to_deago
.
In [ ]:
mart_to_deago -a ensembl_mm10.tsv
This will create the file deago_annotation.tsv which is the same as ensembl_mm10_deago_formatted.tsv.
_Note: Sanger pathogens users can get associated GO terms with farm_interproscan
giving it the prokka annotation file (.gff) for the reference used to generate the count files from the RNA-Seq Expression pipeline._
Q1: How many genes are there in each of the count files?
Hint: use grep
and/or wc -l
to count the number of lines (minus headers) or gene identifiers in one of the count files
Q2: How many genes have associated annotations?
Hint: count the number of genes in the DEAGO-formatted annotation file
Q3: How many of those genes have associated gene names?
Hint: use awk
and its built-in variable NF
to count the number of lines where at least one of the three fields is missing and whose second columns contain GO terms as the gene name is missing
Q4: How many of those genes have associated GO terms?
Hint: use awk
and its built-in variable NF
to count the number of lines where at least one of the three fields is missing and whose second columns don't contain GO terms as they are missing
For a quick recap of what the tutorial covers head back to the Introduction.
If you want a reintroduction to the tutorial dataset, head back to introducing the tutorial dataset.
Otherwise, let's continue on to running a quality control (QC) analysis.