Input files

DEAGO requires the following for all analyses:

If you want to perform GO enrichment analyses or show gene symbols, you will also need:

For bespoke analyses you will also need to pre-generate a config file and use this as an input for some of the component scripts.


Count data directory

DEAGO looks in a single folder for all of your count data, one file per sample. You will need to tell DEAGO the location or path for this folder.

To work with different types of expression count files, DEAGO uses several command line options:

Command line option Description
--count_type type of count file [expression or featurecounts]
--count_column number of column containing count values
--skip_lines number of lines to skip in count file
--count_delim count file delimiter
--gene_ids name of column containing gene identifiers

You don't always need to use these options. By default, DEAGO assumes the following:

  • --count_column 5
  • --skip_lines 0
  • --gene_ids 'GeneID'
  • --count_delim ","

These are the parameters for the expression count files which are produced for all organisms processed by the Sanger pathogens RNA-Seq expression pipeline.

And example of a default expression count file would be:

"Seq ID",GeneID,"Locus Tag","Feature Type","Total Reads Mapping","Total RPKM (Reads Mapped)",
"Total Reads Mapping (Reads Mapped to Gene Models)","Total RPKM (Reads Mapped to Gene Models)",
"Sense Reads Mapping","Sense RPKM (Reads Mapped)","Sense RPKM (Reads Mapped to Gene Models)",
"Antisense Reads Mapping","Antisense RPKM (Reads Mapped)","Antisense RPKM (Reads Mapped to Gene Models)"
1,ENSG00000266468,,,0,0,0,0,0,,,0,,
1,ENSG00000228682,,,0,0,0,0,0,,,0,,
1,ENSG00000202027,,,0,0,0,0,0,,,0,,
1,ENSG00000235777,,,0,0,0,0,0,,,0,,
1,ENSG00000230718,,,0,0,0,0,0,,,0,,

Gene identifiers are in the second column GeneID (e.g. ENSG00000266468) and the gene counts are in the fifth column (column 5). The first line is the header (i.e. no lines to skip) and the fields are comma-delimited (,).

There is also a preset for featureCounts files --count_type featurecounts which assumes:

  • --count_column 7
  • --skip_lines 1
  • --gene_ids 'Geneid'
  • --count_delim "\t"

Gene identifiers are in the first column Geneid (e.g. ENSMUSG00000090025) and the gene counts are in the last column (column 7). The first line is a comment which gives the details of the program and command used to generate the count file and the fields are tab-delimited (\t).

# Program:featureCounts v1.4.5-p1; Command:"featureCounts" "-O" "-T" "1" "-t" "exon" "-g" "gene_id" "-a"
"/lustre/scratch118/infgen/pathogen/pathpipe/refs/Mus/musculus/Mus_musculus_mm10.gtf" "-o"
"390176.pe.markdup.bam.featurecounts.csv" "390176.pe.markdup.bam"
Geneid  Chr     Start   End     Strand  Length  390176.pe.markdup.bam
ENSMUSG00000090025      1       3054233 3054733 +       501     0
ENSMUSG00000064842      1       3102016 3102125 +       110     0

You can also use other count file formats with DEAGO as long as you configure the options in the table above accordingly.


Targets file

DEAGO also requires a targets file which tells it the counts file associated with each sample and the experimental conditions which were applied.

Each row corresponds to a sample and there are three required columns:

  • filename name of the sample count file in the counts directory
  • condition experimental condition that was applied
  • replicate number or phrase representing a replicate group

Let's take a look at an example targets file:

condition   cell_type   treatment   replicate   filename
WT_Ctrl WT  Ctrl    2.1 8380_3#1.390176.pe.markdup.bam.featurecounts.csv
WT_Ctrl WT  Ctrl    2.2 8380_3#2.390269.pe.markdup.bam.featurecounts.csv
WT_IL22 WT  IL22    2.2 8380_3#4.389017.pe.markdup.bam.featurecounts.csv

Notice that in addition to the three expected columns we also have cell_type and treatment. That's because this example dataset had two factors, cell type (WT/KO) and treatment (Ctrl/IL22).

Currently, DEAGO can only perform single factor analyses. So, we concatenate the cell type and treatment values in the condition column (e.g. WT_Ctrl). See Bespoke analyses for how to adapt the DEAGO output for multi-factor analyses.

In this dataset we also had 4 biological replicates and 2 technical replicates for each condition. This is represented in the replicate column e.g. replicate 1.2 is biological replicate 1 and technical replicate 2.


Annotation file

An annotation file is required if you want to include gene symbols in the results tables or if you want to perform GO term enrichment analyses with DEAGO. The annotation file maps the gene identifiers with gene symbols and/or GO terms.

Let's take a look at an example of a DEAGO-formatted annotation file:

ENSMUSG00000000001  Gnai3   GO:0000166;GO:0003924;GO:0004871;GO:0005515;GO:0005525;GO:0005624;GO:0005737;GO:0005794;GO:0005813;GO:0005834;GO:0005856;GO:0005886;GO:0006184;GO:0006906;GO:0007049;GO:0007186;GO:0019001;GO:0030496;GO:0031821;GO:0042588;GO:0045121;GO:0046872;GO:0050805;GO:0051301
ENSMUSG00000000003  Pbsn    GO:0005215;GO:0005488;GO:0005549;GO:0005576
ENSMUSG00000000028  Cdc45   GO:0005634;GO:0005813;GO:0006260;GO:0006270;GO:0007049
ENSMUSG00000000037  Scml2   GO:0005634;GO:0006355;GO:0031519
ENSMUSG00000000049  Apoh    GO:0001937;GO:0001948;GO:0005515;GO:0005543;GO:0005576;GO:0005615;GO:0006641;GO:0007596;GO:0007597;GO:0008201;GO:0008289;GO:0009986;GO:0010596;GO:0016525;GO:0030193;GO:0030195;GO:0031012;GO:0031100;GO:0031639;GO:0033033;GO:0034197;GO:0034361;GO:0034364;GO:0034392;GO:0042627;GO:0043499;GO:0051006;GO:0051917;GO:0051918;GO:0060230;

There is one row per gene. However, not all genes need to be accounted for in the annotation file. The annotation file must contain at least 2 columns. The first column should always contain the gene identifiers (Geneid column in featurecounts files). The second and/or third column should contain the gene symbols or GO terms associated with the gene. If there are multiple genes or GO terms associated with a gene they should be concatenated (joined together) with semi-colons (e.g. GO:0005634;GO:0006355;GO:0031519).

For more information on preparing an annotation file for DEAGO (i.e. from Ensembl BioMart output) see Preparing an annotation file.


Configuration file

DEAGO uses a config file containing tab-delimited key/value pairs to define the parameters for each analysis.

For a featurecounts QC it will look something like this:

count_column    7
count_delim \t
count_type  featurecounts
counts_directory    /path/to/counts
gene_ids    Geneid
go_analysis 0
go_levels   all
keep_images 0
qc_only 1
qvalue  0.05
results_directory   /path/to/qc_results
skip_lines  1
targets_file    /path/to/targets.txt

We can ask DEAGO to generate this file using the --build_config options:


In [ ]:
deago --build_config -c data/counts -t data/targets.txt

This will generate a file called deago.config as part of the analysis which is built using the command line parameters you provide and will look similar to the content above.

We can also use an existing config file if we want to re-run an analysis:


In [ ]:
deago --config deago.config

DEAGO can also be run in stages (e.g. build_deago_config, build_deago_markdown, deago_markdown_to_html) if you want to do a bespoke analysis. See Bespoke analyses for more information.

Return to the index
Next: Output files