Preparing input data

Introduction

In this part of the tutorial we will look at preparing input data for use with DEAGO. The objectives of this part of the tutorial are:

understand the format of the three main input data types used by DEAGO
be able to prepare counts for use with DEAGO
be able to prepare a sample/condition mapping file
be able to prepare an annotation file for use with DEAGO

Input data

We will be using three types of input data in this tutorial. Our gene counts and sample/condition mapping are required:

counts
a directory containing count data files (one file per sample)

targets.txt
a sample/condition mapping file

We can also use an annotation file which gives us details about the genes such as their gene names and the gene ontology (GO) terms they are associated with. An annotation file is optional for quality control (QC) and differential expression (DE) analyses, but required for a GO term enrichment analysis.

ensembl_mm10_deago_formatted.tsv
an annotation file containing gene symbols and GO terms

For more information on the input datatypes DEAGO uses and their formats see the user guide.

The data directory contains all of our input files. Let's go there.



In [ ]:

    
cd data

Count data

DEAGO looks in a single folder for all of your count data, one file per sample. When we run an analysis we will need to tell DEAGO the location or path for this folder.

Let's take a look at our tutorial count data which can be found in the counts directory:



In [ ]:

    
ls counts

You should see a list of 32 count files were generated using featureCounts, one file per sample. Each file contains gene counts i.e. the number or reads assigned to each gene or our gene expression profiles.

Let's take a look inside one of the count files:



In [ ]:

    
head counts/8380_3#1.390176.pe.markdup.bam.featurecounts.csv

There are several things to notice about the featureCounts file format:

Gene identifiers are in the first column Geneid (e.g. ENSMUSG00000090025)

Gene counts are in the last column (7)

The first line of the file is a comment which gives the details of the program and command used to generate the count file

The fields are tab-delimited (\t)

The reason we're highlighting this is because DEAGO has an option called count_type which we will be using in our analyses. By default, DEAGO is designed to take the expression counts which are the output from the Sanger pathogens RNA-Seq Expression pipeline. These have a different format from the featurecounts files. So, we'll be adding --count_type featurecounts to our commands which will tell DEAGO to expect count files which were generated by featureCounts and have the above characteristics.

You can also use raw count data as input as long as you have one file per sample. There are several options you can use which will configure DEAGO to be able to use them:

count_column - number of column containing count values
skip_lines - number of lines to skip in count file
count_delim - count file delimiter
gene_ids - name of column containing gene ids

Sample/condition mappings

DEAGO also requires a targets file which tells it which counts file is associated with each sample and the experimental conditions that were applied.

Let's take a look at our tutorial targets file:



In [ ]:

    
head targets.txt

In our targets file, each row corresponds to a sample. There are three columns which DEAGO always expects to find in our tab-delimited targets file:

filename
name of the sample count file in the counts directory

condition
experimental condition that was applied

replicate
number or phrase representing a replicate group

The filename is the name of the file containing the count data for the sample. We don't need to put the directory name in the filename (e.g. count/). This is because we tell DEAGO the directory the count files are stored in and the filename is the path in relation to that directory.

The condition is a short label which will be used to identify the condition. The condition label should be the same for all of the samples which share that condition. Here, our condition is a combination (e.g. WT_Ctrl) of the cell type (e.g. WT) and the treatment (e.g. Ctrl).

This dataset has 4 biological replicates and 2 technical replicates for each condition represented in the replicate column. For example, replicate 1.2 is in biological replicate group 1 and is the second technical replicate for that sample.

Finally, you may have noticed that, in addition to the three expected columns, we also have two extra columns, cell_type and treatment. That's because the tutorial dataset had two factors, cell type (WT/KO) and treatment (Ctrl/IL22). These are here just for our information and DEAGO will ignore them.

Annotation file

Because we want to see the gene symbols that are linked to each gene identifier and to perform gene ontology (GO) term enrichment analyses we also need an annotation file. You should see two annotation files in your tutorial directory:

The gene annotation for the mouse genome (mm10) from Ensembl BioMart, one line per annotation, is in ensembl_mm10.tsv.

Let's take a look at "ensembl_mm10.tsv":



In [ ]:

    
head ensembl_mm10.tsv

Here you can see that the GO terms for "ENSMUSG00000064370" are split over 39 lines, one per annotation.



In [ ]:

    
grep '^ENSMUSG00000064370' ensembl_mm10.tsv | wc -l

We'll look at how to get annotations from Ensembl BioMart in the exercise below.

The DEAGO-formatted annotation is in ensembl_mm10_deago_formatted.tsv. The difference between this and the previous file is that the DEAGO-formatted annotation has one line per gene. Essentially, we bring together all the gene names and GO terms associated to a gene together.

Let's take a look at "ensembl_mm10_deago_formatted.tsv":



In [ ]:

    
head ensembl_mm10_deago_formatted.tsv

Now, if we look for the same gene "ENSMUSG00000064370" there is only one line.



In [ ]:

    
grep '^ENSMUSG00000064370'

There are three tab-delimited columns:

gene identifier
gene name
GO terms

Where there are multiple gene names or GO terms associated with a gene, they will be concatenated together with a semi-colon (';').

Note: the gene identifiers must match those found in the count data files.

Let's see how many GO terms were associated with our gene.



In [ ]:

    
grep '^ENSMUSG00000064370' ensembl_mm10_deago_formatted.tsv | \
    cut -f 3 | tr ';' '\n' | wc -l

There were 36 go terms associated with ENSMUSG00000064370.

Note: we used tr here because grep -c "GO:" wouldn't work as it doesn't count multiple occurences of the same phrase in a single line.

Exercise 2

Let's take a quick look at how to get annotations for mm10 in Ensembl BioMart.

Either click on the link below or type the URL into a web browser to go the Ensembl page for the mm10 genome.

https://www.ensembl.org/Mus_musculus

Follow the steps below to download your Ensembl annotation file.

Click BioMart on the top menu
Select Ensembl Genes 93 as the database (the version number may change with Ensembl updates)
Select Mus musculus genes (GRCm38.p5) as the dataset
Select Attributes from the left-hand menu
Click on the + symbol next to GENE: and select Ensembl Gene ID and Associated Gene Name
Click on the + symbol next to EXTERNAL: and select GO Term Accession
Click the Results button
Check Export all results to is set to File and TSV and click the Go button to download the annotations

The downloaded file will be called mart_export.txt and is the same as ensembl_mm10.tsv.

While deago is the main running command for DEAGO, there are also some other commands built in which perform the intermediate steps in a DEAGO analysis. One of those is mart_to_deago which will convert a BioMart annotation into a DEAGO-formatted annotation.

Let's take a look at the usage for mart_to_deago.



In [ ]:

    
mart_to_deago -h

So, we need to specify the annotation file we want to convert using the -a option.

Let's try converting ensembl_mm10.tsv into a DEAGO-formatted annotation file using mart_to_deago.



In [ ]:

    
mart_to_deago -a ensembl_mm10.tsv

This will create the file deago_annotation.tsv which is the same as ensembl_mm10_deago_formatted.tsv.

_Note: Sanger pathogens users can get associated GO terms with farm_interproscan giving it the prokka annotation file (.gff) for the reference used to generate the count files from the RNA-Seq Expression pipeline._

Questions

Q1: How many genes are there in each of the count files?
Hint: use grep and/or wc -l to count the number of lines (minus headers) or gene identifiers in one of the count files

Q2: How many genes have associated annotations?
Hint: count the number of genes in the DEAGO-formatted annotation file

Q3: How many of those genes have associated gene names?
Hint: use awk and its built-in variable NF to count the number of lines where at least one of the three fields is missing and whose second columns contain GO terms as the gene name is missing

Q4: How many of those genes have associated GO terms?
Hint: use awk and its built-in variable NF to count the number of lines where at least one of the three fields is missing and whose second columns don't contain GO terms as they are missing

What's next?

For a quick recap of what the tutorial covers head back to the Introduction.

If you want a reintroduction to the tutorial dataset, head back to introducing the tutorial dataset.

Otherwise, let's continue on to running a quality control (QC) analysis.