Sanger pathogen users

This section is for Sanger users who are wanting to prepare their input data from the pathogens RNA-Seq Expression Pipeline.

In this section, we'll go through how you could generate the files we use in the DEAGO tutorial using the output from the Sanger pathogen pipelines.

You will need to be logged in to either pcs5 or the farm to run the commands in this section.

Requesting the RNA-Seq Expression Pipeline for your data

For an overview of the RNA-Seq Expression Pipeline and for details on requesting this pipeline for your data, please see the Pathogen Informatics wiki. If you need help with this or have questions, please email path-help@sanger.ac.uk.


PathFind scripts

There are several commands available for accessing the results of the pathogen informatics analysis pipelines. These are referred to as the pathfind or pf scripts. Please see the Pathogen Informatics wiki for more information.

For usage instructions run:


In [ ]:
pf man

Pipeline status

Once you have requested the RNA-Seq Expression Pipeline, check your samples have finished going through the pipeline using pf status. The pf status script will return information about the pipeline status (Running, Done or Failed) for each of the lanes in the input data allowing you to see which pipelines have been run on the data.

The command to check the pipeline status of our tutorial pipeline data would be:


In [ ]:
pf status -t study -i 2319

This should give you the status of the 32 lanes in this study within all of the pathogens pipelines. Here's the first few lanes.

Name Import QC Mapping Archive Improve SNP call RNASeq Assemble Annotate
8380_3#1 Done Done Done Done - Done Done Done -
8380_3#2 Done Done Done Done - Done Done Done -
8380_3#4 Done Done Done - - Done Done Done -
8380_3#5 Done Done Done Done - Done Done Done -

Count data

We can use pf rnaseq to find the count files which are the output from the RNA-Seq Expression Pipeline.


In [ ]:
pf rnaseq -t study -i 2319

This will give you the location of the 32 expression count files.

/lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1
/SLX/WT2xCtrl_1_5733492/8380_3#1/390176.pe.markdup.bam.expression.csv

...

Expression counts files are available for all organisms. For human and mouse, there are also featurecounts files which have been generated using featureCounts. To access these instead we can use the -f option to find a particular filetype.


In [ ]:
pf rnaseq -t study -i 2319 -f featurecounts

For DEAGO, you will need to have your count files in a single directory. It isn't efficient to copy the data from the pipelines to your working directory. Instead you should create a shortcut or reference to these files called a symlink.


In [ ]:
mkdir counts
pf rnaseq -t study -i 2319 -f featurecounts -l counts

Take a look in the counts directory:


In [ ]:
ls counts

And you should see:

8380_3#1.390176.pe.markdup.bam.featurecounts.csv   8380_5#12.389308.pe.markdup.bam.featurecounts.csv 
8380_8#11.390155.pe.markdup.bam.featurecounts.csv  8380_3#2.390269.pe.markdup.bam.featurecounts.csv  
8380_6#1.390254.pe.markdup.bam.featurecounts.csv   8380_8#12.390242.pe.markdup.bam.featurecounts.csv
...

While it might look like your files are in the counts directory, what the -l option has done is create a series of symlinks which point the locations of the counts files within the pipelines. You can use ls -al to see what we mean.


Sample/condition mapping

DEAGO also needs a targets file which maps the sample files to the experimental conditions that were applied.

DEAGO expects to see these three columns in this file:

  • filename - name of the sample count file in the counts directory

  • condition - experimental condition that was applied

  • replicate - number or phrase representing a replicate group

You can get your filenames by using ls to list the files in your counts directory.

You may be able to use pf info to get information on the experimental conditions and replicate numbers. The pf info script will return metadata information allowing you to match up the internal sample identifiers with metadata such as the supplier identifiers for the given input.


In [ ]:
pf info -t study -i 2319

Here we can see that the Sample column might be able to help us:

Lane Sample Supplier Name Public Name Strain
8380_3#1 WT2xCtrl_1 NA WT2xCtrl_1 C57BL/6
8380_3#2 WT2xCtrl_2 NA WT2xCtrl_2 C57BL/6
... ... ... ... ...
8380_6#3 KO1xIL22_1 NA KO1xIL22_1 C57BL/6
8380_6#4 KO1xIL22_2 NA KO1xIL22_2 C57BL/6
... ... ... ... ...
8380_7#9 KO3xCtrl_1 NA KO3xCtrl_1 C57BL/6
... ... ... ... ...
8380_8#15 KO4xIL22_1 NA KO4xIL22_1 C57BL/6

From this we can see that there are:

  • 2 cell types: WT and KO
  • 2 treatments: Ctrl and IL22
  • 4 biological replicates (e.g. KO4)
  • 2 technical replicates (e.g. _1 and _2)

However, for our targets file we also need our count filenames. You can use the -S option from pf rnaseq which will combine the output metadata from pf info with the count filename for each sample. Here's an example for our featurecounts file:


In [ ]:
pf rnaseq -t study -i 2319 -S

You should see a message telling you what the summary file has been called:

Wrote summary to "2319.rnaseqfind_summary.tsv"

We've got a copy of this in the data directory. Let's take a look:


In [ ]:
head data/2319.rnaseqfind_summary.tsv

Here are the first few lines from the summary file:

Lane Sample Supplier_Name Public_Name Strain Filename File_Path
8380_3#1 WT2xCtrl_1 NA WT2xCtrl_1 C57BL/6 8380_3#1.390176.pe.markdup.bam.expression.csv /lustre/scratch118/...
8380_3#2 WT2xCtrl_2 NA WT2xCtrl_2 C57BL/6 8380_3#2.390269.pe.markdup.bam.expression.csv /lustre/scratch118/...
8380_3#4 WT2xIL22_2 NA WT2xIL22_2 C57BL/6 8380_3#4.389017.pe.markdup.bam.expression.csv /lustre/scratch118/...
8380_3#5 WT3xCtrl_1 NA WT3xCtrl_1 C57BL/6 8380_3#5 390266.pe.markdup.bam.expression.csv /lustre/scratch118/...

Remember that in our targets file, we need to have three columns:

  • Filename
  • Condition
  • Replicate

The summary file from pf rnaseq already gives us our Filename column which corresponds to the symlinked filenames in our counts directory. If we're lucky, either the Sample or Public_Name column will give us the condition and replicate information for each sample.

Adding the condition and replicate columns manually

If you're not comfortable with the command line, you can open this file in a text editor and manually add the Condition and Replicate columns. Remember that the targets file needs to be tab-delimited! If you use a program Excel make sure to export as a tab-delimited file.

Adding the condition and replicate columns on the command line

Extracting the condition and replicate information on the command line is going to be advanced unix. If you're happy doing this manually, skip this section.

Note: these commands will only work for the sample format here, each experiment is different and the commands you use will depend on how the sample information is set out.

First, let's create our targets file targets.tsv. We'll use echo to print the header to the file. The -e option allows us to write tabs and newlines to the file as without it they would look like "\t" and "\n".


In [ ]:
echo -e "condition\treplicate\tfilename" > targets.tsv

Let's take a look:


In [ ]:
cat targets.tsv

Now for the tricky bit. We'll use awk to rearrange the information in our Sample column and give us our Condition and Replicate columns. For our condition we need to extract the cell type (e.g. WT) and treatment (e.g. Ctrl) and merge them (e.g. WT_Ctrl). For our replicate we need to extract and merge the biological and technical replicate numbers (e.g. 2.1). So, for the first three samples we would get:

Sample Condition Replicate
WT2xCtrl_1 WT_Ctrl 2.1
WT2xCtrl_2 WT_Ctrl 2.2
WT2xIL22_2 WT_IL22 2.2

The sample is in the second column of the summary file which we can get with:


In [ ]:
awk '{print $2}' data/2319.rnaseqfind_summary.tsv

We can use NR>1 so that we skip the first line of the summary file as we don't want the column headers.


In [ ]:
awk 'NR>1{print $2}' data/2319.rnaseqfind_summary.tsv

Next, we're going to use split to give us the cell type (e.g. WT2) and treatment (e.g. Ctrl_1).


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    print sample[1]"\t"sample[2];
}' data/2319.rnaseqfind_summary.tsv

We can use split again to get the treatment and treatment replicate information.


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    split(sample[2], treatment, "_");
    print treatment[1]"\t"treatment[2];
}' data/2319.rnaseqfind_summary.tsv

Things are a little more tricky for the cell type and cell type replicate information. We know that our cell type is either WT or KO. Each of these is 2 characters. This means we can use substr to extract the first two characters i.e. the cell type. We need to tell substr where to start (0) and how many characters to extract from that start point (2). Note that the start point is 0 and not 1.


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    split(sample[2], treatment, "_");
    cell_type=substr(sample[1],0,2);
    print cell_type;
}' data/2319.rnaseqfind_summary.tsv

By the same logic, we can assume that everything after the first two characters is the replicate information. We can use substr again, telling it to take everything from the third character.


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    split(sample[2], treatment, "_");
    cell_type=substr(sample[1],0,2);
    cell_type_rep=substr(sample[1],3);
    print cell_type_rep;
}' data/2319.rnaseqfind_summary.tsv

DEAGO expects a condition column. We have two conditions here, cell type (cell_type) and treatment (treatment[1]), but DEAGO can only perform single-factor analyses. So we must join these together for the condition i.e. WT_Ctrl.

We also only have one replicate column so we join the biological (biological_replicate) and technical replicates (treatment[2]) i.e. 2.1.


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    split(sample[2], treatment, "_");
    cell_type=substr(sample[1],0,2);
    biological_rep=substr(sample[1],3);
    condition=cell_type"_"treatment[1];
    replicate=cell_type_rep"."treatment[2];
    print condition"\t"replicate
}' data/2319.rnaseqfind_summary.tsv

But wait, we need to add our filename from column 6 ($6):


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    split(sample[2], treatment, "_");
    cell_type=substr(sample[1],0,2);
    biological_rep=substr(sample[1],3);
    condition=cell_type"_"treatment[1];
    replicate=cell_type_rep"."treatment[2];
    print condition"\t"replicate"\t"$6
}' data/2319.rnaseqfind_summary.tsv

And append it (add it to the bottom of) the targets file.


In [ ]:
awk 'NR>1{
    split($2, sample, "x");
    split(sample[2], treatment, "_");
    cell_type=substr(sample[1],0,2);
    biological_rep=substr(sample[1],3);
    condition=cell_type"_"treatment[1];
    replicate=cell_type_rep"."treatment[2];
    print condition"\t"replicate"\t"$6
}' data/2319.rnaseqfind_summary.tsv >> targets.tsv

Let's take a look at our finished targets file:


In [ ]:
head targets.tsv

This is all we need for DEAGO. However, DEAGO will ignore extra columns in your targets file so you can have other descriptive columns like cell_type and treatment.

Here's an example:


In [ ]:
echo -e "condition\tcell_type\ttreatment\treplicate\tfilename" > extended_targets.tsv

awk 'BEGIN{ OFS="\t" }
     NR>1{
        split($2, sample, "x");
        split(sample[2], treatment, "_");
        cell_type=substr(sample[1],0,2);
        biological_rep=substr(sample[1],3);
        condition=cell_type"_"treatment[1];
        replicate=cell_type_rep"."treatment[2];
        print condition,cell_type,treatment[1],replicate,$6
     }' data/2319.rnaseqfind_summary.tsv >> extended_targets.tsv

head extended_targets.tsv

Note: we have used BEGIN{ OFS="\t" } which sets the output seperator to "\t" so that we don't have to keep writing it in the print statement and can just use a comma (',') instead.


Annotations

To prepare an annotation file, it's often useful to know what reference was used to map your reads. The --details option in pf rnaseq will give you this information.


In [ ]:
pf rnaseq -t lane -i 8380_3#1 --details

Here we can see that Mus_musculus_mm10 was used as the reference when mapping this lane with BWA.

/lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1/SLX/WT2xCtrl_1_5733492/8380_3#1/390176.pe.markdup.bam.expression.csv  Mus_musculus_mm10 bwa 2012-12-05T10:36:36

For organisms where gene symbol and GO terms are available in Ensembl, you can use BioMart to gather your annotation.

Annotations for organisms not in Ensembl

For other organisms, you may need to mine other databases or sources. The annotation will need to be formatted for use with DEAGO.

It's worth seeing if you can use farm_interproscan to generate the GO term annotations for your reference.

For example, if your reference was Staphylococcus_aureus_SR434_GCF_001986135_1, you could find the reference file using pf ref.


In [ ]:
pf ref -i Staphylococcus_aureus_SR434_GCF_001986135_1

This would give you the location of the reference fasta file, but what you need is the GFF-formatted annotation file which would have been generated by prokka when the reference was added to the pathogen databases.

We can find the prokka annotations in a folder called annotation which is in the same directory as the reference sequence.


In [ ]:
ls /lustre/scratch118/infgen/pathogen/pathpipe/refs/Staphylococcus/aureus_SR434_GCF_001986135_1/annotation

In the annotation directory will hopefully be a GFF file (Staphylococcus_aureus_SR434_GCF_001986135_1.gff) which can be used with farm_interproscan. You'll want to symlink this to your analysis directory.


In [ ]:
ln -s /lustre/scratch118/infgen/pathogen/pathpipe/refs/Staphylococcus/aureus_SR434_GCF_001986135_1/annotation/Staphylococcus_aureus_SR434_GCF_001986135_1.gff .

You can get the usage for farm_interproscan with:


In [ ]:
farm_interproscan -h

These are the example usage commands you're looking for:

# Run InterProScan using LSF with GFF input (standard genetic code for translation)
  farm_interproscan -a annotation.gff -g

# Run InterProScan using LSF with GFF input (bacterial code for translation)
  farm_interproscan -a annotation.gff -g -c 11

As our example here is a bacteria strain we need the second of these.


In [ ]:
farm_interproscan -a Staphylococcus_aureus_SR434_GCF_001986135_1.gff -g -c 11

When this has finished running you'll get an output file with the suffix .go.tsv. We've put the first 20 lines of this file as an example in the data directory.


In [60]:
head data/Staphylococcus_aureus_SR434_GCF_001986135_1.gff.iprscan.gff.go.tsv


Staphylococcus_aureus_SR434_GCF_001986135_1_02557	GO:0016491;GO:0008270;GO:0055114
Staphylococcus_aureus_SR434_GCF_001986135_1_01435	GO:0006355
Staphylococcus_aureus_SR434_GCF_001986135_1_00932	GO:0005488;GO:0015886;GO:0016020;GO:0020037
Staphylococcus_aureus_SR434_GCF_001986135_1_02609	GO:0006950;GO:0055114;GO:0016722;GO:0006879;GO:0008199
Staphylococcus_aureus_SR434_GCF_001986135_1_02722	GO:0005737;GO:0006457;GO:0005524
Staphylococcus_aureus_SR434_GCF_001986135_1_00371	GO:0000155;GO:0007165;GO:0004871;GO:0016021;GO:0016310;GO:0016772
Staphylococcus_aureus_SR434_GCF_001986135_1_01882	GO:0003824;GO:0006525;GO:0008483;GO:0030170;GO:0004587
Staphylococcus_aureus_SR434_GCF_001986135_1_01655	GO:0005576;GO:0009405
Staphylococcus_aureus_SR434_GCF_001986135_1_02101	GO:0031647
Staphylococcus_aureus_SR434_GCF_001986135_1_01633	GO:0005328;GO:0006836;GO:0016021

This will give you the GO terms associated with each unique gene identifier. The GO terms will be seperated by a semi-colon (';') as this is the format used by topGO and DEAGO.

Note: this will not give you gene names or descriptions although you may be able to find these in the GFF files and add them to the annotation.