This section is for Sanger users who are wanting to prepare their input data from the pathogens RNA-Seq Expression Pipeline.
In this section, we'll go through how you could generate the files we use in the DEAGO tutorial using the output from the Sanger pathogen pipelines.
You will need to be logged in to either pcs5
or the farm
to run the commands in this section.
For an overview of the RNA-Seq Expression Pipeline and for details on requesting this pipeline for your data, please see the Pathogen Informatics wiki. If you need help with this or have questions, please email path-help@sanger.ac.uk.
There are several commands available for accessing the results of the pathogen informatics analysis pipelines. These are referred to as the pathfind or pf scripts. Please see the Pathogen Informatics wiki for more information.
For usage instructions run:
In [ ]:
pf man
Once you have requested the RNA-Seq Expression Pipeline, check your samples have finished going through the pipeline using pf status
. The pf status script will return information about the pipeline status (Running, Done or Failed) for each of the lanes in the input data allowing you to see which pipelines have been run on the data.
The command to check the pipeline status of our tutorial pipeline data would be:
In [ ]:
pf status -t study -i 2319
This should give you the status of the 32 lanes in this study within all of the pathogens pipelines. Here's the first few lanes.
Name | Import | QC | Mapping | Archive | Improve | SNP call | RNASeq | Assemble | Annotate |
---|---|---|---|---|---|---|---|---|---|
8380_3#1 | Done | Done | Done | Done | - | Done | Done | Done | - |
8380_3#2 | Done | Done | Done | Done | - | Done | Done | Done | - |
8380_3#4 | Done | Done | Done | - | - | Done | Done | Done | - |
8380_3#5 | Done | Done | Done | Done | - | Done | Done | Done | - |
We can use pf rnaseq
to find the count files which are the output from the RNA-Seq Expression Pipeline.
In [ ]:
pf rnaseq -t study -i 2319
This will give you the location of the 32 expression count files.
/lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1
/SLX/WT2xCtrl_1_5733492/8380_3#1/390176.pe.markdup.bam.expression.csv
...
Expression counts files are available for all organisms. For human and mouse, there are also featurecounts files which have been generated using featureCounts. To access these instead we can use the -f
option to find a particular filetype.
In [ ]:
pf rnaseq -t study -i 2319 -f featurecounts
For DEAGO, you will need to have your count files in a single directory. It isn't efficient to copy the data from the pipelines to your working directory. Instead you should create a shortcut or reference to these files called a symlink.
In [ ]:
mkdir counts
pf rnaseq -t study -i 2319 -f featurecounts -l counts
Take a look in the counts directory:
In [ ]:
ls counts
And you should see:
8380_3#1.390176.pe.markdup.bam.featurecounts.csv 8380_5#12.389308.pe.markdup.bam.featurecounts.csv
8380_8#11.390155.pe.markdup.bam.featurecounts.csv 8380_3#2.390269.pe.markdup.bam.featurecounts.csv
8380_6#1.390254.pe.markdup.bam.featurecounts.csv 8380_8#12.390242.pe.markdup.bam.featurecounts.csv
...
While it might look like your files are in the counts directory, what the -l
option has done is create a series of symlinks which point the locations of the counts files within the pipelines. You can use ls -al
to see what we mean.
DEAGO also needs a targets file which maps the sample files to the experimental conditions that were applied.
DEAGO expects to see these three columns in this file:
filename
- name of the sample count file in the counts directory
condition
- experimental condition that was applied
replicate
- number or phrase representing a replicate group
You can get your filenames by using ls
to list the files in your counts directory.
You may be able to use pf info
to get information on the experimental conditions and replicate numbers. The pf info script will return metadata information allowing you to match up the internal sample identifiers with metadata such as the supplier identifiers for the given input.
In [ ]:
pf info -t study -i 2319
Here we can see that the Sample column might be able to help us:
Lane | Sample | Supplier Name | Public Name | Strain |
---|---|---|---|---|
8380_3#1 | WT2xCtrl_1 | NA | WT2xCtrl_1 | C57BL/6 |
8380_3#2 | WT2xCtrl_2 | NA | WT2xCtrl_2 | C57BL/6 |
... | ... | ... | ... | ... |
8380_6#3 | KO1xIL22_1 | NA | KO1xIL22_1 | C57BL/6 |
8380_6#4 | KO1xIL22_2 | NA | KO1xIL22_2 | C57BL/6 |
... | ... | ... | ... | ... |
8380_7#9 | KO3xCtrl_1 | NA | KO3xCtrl_1 | C57BL/6 |
... | ... | ... | ... | ... |
8380_8#15 | KO4xIL22_1 | NA | KO4xIL22_1 | C57BL/6 |
From this we can see that there are:
However, for our targets file we also need our count filenames. You can use the -S
option from pf rnaseq
which will combine the output metadata from pf info
with the count filename for each sample. Here's an example for our featurecounts file:
In [ ]:
pf rnaseq -t study -i 2319 -S
You should see a message telling you what the summary file has been called:
Wrote summary to "2319.rnaseqfind_summary.tsv"
We've got a copy of this in the data
directory. Let's take a look:
In [ ]:
head data/2319.rnaseqfind_summary.tsv
Here are the first few lines from the summary file:
Lane | Sample | Supplier_Name | Public_Name | Strain | Filename | File_Path |
---|---|---|---|---|---|---|
8380_3#1 | WT2xCtrl_1 | NA | WT2xCtrl_1 | C57BL/6 | 8380_3#1.390176.pe.markdup.bam.expression.csv | /lustre/scratch118/... |
8380_3#2 | WT2xCtrl_2 | NA | WT2xCtrl_2 | C57BL/6 | 8380_3#2.390269.pe.markdup.bam.expression.csv | /lustre/scratch118/... |
8380_3#4 | WT2xIL22_2 | NA | WT2xIL22_2 | C57BL/6 | 8380_3#4.389017.pe.markdup.bam.expression.csv | /lustre/scratch118/... |
8380_3#5 | WT3xCtrl_1 | NA | WT3xCtrl_1 | C57BL/6 | 8380_3#5 390266.pe.markdup.bam.expression.csv | /lustre/scratch118/... |
Remember that in our targets file, we need to have three columns:
The summary file from pf rnaseq
already gives us our Filename column which corresponds to the symlinked filenames in our counts directory. If we're lucky, either the Sample or Public_Name column will give us the condition and replicate information for each sample.
If you're not comfortable with the command line, you can open this file in a text editor and manually add the Condition and Replicate columns. Remember that the targets file needs to be tab-delimited! If you use a program Excel make sure to export as a tab-delimited file.
Extracting the condition and replicate information on the command line is going to be advanced unix. If you're happy doing this manually, skip this section.
Note: these commands will only work for the sample format here, each experiment is different and the commands you use will depend on how the sample information is set out.
First, let's create our targets file targets.tsv. We'll use echo
to print the header to the file. The -e
option allows us to write tabs and newlines to the file as without it they would look like "\t" and "\n".
In [ ]:
echo -e "condition\treplicate\tfilename" > targets.tsv
Let's take a look:
In [ ]:
cat targets.tsv
Now for the tricky bit. We'll use awk
to rearrange the information in our Sample column and give us our Condition and Replicate columns. For our condition we need to extract the cell type (e.g. WT) and treatment (e.g. Ctrl) and merge them (e.g. WT_Ctrl). For our replicate we need to extract and merge the biological and technical replicate numbers (e.g. 2.1). So, for the first three samples we would get:
Sample | Condition | Replicate |
---|---|---|
WT2xCtrl_1 | WT_Ctrl | 2.1 |
WT2xCtrl_2 | WT_Ctrl | 2.2 |
WT2xIL22_2 | WT_IL22 | 2.2 |
The sample is in the second column of the summary file which we can get with:
In [ ]:
awk '{print $2}' data/2319.rnaseqfind_summary.tsv
We can use NR>1
so that we skip the first line of the summary file as we don't want the column headers.
In [ ]:
awk 'NR>1{print $2}' data/2319.rnaseqfind_summary.tsv
Next, we're going to use split
to give us the cell type (e.g. WT2) and treatment (e.g. Ctrl_1).
In [ ]:
awk 'NR>1{
split($2, sample, "x");
print sample[1]"\t"sample[2];
}' data/2319.rnaseqfind_summary.tsv
We can use split
again to get the treatment and treatment replicate information.
In [ ]:
awk 'NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
print treatment[1]"\t"treatment[2];
}' data/2319.rnaseqfind_summary.tsv
Things are a little more tricky for the cell type and cell type replicate information. We know that our cell type is either WT or KO. Each of these is 2 characters. This means we can use substr
to extract the first two characters i.e. the cell type. We need to tell substr
where to start (0) and how many characters to extract from that start point (2). Note that the start point is 0 and not 1.
In [ ]:
awk 'NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
cell_type=substr(sample[1],0,2);
print cell_type;
}' data/2319.rnaseqfind_summary.tsv
By the same logic, we can assume that everything after the first two characters is the replicate information. We can use substr
again, telling it to take everything from the third character.
In [ ]:
awk 'NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
cell_type=substr(sample[1],0,2);
cell_type_rep=substr(sample[1],3);
print cell_type_rep;
}' data/2319.rnaseqfind_summary.tsv
DEAGO expects a condition column. We have two conditions here, cell type (cell_type
) and treatment (treatment[1]
), but DEAGO can only perform single-factor analyses. So we must join these together for the condition i.e. WT_Ctrl.
We also only have one replicate column so we join the biological (biological_replicate
) and technical replicates (treatment[2]
) i.e. 2.1.
In [ ]:
awk 'NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
cell_type=substr(sample[1],0,2);
biological_rep=substr(sample[1],3);
condition=cell_type"_"treatment[1];
replicate=cell_type_rep"."treatment[2];
print condition"\t"replicate
}' data/2319.rnaseqfind_summary.tsv
But wait, we need to add our filename from column 6 ($6
):
In [ ]:
awk 'NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
cell_type=substr(sample[1],0,2);
biological_rep=substr(sample[1],3);
condition=cell_type"_"treatment[1];
replicate=cell_type_rep"."treatment[2];
print condition"\t"replicate"\t"$6
}' data/2319.rnaseqfind_summary.tsv
And append it (add it to the bottom of) the targets file.
In [ ]:
awk 'NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
cell_type=substr(sample[1],0,2);
biological_rep=substr(sample[1],3);
condition=cell_type"_"treatment[1];
replicate=cell_type_rep"."treatment[2];
print condition"\t"replicate"\t"$6
}' data/2319.rnaseqfind_summary.tsv >> targets.tsv
Let's take a look at our finished targets file:
In [ ]:
head targets.tsv
This is all we need for DEAGO. However, DEAGO will ignore extra columns in your targets file so you can have other descriptive columns like cell_type and treatment.
Here's an example:
In [ ]:
echo -e "condition\tcell_type\ttreatment\treplicate\tfilename" > extended_targets.tsv
awk 'BEGIN{ OFS="\t" }
NR>1{
split($2, sample, "x");
split(sample[2], treatment, "_");
cell_type=substr(sample[1],0,2);
biological_rep=substr(sample[1],3);
condition=cell_type"_"treatment[1];
replicate=cell_type_rep"."treatment[2];
print condition,cell_type,treatment[1],replicate,$6
}' data/2319.rnaseqfind_summary.tsv >> extended_targets.tsv
head extended_targets.tsv
Note: we have used BEGIN{ OFS="\t" }
which sets the output seperator to "\t" so that we don't have to keep writing it in the print
statement and can just use a comma (',') instead.
In [ ]:
pf rnaseq -t lane -i 8380_3#1 --details
Here we can see that Mus_musculus_mm10 was used as the reference when mapping this lane with BWA.
/lustre/scratch118/infgen/pathogen/pathpipe/prokaryotes/seq-pipelines/Mus/musculus/TRACKING/2319/WT2xCtrl_1/SLX/WT2xCtrl_1_5733492/8380_3#1/390176.pe.markdup.bam.expression.csv Mus_musculus_mm10 bwa 2012-12-05T10:36:36
For organisms where gene symbol and GO terms are available in Ensembl, you can use BioMart to gather your annotation.
For other organisms, you may need to mine other databases or sources. The annotation will need to be formatted for use with DEAGO.
It's worth seeing if you can use farm_interproscan
to generate the GO term annotations for your reference.
For example, if your reference was Staphylococcus_aureus_SR434_GCF_001986135_1, you could find the reference file using pf ref
.
In [ ]:
pf ref -i Staphylococcus_aureus_SR434_GCF_001986135_1
This would give you the location of the reference fasta file, but what you need is the GFF-formatted annotation file which would have been generated by prokka when the reference was added to the pathogen databases.
We can find the prokka annotations in a folder called annotation
which is in the same directory as the reference sequence.
In [ ]:
ls /lustre/scratch118/infgen/pathogen/pathpipe/refs/Staphylococcus/aureus_SR434_GCF_001986135_1/annotation
In the annotation directory will hopefully be a GFF file (Staphylococcus_aureus_SR434_GCF_001986135_1.gff
) which can be used with farm_interproscan
. You'll want to symlink this to your analysis directory.
In [ ]:
ln -s /lustre/scratch118/infgen/pathogen/pathpipe/refs/Staphylococcus/aureus_SR434_GCF_001986135_1/annotation/Staphylococcus_aureus_SR434_GCF_001986135_1.gff .
You can get the usage for farm_interproscan
with:
In [ ]:
farm_interproscan -h
These are the example usage commands you're looking for:
# Run InterProScan using LSF with GFF input (standard genetic code for translation)
farm_interproscan -a annotation.gff -g
# Run InterProScan using LSF with GFF input (bacterial code for translation)
farm_interproscan -a annotation.gff -g -c 11
As our example here is a bacteria strain we need the second of these.
In [ ]:
farm_interproscan -a Staphylococcus_aureus_SR434_GCF_001986135_1.gff -g -c 11
When this has finished running you'll get an output file with the suffix .go.tsv
. We've put the first 20 lines of this file as an example in the data
directory.
In [60]:
head data/Staphylococcus_aureus_SR434_GCF_001986135_1.gff.iprscan.gff.go.tsv
This will give you the GO terms associated with each unique gene identifier. The GO terms will be seperated by a semi-colon (';') as this is the format used by topGO and DEAGO.
Note: this will not give you gene names or descriptions although you may be able to find these in the GFF files and add them to the annotation.
See Preparing an annotation file for more information.