When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf
scripts to return the results of each of the automated analysis pipelines.
The annotation pipeline prepares genomes for submission to EMBL/GenBank in a standardised manner, with all the annotation files tracked centrally and available with the pf annotation
command.
In this section of the tutorial we will cover:
pf annotation
to get GFF files generated by the annotation pipelinepf annotation
results by assembler pf annotation
to symlink files generated by the annotation pipeline pf annotation
to get annotation statisticspf annotation
to check whether a gene is found in the GFF files generated by the annotation pipelineFirst, let's tell the system the location of our tutorial configuration file.
In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
Let's take a look at the pf annotation
usage.
In [ ]:
pf annotation -h
Now, let's get the annotation pipeline results for lane 5477_6#1.
In [ ]:
pf annotation -t lane -i 5477_6#1
This returns the locations of the GFF files which were produced by the annotation pipeline. Two paths were returned. This is because the assembly pipeline was run twice on this lane with two different assemblers: SPAdes and Velvet. Now, look closely at the annotation file paths. Did you notice that each annotation file that's returned is within an assembly directory? This is because your annotations and assemblies are linked.
We can filter the returned results, looking only for those results associated with a particular assembly program e.g. SPAdes, Velvet or IVA. To do this, we can use the --program
or -P
option as we did with pf assembly
.
Let's get all annotation pipeline results for run 5477_6#1 which were generated using "spades".
In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades
That leaves us with just one annotation file in the "spades_assembly" directory.
Several different file formats are generated by the annotation pipeline. By default, pf annotation
returns the GFF file produced by the annotation pipeline. We can also ask pf annotation
to return a different filetype e.g. the protein FASTA file of the translated CDS sequences. To do this we use the --filetype
or -f
option.
Let's look for the protein FASTA file of the translated CDS sequences for lane 5477_6#1.
In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades -f faa
Here we got files returned with the extension ".faa" instead of the default ".gff" files. We can now symlink these FASTA files into a directory using the --symlink
or -l
option.
Let's symlink our protein FASTA files for our SPAdes assembly from lane 5477_6#1 to "my_protein_files".
In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades \
-f faa -l my_protein_files
In [ ]:
ls my_protein_files
We can also get some statistics from our annotation results using the --stats
or -s
option.
Let's get some annotation statistics for our SPAdes assembly from lane 5477_6#1.
In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades -s
This generated a new file called "5477_6_1.annotationfind_stats.csv" which contains our annotation statistics.
In [ ]:
cat 5477_6_1.annotationfind_stats.csv
Finally, we can check to see if a gene is present in our sample using the --gene
or -g
option.
Let's see if any of our assemblies for lane 5477_6#1 contain the gene "dnaG".
In [ ]:
pf annotation -t lane -i 5477_6#1 -g dnaG
Here we can see that both of our samples contain the dnaG gene. To check this, we can can use grep
.
Use grep
to search for "dnaG" in the SPAdes annotation for lane 5477_6#1.
In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades | xargs grep dnaG
In [ ]:
# Enter your answer here
Q2: What is the location of the annotation for the SPAdes assembly of lane 10018_1#1?
In [ ]:
# Enter your answer here
Q3: What is the location of the translated CDS sequence file for the SPAdes assembly of lane 10018_1#1?
Hint: think about file extensions
In [ ]:
# Enter your answer here
Q4: How many of the assemblies for run 5477_6 contain the gene "dnaG"?
In [ ]:
# Enter your answer here
In [ ]:
# Enter your answer here
For a quick recap of how to get QC pipeline results, head back to assembly pipeline results.
Otherwise, let's move on to how to get your RNA-Seq expression pipeline results.