Annotation pipeline results

Introduction

When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf scripts to return the results of each of the automated analysis pipelines.

The annotation pipeline prepares genomes for submission to EMBL/GenBank in a standardised manner, with all the annotation files tracked centrally and available with the pf annotation command.

In this section of the tutorial we will cover:

  • using pf annotation to get GFF files generated by the annotation pipeline
  • filtering pf annotation results by assembler
  • using pf annotation to symlink files generated by the annotation pipeline
  • using pf annotation to get annotation statistics
  • using pf annotation to check whether a gene is found in the GFF files generated by the annotation pipeline

Exercise 9

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Let's take a look at the pf annotation usage.


In [ ]:
pf annotation -h

Now, let's get the annotation pipeline results for lane 5477_6#1.


In [ ]:
pf annotation -t lane -i 5477_6#1

This returns the locations of the GFF files which were produced by the annotation pipeline. Two paths were returned. This is because the assembly pipeline was run twice on this lane with two different assemblers: SPAdes and Velvet. Now, look closely at the annotation file paths. Did you notice that each annotation file that's returned is within an assembly directory? This is because your annotations and assemblies are linked.

We can filter the returned results, looking only for those results associated with a particular assembly program e.g. SPAdes, Velvet or IVA. To do this, we can use the --program or -P option as we did with pf assembly.

Let's get all annotation pipeline results for run 5477_6#1 which were generated using "spades".


In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades

That leaves us with just one annotation file in the "spades_assembly" directory.

Several different file formats are generated by the annotation pipeline. By default, pf annotation returns the GFF file produced by the annotation pipeline. We can also ask pf annotation to return a different filetype e.g. the protein FASTA file of the translated CDS sequences. To do this we use the --filetype or -f option.

Let's look for the protein FASTA file of the translated CDS sequences for lane 5477_6#1.


In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades -f faa

Here we got files returned with the extension ".faa" instead of the default ".gff" files. We can now symlink these FASTA files into a directory using the --symlink or -l option.

Let's symlink our protein FASTA files for our SPAdes assembly from lane 5477_6#1 to "my_protein_files".


In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades \
    -f faa -l my_protein_files

In [ ]:
ls my_protein_files

We can also get some statistics from our annotation results using the --stats or -s option.

Let's get some annotation statistics for our SPAdes assembly from lane 5477_6#1.


In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades -s

This generated a new file called "5477_6_1.annotationfind_stats.csv" which contains our annotation statistics.


In [ ]:
cat 5477_6_1.annotationfind_stats.csv

Finally, we can check to see if a gene is present in our sample using the --gene or -g option.

Let's see if any of our assemblies for lane 5477_6#1 contain the gene "dnaG".


In [ ]:
pf annotation -t lane -i 5477_6#1 -g dnaG

Here we can see that both of our samples contain the dnaG gene. To check this, we can can use grep.

Use grep to search for "dnaG" in the SPAdes annotation for lane 5477_6#1.


In [ ]:
pf annotation -t lane -i 5477_6#1 -P spades | xargs grep dnaG

Questions

Q1: How many GFF files are returned by default for lane 10018_1#1?


In [ ]:
# Enter your answer here

Q2: What is the location of the annotation for the SPAdes assembly of lane 10018_1#1?


In [ ]:
# Enter your answer here

Q3: What is the location of the translated CDS sequence file for the SPAdes assembly of lane 10018_1#1?
Hint: think about file extensions


In [ ]:
# Enter your answer here

Q4: How many of the assemblies for run 5477_6 contain the gene "dnaG"?


In [ ]:
# Enter your answer here

In [ ]:
# Enter your answer here

What's next?

For a quick recap of how to get QC pipeline results, head back to assembly pipeline results.

Otherwise, let's move on to how to get your RNA-Seq expression pipeline results.