Assembly pipeline results

Introduction

When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf scripts to return the results of each of the automated analysis pipelines.

The genome assembly pipeline used depends on sequence data and organism:

We can use pf assembly to return the location of assembly pipeline results.

In this section of the tutorial we will cover:

  • using pf assembly to get assembly pipeline results
  • filtering pf assembly results by program
  • using pf assembly to symlink assembly pipeline results
  • using pf assembly to get assembly statistics

Exercise 8

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Let's take a look at the pf assembly usage.


In [ ]:
pf assembly -h

Now, let's get the assembly pipeline results for run 5477_6#1.


In [ ]:
pf assembly -t lane -i 5477_6#1

This returns the locations of the FASTA-formatted contig files which were produced by the assembly pipeline.

By default, pf assembly will return the scaffolded contigs. But, what if you want to see all of the assembled contigs. To get these we can use the --filetype or -f option.


In [ ]:
pf assembly -t lane -i 5477_6#1 -f all

This returns a third file, "unscaffolded_contigs.fa".

Notice that the results are located in a directories which are named after the assembler that was used to generate the assembly e.g. "spades_assembly". This tells us that SPAdes was the program used to generate the assembly. A quick way to filter assembly pipeline results by program is to use the --progam or -P option.

Let's get all assembly pipeline results for run 5477_6 which were generated using "spades".


In [ ]:
pf assembly -t lane -i 5477_6 -P spades

Here we can see that SPAdes was used to generate assemblies for lanes 5477_6#1 and 5477_6#3. We can symlink these assemblies into a directory using the --symlink or -l option.

Let's symlink the assembly pipeline results for run 5477_6 which were generated with SPAdes to "5477_6_spades".


In [ ]:
pf assembly -t lane -i 5477_6 -P spades -l 5477_6_spades

In [ ]:
ls 5477_6_spades

We can also get some statistics from our assembly results using the --stats or -s option.

Let's get some assembly statistics for lane 10018_1#2.


In [ ]:
pf assembly -t lane -i 5477_6#1 -s

This generated a new file called "5477_6_1.assemblyfind_stats.csv" which contains our assembly statistics.


In [ ]:
cat 5477_6_1.assemblyfind_stats.csv

Questions

Q1: How many assembly files are returned by default for lane 10018_1#50?


In [ ]:
# Enter your answer here

Q2: Which program was used to generate the assembly for lane 10018_1#51?
Hint: look at the location path


In [ ]:
# Enter your answer here

Q3: Symlink the assembly/assemblies generated by "IVA" for run 10018_1 into a new directory called "iva_results".
Hint: don't forget to filter the results if more than one program has been used


In [ ]:
# Enter your answer here

Q4: How many contigs were assembled by velvet for lane 5477_6#2 and what is the N50?
Hint: you'll need to get some statistics for this lane and filter by program


In [ ]:
# Enter your answer here

In [ ]:
# Enter your answer here

What's next?

For a quick recap of how to get QC pipeline results, head back to SNP calling pipeline results.

Otherwise, let's move on to how to get your annotation pipeline results.