SNP pipeline results

Introduction

When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf scripts to return the results of each of the automated analysis pipelines.

The SNP calling pipeline is composed of two parts, SNP calling and pseudogenome construction. We can use pf snp to return the location of the variant calling format (VCF) files that were produced by the SNP calling pipeline.

In this section of the tutorial we will cover:

  • using pf snp to get VCF files generated by the SNP calling pipeline
  • filtering pf snp results by mapper and reference
  • using pf snp to symlink VCF files generated by the SNP calling pipeline
  • using pf snp to generate a pseudogenome

Exercise 7

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Let's take a look at the pf snp usage.


In [ ]:
pf snp -h

Now, let's get the SNP calling pipeline results for lane 5477_6#1.


In [ ]:
pf snp -t lane -i 5477_6#1

This returns the locations of the gzipped VCF files (.vcf.gz) and their indices (.vcf.gz.tbi) which were produced by the SNP calling pipeline.

The mapping pipeline is run before the SNP calling pipeline. A quick way to get information about which mapper and reference were used by the mapping pipeline is to use the --details or -d option.

Let's get the SNP calling details for lane 5477_6#1.


In [ ]:
pf snp -t lane -i 5477_6#1 -d

Here we can see that "smalt" was use as the mapper for both the "Streptococcus_pneumoniae_ATCC_700669_v1" and "Streptococcus_pneumoniae_Taiwan19F-14_v1" references.

You can request the SNP calling pipeline be run more than once using different mappers or reference. To filter the output by mapper we can use the --mapper or -M option and the --reference or -R option to filter by reference.

Let's look for SNP calling pipeline results for lane 5477_6#1 which used the mapper "smalt".


In [ ]:
pf snp -t lane -i 5477_6#1 -M smalt

Here we got the same results as before. But, what if we try looking for results produced by a different mapper?

Let's look for SNP calling pipeline results for lane 5477_6#1 which used the mapper "bwa".


In [ ]:
pf snp -t lane -i 5477_6#1 -M bwa

This gave us "No data found" as BWA mapping hasn't been run on this lane.

Let's look for SNP calling pipeline results for lane 5477_6#1 which used the reference "Streptococcus_pneumoniae_Taiwan19F-14_v1".


In [ ]:
pf snp -t lane -i 5477_6#1 \
    -R "Streptococcus_pneumoniae_Taiwan19F-14_v1"

Notice that we only get one VCF file (.bam) and its index (.tbi) returned. This is because the SNP calling pipeline has been run twice on this lane, once using the reference "Streptococcus_pneumoniae_Taiwan19F-14_v1" and once with "Streptococcus_pneumoniae_ATCC_700669_v1".

We could then symlink the gzipped VCF files into a directory using the --symlink or -l option.

Let's symlink our gzipped VCF files for lane 5477_6#1 to "my_vcf_files".


In [ ]:
pf snp -t lane -i 5477_6#1 \
    -R "Streptococcus_pneumoniae_Taiwan19F-14_v1" \
    -l my_vcf_files

Finally, we can generate the pseudogenome, a custom genome that incorporates filtered variants with respect to the reference. To generate the pseudogenome, we use the --pseudogenome or -p option.

Let's get the pseudogenomes for lane 5477_6#1.


In [ ]:
pf snp -t lane -i 5477_6#1 -p

Notice that there are two pseudogenome files for lane 5477_6#1. This is because the SNP calling has been run twice.

Questions

Q1: How many lanes from run 10018_1 has the SNP calling pipeline been completed on?
Hint: is there another pf command you can use to check the status of run?


In [ ]:
# Enter your answer here

Q2: How many gzipped VCF files are returned by default for lane 10018_1#20?


In [ ]:
# Enter your answer here

Q3: Which mapper and reference was used by the SNP calling pipeline for lane 10018_1#20?


In [ ]:
# Enter your answer here

Q4: Generate the pseudogenome for lane 10018_1#20 excluding the reference.
Hint: look at the pf snp usage


In [ ]:
# Enter your answer here

Q5: Symlink the gzipped VCF files generated by the SNP calling pipeline for run 10018_1 to a new directory called "10010_1_vcfs".


In [ ]:
# Enter your answer here

What's next?

For a quick recap of how to get QC pipeline results, head back to mapping pipeline results.

Otherwise, let's move on to how to get your assembly pipeline results.