When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf
scripts to return the results of each of the automated analysis pipelines.
The SNP calling pipeline is composed of two parts, SNP calling and pseudogenome construction. We can use pf snp
to return the location of the variant calling format (VCF) files that were produced by the SNP calling pipeline.
In this section of the tutorial we will cover:
pf snp
to get VCF files generated by the SNP calling pipelinepf snp
results by mapper and reference pf snp
to symlink VCF files generated by the SNP calling pipeline pf snp
to generate a pseudogenomeFirst, let's tell the system the location of our tutorial configuration file.
In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
Let's take a look at the pf snp
usage.
In [ ]:
pf snp -h
Now, let's get the SNP calling pipeline results for lane 5477_6#1.
In [ ]:
pf snp -t lane -i 5477_6#1
This returns the locations of the gzipped VCF files (.vcf.gz) and their indices (.vcf.gz.tbi) which were produced by the SNP calling pipeline.
The mapping pipeline is run before the SNP calling pipeline. A quick way to get information about which mapper and reference were used by the mapping pipeline is to use the --details
or -d
option.
Let's get the SNP calling details for lane 5477_6#1.
In [ ]:
pf snp -t lane -i 5477_6#1 -d
Here we can see that "smalt" was use as the mapper for both the "Streptococcus_pneumoniae_ATCC_700669_v1" and "Streptococcus_pneumoniae_Taiwan19F-14_v1" references.
You can request the SNP calling pipeline be run more than once using different mappers or reference. To filter the output by mapper we can use the --mapper
or -M
option and the --reference
or -R
option to filter by reference.
Let's look for SNP calling pipeline results for lane 5477_6#1 which used the mapper "smalt".
In [ ]:
pf snp -t lane -i 5477_6#1 -M smalt
Here we got the same results as before. But, what if we try looking for results produced by a different mapper?
Let's look for SNP calling pipeline results for lane 5477_6#1 which used the mapper "bwa".
In [ ]:
pf snp -t lane -i 5477_6#1 -M bwa
This gave us "No data found" as BWA mapping hasn't been run on this lane.
Let's look for SNP calling pipeline results for lane 5477_6#1 which used the reference "Streptococcus_pneumoniae_Taiwan19F-14_v1".
In [ ]:
pf snp -t lane -i 5477_6#1 \
-R "Streptococcus_pneumoniae_Taiwan19F-14_v1"
Notice that we only get one VCF file (.bam) and its index (.tbi) returned. This is because the SNP calling pipeline has been run twice on this lane, once using the reference "Streptococcus_pneumoniae_Taiwan19F-14_v1" and once with "Streptococcus_pneumoniae_ATCC_700669_v1".
We could then symlink the gzipped VCF files into a directory using the --symlink
or -l
option.
Let's symlink our gzipped VCF files for lane 5477_6#1 to "my_vcf_files".
In [ ]:
pf snp -t lane -i 5477_6#1 \
-R "Streptococcus_pneumoniae_Taiwan19F-14_v1" \
-l my_vcf_files
Finally, we can generate the pseudogenome, a custom genome that incorporates filtered variants with respect to the reference. To generate the pseudogenome, we use the --pseudogenome
or -p
option.
Let's get the pseudogenomes for lane 5477_6#1.
In [ ]:
pf snp -t lane -i 5477_6#1 -p
Notice that there are two pseudogenome files for lane 5477_6#1. This is because the SNP calling has been run twice.
In [ ]:
# Enter your answer here
Q2: How many gzipped VCF files are returned by default for lane 10018_1#20?
In [ ]:
# Enter your answer here
Q3: Which mapper and reference was used by the SNP calling pipeline for lane 10018_1#20?
In [ ]:
# Enter your answer here
Q4: Generate the pseudogenome for lane 10018_1#20 excluding the reference.
Hint: look at the pf snp
usage
In [ ]:
# Enter your answer here
Q5: Symlink the gzipped VCF files generated by the SNP calling pipeline for run 10018_1 to a new directory called "10010_1_vcfs".
In [ ]:
# Enter your answer here
For a quick recap of how to get QC pipeline results, head back to mapping pipeline results.
Otherwise, let's move on to how to get your assembly pipeline results.