Pathogen Informatics maintain a series of databases which track the progress of pathogen studies and samples. These samples may have been through the Sanger sequencing pipeline (internal) or imported from other sources (external).
Once the sample data is in the Pathogen Informatics databases, it is then available to the automated analysis pipelines. Pathogen Informatics maintain the following automated analysis pipelines:
For more information on study registration, external data tracking and the automated analysis pipelines, please see the Pathogen Informatics wiki.
A series of scripts were developed so that users can access imported sequence data and the results of the analysis pipelines. These are referred to as the pathfind or pf scripts.
Command | Description |
---|---|
pf status | used to find the pipeline progress for a given study, sample or lane |
pf data | used to find the FASTQ or PacBio files for a given study, sample or lane |
pf info | used to match sample internal ids and and supplier ids for a given study, sample or lane |
pf accession | used to obtain accession numbers for a given study, sample or lane |
pf supplementary | used to get supplementary information about a given study, sample or lane |
pf qc | used to find the Kraken results for a given study, sample or lane |
pf map | used to find the location of BAM files produced by the mapping pipeline |
pf snp | used to find the location of VCF files produced by the SNP calling pipeline |
pf assembly | used to find the location of the contig FASTA files produced by the assembly pipeline |
pf annotation | used to find the location of the GFF files produced by the annotation pipeline |
pf rnaseq | used to find the location of expression counts produced by the RNA-Seq analysis pipeline |
pf ref | used to find the location of a reference on pathogen disk |
The pf scripts return information or locations for each lane that is found. To run the individual scripts the command structure we use is pf
followed by the command you want to use i.e. data
or status
and then the options for that command.
pf <command> [options]
For example, to use pf data
it would be:
pf data [options]
To specify which lanes we want to retrieve, we use the --type
(-t
or --type
) and --id
(-i
or --id
) options. These are required options for all of the pf
commands except for pf ref
.
There are four commonly used ID types (-t
) you can use to search for information:
For pf data
this would look like:
pf data -i <id> -t <ID type>
You can look at an overview of all the pf commands using:
pf -h
Or, you can look at the usage and available options for a particular command using:
pf <command> -h
In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
In [ ]:
pf data -t study -i 664
We can also search for a study using its name. In Sequencescape we can see that the name of study 644 is "Streptococcus pneumoniae global lineages".
Let's try searching using the study name.
In [ ]:
pf data -t study -i "Streptococcus pneumoniae global lineages"
Finally, we can count the number of lanes that were returned using wc -l
.
In [ ]:
pf data -t study -i 664 | wc -l
In [ ]:
pf data -t lane -i 5477_6#1
In [ ]:
pf data -t lane -i 5477_6
In [ ]:
pf data -t sample -i Tw01_0055
In [ ]:
cat data/lanes.txt
Here you can see we have one lane per line in our file. To use this file, we need to set the type (-t
) to file and give the file name as the id (-i
).
Let's try searching for information on the lanes in "data/lanes.txt".
In [ ]:
pf data -t file -i data/lanes.txt
In [ ]:
# Enter your answer here
Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?
Hint: you can use wc -l
to count the number of lines (lanes) returned by pf data
In [ ]:
# Enter your answer here
For a quick recap of what the tutorial covers and the software you will need, head back to the tutorial overview.
Otherwise, let's get started with looking at finding your data.