Pathogen Informatics maintain a series of databases which track the progress of pathogen studies and samples. These samples may have been through the Sanger sequencing pipeline (internal) or imported from other sources (external).
Once the sample data is in the Pathogen Informatics databases, it is then available to the automated analysis pipelines. Pathogen Informatics maintain the following automated analysis pipelines:
For more information on study registration, external data tracking and the automated analysis pipelines, please see the Pathogen Informatics wiki.
A series of scripts were developed so that users can access imported sequence data and the results of the analysis pipelines. These are referred to as the pathfind or pf scripts.
| Command | Description | 
|---|---|
| pf status | used to find the pipeline progress for a given study, sample or lane | 
| pf data | used to find the FASTQ or PacBio files for a given study, sample or lane | 
| pf info | used to match sample internal ids and and supplier ids for a given study, sample or lane | 
| pf accession | used to obtain accession numbers for a given study, sample or lane | 
| pf supplementary | used to get supplementary information about a given study, sample or lane | 
| pf qc | used to find the Kraken results for a given study, sample or lane | 
| pf map | used to find the location of BAM files produced by the mapping pipeline | 
| pf snp | used to find the location of VCF files produced by the SNP calling pipeline | 
| pf assembly | used to find the location of the contig FASTA files produced by the assembly pipeline | 
| pf annotation | used to find the location of the GFF files produced by the annotation pipeline | 
| pf rnaseq | used to find the location of expression counts produced by the RNA-Seq analysis pipeline | 
| pf ref | used to find the location of a reference on pathogen disk | 
The pf scripts return information or locations for each lane that is found. To run the individual scripts the command structure we use is pf followed by the command you want to use i.e. data or status and then the options for that command.
pf <command> [options]
For example, to use pf data it would be:
pf data [options]
To specify which lanes we want to retrieve, we use the --type (-t or --type) and --id (-i or --id) options. These are required options for all of the pf commands except for pf ref.
There are four commonly used ID types (-t) you can use to search for information:
For pf data this would look like:
pf data -i <id> -t <ID type>
You can look at an overview of all the pf commands using:
pf -h
Or, you can look at the usage and available options for a particular command using:
pf <command> -h
In [ ]:
    
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
    
In [ ]:
    
pf data -t study -i 664
    
We can also search for a study using its name. In Sequencescape we can see that the name of study 644 is "Streptococcus pneumoniae global lineages".
Let's try searching using the study name.
In [ ]:
    
pf data -t study -i "Streptococcus pneumoniae global lineages"
    
Finally, we can count the number of lanes that were returned using wc -l.
In [ ]:
    
pf data -t study -i 664 | wc -l
    
In [ ]:
    
pf data -t lane -i 5477_6#1
    
In [ ]:
    
pf data -t lane -i 5477_6
    
In [ ]:
    
pf data -t sample -i Tw01_0055
    
In [ ]:
    
cat data/lanes.txt
    
Here you can see we have one lane per line in our file. To use this file, we need to set the type (-t) to file and give the file name as the id (-i).
Let's try searching for information on the lanes in "data/lanes.txt".
In [ ]:
    
pf data -t file -i data/lanes.txt
    
In [ ]:
    
# Enter your answer here
    
Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?
Hint: you can use wc -l to count the number of lines (lanes) returned by pf data
In [ ]:
    
# Enter your answer here
    
For a quick recap of what the tutorial covers and the software you will need, head back to the tutorial overview.
Otherwise, let's get started with looking at finding your data.