Introduction

Automated analysis pipelines

Pathogen Informatics maintain a series of databases which track the progress of pathogen studies and samples. These samples may have been through the Sanger sequencing pipeline (internal) or imported from other sources (external).

Once the sample data is in the Pathogen Informatics databases, it is then available to the automated analysis pipelines. Pathogen Informatics maintain the following automated analysis pipelines:

For more information on study registration, external data tracking and the automated analysis pipelines, please see the Pathogen Informatics wiki.

Accessing pathogen data and analysis results

A series of scripts were developed so that users can access imported sequence data and the results of the analysis pipelines. These are referred to as the pathfind or pf scripts.

Command Description
pf status used to find the pipeline progress for a given study, sample or lane
pf data used to find the FASTQ or PacBio files for a given study, sample or lane
pf info used to match sample internal ids and and supplier ids for a given study, sample or lane
pf accession used to obtain accession numbers for a given study, sample or lane
pf supplementary used to get supplementary information about a given study, sample or lane
pf qc used to find the Kraken results for a given study, sample or lane
pf map used to find the location of BAM files produced by the mapping pipeline
pf snp used to find the location of VCF files produced by the SNP calling pipeline
pf assembly used to find the location of the contig FASTA files produced by the assembly pipeline
pf annotation used to find the location of the GFF files produced by the annotation pipeline
pf rnaseq used to find the location of expression counts produced by the RNA-Seq analysis pipeline
pf ref used to find the location of a reference on pathogen disk

The pf scripts return information or locations for each lane that is found. To run the individual scripts the command structure we use is pf followed by the command you want to use i.e. data or status and then the options for that command.

pf <command> [options]

For example, to use pf data it would be:

pf data [options]

To specify which lanes we want to retrieve, we use the --type (-t or --type) and --id (-i or --id) options. These are required options for all of the pf commands except for pf ref.

There are four commonly used ID types (-t) you can use to search for information:

  • study
    retrieve all lanes associated with a study using a study ID
  • lane
    retrieve a single lane using a lane name
  • sample
    retrieve a single sample using a sample name
  • file
    retrieve all lanes which are listed in the file using the filename as the identifier

For pf data this would look like:

pf data -i <id> -t <ID type>

Getting help

You can look at an overview of all the pf commands using:

pf -h

Or, you can look at the usage and available options for a particular command using:

pf <command> -h

Exercise 1

This exercise uses pf data to walk you through using the four most commonly used ID types to search for information.

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Retrieving all lanes associated with a study

To retrieve all of the lanes which are associated with a study we will set the type (-t) to study and the id (-i) to 664.

Let's try searching for lanes associated with study 664.


In [ ]:
pf data -t study -i 664

We can also search for a study using its name. In Sequencescape we can see that the name of study 644 is "Streptococcus pneumoniae global lineages".

Let's try searching using the study name.


In [ ]:
pf data -t study -i "Streptococcus pneumoniae global lineages"

Finally, we can count the number of lanes that were returned using wc -l.


In [ ]:
pf data -t study -i 664 | wc -l

Retrieving a single lane

When we don't want all the lanes from a study, we can search for individual lanes. To do this we need to set the type (-t) to lane and give the lane name as our identifier (-i).

Let's try searching for a lane using the lane name.


In [ ]:
pf data -t lane -i 5477_6#1

Retrieving a run

If there are multiple lanes you want to retrieve from a run, we can search for all lanes associated with that run. To do this we need to set the type (-t) to lane and give the run as our identifier (-i).

Let's try searching for lanes associated with run 5477_6.


In [ ]:
pf data -t lane -i 5477_6

Retrieving a single sample

Perhaps you don't have the lane name but you do have the sample name. To seach using a sample name we need to set the type (-t) to sample and give the sample name as the id (-i).

Let's try searching for a sample using the sample name.


In [ ]:
pf data -t sample -i Tw01_0055

Retrieving multiple lanes using a file

Last, but not least, we can retrieve information for a list of lanes which are stored in a file. First, let's take a look at our file of lanes.


In [ ]:
cat data/lanes.txt

Here you can see we have one lane per line in our file. To use this file, we need to set the type (-t) to file and give the file name as the id (-i).

Let's try searching for information on the lanes in "data/lanes.txt".


In [ ]:
pf data -t file -i data/lanes.txt

Questions

Q1: How many lanes are associated with study 607?
Hint: you can use wc -l to count the number of lines (lanes) returned by pf data


In [ ]:
# Enter your answer here

Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?
Hint: you can use wc -l to count the number of lines (lanes) returned by pf data


In [ ]:
# Enter your answer here

What's next?

For a quick recap of what the tutorial covers and the software you will need, head back to the tutorial overview.

Otherwise, let's get started with looking at finding your data.