Finding a reference

Introduction

For a reference to be used with the Pathogen Informatics analysis pipelines, the reference must first be in the pathogen databases. All complete bacterial genomes from RefSeq are automatically imported and updated. There are user-submitted references too.

We can look at the available reference sequences using pf ref. This differs from the other pf scripts that we've looked at in that it doesn't require a type (-t). It only needs a partial id (id or i).

In this section of the tutorial we will cover:

  • using pf ref to find a FASTA-formatted reference
  • using pf ref to find the GFF annotation for a reference
  • using pf ref to symlink a reference

Exercise 11

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

We can find a reference using part of the reference name using pf ref.

Let's take a look at the usage information for pf ref.


In [ ]:
pf ref -h

So, if we wanted to look which mouse (Mus_musculus) references are available we can run:

pf ref -i Mus_musculus

Notice that there are no spaces between the genus, species and strain. Instead, these are replaced with underscores!

The command above would give you:

No exact match for "Mus_musculus". Did you mean:
  [1] Mus_musculus_mm10
  [2] Mus_musculus_mm9
  [a] all references
Which reference?

You would then enter the number corresponding to the reference location you need. Say we want to find the reference for "Mus_musculus_mm9", we would enter 1 which would give us:

/path/to/refs/Mus/musculus/Mus_musculus_mm10.fa

We can't do that in this notebook as it will wait forever in the background for us to enter an option. So, instead we need to use the --all or -A option to list all of the available references that match our query.

Let's see which Salmonella references are available.


In [ ]:
pf ref -i Salmonella -A

This gives us the location of the reference FASTA file on disk. However, maybe we just want to see the reference names. We can do this using the --reference-names or -R option. These can be useful where you need to specify a reference name when requesting the analysis pipelines on the command line.

Now, let's get the Salmonella reference names.


In [ ]:
pf ref -i Salmonella -A -R

Notice the version numbers at the end of the reference name. There is usually a naming convention with the references based on their source:

  • RefSeq accession (e.g. GCF_001887015_1) - complete genome imported from RefSeq
  • version (v) >=1 (e.g. v1) - genome requested by user and imported from public repository (e.g. ENA/GenBank)
  • version (v) <1 (e.g. v0.1) - internal genome assembly requested by user

But, perhaps you don't want the FASTA file, perhaps you want the reference annotation (i.e. GFF file). To get this, we need to use the --filetype or -f option.

Let's get the annotation (GFF) locations for the available Salmonella references.


In [ ]:
pf ref -i Salmonella -A -f gff

Finally, you might want to use the reference files in an analysis. The simplest way is to symlink them using the --symlink or -l option.

Let's symlink our Salmonella reference genomes to a directory called "salmonella_refs".


In [ ]:
pf ref -i Salmonella -A -l salmonella_refs

In [ ]:
ls salmonella_refs

Questions

Don't forget to use the -A option for all of these questions if you're running the pf ref commands inside the notebook!

Q1: How many Streptococcus pneumoniae references are available?
Hint: you can use wc to count the number of references returned


In [ ]:
# Enter your answer here

Q2: How many of those Streptococcus pneumoniae references were imported from a public repository ?
Hint: think about the version in the suffix, using the -R option to get only the names might make things clearer


In [ ]:
# Enter your answer here

Q3: What is the location of the annotation (GFF) file for Streptococcus pneumoniae P1031.


In [ ]:
# Enter your answer here

Q4: Symlink the annotation (GFF) file for Streptococcus pneumoniae P1031 to your current directory.


In [ ]:
# Enter your answer here

What's next?

You can head back to RNA-Seq expression pipeline results.

Otherwise, let's move on to looking at troubleshooting.