Sample information and accessions

Introduction

Once your samples have been sequenced or imported, it can be useful to match up the internal lane identifiers with the sample and supplier identifiers. We can look at the relationship between lane and sample using pf info which will return values for:

  • Lane name
  • Sample name
  • Supplier name
  • Public name
  • Strain

Alternatively, you might want to know the EBI sample and submission numbers for a particular lane or sample. To get this, you can use pf accession which will return:

  • Sample name
  • Sample accession
  • Lane name
  • Lane accession

For more information about EBI accession number format please see www.ebi.ac.uk/ena/submit/read-data-format.

You can also use pf to generate a spreadsheet with supplementary data, which can be useful for publication. pf supplementary will return:

  • Sample name
  • Sample accession
  • Lane name
  • Lane accession
  • Supplier name
  • Public name
  • Strain
  • Study ID
  • Study accession

Optionally, pf supplementary can also return the sample description.

In this section of the tutorial we will cover:

  • using pf info to get sample metadata
  • using pf accession to get sample accessions
  • using pf supplementary to get supplementary data.

Exercise 3

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Metadata

We can get the metadata associated with our lanes using pf info.

Let's take a look at the usage information for pf info.


In [ ]:
pf info -h

Let's get the sample name that corresponds to lane 5477_6#1.


In [ ]:
pf info -t lane -i 5477_6#1

Here we can see that several pieces of metadata have been returned. One of these is the sample name: Tw01_0055.

Now, let's get the sample names for all lanes associated with study 664.


In [ ]:
pf info -t study -i 664

We can write this information to file using the -o or --outfile option.

Let's write our lane metadata to file.


In [ ]:
pf info -t study -i 664 -o

This has generated a new file "infofind.csv" which contains our comma-separated lane metadata.


In [ ]:
cat infofind.csv

We can also give the output file a different name.

Let's call the metadata file for study 664 "study_664_info.csv".


In [ ]:
pf info -t study -i 664 -o study_664_info.csv

This generates the file "study_664_info.csv" which contains our metadata.


In [ ]:
cat study_664_info.csv

Accessions

If available, we can also get the EBI raw sequence data and sample accessions for the lanes associated with study 664 using pf accession.

Let's take a look at the usage information for pf accession.


In [ ]:
pf accession -h

Let's get the EBI accessions for all lanes associated with study 664.


In [ ]:
pf accession -t study -i 664

As with pf info we can also write the output of pf accession to a comma-delimited file.

Let's write the accessions associated with study 664 to a file called "study_664_accessions.csv".


In [ ]:
pf accession -t study -i 664 -o study_664_accessions.csv

This generates the file "study_664_accessions.csv" which contains our comma-separated accessions.


In [ ]:
cat study_664_accessions.csv

Finally, we can get the EBI URLs to download the raw data using the -f or --fastq option. By default, these will be written to a file called "fastq_urls.txt".

Let's get the URLs for downloading the FASTQ files for study 667 from the European Nucleodtide Archive (ENA).


In [ ]:
pf accession -t study -i 664 -f

This generated a file called "fastq_urls.txt" which contained the URLs to download the raw sequencing data, one URL per file.


In [ ]:
cat fastq_urls.txt

Supplementary data

We can get the supplementary data associated with our lanes using pf supplementary.

Let's take a look at the usage information for pf supplementary.


In [ ]:
pf supplementary -h

Let's get the supplementary data for all lanes associated with study 664.


In [ ]:
pf supplementary -t study -i 664

As with pf info and pf accession we can also write the output of pf supplementary to a comma-delimited file.

Let's write the supplementary data associated with study 664 to a file called "study_664_supplementary.csv".


In [ ]:
pf supplementary -t study -i 664 -o study_664_supplementary.csv

This generates the file "study_664_supplementary.csv" which contains our comma-separated supplementary data.


In [ ]:
cat study_664_supplementary.csv

Finally, we can include sample description in the supplementary information by using the -d or --description option.

Let's get the supplementary data for all lanes associated with study 664, including the sample description


In [ ]:
pf supplementary -t study -i 664 -d

Questions

Q1: What is the sample name that corresponds with lane 10018_1#1?


In [ ]:
# Enter your answer here

Q2: What lane name(s) correspond with sample APP_T1_OP2?


In [ ]:
# Enter your answer here

Q3: What are the sample and lane names of the last lane in the file "data/lanes_to_search.txt".
Hint: use tail -1 to get the last line of the output


In [ ]:
# Enter your answer here

Q4: What are the sample and lane accessions for lane 5477_6#1?


In [ ]:
# Enter your answer here

Q5: What are the two URLs which can be used to download the FASTQ files for lane 5477_6#1 from the ENA?


In [ ]:
# Enter your answer here

In [ ]:
# Enter your answer here

What's next?

You can head back to finding your data.

Otherwise, let's move on to looking at analysis pipeline status.