Finding your data

Introduction

To search for the location(s) of data stored in the pathogen databases, we can use pf data. In the previous section, we looked at two options which are used by most of the pf scripts, type (-t) and id (-i).

In this section of the tutorial we will be looking at several other functions which pf data can perform that may be useful when finding, sharing or using your sequencing data.

By default, pf data will return a directory. It not only contains the imported sequence data, but also the results of any of the analysis pipelines which have been run on that data.

In this section of the tutorial we will cover:

  • the pf data command format
  • using pf data to find the top level directory where sequence data and analysis pipeline results are stored
  • using pf data to find sequence data files
  • using pf data to symlink files and directories
  • using pf data to compress files and directories
  • using pf data to generate sequencing data statistics

Filetypes

However, you might not want to know the top level directory location. You might want to know where the sequence data files are and what they are called so that you can use them in a downstream analysis. To do this, we ask pf data to find the sequence files using the filetype (--filetype or -f).

Symlinking

Pathogen Informatics asks users not to copy sequence data or results that are already in the pathogen databases. This is because copying data uses up precious disk space.

Instead we ask users to symlink the data. Symlinks contain no data, simply referencing the location of the original file or directory. To most commands, the symlink looks like the original file, but the operations the command performs (e.g. reading from the file) are directed to the original file which the symlink is pointed to.

You can symlink a file or directory that's returned by a pf data search by using the --symlink or -l option.

Archiving or compressing data

You may want to transfer or share some of your sequencing data. The simplest way to do this is to archive or compress the data you want to transfer. To compress data returned by pf data you can use the --archive or -a option. This will compress the returned data and return a file with the extension ".tar.gz" that is much smaller and easier to share or transport.

Getting general information and statistics

For some of the pf scripts, you can also get an overview of the data returned by pf data using the --stats or -s option. This will write a spreadsheet which contains statistics and general information.

These include, but are not limited to:

  • general information
    study ID, sample name, lane name...

  • sequencing information
    number of cycles, number of reads, number of bases...

  • quality control (QC) results
    reference used, percentage mapped, percentage paired, depth of coverage...

  • pipeline status
    QC, mapping, SNP calling, assembly, annotation, RNA-Seq...

Exercise 2

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

You can see the available options for pf data using the --help or -h option.

Let's take a look at the usage information for pf data.


In [ ]:
pf data -h

Here we can see that basic pf data command uses just the type (--type or -t) and id (--id or -i) options.

 pf data --id <id> --type <ID type> [options]

Let's search for the location of data associated with lane 5477_6#1.


In [ ]:
pf data -t lane -i 5477_6#1

The disk location pf data returned is the top level directory where all of data and results associated with lane 5477_6#1 are stored.

Filetypes

We may want to find the sequence data files which were imported so that we can use them for a subsequent analysis.

Let's find the FASTQ files which were imported for lane 5477_6#1.


In [ ]:
pf data -t lane -i 5477_6#1 -f fastq

As this is Illumina paired end data, there are two gzipped (.gz) FASTQ-formatted sequence data files returned which correspond to the left (_1) and right (_2) reads.

Symlinking

We don't want to copy these files to where we're running the analysis because this uses up disk space unnecessarily. Instead, we'll symlink them.

First, let's try symlinking our two FASTA files from lane 5477_6#1.


In [ ]:
pf data -t lane -i 5477_6#1 -f fastq -l

This should return a message like "Creating links in 'pathfind_5477_6_1'" which tells you where your files have been symlinked to. Here we can see that a new directory has been created with the prefix "pathfind_" and our lane name "5477_6_1". You'll also notice that the "#" in our lane name has been replcated by an underscore ("_").

Now, let's look in the new directory with ls.


In [ ]:
ls pathfind_5477_6_1

There we see our two files "5477_6#1_1.fastq.gz" and "5477_6#1_1.fastq.gz".

But, if we take a closer look using ls -l we can see that those files are symlinks to our tutorial data files.


In [ ]:
ls -l pathfind_5477_6_1

Now, let's try symlinking to a new directory called "my_lanes".


In [ ]:
pf data -t lane -i 5477_6#1 -f fastq -l my_lanes

We can now see that a new directory called "my_lanes" has been created.


In [ ]:
ls

And inside the "my lanes" directory are our two symlinked files.


In [ ]:
ls -l my_lanes

So, we've been symlinking our FASTQ files. But, what if we want to symlink all of the data and results associated with our lane.

Instead of symlinking just our sequence data, let's symlink all of the data and results for lane 5477_6#1 to a new directory called "my_lane_data".


In [ ]:
pf data -t lane -i 5477_6#1 -l my_lane_data

Looking inside "my_lane_data" we see a directory which has the same name as our lane, 5477_6#1. This directory is symlinked to the tutorial data directory for this lane.


In [ ]:
ls -l my_lane_data

Finally, let's try symlinking the data and results for all lanes associated with a study.


In [ ]:
pf data -t study -i 664 -l my_study_lanes

Here we see 11 symlinked directories which have the names of the 11 lanes associated with study 664.


In [ ]:
ls -l my_study_lanes

Archiving data

Sometimes, you may want to transfer or share your data. The simplest way to do this is to archive or compress the sequence data.

Let's archive the data for lane 5477_6#1.


In [ ]:
pf data -t lane -i 5477_6#1 -a

Here we see "pathfind_5477_6_1.tar.gz" has been created.


In [ ]:
ls

We can uncompress "pathfind_5477_6_1.tar.gz" using tar.


In [ ]:
tar xf pathfind_5477_6_1.tar.gz

This gives us a directory which shares the name of the lane we were looking for (with '#' replaced with an '_'). Inside that directory are our two sequence data files "5477_6#1_1.fastq.gz" and "5477_6#1_2.fastq.gz" as well as "stats.csv" which contains some general information and statistics.

Getting general information and statistics

We can get some general information and statistics about our sequence data using the -s or --stats option with pf data.

Let's try getting some statistics for lane 5477_6#1.


In [ ]:
pf data -t lane -i 5477_6#1 -s

You can see this has generated a new file called "5477_6_1.pathfind_stats.csv".


In [ ]:
ls

We can take a quick look at the contents of this file using cat.


In [ ]:
cat 5477_6_1.pathfind_stats.csv

Now, let's try getting some statistics for all lanes in our file "lanes.txt" and calling the output file "my_lane_stats.csv".


In [ ]:
pf data -t file -i data/lanes.txt -s my_lane_stats.csv

You should get a message which says your statistics have been written to "my_lane_stats.csv". We can take a look at this file. Perhaps just getting the first few columns using awk.

Note: we use '-F' with awk to tell it that the data we're parsing is comma-separated.


In [ ]:
awk -F',' '{print $1"\t"$2"\t"$3}' my_lane_stats.csv

Here we can see that there is one row per lane in the statistics file (see the "Lane Name" column).

Questions

Q1: What is the location of the top level directory for data and results associated with lane 10018_1#1?


In [ ]:
# Enter your answer here

Q2: What is the location of the FASTQ file(s) associated with lane 10018_1#1?


In [ ]:
# Enter your answer here

Q3: Symlink the FASTQ files from study 607 into a directory called "study_607_links". How many FASTQ files were symlinked to "study_607_links?
Hint: you can use wc -l to count the number of files in the directory


In [ ]:
# Enter your answer here

In [ ]:
# Enter your answer here

Q4: What reference was used to map lane 10018_1#1 during QC and what percentage of the reads were mapped to the reference?
Hint: you'll need to get some statistics


In [ ]:
# Enter your answer here

In [ ]:
# Enter your answer here

What's next?

For a quick recap of what the pf scripts are, head back to the introduction.

Otherwise, let's move on to sample information and accessions.