To search for the location(s) of data stored in the pathogen databases, we can use pf data
. In the previous section, we looked at two options which are used by most of the pf scripts, type (-t
) and id (-i
).
In this section of the tutorial we will be looking at several other functions which pf data
can perform that may be useful when finding, sharing or using your sequencing data.
By default, pf data
will return a directory. It not only contains the imported sequence data, but also the results of any of the analysis pipelines which have been run on that data.
In this section of the tutorial we will cover:
pf data
command formatpf data
to find the top level directory where sequence data and analysis pipeline results are storedpf data
to find sequence data filespf data
to symlink files and directoriespf data
to compress files and directoriespf data
to generate sequencing data statisticsHowever, you might not want to know the top level directory location. You might want to know where the sequence data files are and what they are called so that you can use them in a downstream analysis. To do this, we ask pf data
to find the sequence files using the filetype (--filetype or -f).
Pathogen Informatics asks users not to copy sequence data or results that are already in the pathogen databases. This is because copying data uses up precious disk space.
Instead we ask users to symlink the data. Symlinks contain no data, simply referencing the location of the original file or directory. To most commands, the symlink looks like the original file, but the operations the command performs (e.g. reading from the file) are directed to the original file which the symlink is pointed to.
You can symlink a file or directory that's returned by a pf data
search by using the --symlink
or -l
option.
You may want to transfer or share some of your sequencing data. The simplest way to do this is to archive or compress the data you want to transfer. To compress data returned by pf data
you can use the --archive
or -a
option. This will compress the returned data and return a file with the extension ".tar.gz" that is much smaller and easier to share or transport.
For some of the pf
scripts, you can also get an overview of the data returned by pf data
using the --stats
or -s
option. This will write a spreadsheet which contains statistics and general information.
These include, but are not limited to:
general information
study ID, sample name, lane name...
sequencing information
number of cycles, number of reads, number of bases...
quality control (QC) results
reference used, percentage mapped, percentage paired, depth of coverage...
pipeline status
QC, mapping, SNP calling, assembly, annotation, RNA-Seq...
In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
You can see the available options for pf data
using the --help
or -h
option.
Let's take a look at the usage information for pf data
.
In [ ]:
pf data -h
Here we can see that basic pf data
command uses just the type (--type
or -t
) and id (--id
or -i
) options.
pf data --id <id> --type <ID type> [options]
Let's search for the location of data associated with lane 5477_6#1.
In [ ]:
pf data -t lane -i 5477_6#1
The disk location pf data
returned is the top level directory where all of data and results associated with lane 5477_6#1 are stored.
We may want to find the sequence data files which were imported so that we can use them for a subsequent analysis.
Let's find the FASTQ files which were imported for lane 5477_6#1.
In [ ]:
pf data -t lane -i 5477_6#1 -f fastq
As this is Illumina paired end data, there are two gzipped (.gz) FASTQ-formatted sequence data files returned which correspond to the left (_1) and right (_2) reads.
In [ ]:
pf data -t lane -i 5477_6#1 -f fastq -l
This should return a message like "Creating links in 'pathfind_5477_6_1'" which tells you where your files have been symlinked to. Here we can see that a new directory has been created with the prefix "pathfind_" and our lane name "5477_6_1". You'll also notice that the "#" in our lane name has been replcated by an underscore ("_").
Now, let's look in the new directory with ls
.
In [ ]:
ls pathfind_5477_6_1
There we see our two files "5477_6#1_1.fastq.gz" and "5477_6#1_1.fastq.gz".
But, if we take a closer look using ls -l
we can see that those files are symlinks to our tutorial data files.
In [ ]:
ls -l pathfind_5477_6_1
Now, let's try symlinking to a new directory called "my_lanes".
In [ ]:
pf data -t lane -i 5477_6#1 -f fastq -l my_lanes
We can now see that a new directory called "my_lanes" has been created.
In [ ]:
ls
And inside the "my lanes" directory are our two symlinked files.
In [ ]:
ls -l my_lanes
So, we've been symlinking our FASTQ files. But, what if we want to symlink all of the data and results associated with our lane.
Instead of symlinking just our sequence data, let's symlink all of the data and results for lane 5477_6#1 to a new directory called "my_lane_data".
In [ ]:
pf data -t lane -i 5477_6#1 -l my_lane_data
Looking inside "my_lane_data" we see a directory which has the same name as our lane, 5477_6#1. This directory is symlinked to the tutorial data directory for this lane.
In [ ]:
ls -l my_lane_data
Finally, let's try symlinking the data and results for all lanes associated with a study.
In [ ]:
pf data -t study -i 664 -l my_study_lanes
Here we see 11 symlinked directories which have the names of the 11 lanes associated with study 664.
In [ ]:
ls -l my_study_lanes
In [ ]:
pf data -t lane -i 5477_6#1 -a
Here we see "pathfind_5477_6_1.tar.gz" has been created.
In [ ]:
ls
We can uncompress "pathfind_5477_6_1.tar.gz" using tar
.
In [ ]:
tar xf pathfind_5477_6_1.tar.gz
This gives us a directory which shares the name of the lane we were looking for (with '#' replaced with an '_'). Inside that directory are our two sequence data files "5477_6#1_1.fastq.gz" and "5477_6#1_2.fastq.gz" as well as "stats.csv" which contains some general information and statistics.
We can get some general information and statistics about our sequence data using the -s
or --stats
option with pf data
.
Let's try getting some statistics for lane 5477_6#1.
In [ ]:
pf data -t lane -i 5477_6#1 -s
You can see this has generated a new file called "5477_6_1.pathfind_stats.csv".
In [ ]:
ls
We can take a quick look at the contents of this file using cat
.
In [ ]:
cat 5477_6_1.pathfind_stats.csv
Now, let's try getting some statistics for all lanes in our file "lanes.txt" and calling the output file "my_lane_stats.csv".
In [ ]:
pf data -t file -i data/lanes.txt -s my_lane_stats.csv
You should get a message which says your statistics have been written to "my_lane_stats.csv". We can take a look at this file. Perhaps just getting the first few columns using awk
.
Note: we use '-F' with awk
to tell it that the data we're parsing is comma-separated.
In [ ]:
awk -F',' '{print $1"\t"$2"\t"$3}' my_lane_stats.csv
Here we can see that there is one row per lane in the statistics file (see the "Lane Name" column).
In [ ]:
# Enter your answer here
Q2: What is the location of the FASTQ file(s) associated with lane 10018_1#1?
In [ ]:
# Enter your answer here
Q3: Symlink the FASTQ files from study 607 into a directory called "study_607_links". How many FASTQ files were symlinked to "study_607_links?
Hint: you can use wc -l to count the number of files in the directory
In [ ]:
# Enter your answer here
In [ ]:
# Enter your answer here
Q4: What reference was used to map lane 10018_1#1 during QC and what percentage of the reads were mapped to the reference?
Hint: you'll need to get some statistics
In [ ]:
# Enter your answer here
In [ ]:
# Enter your answer here
For a quick recap of what the pf scripts are, head back to the introduction.
Otherwise, let's move on to sample information and accessions.