Here are the answers to the questions from each of the tutorial sections.
First, let's tell the system the location of our tutorial configuration file.
In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
In [ ]:
pf data -t study -i 607 | wc -l
Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?
10
For this search, you need to set the type (-t
) to file and the id (-i
) to the location of the file, "data/lanes_to_search.txt". You can then pipe the locations returned by pf data
into wc -l
to count the number of locations (lines) returned.
In [ ]:
pf data -t file -i data/lanes_to_search.txt | wc -l
You can check that all the lanes in the file have been found by counting the number of lanes in the file.
In [ ]:
wc -l data/lanes_to_search.txt
In [ ]:
pf data -t lane -i 10018_1#1
Q2: What is the location of the FASTQ file(s) associated with lane 10018_1#1?
The location of the FASTQ file can be found by using the -f
or --filetype
option to get the location of the FASTQ files:
In [ ]:
pf data -t lane -i 10018_1#1 -f fastq
Q3: Symlink the FASTQ files from study 607 into a directory called "study_607_links". How many FASTQ files were symlinked to "study_607_links?
50
First, we need to get the FASTQ files for study 607 using the -f
or --filetype
option in case there are any non-FASTQ files. We then add the -l
or --symlink
option with directory we want to symlink to "study_607_links".
In [ ]:
pf data -t study -i 607 -f fastq -l study_607_links
We then look at the contents of "study_607_links" with ls
and count the number of files (lines) returned with wc -l
.
In [ ]:
ls study_607_links | wc -l
Q4: What reference was used to map lane 10018_1#1 during QC and what percentage of the reads were mapped to the reference?
Streptococcus_suis_P1_7_v1 and 0.00%
First, we need to get the statistics for lane 10018_1#1 using the -s
or --stats
option.
In [ ]:
pf data -t lane -i 10018_1#1 -s
Then, we need to find the "Reference" and "Mapped %" column in the statistics file (10018_1_1.pathfind_stats.csv).
In [ ]:
cat 10018_1_1.pathfind_stats.csv
Q1: What is the sample name that corresponds with lane 10018_1#1?
APP_N2_OP1
We can use the default output from running pf info
with the identifier type (-t
or --type
) set as "lane" and the identifier (-i
or --id
) as 10018_1#1 to get the sample name.
In [ ]:
pf info -t lane -i 10018_1#1
We could also have used pf accession
.
In [ ]:
pf accession -t lane -i 10018_1#1
Q2: What lane name(s) correspond with sample APP_T1_OP2?
10018_1#3 and 10018_1#34
We can use the default output from running pf info
with the identifier type (-t
or --type
) set as "sample" and the identifier (-i
or --id
) as APP_T1_OP2 to get the sample name.
In [ ]:
pf info -t sample -i APP_T1_OP2
Again, we could also have used pf accession
.
In [ ]:
pf accession -t sample -i APP_T1_OP2
Q3: What are the sample and lane names of the last lane in the file "data/lanes_to_search.txt"?
10018_1#51 and APP_T5_OP2
We can use the default output from running pf info
with the identifier type (-t
or --type
) set as "file" and the identifier (-i
or --id
) as "data/lanes_to_search.txt" to get the lane and sample names. To get the last line output (analogous to the last line in the file) we can use tail -1
.
In [ ]:
pf info -t file -i data/lanes_to_search.txt | tail -1
Again, we could also have used pf accession
.
In [ ]:
pf accession -t file -i data/lanes_to_search.txt | tail -1
Q4: What are the sample and lane accessions for lane 5477_6#1?
ERS015862 and ERR028809
We can use the default output from running pf accession
with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 5477_6#1 to get the lane and sample accessions.
In [ ]:
pf accession -t lane -i 5477_6#1
Q5: What are the two URLs which can be used to download the FASTQ files for lane 5477_6#1from the ENA?
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028809/ERR028809_1.fastq.gz
and
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028809/ERR028809_2.fastq.gz
We can get the ENA download URLs by running pf accession with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 5477_6#1 with the option -f
or --fastq
.
In [ ]:
pf accession -t lane -i 5477_6#1 -f
This will generate "fastq_urls.txt" which contains the two URLS you're looking for.
In [ ]:
cat fastq_urls.txt
Note: if the file "fastq_urls.txt" already exists you will need to remove it before you can use pf accession
to create it again.
Q1: Has the assembly pipeline been run on lane 10018_1#1? If so, what is the status?
No.
The status for the assembly pipeline for lane 10018_1#1 is '-' which means that the assembly pipeline has not been run for this data.
In [ ]:
pf status -t lane -i 10018_1#1
Q2: Which lanes in study 607 has the assembly pipeline been run on?
10018_1#2, 10018_1#2, 10018_1#2, 10018_1#2 and 10018_1#51
We can pipe the output of pf status
for study 607 into awk
. The assembly pipeline status is found in column 9 and we want to filter for values which are "Done". This should return five lanes.
In [ ]:
pf status -t study -i 607 | awk '$9 == "Done"'
Q3: How many lanes in study 607 has the mapping pipeline been run on?
41
The command structure here is similar to before except we want to filter values for the mapping pipeline in column 4. We can then count the number of lines returned with wc -l
.
In [ ]:
pf status -t study -i 607 | awk '$4 == "Done"' | wc -l
Q1: What percentage of the reads from lane 10018_1#1 were "unclassified" by Kraken?
69.55
We can use the default output from running pf qc
with the identifier type (-t
or --type
) set as "lane" and the identifier (-i
or --id
) as 10018_1#1 to get the location of the Kraken report. We then use xargs
to pass this location to head
so that we can see the first few lines of the report.
In [ ]:
pf qc -t lane -i 10018_1#1 | xargs head
Q2: What percentage of the reads from the lane 10018_1#1 were classified to the genus Actinobacillus by Kraken?
28.77%
We can write a summary of the Kraken report using the --summary
or -s
option. Here we called this file "qc_genus_summary.csv". To set the taxonomic level for the summary we use the --level
or -L
option. Genus is represented by a "G".
In [ ]:
pf qc -t lane -i 10018_1#1 -L G -s qc_genus_summary.csv
We then look to the summary file to see what precentage of reads were classified to the genus Actinobacillus.
In [ ]:
head qc_genus_summary.csv
Q1: How many BAM files are returned by default for lane 5477_6#10?
4
You can use grep -c
to count the number of returned locations ending in .bam (".bam$"). Notice we use a dollar sign to signify the end as we don't want to count the index files (.bam.bai).
In [ ]:
pf map -t lane -i 5477_6#10 | grep -c ".bam$"
Q2: Which mappers have been used with the mapping pipeline for lane 5477_6#10?
bowtie2, smalt and tophat
We can use the --details
or -d
option to get information about which mapper and reference were used to generate each of the BAM files. Then we can use awk
to get the 3rd column which contains the mapper.
In [ ]:
pf map -t lane -i 5477_6#10 -d | awk '{print $3}'
If you want you can also sort
to find the unique mappers with uniq
.
In [ ]:
pf map -t lane -i 5477_6#10 -d | awk '{print $3}' | sort | uniq
Q3: Which references have been used with the mapping pipeline for lane 5477_6#10?
The references used were:
Streptococcus_pneumoniae_ATCC_700669_v1
Streptococcus_pneumoniae_OXC141_v1
Streptococcus_pneumoniae_Taiwan19F-14_v1
You can us the same command as before except this time we are looking for the references in column 2 with awk
.
In [ ]:
pf map -t lane -i 5477_6#10 -d | awk '{print $2}' | sort | uniq
Q4: What percentage of the reads from lane 5477_6#10 were mapped to "Streptococcus_pneumoniae_OXC141_v1"?
97.5%
First, we need to filter our returned mapping pipeline results by reference using the --reference
or -R
option. Then we write the comma-delimited statistics for the returned BAM files to file using the --stats
or -s
option.
In [ ]:
pf map -t lane -i 5477_6#10 -R "Streptococcus_pneumoniae_OXC141_v1" -s
In [ ]:
cat 5477_6_10.mapping_stats.csv
This generates "5477_6_10.mapping_stats.csv" which we can filter by mapper (column 10) using awk
and return only the mapping percentage (column 12).
In [ ]:
awk -F',' '$8=="Streptococcus_pneumoniae_OXC141_v1" {print $12}' 5477_6_10.mapping_stats.csv
Q1: How many lanes from run 10018_1 has the SNP calling pipeline been completed on?
3
You can use pf status
to tell you which of the lanes in run 10018_1 the SNP calling pipeline has been completed on.
In [ ]:
pf status -t lane -i 10018_1
To count these you can get all of the rows where the SNP calling is "Done" (column 7) with awk
and then count the number of lines returned with wc -l
.
In [ ]:
pf status -t lane -i 10018_1 | awk '$7=="Done"' | wc -l
Q2: How many gzipped VCF files are returned by default for lane 10018_1#20?
1
In [ ]:
pf snp -t lane -i 10018_1#20
Q3: Which mapper and reference was used by the SNP calling pipeline for lane 10018_1#20?
smalt and Streptococcus_suis_P1_7_v1
You can get the mapper and reference information using the --details
or -d
option.
In [ ]:
pf snp -t lane -i 10018_1#20 -d
Q4: Generate the pseudogenome for lane 10018_1#20 excluding the reference.
To generate the pseudogenome you can use the --pseudogenome
or -p
option and --exclude-reference
or -x
option to exclude the reference.
In [ ]:
pf snp -t lane -i 10018_1#20 -p -x
Q5: Symlink the gzipped VCF files generated by the SNP calling pipeline for run 10018_1 to a new directory called "10010_1_vcfs".
You can symlink the VCF files using the --symlink
or -l
option, followed by the name of the directory you want to create.
In [ ]:
pf snp -t lane -i 10018_1#20 -l 10010_1_vcfs
In [ ]:
pf assembly -t lane -i 10018_1#50
Q2: Which program was used to generate the assembly for lane 10018_1#51?
velvet
Look at the end of the path - "10018_1#51/velvet_assembly/contigs.fa".
In [ ]:
pf assembly -t lane -i 10018_1#51
Q3: Symlink the assembly/assemblies generated by "IVA" for run 10018_1 into a new directory called "iva_results".
In [ ]:
pf assembly -t lane -i 10018_1 -P iva -l iva_results
Q4: How many contigs were assembled by velvet for lane 5477_6#1 and what is the N50?
66 contigs with an N50 of 61,250
First, you need to generate the statistics file using the --stats
or -s
option. We need to filter our results so that we only get the statistics for the velvet assembly. We can do this with the --program
or -P
option.
In [ ]:
pf assembly -t lane -i 5477_6#1 -s -P velvet
Then, you need to look at the contents.
In [ ]:
cat 5477_6_1.assemblyfind_stats.csv
In [ ]:
pf annotation -t lane -i 10018_1#1
Q2: What is the location of the annotation for the SPAdes assembly of lane 10018_1#1?
To get the location of the SPAdes annotation, you need to use the --program
or -P
option to filter the results by assembler:
In [ ]:
pf annotation -t lane -i 10018_1#1 -P spades
Q3: What is the location of the translated CDS sequence file for the SPAdes assembly of lane 10018_1#1?
To get the translated CDS sequence file you need to use the --filetype
or -f
option:
In [ ]:
pf annotation -t lane -i 10018_1#1 -P spades -f faa
Q4: How many of the assemblies for run 5477_6 contain the gene "dnaG"?
3
You need to use the --gene
or -g
option to search for a gene name.
In [ ]:
pf annotation -t lane -i 5477_6 -g dnaG
Q1: How many count files are returned by default for run 8479_4?
5
In [ ]:
pf rnaseq -t lane -i 8479_4
Q2: Which mappers have been used with the mapping pipeline for lane 8479_4#18?
bwa
You can get the mapper details using the --details
or -d
option.
In [ ]:
pf rnaseq -t lane -i 8479_4#18 -d
Q3: Which reference was used with the mapping pipeline for lane 8479_4#18?
Mus_musculus_mm10
You can get the reference details using the --details
or -d
option.
In [ ]:
pf rnaseq -t lane -i 8479_4#18 -d
Q4: What is the location or path of the featurecounts file for lane 8479_4#18?
You can get the location of the featurecounts file by using the --filetype
or -f
option:
In [ ]:
pf rnaseq -t lane -i 8479_4#18 -f featurecounts
Q5: Which of the lanes in run 8479_4 has the lowest percentage of mapped reads?
8479_4#17
You can get the mapping statistics for the run using the --statistics
or -s
option.
In [ ]:
pf rnaseq -t lane -i 8479_4 -s
This generates a new file called "8479_4.rnaseqfind_stats.csv". You can get the lane name and mapping percentage using awk
to print the third and twelfth columns.
In [ ]:
awk -F',' '{print $3"\t"$12}' 8479_4.rnaseqfind_stats.csv
Q6: What is the sample name and symlinked file name associated with lane 8479_4#18?
WT1xCtrl_2 and 8479_4#18.390152.pe.markdup.bam.expression.csv
You can use the --summary
or -S
option to get the relationship between lane, sample and symlinked file names.
In [ ]:
pf rnaseq -t lane -i 8479_4#18 -S
In [ ]:
cat 8479_4_18.rnaseqfind_summary.tsv
In [ ]:
pf ref -i Streptococcus_pneumoniae -A | wc -l
If you got more, that's because your search wasn't specific enough e.g.:
In [ ]:
pf ref -i Streptococcus -AR
Q2: How many of the Streptococcus pneumoniae references were imported from a public repository?
5
One of the references, "Streptococcus_pneumoniae_str_110.58_v0.4", has a version < 1 which means it is an internal assembly and so hasn't been imported from a public repository.
Q3: What is the location of the annotation (GFF) file for Streptococcus pneumoniae P1031.
You need to use the --filetype
or -f
option to get the location.
In [ ]:
pf ref -i Streptococcus_pneumoniae_P1031 -f gff
Q4: Symlink the annotation (GFF) file for Streptococcus pneumoniae P1031 to your current directory.
In [ ]:
pf ref -i Streptococcus_pneumoniae_P1031 -f gff -l