Answers

Here are the answers to the questions from each of the tutorial sections.

First, let's tell the system the location of our tutorial configuration file.



In [ ]:

    
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Introduction

Q1: How many lanes are associated with study 607?

For this search, you need to set the type (-t) to study and the id (-i) to 607. You can then pipe the locations returned by pf data into wc -l to count the number of locations (lines) returned.



In [ ]:

    
pf data -t study -i 607 | wc -l

Q2: How many lanes are returned if you search using the file "data/lanes_to_search.txt"?

For this search, you need to set the type (-t) to file and the id (-i) to the location of the file, "data/lanes_to_search.txt". You can then pipe the locations returned by pf data into wc -l to count the number of locations (lines) returned.



In [ ]:

    
pf data -t file -i data/lanes_to_search.txt | wc -l

You can check that all the lanes in the file have been found by counting the number of lanes in the file.



In [ ]:

    
wc -l data/lanes_to_search.txt

Finding your data

Q1: What is the location of the top level directory for data and results associated with lane 10018_1#1?

The location of the top directory can be found with:



In [ ]:

    
pf data -t lane -i 10018_1#1

Q2: What is the location of the FASTQ file(s) associated with lane 10018_1#1?

The location of the FASTQ file can be found by using the -f or --filetype option to get the location of the FASTQ files:



In [ ]:

    
pf data -t lane -i 10018_1#1 -f fastq

Q3: Symlink the FASTQ files from study 607 into a directory called "study_607_links". How many FASTQ files were symlinked to "study_607_links?

First, we need to get the FASTQ files for study 607 using the -f or --filetype option in case there are any non-FASTQ files. We then add the -l or --symlink option with directory we want to symlink to "study_607_links".



In [ ]:

    
pf data -t study -i 607 -f fastq -l study_607_links

We then look at the contents of "study_607_links" with ls and count the number of files (lines) returned with wc -l.



In [ ]:

    
ls study_607_links | wc -l

Q4: What reference was used to map lane 10018_1#1 during QC and what percentage of the reads were mapped to the reference?

Streptococcus_suis_P1_7_v1 and 0.00%

First, we need to get the statistics for lane 10018_1#1 using the -s or --stats option.



In [ ]:

    
pf data -t lane -i 10018_1#1 -s

Then, we need to find the "Reference" and "Mapped %" column in the statistics file (10018_1_1.pathfind_stats.csv).



In [ ]:

    
cat 10018_1_1.pathfind_stats.csv

Sample information and accessions

Q1: What is the sample name that corresponds with lane 10018_1#1?

APP_N2_OP1

We can use the default output from running pf info with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 10018_1#1 to get the sample name.



In [ ]:

    
pf info -t lane -i 10018_1#1

We could also have used pf accession.



In [ ]:

    
pf accession -t lane -i 10018_1#1

Q2: What lane name(s) correspond with sample APP_T1_OP2?

10018_1#3 and 10018_1#34

We can use the default output from running pf info with the identifier type (-t or --type) set as "sample" and the identifier (-i or --id) as APP_T1_OP2 to get the sample name.



In [ ]:

    
pf info -t sample -i APP_T1_OP2

Again, we could also have used pf accession.



In [ ]:

    
pf accession -t sample -i APP_T1_OP2

Q3: What are the sample and lane names of the last lane in the file "data/lanes_to_search.txt"?

10018_1#51 and APP_T5_OP2

We can use the default output from running pf info with the identifier type (-t or --type) set as "file" and the identifier (-i or --id) as "data/lanes_to_search.txt" to get the lane and sample names. To get the last line output (analogous to the last line in the file) we can use tail -1.



In [ ]:

    
pf info -t file -i data/lanes_to_search.txt | tail -1

Again, we could also have used pf accession.



In [ ]:

    
pf accession -t file -i data/lanes_to_search.txt | tail -1

Q4: What are the sample and lane accessions for lane 5477_6#1?

ERS015862 and ERR028809

We can use the default output from running pf accession with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 5477_6#1 to get the lane and sample accessions.



In [ ]:

    
pf accession -t lane -i 5477_6#1

Q5: What are the two URLs which can be used to download the FASTQ files for lane 5477_6#1from the ENA?

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028809/ERR028809_1.fastq.gz

and

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028809/ERR028809_2.fastq.gz

We can get the ENA download URLs by running pf accession with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 5477_6#1 with the option -f or --fastq.



In [ ]:

    
pf accession -t lane -i 5477_6#1 -f

This will generate "fastq_urls.txt" which contains the two URLS you're looking for.



In [ ]:

    
cat fastq_urls.txt

Note: if the file "fastq_urls.txt" already exists you will need to remove it before you can use pf accession to create it again.

Analysis pipeline status

Q1: Has the assembly pipeline been run on lane 10018_1#1? If so, what is the status?

No.

The status for the assembly pipeline for lane 10018_1#1 is '-' which means that the assembly pipeline has not been run for this data.



In [ ]:

    
pf status -t lane -i 10018_1#1

Q2: Which lanes in study 607 has the assembly pipeline been run on?

10018_1#2, 10018_1#2, 10018_1#2, 10018_1#2 and 10018_1#51

We can pipe the output of pf status for study 607 into awk. The assembly pipeline status is found in column 9 and we want to filter for values which are "Done". This should return five lanes.



In [ ]:

    
pf status -t study -i 607 | awk '$9 == "Done"'

Q3: How many lanes in study 607 has the mapping pipeline been run on?

The command structure here is similar to before except we want to filter values for the mapping pipeline in column 4. We can then count the number of lines returned with wc -l.



In [ ]:

    
pf status -t study -i 607 | awk '$4 == "Done"' | wc -l

QC pipeline results

Q1: What percentage of the reads from lane 10018_1#1 were "unclassified" by Kraken?

69.55

We can use the default output from running pf qc with the identifier type (-t or --type) set as "lane" and the identifier (-i or --id) as 10018_1#1 to get the location of the Kraken report. We then use xargs to pass this location to head so that we can see the first few lines of the report.



In [ ]:

    
pf qc -t lane -i 10018_1#1 | xargs head

Q2: What percentage of the reads from the lane 10018_1#1 were classified to the genus Actinobacillus by Kraken?

28.77%

We can write a summary of the Kraken report using the --summary or -s option. Here we called this file "qc_genus_summary.csv". To set the taxonomic level for the summary we use the --level or -L option. Genus is represented by a "G".



In [ ]:

    
pf qc -t lane -i 10018_1#1 -L G -s qc_genus_summary.csv

We then look to the summary file to see what precentage of reads were classified to the genus Actinobacillus.



In [ ]:

    
head qc_genus_summary.csv

Mapping pipeline results

Q1: How many BAM files are returned by default for lane 5477_6#10?

You can use grep -c to count the number of returned locations ending in .bam (".bam$"). Notice we use a dollar sign to signify the end as we don't want to count the index files (.bam.bai).



In [ ]:

    
pf map -t lane -i 5477_6#10 | grep -c ".bam$"

Q2: Which mappers have been used with the mapping pipeline for lane 5477_6#10?

bowtie2, smalt and tophat

We can use the --details or -d option to get information about which mapper and reference were used to generate each of the BAM files. Then we can use awk to get the 3rd column which contains the mapper.



In [ ]:

    
pf map -t lane -i 5477_6#10 -d | awk '{print $3}'

If you want you can also sort to find the unique mappers with uniq.



In [ ]:

    
pf map -t lane -i 5477_6#10 -d | awk '{print $3}' | sort | uniq

Q3: Which references have been used with the mapping pipeline for lane 5477_6#10?

The references used were:

Streptococcus_pneumoniae_ATCC_700669_v1
Streptococcus_pneumoniae_OXC141_v1
Streptococcus_pneumoniae_Taiwan19F-14_v1

You can us the same command as before except this time we are looking for the references in column 2 with awk.



In [ ]:

    
pf map -t lane -i 5477_6#10 -d | awk '{print $2}' | sort | uniq

Q4: What percentage of the reads from lane 5477_6#10 were mapped to "Streptococcus_pneumoniae_OXC141_v1"?

97.5%

First, we need to filter our returned mapping pipeline results by reference using the --reference or -R option. Then we write the comma-delimited statistics for the returned BAM files to file using the --stats or -s option.



In [ ]:

    
pf map -t lane -i 5477_6#10 -R "Streptococcus_pneumoniae_OXC141_v1" -s



In [ ]:

    
cat 5477_6_10.mapping_stats.csv

This generates "5477_6_10.mapping_stats.csv" which we can filter by mapper (column 10) using awk and return only the mapping percentage (column 12).



In [ ]:

    
awk -F',' '$8=="Streptococcus_pneumoniae_OXC141_v1" {print $12}' 5477_6_10.mapping_stats.csv

SNP pipeline results

Q1: How many lanes from run 10018_1 has the SNP calling pipeline been completed on?

You can use pf status to tell you which of the lanes in run 10018_1 the SNP calling pipeline has been completed on.



In [ ]:

    
pf status -t lane -i 10018_1

To count these you can get all of the rows where the SNP calling is "Done" (column 7) with awk and then count the number of lines returned with wc -l.



In [ ]:

    
pf status -t lane -i 10018_1 | awk '$7=="Done"' | wc -l

Q2: How many gzipped VCF files are returned by default for lane 10018_1#20?



In [ ]:

    
pf snp -t lane -i 10018_1#20

Q3: Which mapper and reference was used by the SNP calling pipeline for lane 10018_1#20?

smalt and Streptococcus_suis_P1_7_v1

You can get the mapper and reference information using the --details or -d option.



In [ ]:

    
pf snp -t lane -i 10018_1#20 -d

Q4: Generate the pseudogenome for lane 10018_1#20 excluding the reference.

To generate the pseudogenome you can use the --pseudogenome or -p option and --exclude-reference or -x option to exclude the reference.



In [ ]:

    
pf snp -t lane -i 10018_1#20 -p -x

Q5: Symlink the gzipped VCF files generated by the SNP calling pipeline for run 10018_1 to a new directory called "10010_1_vcfs".

You can symlink the VCF files using the --symlink or -l option, followed by the name of the directory you want to create.



In [ ]:

    
pf snp -t lane -i 10018_1#20 -l 10010_1_vcfs

Assembly pipeline results

Q1: How many assembly files are returned by default for lane 10018_1#50?

Assemblies have been generated using IVA and SPAdes (look at the result paths).



In [ ]:

    
pf assembly -t lane -i 10018_1#50

Q2: Which program was used to generate the assembly for lane 10018_1#51?

velvet

Look at the end of the path - "10018_1#51/velvet_assembly/contigs.fa".



In [ ]:

    
pf assembly -t lane -i 10018_1#51

Q3: Symlink the assembly/assemblies generated by "IVA" for run 10018_1 into a new directory called "iva_results".



In [ ]:

    
pf assembly -t lane -i 10018_1 -P iva -l iva_results

Q4: How many contigs were assembled by velvet for lane 5477_6#1 and what is the N50?

66 contigs with an N50 of 61,250

First, you need to generate the statistics file using the --stats or -s option. We need to filter our results so that we only get the statistics for the velvet assembly. We can do this with the --program or -P option.



In [ ]:

    
pf assembly -t lane -i 5477_6#1 -s -P velvet

Then, you need to look at the contents.



In [ ]:

    
cat 5477_6_1.assemblyfind_stats.csv

Annotation pipeline results

Q1: How many GFF files are returned by default for lane 10018_1#1?

There are two GFF files returned, one for an IVA assembly and one for a SPAdes assembly.



In [ ]:

    
pf annotation -t lane -i 10018_1#1

Q2: What is the location of the annotation for the SPAdes assembly of lane 10018_1#1?

To get the location of the SPAdes annotation, you need to use the --program or -P option to filter the results by assembler:



In [ ]:

    
pf annotation -t lane -i 10018_1#1 -P spades

Q3: What is the location of the translated CDS sequence file for the SPAdes assembly of lane 10018_1#1?

To get the translated CDS sequence file you need to use the --filetype or -f option:



In [ ]:

    
pf annotation -t lane -i 10018_1#1 -P spades -f faa

Q4: How many of the assemblies for run 5477_6 contain the gene "dnaG"?

You need to use the --gene or -g option to search for a gene name.



In [ ]:

    
pf annotation -t lane -i 5477_6 -g dnaG

RNA-Seq expression pipeline results

Q1: How many count files are returned by default for run 8479_4?



In [ ]:

    
pf rnaseq -t lane -i 8479_4

Q2: Which mappers have been used with the mapping pipeline for lane 8479_4#18?

bwa

You can get the mapper details using the --details or -d option.



In [ ]:

    
pf rnaseq -t lane -i 8479_4#18 -d

Q3: Which reference was used with the mapping pipeline for lane 8479_4#18?

Mus_musculus_mm10

You can get the reference details using the --details or -d option.



In [ ]:

    
pf rnaseq -t lane -i 8479_4#18 -d

Q4: What is the location or path of the featurecounts file for lane 8479_4#18?

You can get the location of the featurecounts file by using the --filetype or -f option:



In [ ]:

    
pf rnaseq -t lane -i 8479_4#18 -f featurecounts

Q5: Which of the lanes in run 8479_4 has the lowest percentage of mapped reads?

8479_4#17

You can get the mapping statistics for the run using the --statistics or -s option.



In [ ]:

    
pf rnaseq -t lane -i 8479_4 -s

This generates a new file called "8479_4.rnaseqfind_stats.csv". You can get the lane name and mapping percentage using awk to print the third and twelfth columns.



In [ ]:

    
awk -F',' '{print $3"\t"$12}' 8479_4.rnaseqfind_stats.csv

Q6: What is the sample name and symlinked file name associated with lane 8479_4#18?

WT1xCtrl_2 and 8479_4#18.390152.pe.markdup.bam.expression.csv

You can use the --summary or -S option to get the relationship between lane, sample and symlinked file names.



In [ ]:

    
pf rnaseq -t lane -i 8479_4#18 -S



In [ ]:

    
cat 8479_4_18.rnaseqfind_summary.tsv

Finding a reference

Q1: How many Streptococcus pneumoniae references are available?

Don't forget that genus, species and strain should be separated by an underscore (not a space!) in your query.



In [ ]:

    
pf ref -i Streptococcus_pneumoniae -A | wc -l

If you got more, that's because your search wasn't specific enough e.g.:



In [ ]:

    
pf ref -i Streptococcus -AR

Q2: How many of the Streptococcus pneumoniae references were imported from a public repository?

One of the references, "Streptococcus_pneumoniae_str_110.58_v0.4", has a version < 1 which means it is an internal assembly and so hasn't been imported from a public repository.

Q3: What is the location of the annotation (GFF) file for Streptococcus pneumoniae P1031.

You need to use the --filetype or -f option to get the location.



In [ ]:

    
pf ref -i Streptococcus_pneumoniae_P1031 -f gff

Q4: Symlink the annotation (GFF) file for Streptococcus pneumoniae P1031 to your current directory.



In [ ]:

    
pf ref -i Streptococcus_pneumoniae_P1031 -f gff -l