This section describes how to use Phandango to view a summary of ARIBA results from many samples.
The most important output file from ARIBA is the report called report.tsv
. For this tutorial, we have all 1517 reports in the directory data/ARIBA_reports/
.
In [ ]:
ls data/ARIBA_reports | wc -l
See the previous section for how to generate a report file for each sample.
ARIBA has a functon called "summary" that can summarise presence/absence of sequences and/or SNPs across samples. It takes at least two ariba reports as input, and makes a CSV file that can be opened in your favourite spreadsheet program, and also makes input files for Phandango. The two Phandango files (a tree and a CSV file) can be dropped straight into the Phandango page for viewing.
The tree that ARIBA makes is based on the CSV file, which contains results of presence/absence of sequence and SNPs, and other information such as percent identity bewteen contigs and reference sequences. This means that it does not necessarily represent the real phylogeny of the samples. It is more accurate to provide a tree built from the sequencing data. For this reason, we will use a pre-computed tree file data/tree_for_phandango.tre
.
First, let's run ariba summary
using the default settings, except we will skip making the tree:
In [ ]:
ariba summary --no_tree out data/ARIBA_reports/*.tsv
We can see that this made two files:
In [ ]:
ls out.*
They are the same except for the first line, which has Phandango-specific information. ARIBA uses the filenames as sample names in the output:
In [ ]:
head -n 2 out.phandango.csv
The first name is "data/ARIBA_reports/ERR1067709.tsv", and the rest are named similarly. This is not ideal, as it will look ugly in Phandango. Further, the names must exactly match the names in the tree file for Phandango to work (have a look in the tree file data/tree_for_phandango.tre
). You could do a little hacking here using the Unix command sed
on the CSV file. Instead, we can supply ARIBA with a file of filenames that also tells ariba what to call the samples in its output CSV files. Instead of "data/ARIBA_reports/ERR1067709.tsv", we would like to simply use "ERR1067709", which is cleaner and matches the tree file. It also means we can (and will) repeatedly run ariba summary
with different options, and get output files that can be loaded straight into Phandango. This is one way to make the file with the naming information:
In [ ]:
ls data/ARIBA_reports/* | awk -F/ '{print $0,$NF}' | \
sed 's/.tsv$//' > data/filenames.fofn
The file is quite simple. Column 1 is the filename, and column 2 is the name we would like to use in the output.
In [ ]:
head data/filenames.fofn
Now we can rerun summary using this input file. Note the use of the new option --fofn
.
In [ ]:
ariba summary --no_tree --fofn data/filenames.fofn \
out data/ARIBA_reports/*.tsv
Check that the renaming worked:
In [ ]:
head -n 2 out.phandango.csv
Now go to Phandango and drag and drop the files out.phandango.csv
and data/tree_for_phandango.tre
into the window. The result should like this:
This a very high-level summary of the data. For each cluster, it is simply saying whether or not each sample has a 'match'. Green means a match, and pink means not a match. For presence/absence genes, this means that the gene must simply be there to count as a match. If it is a "variant only" gene, then the gene must be there and one of the variants that we told ARIBA about earlier when generating the ARIBA database.
In addition to a simple "yes" or "no" as to whether a sample "matches" a given cluster (as explained above), more columns can be output for each cluster. See the ARIBA summary wiki page for a full description of the options.
Adding more columns can result in a very wide plot, so we will just look at 23S from now on, using the option --only_clusters 23S
. Adding the option --preset all
will show all available columns for the 23S cluster:
In [ ]:
ariba summary --only_clusters 23S --preset cluster_all --no_tree \
--fofn data/filenames.fofn out data/ARIBA_reports/*.tsv
As before, drag and drop the files out.phandango.csv
and data/tree_for_phandango.tre
into the window. This time the result should look like this:
Now there are seven columns, showing various findings from ARIBA for 23S. These columns are described in the ARIBA summary wiki page.
You can control exactly which of the seven cluster columns are output using the option --cluster_cols
instead of --preset
. For example, this will show just the "match" and "pct_id" columns:
In [ ]:
ariba summary --only_clusters 23S --cluster_cols match,pct_id \
--no_tree --fofn data/filenames.fofn out \
data/ARIBA_reports/*.tsv
In the previous screenshot, where the option --preset cluster_all
, there are two variant columns: "known_var" and "novel_var". Green means "yes" and pink means "no". Here, we are considering a variant to be a difference bewteen the reference sequence and the assembly contig from the reads. The known_var column indicates whether each sample has any variant that is known to ARIBA, which means it was included when the original ARIBA database was generated. The novel_var column indicates whether or not a sample has any variant that is not already known to ARIBA.
We can view the calls for all the known variants by adding the --known_variants
option:
In [ ]:
ariba summary --only_clusters 23S --known_variants --no_tree \
--fofn data/filenames.fofn out data/ARIBA_reports/*.tsv
As before, drag and drop the files out.phandango.csv
and data/tree_for_phandango.tre
into the window. This time the result should like this:
Each known SNP and whether or not it is present is shown, but you may have noticed there are three colours. Green means "yes", orange means "heterozygous", and pink means "no". N. gonorrhoeae has four copies of 23S. A SNP could be present in none, some, or all copies. Where it is present in 1,2, or 3 copies, it is called "heterozygous" by ARIBA. Consider the SNP 2597CT in column 4, and column 5 to the right "2597CT.%". Samples either do not have this SNP, or have heterozygous calls. The 2597CT.% shows the percent of reads that have the SNP, when mapped to the contig. Hover over the coloured blocks in Phandango to see the percentage.
In [ ]:
ariba summary --only_clusters 23S --novel_variants --no_tree \
--fofn data/filenames.fofn out data/ARIBA_reports/*.tsv
Now Phandango shows all variants found in any of the samples (even if the variant is unique to one sample). This results in a lot of columns!
Now go to the next part of the tutorial where we Investigate MIC data in relation to variants in the samples.
You can also return to the index or revisit the previous section.