Let's have a look at the results. We will focus on the output from the second run as this will be the same as the first run but will also include the core gene alignment produced by MAFFT. We will start by looking at the most important output files and after this we will look at how you can query your pan genome and draw a simple tree form the core gene alignment.
The summary_statistics.txt file contains a summary of the number of genes in the pan, core and accessory genomes. It provides an overview of the genes and how frequently they occur in the input isolates. Usually, you can expect the total number of genes in this file to be about 1,000 genes per million bases of your species reference genome. In this case, the genomes are around 2 million bases, so we would expect a total number of genes to be in the order of 2,000. Let's have a look and see if this is the case.
In [ ]:
cat output_with_alignment/summary_statistics.txt
As you can see, we have around 2,500 genes which seems about right. If you get a lot fewer or many more genes than expected this could be an indication of an issue with your input data, for example contamination.
The gene_presence_absence files lists each gene and which samples it is present in. The .csv file can be opened in Excel.
Let's have a look at the first ten lines of the file:
In [ ]:
head output_with_alignment/gene_presence_absence.csv
The columns are tab separated and contains the following information:
The .Rtab file contains a tab delimited binary matrix with the precence and abscence of each gene in each sample. This makes it easy to load into R using the read.table function, giving you access to a number of useful tools. The first row is the header containing the name of each sample, and the first column contains the gene name. In the matrix, 1 indicates the gene is present in the sample and 0 indicates it is absent.
This fasta file contains a single nucleotide sequence from each of the clusters in the pan genome. The name of each sequence is the source sequence ID followed by the cluster it came from. This file can be of use for reference guided assembly, whole genome MLST or for mapping raw reads to it.
Roary comes packaged with a script called create_pan_genome_plots.R. It requires R and the R-package ggplot2, and can be used to generate graphs from the .Rtab files, showing how the pan genome varies as genomes are added.
This is a tree in newick format, created using the binary presence and absence of accessory genes. It can for example be viewed in FigTree. The tree is only a quick and dirty tree, generated to provide a rough overview of the data. To generate a more accurate tree, we will use the core gene alignment a bit further on.
This is the multi-FASTA alignment of core genes that we created in the second run, using MAFFT. We will soon use this as input to build a phylogenetic tree.
In this file each line lists the sequences in a cluster. We will use this later on in the tutorial to query the pan genome.
For more information about the different output files, including some that we haven't mentioned here, please feel free to have a look at the Roary web page.
Roary comes with a script called query_pan_genome that can be used to examine the gene differences between groups of isolates. To have a look at the usage options for this script, you can do:
query_pan_genome -h
This will show you the following usage options:
Usage: query_pan_genome [options] *.gff
Perform set operations on the pan genome to see the gene differences
between groups of isolates.
Options: -g STR groups filename [clustered_proteins]
-a STR action (union/intersection/complement/gene_multifasta/
difference) [union]
-c FLOAT percentage of isolates a gene must be in to be core [99]
-o STR output filename [pan_genome_results]
-n STR comma separated list of gene names for use with
gene_multifasta action
-i STR comma separated list of filenames, comparison set one
-t STR comma separated list of filenames, comparison set two
-v verbose output to STDOUT
-h this help message
Examples:
Union of genes found in isolates
query_pan_genome -a union *.gff
Intersection of genes found in isolates (core genes)
query_pan_genome -a intersection *.gff
Complement of genes found in isolates (accessory genes)
query_pan_genome -a complement *.gff
Extract the sequence of each gene listed and create multi-FASTA files
query_pan_genome -a gene_multifasta -n gryA,mecA,abc *.gff
Gene differences between sets of isolates
query_pan_genome -a difference --input_set_one 1.gff,2.gff --input_set_two 3.gff,4.gff,5.gff
For further info see: http://sanger-pathogens.github.io/Roary/
As you can see, this also shows us some examples uses. Give the first one a go, using the clustered_proteins file in the output_with_alignment:
In [ ]:
query_pan_genome -a union \
-g output_with_alignment/clustered_proteins *.gff
This will give us a file called pan_genome_results that contains a list of all genes in all samples, i.e. the pan genome. Have a look at the first ten lines of the newly generated file:
In [ ]:
head pan_genome_results
As you can see, the list contains the names of the clusters (this is usually the most frequently occurring gene name from the sequences in the cluster or, if there is no gene name, a generic unique name group_XXX). For each cluster, there is a tab separated list of each sample specific gene belonging in that cluster.
In a similar way, you can use query_pan_genome to get a list of the core genes:
query_pan_genome -a intersection \
-g output_with_alignment/clustered_proteins *.gff
and a list of the accessory genes:
query_pan_genome -a complement \
-g output_with_alignment/clustered_proteins *.gff
query_pan_genome can also be used to extract the protein sequences for genes you are particulatly interested in. Try extracting the sequences for three genes by specifying the -n option and a comma separated list of the cluster names:
In [ ]:
query_pan_genome -a gene_multifasta \
-g output_with_alignment/clustered_proteins \
-n patA_1,mnmG,hsdS_2 *.gff
You should have three new files, one for each gene you specified. Have a look at pan_genome_results_patA_1.fa:
In [ ]:
cat pan_genome_results_patA_1.fa
This multi-FASTA file contains the three protein sequences in the specified cluster (patA_1).
There is yet more functionality of query_pan_genome, but we won't go into that here. For further information, please feel free to visit the Roary web page.
The tree created by Roary (accessory_binary_genes.fa.newick) is just a quick tree to provide a rough insight into the data. To create a more accurate tree you can use the core gene alignment as input to a tree building software of your choice. RAxML is very accurate, however it is also fairly slow so in this tutorial we are going to use FastTree. You don't have to run this step, but if you want to you need to make sure FastTree is installed on your computer first.
To create a tree in Newick format from a nucleotide alignment using a generalized time-reversible model (the -gtr option), do:
In [ ]:
FastTree -nt -gtr output_with_alignment/core_gene_alignment.aln \
> tree.newick
The tree in this case will look like:
(sample1:0.006228253,sample2:0.002364375,sample3:0.002920483);
We can view this in FigTree, which will look something like:
In the event that you did not run this step, a copy of tree.newick has been placed in the ROARY/tree/ directory for the next section of this tutorial.
Q9: Approximately how many genes would you expect to see in the summary_statistics.txt file if you are working with a species with a genome size of 5,000,000 bases?
a) 500
b) 5000
c) 50,000
Q10: What does the accessory_binary_genes.fa.newick file provide?
a) A pylogenetic tree ready for publishing
b) Nothing, it is useless
c) A quick insight to the data
Q11: For query_pan_genome, what option should you use to get the accessory genome?
a) union
b) intersection
c) complement
The answers to these questions can be found here.
Now that you know a bit about the output from Roary, let's make use of them by visualising the results using Phandango.
You can also revisit the previous page, or go back to the index page.