Exploring the results

Output files

Let's have a look at the results. We will focus on the output from the second run as this will be the same as the first run but will also include the core gene alignment produced by MAFFT. We will start by looking at the most important output files and after this we will look at how you can query your pan genome and draw a simple tree form the core gene alignment.

summary_statistics.txt

The summary_statistics.txt file contains a summary of the number of genes in the pan, core and accessory genomes. It provides an overview of the genes and how frequently they occur in the input isolates. Usually, you can expect the total number of genes in this file to be about 1,000 genes per million bases of your species reference genome. In this case, the genomes are around 2 million bases, so we would expect a total number of genes to be in the order of 2,000. Let's have a look and see if this is the case.


In [ ]:
cat output_with_alignment/summary_statistics.txt

As you can see, we have around 2,500 genes which seems about right. If you get a lot fewer or many more genes than expected this could be an indication of an issue with your input data, for example contamination.

gene_presence_absence

The gene_presence_absence files lists each gene and which samples it is present in. The .csv file can be opened in Excel.

Let's have a look at the first ten lines of the file:


In [ ]:
head output_with_alignment/gene_presence_absence.csv

The columns are tab separated and contains the following information:

  1. The gene name (the most frequently occurring gene name from the sequences in the cluster)
  2. A non unique gene name
  3. Functional annotation (the most frequently occurring functional annotation from the cluster)
  4. Number of isolates in the cluster
  5. Number of sequences in the cluster
  6. Average number of sequences per isolate (normally 1)
  7. Genome fragment
  8. Order within fragment
  9. Accessory fragment
  10. Accessory order within fragment
  11. Comments on the quality of the cluster
  12. Minimum sequence length in nucleotides of the cluster
  13. Maximum sequence length in nucleotides of the cluster
  14. Average sequence length in nucleotides of the cluster
  15. Presence and absence of genes in each sample, with the corresponding source Gene ID

The .Rtab file contains a tab delimited binary matrix with the precence and abscence of each gene in each sample. This makes it easy to load into R using the read.table function, giving you access to a number of useful tools. The first row is the header containing the name of each sample, and the first column contains the gene name. In the matrix, 1 indicates the gene is present in the sample and 0 indicates it is absent.

pan_genome_reference.fa

This fasta file contains a single nucleotide sequence from each of the clusters in the pan genome. The name of each sequence is the source sequence ID followed by the cluster it came from. This file can be of use for reference guided assembly, whole genome MLST or for mapping raw reads to it.

.Rtab

Roary comes packaged with a script called create_pan_genome_plots.R. It requires R and the R-package ggplot2, and can be used to generate graphs from the .Rtab files, showing how the pan genome varies as genomes are added.

accessory_binary_genes.fa.newick

This is a tree in newick format, created using the binary presence and absence of accessory genes. It can for example be viewed in FigTree. The tree is only a quick and dirty tree, generated to provide a rough overview of the data. To generate a more accurate tree, we will use the core gene alignment a bit further on.

core_gene_alignment.aln

This is the multi-FASTA alignment of core genes that we created in the second run, using MAFFT. We will soon use this as input to build a phylogenetic tree.

clustered_proteins

In this file each line lists the sequences in a cluster. We will use this later on in the tutorial to query the pan genome.

For more information about the different output files, including some that we haven't mentioned here, please feel free to have a look at the Roary web page.

Query the pan genome

Roary comes with a script called query_pan_genome that can be used to examine the gene differences between groups of isolates. To have a look at the usage options for this script, you can do:

query_pan_genome -h

This will show you the following usage options:

Usage: query_pan_genome [options] *.gff
Perform set operations on the pan genome to see the gene differences 
between groups of isolates.

Options: -g STR    groups filename [clustered_proteins]
         -a STR    action (union/intersection/complement/gene_multifasta/
                     difference) [union]
         -c FLOAT  percentage of isolates a gene must be in to be core [99]
         -o STR    output filename [pan_genome_results]
         -n STR    comma separated list of gene names for use with 
                     gene_multifasta action
         -i STR    comma separated list of filenames, comparison set one
         -t STR    comma separated list of filenames, comparison set two
         -v        verbose output to STDOUT
         -h        this help message

Examples:
Union of genes found in isolates
         query_pan_genome -a union *.gff

Intersection of genes found in isolates (core genes)
         query_pan_genome -a intersection *.gff

Complement of genes found in isolates (accessory genes)
         query_pan_genome -a complement *.gff

Extract the sequence of each gene listed and create multi-FASTA files
         query_pan_genome -a gene_multifasta -n gryA,mecA,abc *.gff

Gene differences between sets of isolates
         query_pan_genome -a difference --input_set_one 1.gff,2.gff --input_set_two 3.gff,4.gff,5.gff

For further info see: http://sanger-pathogens.github.io/Roary/

As you can see, this also shows us some examples uses. Give the first one a go, using the clustered_proteins file in the output_with_alignment:


In [ ]:
query_pan_genome -a union \
    -g output_with_alignment/clustered_proteins *.gff

This will give us a file called pan_genome_results that contains a list of all genes in all samples, i.e. the pan genome. Have a look at the first ten lines of the newly generated file:


In [ ]:
head pan_genome_results

As you can see, the list contains the names of the clusters (this is usually the most frequently occurring gene name from the sequences in the cluster or, if there is no gene name, a generic unique name group_XXX). For each cluster, there is a tab separated list of each sample specific gene belonging in that cluster.

In a similar way, you can use query_pan_genome to get a list of the core genes:

query_pan_genome -a intersection \
    -g output_with_alignment/clustered_proteins *.gff

and a list of the accessory genes:

query_pan_genome -a complement \
    -g output_with_alignment/clustered_proteins *.gff

query_pan_genome can also be used to extract the protein sequences for genes you are particulatly interested in. Try extracting the sequences for three genes by specifying the -n option and a comma separated list of the cluster names:


In [ ]:
query_pan_genome -a gene_multifasta \
    -g output_with_alignment/clustered_proteins \
    -n patA_1,mnmG,hsdS_2 *.gff

You should have three new files, one for each gene you specified. Have a look at pan_genome_results_patA_1.fa:


In [ ]:
cat pan_genome_results_patA_1.fa

This multi-FASTA file contains the three protein sequences in the specified cluster (patA_1).

There is yet more functionality of query_pan_genome, but we won't go into that here. For further information, please feel free to visit the Roary web page.

Draw a tree form the core gene alignment

The tree created by Roary (accessory_binary_genes.fa.newick) is just a quick tree to provide a rough insight into the data. To create a more accurate tree you can use the core gene alignment as input to a tree building software of your choice. RAxML is very accurate, however it is also fairly slow so in this tutorial we are going to use FastTree. You don't have to run this step, but if you want to you need to make sure FastTree is installed on your computer first.

To create a tree in Newick format from a nucleotide alignment using a generalized time-reversible model (the -gtr option), do:


In [ ]:
FastTree -nt -gtr output_with_alignment/core_gene_alignment.aln \
    > tree.newick

The tree in this case will look like:

(sample1:0.006228253,sample2:0.002364375,sample3:0.002920483);

We can view this in FigTree, which will look something like:

In the event that you did not run this step, a copy of tree.newick has been placed in the ROARY/tree/ directory for the next section of this tutorial.

Check your understanding

Q9: Approximately how many genes would you expect to see in the summary_statistics.txt file if you are working with a species with a genome size of 5,000,000 bases?
a) 500
b) 5000
c) 50,000

Q10: What does the accessory_binary_genes.fa.newick file provide?
a) A pylogenetic tree ready for publishing
b) Nothing, it is useless
c) A quick insight to the data

Q11: For query_pan_genome, what option should you use to get the accessory genome?
a) union
b) intersection
c) complement

The answers to these questions can be found here.

Now that you know a bit about the output from Roary, let's make use of them by visualising the results using Phandango.

You can also revisit the previous page, or go back to the index page.