Exercise 01 - Whole Genome Comparisons

Introduction

The diversification of pathogen lineages can involve shifts in host adaptation and the acquisition and loss of virulence determinants. These processes occur in addition to the usual processes of environmental adaptation that apply to most organisms. In particular, gene gain and loss are highly likely to play significant roles in the adaptation of pathogens to their hosts, and the acquisition of novel virulence function.

An increasingly important approach to understanding the evolution of plant pathogens is comparative genomics. Sequencing several pathogens (and non-pathogenic close relatives) permits comparison of genome content, structure and organisation across those organisms. It is then possible to look for statistically significant associations between gene presence and some phenotypic feature, such as the presence or absence of virulence on all hosts, or a particular host.

The first step in these analyses is often to generate and inspect alignments of whole genomes (or regions of whole genomes), and the predicted protein complements. In this exercise, you will align and inspect pairwise alignments of bacterial plant pathogen genomes (Pseudomonas spp.), and in exercise 02 (02-cds_feature_comparisons) you will align and compare their proteomes.

Classification

With the increasing wide and inexpensive availability to sequence complete genomes as a service, taxonomic classification is moving from morphological and phenotypic methods, and single gene/MLST approaches, to multigene trees and larger comparisons. For bacteria, it is now feasible to organise many thousands of bacterial isolates into hierarchical classifications on the basis of whole genome similarity, and in this exercise you will use an open-source tool for this task to construct a classification for six Pseudomonas genomes.

 Learning outcomes

  • Use BLASTN at the command-line to compare prokaryotic genomes of related plant pathogens and environmental isolates.
  • Use ACT to visualise and interpret pairwise genome alignments between plant pathogens and environmental isolates.
  • Use ANIm to cluster and classify related pathogenic and non-pathogenic prokaryotic genomes.
  • Interpret ANIm/whole-genome classification in the context of existing taxonomy.

References

Blogs

Requirements

To complete this exercise, you will need:
  • an active internet connection
  • a local installation of pyani
  • a local installation of MUMmer
  • a local installation of BLAST+
  • a local installation of the visualisation tool ACT

Downloading sequence data

To complete this exercise, we need to obtain Pseudomonas genome sequence data from NCBI. For the sake of time, and minimising unnecessary bandwidth use, the relevant sequences have already been downloaded and are part of this repository. The script used to download this dataset is included in the repository as scripts/01-download.sh for your information, and as an example.

The downloaded genome and CDS amino acid sequences (as well as some other useful files) can be found in the pseudomonas subdirectory:

$ tree ./pseudomonas
./pseudomonas
├── GCF_000012245.1_ASM1224v1_genomic.fna
├── GCF_000012245.1_ASM1224v1_genomic.gff
├── GCF_000012245.1_ASM1224v1_protein.faa
├── GCF_000263675.1_ASM26367v2_genomic.fna
├── GCF_000263675.1_ASM26367v2_genomic.gff
├── GCF_000263675.1_ASM26367v2_protein.faa
├── GCF_000293885.2_ASM29388v3_genomic.fna
├── GCF_000293885.2_ASM29388v3_genomic.gff
├── GCF_000293885.2_ASM29388v3_protein.faa
├── GCF_000473745.2_ASM47374v3_genomic.fna
├── GCF_000473745.2_ASM47374v3_genomic.gff
├── GCF_000473745.2_ASM47374v3_protein.faa
├── GCF_000626655.2_ASM62665v2_genomic.fna
├── GCF_000626655.2_ASM62665v2_genomic.gff
├── GCF_000626655.2_ASM62665v2_protein.faa
├── GCF_000988485.1_ASM98848v1_genomic.fna
├── GCF_000988485.1_ASM98848v1_genomic.gff
├── GCF_000988485.1_ASM98848v1_protein.faa
├── classes.tab
└── labels.tab
Note: the accompaning worksheet ../worksheets/01-downloading_data_biopython.ipynb is a worked example of how to download sequences from NCBI/Entrez programmatically, using Biopython

Pairwise genome sequence comparison

We will use BLASTN to conduct pairwise comparisons between the complete genome sequences of the plant pathogenic bacterial isolates P. syringae B301D and B728a, and the cyanide-using bacterium P. fluorescens NCIMB 11764. We will visualise these comparisons using the visualisation tool ACT.

We will make two comparisons:

  • P. syringae B301D against P. syringae B728a
  • P. syringae B728a against P. fluorescens NCIMB 11764

so that we can see how related plant pathogenic bacteria compare to each other, and to non-pathogenic relatives.

BLAST+ comparison

The three genomes we will compare have been chosen in part because they are complete, closed chromosomes that fully represent the isolate's genome (i.e. there are no associated plasmids). This makes the visualisation task easier.

The relevant genome sequence files are:

  • P. syringae B301D: pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna
  • P. syringae B728a: pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna
  • P. fluorescens NCIMB 11764: pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna

To run BLAST+ comparisons, we will use BLASTN at the command-line. This is a fast comparison, and we will use default settings. We will need to create a new directory to store our output, and then run two blastn commands, by issuing the commands below in the terminal:

$ mkdir -p pseudomonas_blastn
$ blastn -query pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna \
    -subject pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna \
    -outfmt 6 -out pseudomonas_blastn/B301D_vs_B728a.tab
$ blastn -query pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna \
    -subject pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna \
    -outfmt 6 -out pseudomonas_blastn/B728a_vs_NCIMB_11764.tab

This will create two new .tab files in the subdirectory pseudomonas_blastn.

Note: the BLASTN output was set to be plain text tab-separate tabular, using the option -outfmt 6. This was chosen because it is a suitable format for reading into ACT. You can examine the contents of those files in a text editor or, for example, by using the command head pseudomonas_blastn/*.tab.

Visualising the comparison

To inspect and interact with the BLASTN comparison output, we will use the ACT package. This enables simultaneous visualisation of several genome comparisons at the same time. To start the ACT program, you can issue the following command at the terminal:

act &
Note: The ampersand (&) is used to make ACT run in the background. This means we can still use the terminal window to type commands.

To load the genome sequences and comparison files, we use ACT's menu.

First, select File -> Open to bring up the file selection dialog box.

At the moment, this is empty and does not have enough slots to include all our comparisons, so click on the more files... button once, to give us two more fields.

Now we can enter our genome sequences for comparison. We'll order these as follows:

  • P. fluorescens NCIMB 11764: pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna
  • P. syringae B728a: pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna
  • P. syringae B301D: pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna

For Sequence file 1, click on the Choose... button to navigate to the the appropriate sequence file for P. fluorescens NCIMB 11764. Then repeat the process for Sequence file 2 to enter the sequences for the two P. syringae genomes.

Now, to load the first comparison file, we click on the Choose... button for Comparison file 1 and navigate to the B728a_vs_NCIMB_11764.tab BLASTN output file. Then we repeat this process for the other comparison output.

Once the fields are complete, clicking on Apply renders the visualisation in an interactive window.

Inspecting the comparison

What can we say about the comparisons between these three genomes?
  • How do the BLAST results between genomes of the same species compare to those between *P. syringae* and *P. fluorescens*? Are there any major rearrangements?
  • What kinds of differences are seen between the two *P. syringae* genomes?

Whole-genome classification

Historically, microbial taxonomic classification has been based on morphology and phenotype. Now, with the advent of cheap whole genome sequencing it has become possible to generate more accurate and precise classifications of microbes (and bacteria in particular) by direct comparison of their complete genome sequences.

Whole-genome classification has several advantages over traditional classification methods. Principally, a complete genome sequence represents the complete current state of an organism's hereditary material; no more or better information of this type could be obtained.

Genomic DNA also represents information that, unlike transcript profiles, phenotype or morphology, is not mediated by environmental states that could influence the measurement.

NOTE: To be sure, whole-genome similarity is not the only possible measure of similarity (and there is much to be said for grouping organisms by phenotype), but it represents a clear and precise complete description of an organism's hereditary material, and has great value.

Here, we will apply the Average Nucleotide Identity method ANIm (as described in Richter et al. (2009)) of whole-genome comparison to visualise the relationships between the six isolates of Pseudomonas listed above.

To conduct ANIm analysis of these genomes, we will use the Python package pyani. This can be run from the terminal with the following command:

average_nucleotide_identity -v -m ANIm \
  -i pseudomonas -o pseudomonas_ANIm \
  -g --gmethod seaborn --gformat pdf,png \
  --classes pseudomonas/classes.tab \
  --labels pseudomonas/labels.tab

This will conduct all required pairwise genome comparisons on the six genomes in the pseudomonas subdirectory, and produce useful graphical output (and tables that can be imported into Excel or R for further analysis).

Inspecting the comparison

Open the files pseudomonas_ANIm/ANIm_percentage_identity.pdf and pseudomonas_ANIm/ANIm_alignment_coverage.pdf using Acrobat or a similar package.

Previous publications (Richter et al. (2009) and Goris et al. (2007)) associate an ANIm percentage identity score of 95% identity with the boundary between species as determined by DNA-DNA hybridisation. The corresponding results for the Pseudomonas comparison are shown in the file pseudomonas/ANIm_percentage_identity.pdf.

  • By this measure, how many species are identified, here?
  • Are all the species assignments consistent with this classification?

One factor that might mediate our interpretation of ANIm percentage identity results is what proportion of each genome contributes to the aligned homologous regions. This information is shown in the file pseudomonas/ANIm_alignment_coverage.pdf.

  • For genomes of the same species, how much of those genomes is aligned in the analysis?
  • For genomes of different species, how much of those genomes is aligned in the analysis?
  • If there are any questionable species assignments, how does the proportion of the genome that is aligned modify your interpretation?