The diversification of pathogen lineages can involve shifts in host adaptation and the acquisition and loss of virulence determinants. These processes occur in addition to the usual processes of environmental adaptation that apply to most organisms. In particular, gene gain and loss are highly likely to play significant roles in the adaptation of pathogens to their hosts, and the acquisition of novel virulence function.
An increasingly important approach to understanding the evolution of plant pathogens is comparative genomics. Sequencing several pathogens (and non-pathogenic close relatives) permits comparison of genome content, structure and organisation across those organisms. It is then possible to look for statistically significant associations between gene presence and some phenotypic feature, such as the presence or absence of virulence on all hosts, or a particular host.
The first step in these analyses is often to generate and inspect alignments of whole genomes (or regions of whole genomes), and the predicted protein complements. In this exercise, you will align and inspect pairwise alignments of bacterial plant pathogen genomes (Pseudomonas spp.), and in exercise 02 (02-cds_feature_comparisons
) you will align and compare their proteomes.
With the increasing wide and inexpensive availability to sequence complete genomes as a service, taxonomic classification is moving from morphological and phenotypic methods, and single gene/MLST approaches, to multigene trees and larger comparisons. For bacteria, it is now feasible to organise many thousands of bacterial isolates into hierarchical classifications on the basis of whole genome similarity, and in this exercise you will use an open-source tool for this task to construct a classification for six Pseudomonas genomes.
References
Blogs
To complete this exercise, we need to obtain Pseudomonas genome sequence data from NCBI. For the sake of time, and minimising unnecessary bandwidth use, the relevant sequences have already been downloaded and are part of this repository. The script used to download this dataset is included in the repository as scripts/01-download.sh
for your information, and as an example.
The downloaded genome and CDS amino acid sequences (as well as some other useful files) can be found in the pseudomonas
subdirectory:
$ tree ./pseudomonas
./pseudomonas
├── GCF_000012245.1_ASM1224v1_genomic.fna
├── GCF_000012245.1_ASM1224v1_genomic.gff
├── GCF_000012245.1_ASM1224v1_protein.faa
├── GCF_000263675.1_ASM26367v2_genomic.fna
├── GCF_000263675.1_ASM26367v2_genomic.gff
├── GCF_000263675.1_ASM26367v2_protein.faa
├── GCF_000293885.2_ASM29388v3_genomic.fna
├── GCF_000293885.2_ASM29388v3_genomic.gff
├── GCF_000293885.2_ASM29388v3_protein.faa
├── GCF_000473745.2_ASM47374v3_genomic.fna
├── GCF_000473745.2_ASM47374v3_genomic.gff
├── GCF_000473745.2_ASM47374v3_protein.faa
├── GCF_000626655.2_ASM62665v2_genomic.fna
├── GCF_000626655.2_ASM62665v2_genomic.gff
├── GCF_000626655.2_ASM62665v2_protein.faa
├── GCF_000988485.1_ASM98848v1_genomic.fna
├── GCF_000988485.1_ASM98848v1_genomic.gff
├── GCF_000988485.1_ASM98848v1_protein.faa
├── classes.tab
└── labels.tab
We will use BLASTN
to conduct pairwise comparisons between the complete genome sequences of the plant pathogenic bacterial isolates P. syringae B301D and B728a, and the cyanide-using bacterium P. fluorescens NCIMB 11764. We will visualise these comparisons using the visualisation tool ACT
.
We will make two comparisons:
so that we can see how related plant pathogenic bacteria compare to each other, and to non-pathogenic relatives.
The relevant genome sequence files are:
pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna
pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna
pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna
To run BLAST+
comparisons, we will use BLASTN
at the command-line. This is a fast comparison, and we will use default settings. We will need to create a new directory to store our output, and then run two blastn
commands, by issuing the commands below in the terminal:
$ mkdir -p pseudomonas_blastn
$ blastn -query pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna \
-subject pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna \
-outfmt 6 -out pseudomonas_blastn/B301D_vs_B728a.tab
$ blastn -query pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna \
-subject pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna \
-outfmt 6 -out pseudomonas_blastn/B728a_vs_NCIMB_11764.tab
This will create two new .tab
files in the subdirectory pseudomonas_blastn
.
To load the genome sequences and comparison files, we use ACT
's menu.
First, select File -> Open
to bring up the file selection dialog box.
At the moment, this is empty and does not have enough slots to include all our comparisons, so click on the more files...
button once, to give us two more fields.
Now we can enter our genome sequences for comparison. We'll order these as follows:
pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna
pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna
pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna
For Sequence file 1
, click on the Choose...
button to navigate to the the appropriate sequence file for P. fluorescens NCIMB 11764. Then repeat the process for Sequence file 2
to enter the sequences for the two P. syringae genomes.
Now, to load the first comparison file, we click on the Choose...
button for Comparison file 1
and navigate to the B728a_vs_NCIMB_11764.tab
BLASTN
output file. Then we repeat this process for the other comparison output.
Once the fields are complete, clicking on Apply
renders the visualisation in an interactive window.
Historically, microbial taxonomic classification has been based on morphology and phenotype. Now, with the advent of cheap whole genome sequencing it has become possible to generate more accurate and precise classifications of microbes (and bacteria in particular) by direct comparison of their complete genome sequences.
Whole-genome classification has several advantages over traditional classification methods. Principally, a complete genome sequence represents the complete current state of an organism's hereditary material; no more or better information of this type could be obtained.
Genomic DNA also represents information that, unlike transcript profiles, phenotype or morphology, is not mediated by environmental states that could influence the measurement.
Here, we will apply the Average Nucleotide Identity method ANIm
(as described in Richter et al. (2009)) of whole-genome comparison to visualise the relationships between the six isolates of Pseudomonas listed above.
To conduct ANIm
analysis of these genomes, we will use the Python
package pyani
. This can be run from the terminal with the following command:
average_nucleotide_identity -v -m ANIm \
-i pseudomonas -o pseudomonas_ANIm \
-g --gmethod seaborn --gformat pdf,png \
--classes pseudomonas/classes.tab \
--labels pseudomonas/labels.tab
This will conduct all required pairwise genome comparisons on the six genomes in the pseudomonas
subdirectory, and produce useful graphical output (and tables that can be imported into Excel
or R
for further analysis).
Open the files pseudomonas_ANIm/ANIm_percentage_identity.pdf
and pseudomonas_ANIm/ANIm_alignment_coverage.pdf
using Acrobat
or a similar package.
Previous publications (Richter et al. (2009) and Goris et al. (2007)) associate an ANIm
percentage identity score of 95% identity with the boundary between species as determined by DNA-DNA hybridisation. The corresponding results for the Pseudomonas comparison are shown in the file pseudomonas/ANIm_percentage_identity.pdf
.
One factor that might mediate our interpretation of ANIm
percentage identity results is what proportion of each genome contributes to the aligned homologous regions. This information is shown in the file pseudomonas/ANIm_alignment_coverage.pdf
.