The diversification of pathogen lineages can involve shifts in host adaptation and the acquisition and loss of virulence determinants. These processes occur in addition to the usual processes of environmental adaptation that apply to most organisms. In particular, gene gain and loss are highly likely to play significant roles in the adaptation of pathogens to their hosts, and the acquisition of novel virulence function.
An increasingly important approach to understanding the evolution of plant pathogens is comparative genomics. Sequencing several pathogens (and non-pathogenic close relatives) permits comparison of genome content, structure and organisation across those organisms. It is then possible to look for statistically significant associations between gene presence and some phenotypic feature, such as the presence or absence of virulence on all hosts, or a particular host.
The first step in these analyses is often to generate and inspect alignments of whole genomes (or regions of whole genomes), and the predicted protein complements. In this exercise, you will align and inspect pairwise alignments of bacterial plant pathogen genomes (Pseudomonas spp.), and in exercise 02 (02-cds_feature_comparisons) you will align and compare their proteomes.
With the increasing wide and inexpensive availability to sequence complete genomes as a service, taxonomic classification is moving from morphological and phenotypic methods, and single gene/MLST approaches, to multigene trees and larger comparisons. For bacteria, it is now feasible to organise many thousands of bacterial isolates into hierarchical classifications on the basis of whole genome similarity, and in this exercise you will use an open-source tool for this task to construct a classification for six Pseudomonas genomes.
References
Blogs
To complete this exercise, we need to obtain Pseudomonas genome sequence data from NCBI. For the sake of time, and minimising unnecessary bandwidth use, the relevant sequences have already been downloaded and are part of this repository. The script used to download this dataset is included in the repository as scripts/01-download.sh for your information, and as an example.
The downloaded genome and CDS amino acid sequences (as well as some other useful files) can be found in the pseudomonas subdirectory:
$ tree ./pseudomonas
./pseudomonas
├── GCF_000012245.1_ASM1224v1_genomic.fna
├── GCF_000012245.1_ASM1224v1_genomic.gff
├── GCF_000012245.1_ASM1224v1_protein.faa
├── GCF_000263675.1_ASM26367v2_genomic.fna
├── GCF_000263675.1_ASM26367v2_genomic.gff
├── GCF_000263675.1_ASM26367v2_protein.faa
├── GCF_000293885.2_ASM29388v3_genomic.fna
├── GCF_000293885.2_ASM29388v3_genomic.gff
├── GCF_000293885.2_ASM29388v3_protein.faa
├── GCF_000473745.2_ASM47374v3_genomic.fna
├── GCF_000473745.2_ASM47374v3_genomic.gff
├── GCF_000473745.2_ASM47374v3_protein.faa
├── GCF_000626655.2_ASM62665v2_genomic.fna
├── GCF_000626655.2_ASM62665v2_genomic.gff
├── GCF_000626655.2_ASM62665v2_protein.faa
├── GCF_000988485.1_ASM98848v1_genomic.fna
├── GCF_000988485.1_ASM98848v1_genomic.gff
├── GCF_000988485.1_ASM98848v1_protein.faa
├── classes.tab
└── labels.tab
We will use BLASTN to conduct pairwise comparisons between the complete genome sequences of the plant pathogenic bacterial isolates P. syringae B301D and B728a, and the cyanide-using bacterium P. fluorescens NCIMB 11764. We will visualise these comparisons using the visualisation tool ACT.
We will make two comparisons:
so that we can see how related plant pathogenic bacteria compare to each other, and to non-pathogenic relatives.
The relevant genome sequence files are:
pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fnapseudomonas/GCF_000988485.1_ASM98848v1_genomic.fnapseudomonas/GCF_000293885.2_ASM29388v3_genomic.fnaTo run BLAST+ comparisons, we will use BLASTN at the command-line. This is a fast comparison, and we will use default settings. We will need to create a new directory to store our output, and then run two blastn commands, by issuing the commands below in the terminal:
$ mkdir -p pseudomonas_blastn
$ blastn -query pseudomonas/GCF_000012245.1_ASM1224v1_genomic.fna \
-subject pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna \
-outfmt 6 -out pseudomonas_blastn/B301D_vs_B728a.tab
$ blastn -query pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna \
-subject pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna \
-outfmt 6 -out pseudomonas_blastn/B728a_vs_NCIMB_11764.tab
This will create two new .tab files in the subdirectory pseudomonas_blastn.
To load the genome sequences and comparison files, we use ACT's menu.
First, select File -> Open to bring up the file selection dialog box.
At the moment, this is empty and does not have enough slots to include all our comparisons, so click on the more files... button once, to give us two more fields.
Now we can enter our genome sequences for comparison. We'll order these as follows:
pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fnapseudomonas/GCF_000988485.1_ASM98848v1_genomic.fnapseudomonas/GCF_000012245.1_ASM1224v1_genomic.fnaFor Sequence file 1, click on the Choose... button to navigate to the the appropriate sequence file for P. fluorescens NCIMB 11764. Then repeat the process for Sequence file 2 to enter the sequences for the two P. syringae genomes.
Now, to load the first comparison file, we click on the Choose... button for Comparison file 1 and navigate to the B728a_vs_NCIMB_11764.tab BLASTN output file. Then we repeat this process for the other comparison output.
Once the fields are complete, clicking on Apply renders the visualisation in an interactive window.
Historically, microbial taxonomic classification has been based on morphology and phenotype. Now, with the advent of cheap whole genome sequencing it has become possible to generate more accurate and precise classifications of microbes (and bacteria in particular) by direct comparison of their complete genome sequences.
Whole-genome classification has several advantages over traditional classification methods. Principally, a complete genome sequence represents the complete current state of an organism's hereditary material; no more or better information of this type could be obtained.
Genomic DNA also represents information that, unlike transcript profiles, phenotype or morphology, is not mediated by environmental states that could influence the measurement.
Here, we will apply the Average Nucleotide Identity method ANIm (as described in Richter et al. (2009)) of whole-genome comparison to visualise the relationships between the six isolates of Pseudomonas listed above.
To conduct ANIm analysis of these genomes, we will use the Python package pyani. This can be run from the terminal with the following command:
average_nucleotide_identity -v -m ANIm \
-i pseudomonas -o pseudomonas_ANIm \
-g --gmethod seaborn --gformat pdf,png \
--classes pseudomonas/classes.tab \
--labels pseudomonas/labels.tab
This will conduct all required pairwise genome comparisons on the six genomes in the pseudomonas subdirectory, and produce useful graphical output (and tables that can be imported into Excel or R for further analysis).
Open the files pseudomonas_ANIm/ANIm_percentage_identity.pdf and pseudomonas_ANIm/ANIm_alignment_coverage.pdf using Acrobat or a similar package.
Previous publications (Richter et al. (2009) and Goris et al. (2007)) associate an ANIm percentage identity score of 95% identity with the boundary between species as determined by DNA-DNA hybridisation. The corresponding results for the Pseudomonas comparison are shown in the file pseudomonas/ANIm_percentage_identity.pdf.
One factor that might mediate our interpretation of ANIm percentage identity results is what proportion of each genome contributes to the aligned homologous regions. This information is shown in the file pseudomonas/ANIm_alignment_coverage.pdf.