Effector proteins, which are translocated from the pathogen either into the host or into a space which both host and pathogen can access, play major roles in pathogen host range and host interaction. They are frequently the targets of resistance (R
) gene recognition, and many are thought to play significant roles in modulating host behaviour to the benefit of the pathogen.
Most pathogens contain multiple effectors representing different families and, when we obtain a new pathogen genome, we often wish to annotate the predicted protein complement with likely effectors, on the basis of sequence similarity to known effector proteins. In order to do this, we need an example of what features characterise the effector or effector family, so that we can screen the predicted protein complement for those features.
It is possible to conduct a straightforward sequence search using BLAST
or an equivalent tool to find effectors. However, methods such as profile hidden Markov models (HMMs) can capture statistical aspects of a diverse set of sequences, enabling a more sensitive and precise search for new members of the family. In this exercise, you will use profile HMMs to build a model representing the Pseudomonas syringae effector avrPto1
, and employ the model in a search for new family members in publicly-available sequence data.
HMMer
.If this is successful, you should see the input marker to the left of the cell change from
In [ ]:
to (for example)
In [1]:
and you may see output appear below the cell.
We will extract a subset of effector sequences from the set of sequences in PHI-Base 4. To do this, we will download all the sequences contained in PHI-Base, in FASTA format. This can be done using your web browser.
PHI-base 4
link in the top link bar on the main pageDownload
link in the top link bar.Continue
button.Submit
button.FASTA format
link.This will download a file called PHI_accessions
to a location on your machine, most likely your Downloads
directory, wherever that may be.
We will create a new directory to hold this data, by issuing the command below at the terminal:
mkdir -p phi_base
Now copy the downloaded file to the phi_base
subdirectory, renaming it to make the database version number explicit, and to assign the .fas
file extension. You can do this with your file manager, or at the terminal with the cp
(copy) command:
cp ~/Downloads/PHI_accessions ./phi_base/PHI_accessions_4_1.fas
From this point, we will assume that the location of this dataset - relative to this notebook - is ./phi_base/PHI_accessions_4_1.fas
.
avrPto1
effector sequences from the PHI-Base downloadWe will use Biopython
to help us extract all sequences from the PHI-Base file that refer to the Pseudomonas syringae effector avrPto1
. This will give us a starting set of sequences representative of early identification of a putative effector in a real project.
To load the PHI-Base sequences, we use Biopython
's SeqIO
library.
We want to filter these sequences to obtain all the avrPto1
protein sequences with records in PHI-Base. In the absence of any other criteria (GO annotation, PFam and filter sequences on the presence of the string avrPto1
in the description. This is equivalent to performing a search for avrPto1
at the PHI-Base site.
avrPto1
.To filter the local sequence file, execute the cell below:
In [ ]:
from Bio import SeqIO
# We create an empty list to hold avrPto1 sequences that we find
avrpto = []
# We read every sequence in turn from the input FASTA file, and
# if the description contains "avrPto1", we put that sequence in
# the avrpto list
for seqrecord in SeqIO.parse('phi_base/PHI_accessions_4_1.fas', 'fasta'):
if "avrPto1" in seqrecord.description:
avrpto.append(seqrecord)
# Show the sequences that we've collected in avrpto
for seqrecord in avrpto:
print(seqrecord)
After running the cell above, you should see that we have found three sequences:
They are all annotated as avrPto1
effectors. We will write these to the file data/avrpto1.fas
in FASTA format, using Biopython
, by running the cell below:
In [ ]:
# Write the contents of the list avrpto to the file data/avrpto1.fas in FASTA format
SeqIO.write(avrpto, "phi_base/avrpto1.fas", "fasta")
avrPto1.fas
sequences in other genomesWe will now use the three avrPto1
sequences you have downloaded as the seed or training set of sequences for finding new examples of these effectors in the Pseudomonas data you downloaded earlier.
It would be possible to search with each sequence individually, and attempt to match the results obtained from each individual search, but instead we will perform a profile search, by building a hidden Markov model from the downloaded sequences, and searching with the HMMer
search tool (instead of BLAST
).
avrPto1
sequencesTo build a sequence profile, we must first align the seed set of sequences, so that the equivalent positions in each input sequence line up in the profile. We will use the command-line tool Clustal Omega (clustalo
) to do this, by running the command below at the terminal:
clustalo -i phi_base/avrPto1.fas -o phi_base/avrpto1_aln.fas
This will create the new file phi_base/avrpto1_aln.fas
containing an alignment of the avrPto1
sequences.
HMMer
profileTo build the sequence profile from our alignment, we use the HMMer
package hmmbuild
. This takes a protein sequence alignment, and converts it to a hidden Markov model (sequence profile) that accounts for the frequency of each amino acid at each position, and also its dependence on the preceding amino acid. This a statistically sophisticated representation of our alignment, and very effective at describing the composition of a large sequence alignment.
To build this statistical model, we run the hmmbuild
command below, at the terminal:
hmmbuild --amino phi_base/avrpto1.hmm phi_base/avrpto1_aln.fas
This produces the file phi_base/avrpto1.hmm
, which describes our input sequence set statistically, and will be used to search against the Pseudomonas proteins you downloaded earlier.
avrPto1
sequencesWe can now use HMMer
's hmmsearch
tool to query the phi_base/avrpto1.hmm
file against the Pseudomonas protein sequences that you downloaded earlier. This is similar to running a BLAST
search at the command-line, and we will run the query against all the downloaded Pseudomonas proteins.
To collect all the Pseudomonas proteins together in one file, we concatenate them with the command cat
at the terminal, to create the file phi_base/pseudomonas.faa
:
cat pseudomonas/*.faa > phi_base/pseudomonas.faa
Now we can run hmmsearch
to query our avrPto1
profile against all the annotated proteins from the six Pseudomonas genomes in pseudomonas
, with the following command:
hmmsearch -o phi_base/avrpto1_vs_pseudomonas.out \
--tblout phi_base/avrpto1_vs_pseudomonas.tab \
phi_base/avrpto1.hmm phi_base/pseudomonas.faa
This writes two files to the phi_base
directory: a standard hmmsearch
output file (avrpto1_vs_pseudomonas.out
) and a computer-parseable table (avrpto1_vs_pseudomonas.tab
). You can inspect these with a text editor or by using the less
or more
commands in the terminal.
When we filtered the downloaded PHI-Base dataset for avrPto1
sequences, we did so trusting the annotation. We assumed that all avrPto1
sequences would be labelled with that string. But is that true?
We can see whether there are any other sequences we could have missed out on by running the same hmmsearch
query against the downloaded PHI-Base sequences, with the command:
hmmsearch -o phi_base/avrpto1_vs_phi_base.out \
--tblout phi_base/avrpto1_vs_phi_base.tab \
phi_base/avrpto1.hmm phi_base/PHI_accessions_4_1.fas