Exercise 03 - Finding Effector Sequences

Introduction

Effector proteins, which are translocated from the pathogen either into the host or into a space which both host and pathogen can access, play major roles in pathogen host range and host interaction. They are frequently the targets of resistance (R) gene recognition, and many are thought to play significant roles in modulating host behaviour to the benefit of the pathogen.

Most pathogens contain multiple effectors representing different families and, when we obtain a new pathogen genome, we often wish to annotate the predicted protein complement with likely effectors, on the basis of sequence similarity to known effector proteins. In order to do this, we need an example of what features characterise the effector or effector family, so that we can screen the predicted protein complement for those features.

It is possible to conduct a straightforward sequence search using BLAST or an equivalent tool to find effectors. However, methods such as profile hidden Markov models (HMMs) can capture statistical aspects of a diverse set of sequences, enabling a more sensitive and precise search for new members of the family. In this exercise, you will use profile HMMs to build a model representing the Pseudomonas syringae effector avrPto1, and employ the model in a search for new family members in publicly-available sequence data.

Learning outcomes

  • Extracting effector family sequences from a public sequence repository.
  • Building a representative sequence profile for the family using HMMer.
  • Identifying new family members in using a HMM sequence profile.

Running cells in this notebook

If this is successful, you should see the input marker to the left of the cell change from

In [ ]:

to (for example)

In [1]:

and you may see output appear below the cell.

avrPto1 UniProt entries

Requirements

To complete this exercise, you will need:

PHI-Base

Downloading PHI-Base

We will extract a subset of effector sequences from the set of sequences in PHI-Base 4. To do this, we will download all the sequences contained in PHI-Base, in FASTA format. This can be done using your web browser.

  • Navigate to http://www.phi-base.org/.
  • Click on the PHI-base 4 link in the top link bar on the main page
  • Click on the Download link in the top link bar.
  • Enter your personal details in the form to register, and click on the Continue button.
  • Accept the terms and conditions, and click on the Submit button.
  • Click on the FASTA format link.

This will download a file called PHI_accessions to a location on your machine, most likely your Downloads directory, wherever that may be.

We will create a new directory to hold this data, by issuing the command below at the terminal:

mkdir -p phi_base

Now copy the downloaded file to the phi_base subdirectory, renaming it to make the database version number explicit, and to assign the .fas file extension. You can do this with your file manager, or at the terminal with the cp (copy) command:

cp ~/Downloads/PHI_accessions ./phi_base/PHI_accessions_4_1.fas

From this point, we will assume that the location of this dataset - relative to this notebook - is ./phi_base/PHI_accessions_4_1.fas.

NOTE: version numbers: it is important that you record and report the version numbers of databases and software used in your analyses - this enables reproducibility for yourself and others.

Extracting avrPto1 effector sequences from the PHI-Base download

We will use Biopython to help us extract all sequences from the PHI-Base file that refer to the Pseudomonas syringae effector avrPto1. This will give us a starting set of sequences representative of early identification of a putative effector in a real project.

Note: there is nothing particularly special about P. syringae's avrPto1 - I've chosen it as an example to illustrate the principles in this worksheet. Also, while PHI-Base is a very useful resource for many reasons, it has been chosen here to provide an illustrative example of non-exhaustive sequence information - to emulate the incomplete information you might have in a sequencing project for a less well-studied organism.

To load the PHI-Base sequences, we use Biopython's SeqIO library.

We want to filter these sequences to obtain all the avrPto1 protein sequences with records in PHI-Base. In the absence of any other criteria (GO annotation, PFam and filter sequences on the presence of the string avrPto1 in the description. This is equivalent to performing a search for avrPto1 at the PHI-Base site.

  • Click this link to carry out a PHI-Base search for avrPto1.

To filter the local sequence file, execute the cell below:


In [ ]:
from Bio import SeqIO

# We create an empty list to hold avrPto1 sequences that we find
avrpto = []

# We read every sequence in turn from the input FASTA file, and
# if the description contains "avrPto1", we put that sequence in
# the avrpto list
for seqrecord in SeqIO.parse('phi_base/PHI_accessions_4_1.fas', 'fasta'):
    if "avrPto1" in seqrecord.description:
        avrpto.append(seqrecord)
        
# Show the sequences that we've collected in avrpto
for seqrecord in avrpto:
    print(seqrecord)

After running the cell above, you should see that we have found three sequences:

  • 974|PHI:974|avrPto1|AAA25728
  • 975|PHI:975|avrPto1|AAO57459
  • 976|PHI:976|avrPto1|AAY39946

They are all annotated as avrPto1 effectors. We will write these to the file data/avrpto1.fas in FASTA format, using Biopython, by running the cell below:


In [ ]:
# Write the contents of the list avrpto to the file data/avrpto1.fas in FASTA format
SeqIO.write(avrpto, "phi_base/avrpto1.fas", "fasta")
On running the cell, you should see the number 3 below the cell (this is the number of sequences that were written), and you can check for the presence of the file avrpto1.fas in the data subdirectory.

Finding avrPto1.fas sequences in other genomes

We will now use the three avrPto1 sequences you have downloaded as the seed or training set of sequences for finding new examples of these effectors in the Pseudomonas data you downloaded earlier.

It would be possible to search with each sequence individually, and attempt to match the results obtained from each individual search, but instead we will perform a profile search, by building a hidden Markov model from the downloaded sequences, and searching with the HMMer search tool (instead of BLAST).

Note: HMMer is one of several tools (such as `PSI-BLAST`) that builds a profile of similar sequences, and searches on the basis of an aggregated, statistical representation of the sequence set. This places more weight on features shared by the input sequences, and less weight on features that the sequences do not have in common. That can result in more sensitive sequence searches, with fewer false positives, but care must be taken when compiling the sequence set used to build the initial sequence profile.

Aligning the avrPto1 sequences

To build a sequence profile, we must first align the seed set of sequences, so that the equivalent positions in each input sequence line up in the profile. We will use the command-line tool Clustal Omega (clustalo) to do this, by running the command below at the terminal:

clustalo -i phi_base/avrPto1.fas -o phi_base/avrpto1_aln.fas

This will create the new file phi_base/avrpto1_aln.fas containing an alignment of the avrPto1 sequences.

Building the HMMer profile

To build the sequence profile from our alignment, we use the HMMer package hmmbuild. This takes a protein sequence alignment, and converts it to a hidden Markov model (sequence profile) that accounts for the frequency of each amino acid at each position, and also its dependence on the preceding amino acid. This a statistically sophisticated representation of our alignment, and very effective at describing the composition of a large sequence alignment.

To build this statistical model, we run the hmmbuild command below, at the terminal:

hmmbuild --amino phi_base/avrpto1.hmm phi_base/avrpto1_aln.fas

This produces the file phi_base/avrpto1.hmm, which describes our input sequence set statistically, and will be used to search against the Pseudomonas proteins you downloaded earlier.

NOTE: You can inspect the model that HMMer builds directly, by looking at the contents of the phi_base/avrpto1.hmm file, for example by using the command less phi_base/avrpto1.hmm.

Searching for new avrPto1 sequences

We can now use HMMer's hmmsearch tool to query the phi_base/avrpto1.hmm file against the Pseudomonas protein sequences that you downloaded earlier. This is similar to running a BLAST search at the command-line, and we will run the query against all the downloaded Pseudomonas proteins.

To collect all the Pseudomonas proteins together in one file, we concatenate them with the command cat at the terminal, to create the file phi_base/pseudomonas.faa:

cat pseudomonas/*.faa > phi_base/pseudomonas.faa

Now we can run hmmsearch to query our avrPto1 profile against all the annotated proteins from the six Pseudomonas genomes in pseudomonas, with the following command:

hmmsearch -o phi_base/avrpto1_vs_pseudomonas.out \
  --tblout phi_base/avrpto1_vs_pseudomonas.tab \
  phi_base/avrpto1.hmm phi_base/pseudomonas.faa

This writes two files to the phi_base directory: a standard hmmsearch output file (avrpto1_vs_pseudomonas.out) and a computer-parseable table (avrpto1_vs_pseudomonas.tab). You can inspect these with a text editor or by using the less or more commands in the terminal.

NOTE: The --tblout option writes a tabular file, which is easier for downstream processing; the -o option writes the hmmsearch output to a file, rather than displaying it via STDOUT
  • How many avrPto1 sequences does hmmsearch find in the Pseudomonas sequence set?

Searching against the PHI-Base dataset

When we filtered the downloaded PHI-Base dataset for avrPto1 sequences, we did so trusting the annotation. We assumed that all avrPto1 sequences would be labelled with that string. But is that true?

We can see whether there are any other sequences we could have missed out on by running the same hmmsearch query against the downloaded PHI-Base sequences, with the command:

hmmsearch -o phi_base/avrpto1_vs_phi_base.out \
  --tblout phi_base/avrpto1_vs_phi_base.tab \
  phi_base/avrpto1.hmm phi_base/PHI_accessions_4_1.fas
  • How many avrPto1 sequences did you find in the PHI-Base sequence file?
  • If this is not the same number of sequences you searched with, why do you think you found as many as you did?