Exercise 02 - CDS Feature Comparisons

Introduction

We often wish to establish an equivalence between the CDS features on two genomes - by which we mean some assertion that sequence A on genome 1 is the "same thing" (in some sense) as sequence B on genome 2. This equivalence can take many practical forms (same catalytic function, same binding interaction, same role in a pathway, and so on) but, given the volume of sequence data available today, is usually established on the basis of sequence similarity. This similarity is then taken as a proxy for the actual equivalence we're interested in.

When sequencing a new pathogen genome, or obtaining a novel transcriptome, we may want to annotate the coding sequences in that genome by determining orthologs - the equivalent sequences - in some other genome.

In this notebook, we will look at three methods (there are many others, but we are constrained by time!) of identifying equivalent sequence features in genomes, in bulk.

All three methods we will consider involve BLASTP comparisons between the protein complements of a plant pathogen genome and a related non-pathogenic isolate. They can be considered to fall under three categories, and all depend on initial BLASTP comparisons.

one-way pairwise comparison - best BLASTP match
two-way pairwise comparison - reciprocal best BLASTP match
clustering - Markov clustering (MCL) of BLASTP matches

We will also need to run some Python code to process and visualise the clustering output.

Learning outcomes

Conduct BLASTP comparisons between protein complements for prokaryotes
Using Python and Pandas to collect, examine and visualise tabular format data
Identify reciprocal best BLAST matches
Visualise and interpret genome-wide reciprocal best BLAST matches.

Running cells in this notebook

This is an interactive notebook, which means you are able to run the code that is written in each of the cells.

To run the code in a cell, you should:

Place your mouse cursor in the cell, and click (this gives the cell *focus*) to make it active
Hold down the Shift key, and press the Return key.

If this is successful, you should see the input marker to the left of the cell change from

In [ ]:

to (for example)

In [1]:

and you may see output appear below the cell.

Requirements

To complete this exercise, you will need:

an active internet connection
a local installation of BLAST+

Software

CRB-BLAST - conditional reciprocal best BLAST
OrthoMCL - a database of predicted orthologs obtained using MCL.
OrthoFinder - a program for finding orthologous protein sequence families

Publications

Aubrey et al. (2014) PLoS Genet. doi:10.1371/journal.pgen.1004365

Blogs

On Reciprocal Best Blast Hits

One-Way Best `BLAST` matches (BBH)

It is still common to see one-way matches used - even if only informally, or as a first attempt - as a means of identifying equivalent proteins/features in a genome. In this section, we'll carry out a one-way BLAST search between the protein complements of the plant pathogen P. syringae B728a and its non-pathogenic relative P. fluorescens NCIMB 11764, and inspect the results graphically.

Performing the `BLASTP` query

We will use the blastp command at the terminal to use every protein sequence in the P. syringae B728a annotation as a query against the predicted proteome of P. fluorescens NCIMB 11764.

The BLAST databases have already been created for you to save time (using the scripts/02-cds_feature_comparisons.sh script), and the results are in the pseudomonas_blastp directory:

$ tree ./pseudomonas_blastp
./pseudomonas_blastp
├── GCF_000012245.1_ASM1224v1_protein.phr
├── GCF_000012245.1_ASM1224v1_protein.pin
├── GCF_000012245.1_ASM1224v1_protein.psq
├── GCF_000293885.2_ASM29388v3_protein.phr
├── GCF_000293885.2_ASM29388v3_protein.pin
├── GCF_000293885.2_ASM29388v3_protein.psq
├── GCF_000988485.1_ASM98848v1_protein.phr
├── GCF_000988485.1_ASM98848v1_protein.pin
└── GCF_000988485.1_ASM98848v1_protein.psq

We will use some custom settings to make our analysis easier to carry out.

We will want to limit our matches to only the best hit, so we specify -max_target_seqs 1
We want our output in tab-separated tabular particular format so we can import it easily into other tools (like R and Python), so use -outfmt 6.
We want some specific non-standard columns (e.g. query sequence coverage) in that table so we can carry out some useful calculations and visualisation. We therefore specify -outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore"
To make the comparisons quicker, we should create BLAST databases for each of the three proteomes, with the makeblastdb command.

To carry out the one-way BLASTP search of P. syringae B728a against P. fluorescens NCIMB 11764, we would execute the following command in the terminal:

blastp -query pseudomonas/GCF_000988485.1_ASM98848v1_protein.faa \
       -db pseudomonas_blastp/GCF_000293885.2_ASM29388v3_protein \
       -max_target_seqs 1 \
       -outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore" \
       -out pseudomonas_blastp/B728a_vs_NCIMB_11764.tab

This will take a few minutes to complete, so to save time the comparison has already been made for you, with the result file being placed in pseudomonas_blastp/B728a_vs_NCIMB_11764.tab.

Importing and visualising the results

The Python module helpers is included in this directory, to provide useful helper functions so that we can read and view the BLASTP output generated above. To make the functions available, we import it by running the Python code cell below.

NOTE: The %pylab inline "magic" below allows us to see plots of the BLAST data we load, inline in this notebook.



In [ ]:

    
%pylab inline
# Import helper module
from helpers import ex02

The first thing we do is load in the BLASTP output we generated, so that we can plot some of the key features. We do that using the ex02.read_data() function in the cell below. This puts the data into a dataframe called data_fwd.



In [ ]:

    
# Load one-way BLAST results into a data frame called data_fwd
data_fwd = ex02.read_data("pseudomonas_blastp/B728a_vs_NCIMB_11764.tab")

NOTE: In the cell below, the data.head() function shows us the first few lines of the one-way BLASTP results, one per match; the data.describe() function shows us some summary data for the table.



In [ ]:

    
# Show first few lines of the loaded data
data_fwd.head()



In [ ]:

    
# Show descriptive statistics for the table
data_fwd.describe()

There are 5265 rows in this table, one for each of the query protein sequences in the P. syringae B728a annotation.

We can look at the distribution of values in the dataframe rows using the .hist() method for any column of interest. For example, data_fwd.subject_length.hist() plots a histogram of the values in the subject_length column.

NOTE: The bins=100 option sets the number of value bins used in the histogram



In [ ]:

    
# Plot a histogram of alignment lengths for the BLAST data
data_fwd.alignment_length.hist(bins=100)



In [ ]:

    
# Plot a histogram of percentage identity for the BLAST data
data_fwd.identity.hist(bins=100)



In [ ]:

    
# Plot a histogram of query_coverage for the BLAST data
data_fwd.query_coverage.hist(bins=100)



In [ ]:

    
# Plot a histogram of percentage coverage for the BLAST data
data_fwd.subject_coverage.hist(bins=100)

QUESTIONS:

What size are most one-way best `BLAST` alignments?
What is the typical query coverage?
What is the typical subject coverage?
What is the typical best `BLAST` match identity?

We can view the relationship between query coverage and subject coverage, and query coverage and match identity for these one-way best BLAST hits by plotting a 2D histogram, with the helper function ex02.plot_hist2d() in the cell below.



In [ ]:

    
# Plot 2D histogram of subject sequence (match) coverage against query
# sequence coverag
ex02.plot_hist2d(data_fwd.query_coverage, data_fwd.subject_coverage,
                "one-way query COV", "one-way subject COV", 
                "one-way coverage comparison")
ex02.plot_hist2d(data_fwd.query_coverage, data_fwd.identity,
                "one-way query COV", "one-way match PID", 
                "one-way coverage/identity comparison")

QUESTIONS:

**What is the query/subject coverage for most one-way best `BLAST` matches?**
**Why do some one-way `BLAST` matches not have the same coverage for query and subject?**
**What is the typical query coverage of a high percentage identity match?**
**What is the typical query coverage of a low percentage identity match?**

QUESTION:
Do one-way best `BLAST` matches always identify equivalent proteins (orthologs)?

Reciprocal (Two-Way) Best `BLAST` matches (RBBH)

To perform a reciprocal BLAST search between two sets of proteins S1 and S2 (say), we need to carry out the forward search of S1 vs S2, and the reverse search S2 vs S1.

Reciprocal best BLAST matches are those where the sequence G(S1) (a gene/CDS from sequence set S1) used as a query makes its best BLAST match to sequence G(S2) (a gene/CDS from sequence set S2), and when sequence G(S2) is used as a query it makes its best match to sequence G(S1) (see figure below).

We carried out the forward search above, for P. syringae B728a (our sequence set S1) against P. fluorescens NCIMB 11764 (our sequence set S2), and now we will carry out the corresponding reverse search by executing the command below at the terminal:

blastp -query pseudomonas/GCF_000293885.2_ASM29388v3_protein.faa \
       -db pseudomonas_blastp/GCF_000988485.1_ASM98848v1_protein \
       -max_target_seqs 1 \
       -outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore" \
       -out pseudomonas_blastp/NCIMB_11764_vs_B728a.tab

As before, this would few minutes to complete, so to save some time the comparison has already been made for you, with the result file being placed in pseudomonas_blastp/NCIMB_11764_vs_B728a.tab.

We'll load the results into a dataframe called data_rev using the helper function ex02.read_data() in the cell below.



In [ ]:

    
# Load one-way BLAST results into a data frame called data_fwd
data_rev = ex02.read_data("pseudomonas_blastp/NCIMB_11764_vs_B728a.tab")

NOTE: You could inspect data_rev using the .head() and .describe() methods, just as you did for data_fwd

The ex02 module provides a function called find_rbbh() which calculates reciprocal best BLAST hits from forward and reverse BLAST searches. The calculation can be performed by executing the cell below.



In [ ]:

    
# Calculate RBBH for the two Pseudomonas datasets
# This returns three dataframes: df1 and df2 are the forward and reverse BLAST
# results (filtered, if any filters were used), and rbbh is the dataframe of
# reciprocal best BLAST hits
df1, df2, rbbh = ex02.find_rbbh(data_fwd, data_rev)

We can inspect the dataframe of RBBH using the .head() and .describe() methods, by executing the cells below.



In [ ]:

    
# Peek at the first few lines of the RBBH results
rbbh.head()



In [ ]:

    
# Show summary statistics for RBBH
rbbh.describe()

It is inevitable that the RBBH set will have the same or fewer protein pairs in it, than the number of proteins in the smallest of the forward and reverse protein sets. But how many proteins have been filtered in this comparison? We can find out by executing the cell below.



In [ ]:

    
# Report the size of each of the forward and reverse input, and rbbh output dataframes
s = '\n'.join(["Forward BLAST input: {0} proteins",
               "Reverse BLAST input: {1} proteins",
               "RBBH output: {2} proteins"])
print(s.format(len(data_fwd), len(data_rev), len(rbbh)))
print("(min difference = {0})".format(min(len(data_fwd), len(data_rev)) - len(rbbh)))

Approximately what proportion of best BLAST matches have been discarded?

Visualising RBBH output

We can get a better idea of what this processing has done by looking at a visual representation of the percentage identity and coverage of RBBH, compared to the (forward) one-way matches. We can do this by executing the cells below.

First, let's look at the percentage identity of best BLAST matches:



In [ ]:

    
# Histogram of forward match percentage identity (one-way)
data_fwd.identity.hist(bins=100)



In [ ]:

    
# Histogram of forward match percentage identity (RBBH)
rbbh.identity_x.hist(bins=100)

What has been the effect of excluding best matches that do not have an RBBH reverse match?

Next, we can inspect the query and subject coverage of RBBH results, compared to the one-way forward BLAST matches by executing the cell below.



In [ ]:

    
# Plot 2D histograms of query coverage against subject coverage for the 
# one-way forward matches, and those retained after calculating RBBH
ex02.plot_hist2d(data_fwd.query_coverage, data_fwd.subject_coverage,
                "one-way query COV", "one-way subject COV", 
                "one-way coverage comparison")
ex02.plot_hist2d(rbbh.query_coverage_x, rbbh.subject_coverage_x,
                "RBBH (fwd) query COV", "RBBH (fwd) subject COV", 
                "RBBH coverage comparison")

Which one-way matches have been excluded by carrying out RBBH?
What is the biological significance of excluding those matches?
What would be a reasonable filter to exclude the remaining suspect matches?

Filtering RBBH output

The find_rbbh() function allows us to apply cutoff filters on percentage identity or coverage (or both) for an RBBH match - this, and visualisation of the results is done in the cells below.

NOTE: There is a software tool (CRB-BLAST - Conditional Reciprocal Best BLAST) available that calculates reciprocal best matches, and statistically evaluates an 'optimal' E-value cutoff, in order to improve accuracy of ortholog assignment.



In [ ]:

    
# Calculate ID and coverage-filtered RBBH for the two Pseudomonas datasets
# This returns three dataframes: df1_filtered and df2_filtered are the 
# filtered forward and reverse BLAST results , and rbbh_filtered is the
# dataframe of reciprocal best BLAST hits
df1_filtered, df2_filtered, rbbh_filtered = ex02.find_rbbh(data_fwd, data_rev, pid=40, cov=70)

# Histogram of forward match percentage identity (RBBH, filtered)
rbbh_filtered.identity_x.hist(bins=100)



In [ ]:

    
# Plot 2D histograms of query coverage against subject coverage for the 
# one-way forward matches retained after calculating RBBH and
# filtering on percentage identity and coverage
ex02.plot_hist2d(rbbh_filtered.query_coverage_x, rbbh_filtered.subject_coverage_x,
                "filtered RBBH (fwd) query COV", "filtered_RBBH (fwd) subject COV", 
                "filtered RBBH coverage comparison")

Visualising RBBH with `ACT`

Finally for this exercise, we will visualise the RBBH between P. syringae B728a and P. fluorescens NCIMB 11764 using ACT (as in exercise 01), comparing the output to that obtained by a BLASTN comparison of the chromosomes.

First, we need to generate an output file describing our (filtered) RBBH that ACT can read. We do this by executing the cell below. This does two things:

Gets the locations of protein features on the chromosome of each organism from a .gbff file, using the helper function read_genbank(), putting them in a variable called features.
Writes the RBBH to a .crunch format file (pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch), which ACT can read.



In [ ]:

    
# Read feature locations for each Pseudomonas file
features = ex02.read_genbank("pseudomonas/GCF_000988485.1_ASM98848v1_genomic.gbff",
                             "pseudomonas/GCF_000293885.2_ASM29388v3_genomic.gbff")

# Write a .crunch file of filtered RBBH for the Pseudomonas comparisons
ex02.write_crunch(rbbh_filtered, features,
                  fwd="GCF_000988485.1_ASM98848v1_genomic",
                  rev="GCF_000293885.2_ASM29388v3_genomic",
                  outdir="pseudomonas_blastp",
                  filename="B728a_rbbh_NCIMB_11764.crunch")

Now we can load the two genome sequences, the BLASTN comparison, and the RBBH comparisons into ACT.

If ACT is not already running, we start it at the terminal as before:

act &

and open the file dialog, expanding it to show five fields, as in the previous exercise. First, select File -> Open to bring up the file selection dialog box.

Then click on the more files... button once, to give us two more fields.

Now we add the sequence files, placing the P. syringae B728a genome sequence (pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna) in Sequence file 2, and the P. fluorescens NCIMB 11764 genome (pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna) in two slots: Sequence file 1 and Sequence file 3, as shown.

Next, we add the RBBH output file pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch to the field Comparison file 1:

And finally, we use the BLASTN comparison file (pseudomonas_blastn/B728a_vs_NCIMB_11764.tab) we prepared in the earlier exercise, adding it to the field Comparison file 2:

and then click on the Apply button. This produces the initial alignment:

which we can zoom in on to inspect differences between the BLASTN and BLASTP comparisons.

How do the BLASTN and RBBH alignments compare to each other, in general?
What kinds of differences are there?
What sorts of biologically useful information can you get from each alignment that is not available from the other alignment?
When is RBBH an appropriate comparison, and when is whole-genome alignment an appropriate comparison?