Session 06 - Reciprocal Best BLAST Hits (RBBH)

Learning Outcomes

  • Conduct BLASTP comparisons between protein complements for prokaryotes
  • Using Python and Pandas to collect and examine tabular format data
  • Identify reciprocal best BLAST matches
  • Visualise and interpret reciprocal best BLAST matches using a range of tools

Introduction

Reciprocal best BLAST hits

As discussed in the lecture, reciprocal Best BLAST Hits (RBBH) are a common approximation to orthology for protein sequences, used in comparative genomics. An RBBH is found when two proteins, each in a different genome, find each other as the best scoring match in the other genome.

Python code

We will use the Biopython libraries to interact with and manipulate sequence data, and the Pandas data analysis libraries to manipulate numerical data and BLAST output.

Most of the code you will use is imported from the local bs32010 module in this directory, to avoid clutter in this notebook. You can inspect this module if you are interested in how the functions work.


In [ ]:
%pylab inline

from bs32010 import ex06  # Local functions and data

1. One-way BLASTP comparisons for RBBH

Two organisms

Reciprocal best hits are built from "one-way" best hits. For organisms A and B, the best hits to the proteins of A in B would be found by a one-way BLASTP search of the proteins of A against those of B. Likewise, the best hits to the proteins of B in A would be found by a one-way BLASTP search of the proteins of B against those of A. This would give us two BLAST output files:

  • A_vs_B.tab
  • B_vs_A.tab

Then, to identify the RBBHs for A and B, we would identify those proteins of A and B that find each other as the best-scoring match in both files.

Three or more organisms

The number of BLASTP comparisons we need to perform increases quickly with the number of additional organisms we include. For three organisms (X, Y, Z), we aim to produce three sets of RBBH, for each of the ways of combining the organisms in pairs:

  • X vs Y
  • X vs Z
  • Y vs Z

and we would need to run two one-way BLASTP searches for each of those outputs.

For $k$ organisms, we need to run $k^2 - k = k(k -1)$ BLASTP searches, in order to obtain $\frac{k(k -1)}{2}$ RBBH output files. This is, for even reasonably small numbers of organisms, time-consuming and best automated by a script.

There are alternative, faster, sequence alignment/search programs that could be used, but the choice of program affects the final results, as found in Ward and Moreno-Hagelsieb (2014).

 Coverage and identity

BLAST matches provide two parameters by which we can rank and filter results: coverage and identity.

  • identity: the percentage of the BLASTP alignment that is made up of identical amino acid sequence, in the two matching sequences.
  • coverage: There are two coverage values, one for the query sequence, and one for the subject sequence. They represent the percentage of the query (or subject) sequence that is covered by the aligned region.

These are useful parameters because they allow us to tune our BLAST output to reflect the degree of sequence similarity (and therefore presumed functional similarity) we want to have in our final RBBH set.

To calculate percentage identity ($PID$) of the BLAST match directly (though this value is provided for us to use in the BLASTP output), we would calculate:

$$ PID = \frac{\text{identical_sites}}{\text{alignment_length}} $$

and to calculate query and subject coverage ($COV_{q}$ and $COV_{s}$, which we can request from custom BLASTP output), we find:

$$ COV_q = \frac{\text{alignment_length}}{\text{query_length}} \\ COV_s = \frac{\text{alignment_length}}{\text{subject_length}} $$

Sequence and BLAST output data

You will be working with the chromosomal protein complements of related bacteria are provided in the rbbh_data directory:

  • NC_004547.faa: Pectobacterium atrosepticum SCRI1043
  • NC_013421.faa: Pectobacterium wasabiae WPP163
  • NC_010694.faa: Erwinia tasmaniensis ET1_99

and BLASTP comparisons of each protein sequence file that, to save time (this takes ≈15min on my laptop, without parallelisation), have already been conducted against both of the other protein sequence files:

  • NC_004547 vs NC_013421
  • NC_004547 vs NC_010694
  • NC_013421 vs NC_004547
  • NC_010694 vs NC_004547
  • NC_010694 vs NC_013421

using BLAST commands of the form:

blastp -query rbbh_data/NC_004547.faa -db rbbh_data/rbbh_output/NC_013421.faa \
       -outfmt "6 qseqid sseqid qlen slen length nident pident evalue bitscore" \
       -out rbbh_data/rbbh_output/NC_004547_vs_NC_013421.tab -max_target_seqs 1

If you would like to see how they were run, please examine the accompanying script file run_rbbh_blast.sh

The one-way BLAST output is found in the rbbh_data/rbbh_output directory:

rbbh_data/
├── NC_004547.faa
├── NC_004547.gbk
├── NC_010694.faa
├── NC_010694.gbk
├── NC_013421.faa
├── NC_013421.gbk
└── rbbh_output
    ├── NC_004547.faa.phr
    ├── NC_004547.faa.pin
    ├── NC_004547.faa.psq
    ├── NC_004547_vs_NC_010694.tab
    ├── NC_004547_vs_NC_013421.tab
    ├── NC_010694.faa.phr
    ├── NC_010694.faa.pin
    ├── NC_010694.faa.psq
    ├── NC_010694_vs_NC_004547.tab
    ├── NC_010694_vs_NC_013421.tab    
    ├── NC_013421.faa.phr
    ├── NC_013421.faa.pin
    ├── NC_013421.faa.psq
    └── NC_013421_vs_NC_004547.tab

One of the required comparison files is missing!

  • NC_013421 vs NC_010694

You will need to run the appropriate BLASTP command (see above) to complete the analysis.

Exercise 1 (5min): Run the BLASTP comparison, then inspect the rbbh_data/rbbh_output directory to see the one-way BLASTP output

 2. Parsing BLASTP output

The BLASTP output in rbbh_data/rbbh_output is in plain text, tab-separated format. BLASTP was run set to find only a single match for each protein query, and the output is not in standard BLASTP form. The columns in each of the *.tab files correspond to:

query_id, subject_id, query_length, subject_length, alignment_length, identical_sites, percentage_identity, E-value , bitscore

The function ex06.find_rbbh() will find the RBBH for any pair of the three files NC_004547, NC_013421, and NC_010694. It returns three Pandas dataframes:

  • df1: the forward direction one-way BLAST search data
  • df2: the reverse direction one-way BLAST search data
  • rbbh: the corresponding set of reciprocal best hits

For example, the code below shows the first few lines of each of the dataframes for the comparison between SCRI1043 and Erwinia tasmaniensis.

df1, df2, rbbh1 = ex06.find_rbbh('NC_004547', 'NC_010694')
df1.head()
df2.head()
rbbh1.head()

Exercise 2(5min): Run this code, and inspect the first few lines of each dataframe


In [ ]:
# Enter code here

Exercise 3 (5min): Perform the same analysis for the Pectobacterium chromosomes NC_004547 and NC_013421, using the variable names df3, df4, and rbbh2


In [ ]:
# Enter code here

You can find the number of rows in a dataframe using the function len(). For example, the dataframe df1 should have 4466 rows, which you can find by running the function len(df1).

Exercise 4 (5min): How many rows do the one-way hits for each analysis have, and how many RBBH result in each analysis?

  • What proportion of one-way hits is lost in each analysis?
  • Which comparison produces the most RBBH?
  • Why are the two results different?

In [ ]:
# Enter code here

By using the .describe() method of a dataframe, we can get more summary statistical information, such as the means and quantiles of a column of data. For example, to get a summary of the information in rbbh1 we could run

rbbh1.describe()

Exercise 5 (5min): Use the .describe() method for both RBBH results, and look at the data for percentage identity and percentage coverage.

  • How do these values differ between the two sets of results?
  • Why do you think they differ in this way?
  • Why do you think the maximum coverage can sometimes exceed 100%?

In [ ]:
# Enter code here

3. Interpreting RBBH output

In this section, you will use simple visualisation approaches to inspect the behaviour of the one-way and reciprocal BLAST output calculated above.

Visualising one-way BLAST output: Scatterplots

You will be viewing some of the basic relationships between BLAST's reported measures of match quality, using scatterplots.

  • E-values: Often, BLAST's E-values are taken to be reliable indicators of match quality, and are often (incorrectly) interpreted as probability values. However, E-values are influenced by query sequence length, and the size of the database being searched against. They are not simply a reflection of alignment quality, but alignment quality in the context of the exact search that was done. The same alignment/match can return multiple different E-values, depending on the database in which the match was found. This makes it unreliable as a comparator between different BLAST searches
  • (Normalised) Bitscores: BLAST's E-values are calculated from the alignment bitscore, modified to reflect the query sequence length and database composition. The (normalised) bitscore itself directly reflects the sequence alignment, and is more reliable for comparison across different BLAST searches.

To plot these relationships, you will use a helper function from ex06: ex06.plot_scatter(). This takes five arguments, in order:

  1. A dataframe column to plot on the x-axis
  2. A dataframe column to plot on the y-axis
  3. An x-axis label
  4. A y-axis label
  5. A plot title

For example,

ex06.plot_scatter(df1.bitscore, log10(df1.Evalue), 
                  "bitscore", "log_{10}(E-value)", 
                  "one-way E-value vs bitscore")

will plot log-transformed E-value against bitscore for the dataframe df1, with the indicated labels and title.

Exercise 6 (5min): What is the relationship between E-value and bitscore for BLAST hits? Use the code above to see if this is consistent for more than one set of comparisons.


In [ ]:
# Enter code here

Exercise 7 (5min): Use the plot_scatter() function to investigate how E-value and bitscore vary with query length

HINT: df1.columns will list the column names of the dataframe df1.


In [ ]:
# Enter code here

You should be able to see that:

  1. As query length increases, the 'best' (smallest) E-value that can be reported also falls.
  2. Bitscores show two relationships - a population for which bitscore increases with query length, and a population for which bitscore does not increase with query length. What do you think these populations represent?
  3. The largest query lengths correspond to the lowest E-values, but surprisingly low bitscores.

So, longer queries might make one-way matches look 'better' than they really are, if only E-values are considered.

Visualising one-way BLAST output: Heatmaps

You will now view some of the basic relationships between BLAST's reported measures of match quality, using 2D histograms, or heatmaps.

In this representation, the X and Y values define a plane, which is divided into a grid (or sometimes hexfield) of cells. These cells are coloured by the number of points that fill them, so that variation in the frequency of datapoints can easily be seen.

  • Percentage identity: BLAST reports percentage identity of the aligned region, not percentage identity between the query and subject sequence. Two sequences that share a common motif exactly may, in some circumstances, result in an alignment with 100% identity, even though the sequences are quite different - even if they differ over nearly all their length.
  • Percentage coverage: This can be reported directly by BLAST, but is not, by default. As calculated above, it reflects the proportion of the query or subject sequence covered by the BLAST alignment. It is useful therefore to distinguish between instances where a high percentage identity corresponds to a full-length protein match, and where it reflects only that a single domain is found to be in common between two proteins. Note that there are two possible coverage values: one for the query, and one for the subject.

You will be plotting one-way BLAST hit query coverage on the x-axis, and subject coverage on the y-axis. Cells are filled with colours from the default (jet) colourmap. The ex06.plot_hist2d function defines 100 bins in each axis by default, so there are approximately 10,000 cells in the resulting plots. In this case, cells with large numbers of points are coloured red, and those with few points are coloured blue (on a log scale). Empty cells are left white.

To use the ex06.plot_hist2d() function, you specify arguments just as for the scatterplot function:

  1. A dataframe column to plot on the x-axis
  2. A dataframe column to plot on the y-axis
  3. An x-axis label
  4. A y-axis label
  5. A plot title
ex06.plot_hist2d(df1.query_coverage, df1.subject_coverage, 
                "one-way COVq", "one-way COVs", 
                "one-way coverage comparison")

Exercise 8 (5min): What is the relationship between query coverage and subject for BLAST hits? Use the code above to see if this is consistent for more than one set of comparisons.

  • What kinds of sequence alignments might explain the different populations you see?

In [ ]:
# Enter code here

The relationship between query and subject sequence coverage is complex:

  1. The vast majority of datapoints (i.e. BLAST hits) are at or near 100% coverage of both query and subject sequences.
  2. There is a population of hits with approximately linear 1:1 relationship (bottom-left to top-right) of query to subject sequence coverage.
  3. There is a population of queries with ≈100% coverage, but whose subject sequence coverage varies across the full range from 15-100%
  4. There is a large population of query sequences with low coverage (<60%) whose best BLAST hit covers less than 20% of the subject sequence.

Exercise 9 (5min): What is the relationship between query coverage and percentage identity of a match for BLAST hits? Use the code above to see if this is consistent for more than one set of comparisons.

  • What kinds of sequence alignments might explain the different populations you see?

In [ ]:
# Enter code here

You should again see complex relationship. The bulk of the data is found in the upper right corner of the plot, indicating that this is a full-length match to the query sequence, and that the match has nearly 100% identity.

But there is a second population that is smeared across the lower half of the graph, where the query coverage varies widely, and the percentage identity rarely rises above 50%

Visualising RBBH output

Now that we know how our one-way BLAST matches behave, we can see how considering only reciprocal best BLAST matches affects our results. You will now compare the relationship between the RBBHs and the one-way BLAST matches for the analyses you have run above.

Exercise 10 (10min): Produce heatmaps of query coverage against subject coverage for the one-way BLAST hits, and the corresponding RBBH output, for one of your analyses.

HINT: You will want to use columns query_coverage_x and subject_coverage_x for your RBBH dataframe.

  • What are the main differences between the plots for one-way matches, and the corresponding RBBH analysis?
  • Why do you think these differences occur?

In [ ]:
# Enter code here

By retaining only RBBH, most of the complexity of the coverage plot has been discarded. In particular, the large cluster of low COVs sequences in the lower-left of the graph, corresponding to query sequences matching only a subdomain of the subject sequence, has disappeared.

Although there are still small numbers of RBBH where coverage is low in one, other or both query and subject sequences, the matches have been very efficiently reduced to the large set in the upper-right corner, indicative of full-length matches of both query and subject sequence, which are intuitively more likely to correspond to orthologues than matches in any other region of the plot.

Exercise 11 (10min): Produce heatmaps of query coverage against percentage identity for the one-way BLAST hits, and the corresponding RBBH output, for one of your analyses.

HINT: You will want to use columns query_coverage_x and percentage_identity_x for your RBBH dataframe.

  • What are the main differences between the plots for one-way matches, and the corresponding RBBH analysis?
  • Why do you think these differences occur?

In [ ]:
# Enter code here

The complexity seen in the one-way BLAST match plot has all but disappeared when only RBBH are considered. This has left, almost exclusively, BLAST matches to the full query sequence, with a level of sequence identity above 60% for the alignment.

There are, again, small numbers of RBBH for which there is not a (nearly) full-length query sequence alignment, or for which the alignment identity is low. These can be removed by imposing a coverage and/or sequence identity filter, e.g. using the pid and cov arguments of the find_rbbh() function.

Exercise 12 (5min): What threshold values of percentage identity and coverage do you think might be reasonable to restrict the RBBH results, to identify mostly orthologues, and why?

4. Writing RBBH output to file

To view your RBBH comparisons in ACT, you need to write the RBBH found above to a .crunch format output file. This job is made easier because your data is collected in a Pandas dataframe.

However, we do not yet have enough information to generate such a file, because you do not know the locations of your gene features, with respect to their source sequences. The read_genbank() function in the ex06 module allows us to get this data, and the the write_crunch() function will allow you to generate .crunch output:

features = ex06.read_genbank("rbbh_data/NC_004547.gbk", "rbbh_data/NC_010694.gbk")
ex06.write_crunch(rbbh1, features, filename="NC_004547_vs_NC_010694.crunch")

Exercise 13 (5min): Modify the code above to write the results of your RBBH analysis to a .crunch file for viewing with ACT


In [ ]:
# Enter code here

HINT: You can check the first few lines of the output file by executing the next cell


In [ ]:
!head NC_004547_vs_NC_010694.crunch