Reciprocal Best Blast CDS Feature Comparisons

Introduction

We often wish to establish an equivalence between the CDS features on two genomes - by which we mean some assertion that sequence A on genome 1 is the "same thing" (in some sense) as sequence B on genome 2. This equivalence can take many practical forms (same catalytic function, same binding interaction, same role in a pathway, and so on) but, given the volume of sequence data available today, is usually established on the basis of sequence similarity. This similarity is then taken as a proxy for the actual equivalence we're interested in.

When sequencing a new pathogen genome, or obtaining a novel transcriptome, we may want to annotate the coding sequences in that genome by determining orthologs - the equivalent sequences - in some other genome.

In this notebook, we will look at three methods (there are many others, but we are constrained by time!) of identifying equivalent sequence features in genomes, in bulk.

All three methods we will consider involve BLASTP comparisons between the protein complements of a plant pathogen genome and a related non-pathogenic isolate. They can be considered to fall under three categories, and all depend on initial BLASTP comparisons.

  • one-way pairwise comparison - best BLASTP match
  • two-way pairwise comparison - reciprocal best BLASTP match
  • clustering - Markov clustering (MCL) of BLASTP matches

We will also need to run some Python code to process and visualise the clustering output.

Learning outcomes

  • Conduct BLASTP comparisons between protein complements for prokaryotes
  • Using Python and Pandas to collect, examine and visualise tabular format data
  • Identify reciprocal best BLAST matches
  • Visualise and interpret genome-wide reciprocal best BLAST matches.

Running cells in this notebook

If this is successful, you should see the input marker to the left of the cell change from

In [ ]:

to (for example)

In [1]:

and you may see output appear below the cell.

Requirements

To complete this exercise, you will need:
  • an active internet connection
  • a local installation of BLAST+

Software

  • CRB-BLAST - conditional reciprocal best BLAST
  • OrthoMCL - a database of predicted orthologs obtained using MCL.
  • OrthoFinder - a program for finding orthologous protein sequence families

Publications

Blogs

One-Way Best BLAST matches (BBH)

It is still common to see one-way matches used - even if only informally, or as a first attempt - as a means of identifying equivalent proteins/features in a genome. In this section, we'll carry out a one-way BLAST search between the protein complements of the plant pathogen P. syringae B728a and its non-pathogenic relative P. fluorescens NCIMB 11764, and inspect the results graphically.

Performing the BLASTP query

We will use the blastp command at the terminal to use every protein sequence in the P. syringae B728a annotation as a query against the predicted proteome of P. fluorescens NCIMB 11764.

The BLAST databases have already been created for you to save time (using the scripts/02-cds_feature_comparisons.sh script), and the results are in the pseudomonas_blastp directory:

$ tree ./data/pseudomonas_blastp
./data/pseudomonas_blastp
├── GCF_000012245.1_ASM1224v1_protein.phr
├── GCF_000012245.1_ASM1224v1_protein.pin
├── GCF_000012245.1_ASM1224v1_protein.psq
├── GCF_000293885.2_ASM29388v3_protein.phr
├── GCF_000293885.2_ASM29388v3_protein.pin
├── GCF_000293885.2_ASM29388v3_protein.psq
├── GCF_000988485.1_ASM98848v1_protein.phr
├── GCF_000988485.1_ASM98848v1_protein.pin
└── GCF_000988485.1_ASM98848v1_protein.psq

We will use some custom settings to make our analysis easier to carry out.

  • We will want to limit our matches to only the best hit, so we specify -max_target_seqs 1
  • We want our output in tab-separated tabular particular format so we can import it easily into other tools (like R and Python), so use -outfmt 6.
  • We want some specific non-standard columns (e.g. query sequence coverage) in that table so we can carry out some useful calculations and visualisation. We therefore specify -outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore"
  • To make the comparisons quicker, we should create BLAST databases for each of the three proteomes, with the makeblastdb command.

To carry out the one-way BLASTP search of P. syringae B728a against P. fluorescens NCIMB 11764, we would execute the following command in the terminal:

blastp -query data/pseudomonas/GCF_000988485.1_ASM98848v1_protein.faa \
       -db data/pseudomonas_blastp/GCF_000293885.2_ASM29388v3_protein \
       -max_target_seqs 1 \
       -outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore" \
       -out data/pseudomonas_blastp/B728a_vs_NCIMB_11764.tab

This will take a few minutes to complete, so to save time the comparison has already been made for you, with the result file being placed in data/pseudomonas_blastp/B728a_vs_NCIMB_11764.tab.

Importing and visualising the results

The Python module helpers is included in this directory, to provide useful helper functions so that we can read and view the BLASTP output generated above. To make the functions available, we import it by running the Python code cell below.

NOTE: The %pylab inline "magic" below allows us to see plots of the BLAST data we load, inline in this notebook.

In [1]:
%pylab inline
# Import helper module
from helpers import rbbh


Populating the interactive namespace from numpy and matplotlib

The first thing we do is load in the BLASTP output we generated, so that we can plot some of the key features. We do that using the ex02.read_data() function in the cell below. This puts the data into a dataframe called data_fwd.


In [2]:
# Load one-way BLAST results into a data frame called data_fwd
data_fwd = rbbh.read_data("data/pseudomonas_blastp/B728a_vs_NCIMB_11764.tab")
NOTE: In the cell below, the data.head() function shows us the first few lines of the one-way BLASTP results, one per match; the data.describe() function shows us some summary data for the table.

In [3]:
# Show first few lines of the loaded data
data_fwd.head()


Out[3]:
query_id subject_id query_length subject_length alignment_length identical_sites identity qcovs Evalue bitscore query_coverage subject_coverage
0 WP_002551314.1 WP_031319201.1 133 133 133 122 91.73 100 3.000000e-85 246.0 100.000000 100.000000
1 WP_002551315.1 WP_017337743.1 44 257 37 9 24.32 84 2.200000e-01 25.4 84.090909 14.396887
2 WP_002551326.1 WP_003213601.1 315 317 317 299 94.32 100 0.000000e+00 625.0 100.634921 100.000000
3 WP_002551363.1 WP_017341421.1 197 193 186 141 75.81 93 2.000000e-82 244.0 94.416244 96.373057
4 WP_002551394.1 WP_017335870.1 290 290 275 265 96.36 95 0.000000e+00 511.0 94.827586 94.827586

In [4]:
# Show descriptive statistics for the table
data_fwd.describe()


Out[4]:
query_length subject_length alignment_length identical_sites identity qcovs Evalue bitscore query_coverage subject_coverage
count 5256.000000 5256.000000 5256.000000 5256.000000 5256.000000 5256.000000 5.256000e+03 5256.000000 5256.000000 5256.000000
mean 509.303463 478.840183 308.540145 199.322869 62.381001 84.300799 1.841770e-01 380.685921 82.287000 79.907095
std 1223.528191 676.488916 315.981863 188.971006 24.282847 26.393363 7.554328e-01 372.121884 29.150842 32.547460
min 28.000000 46.000000 12.000000 7.000000 19.690000 3.000000 0.000000e+00 21.600000 0.747526 1.154201
25% 181.000000 210.000000 122.000000 62.000000 36.247500 84.750000 0.000000e+00 88.100000 76.975386 69.477406
50% 296.000000 317.500000 249.000000 152.000000 69.060000 98.000000 5.000000e-100 292.000000 98.697917 98.837209
75% 441.000000 476.250000 388.250000 279.250000 84.432500 100.000000 1.000000e-21 552.000000 100.000000 100.000000
max 13536.000000 4332.000000 4342.000000 3399.000000 100.000000 100.000000 9.700000e+00 6872.000000 128.000000 122.368421

There are 5265 rows in this table, one for each of the query protein sequences in the P. syringae B728a annotation.

We can look at the distribution of values in the dataframe rows using the .hist() method for any column of interest. For example, data_fwd.subject_length.hist() plots a histogram of the values in the subject_length column.

NOTE: The bins=100 option sets the number of value bins used in the histogram

In [5]:
# Plot a histogram of alignment lengths for the BLAST data
data_fwd.alignment_length.hist(bins=100)


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x105272d68>

In [6]:
# Plot a histogram of percentage identity for the BLAST data
data_fwd.identity.hist(bins=100)


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x104fc8630>

In [7]:
# Plot a histogram of query_coverage for the BLAST data
data_fwd.query_coverage.hist(bins=100)


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x105452da0>

In [8]:
# Plot a histogram of percentage coverage for the BLAST data
data_fwd.subject_coverage.hist(bins=100)


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x105693390>
QUESTIONS:
  • What size are most one-way best `BLAST` alignments?
  • What is the typical query coverage?
  • What is the typical subject coverage?
  • What is the typical best `BLAST` match identity?

We can view the relationship between query coverage and subject coverage, and query coverage and match identity for these one-way best BLAST hits by plotting a 2D histogram, with the helper function ex02.plot_hist2d() in the cell below.


In [9]:
# Plot 2D histogram of subject sequence (match) coverage against query
# sequence coverag
rbbh.plot_hist2d(data_fwd.query_coverage, data_fwd.subject_coverage,
                "one-way query COV", "one-way subject COV", 
                "one-way coverage comparison")
rbbh.plot_hist2d(data_fwd.query_coverage, data_fwd.identity,
                "one-way query COV", "one-way match PID", 
                "one-way coverage/identity comparison")


QUESTIONS:
  • **What is the query/subject coverage for most one-way best `BLAST` matches?**
  • **Why do some one-way `BLAST` matches not have the same coverage for query and subject?**
  • **What is the typical query coverage of a high percentage identity match?**
  • **What is the typical query coverage of a low percentage identity match?**

Reciprocal (Two-Way) Best BLAST matches (RBBH)

To perform a reciprocal BLAST search between two sets of proteins S1 and S2 (say), we need to carry out the forward search of S1 vs S2, and the reverse search S2 vs S1.

Reciprocal best BLAST matches are those where the sequence G(S1) (a gene/CDS from sequence set S1) used as a query makes its best BLAST match to sequence G(S2) (a gene/CDS from sequence set S2), and when sequence G(S2) is used as a query it makes its best match to sequence G(S1) (see figure below).

We carried out the forward search above, for P. syringae B728a (our sequence set S1) against P. fluorescens NCIMB 11764 (our sequence set S2), and now we will carry out the corresponding reverse search by executing the command below at the terminal:

blastp -query data/pseudomonas/GCF_000293885.2_ASM29388v3_protein.faa \
       -db data/pseudomonas_blastp/GCF_000988485.1_ASM98848v1_protein \
       -max_target_seqs 1 \
       -outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore" \
       -out data/pseudomonas_blastp/NCIMB_11764_vs_B728a.tab

As before, this would few minutes to complete, so to save some time the comparison has already been made for you, with the result file being placed in pseudomonas_blastp/NCIMB_11764_vs_B728a.tab.

We'll load the results into a dataframe called data_rev using the helper function ex02.read_data() in the cell below.


In [10]:
# Load one-way BLAST results into a data frame called data_fwd
data_rev = rbbh.read_data("data/pseudomonas_blastp/NCIMB_11764_vs_B728a.tab")
NOTE: You could inspect data_rev using the .head() and .describe() methods, just as you did for data_fwd

The ex02 module provides a function called find_rbbh() which calculates reciprocal best BLAST hits from forward and reverse BLAST searches. The calculation can be performed by executing the cell below.


In [11]:
# Calculate RBBH for the two Pseudomonas datasets
# This returns three dataframes: df1 and df2 are the forward and reverse BLAST
# results (filtered, if any filters were used), and rbbh is the dataframe of
# reciprocal best BLAST hits
df1, df2, data_rbbh = rbbh.find_rbbh(data_fwd, data_rev)

We can inspect the dataframe of RBBH using the .head() and .describe() methods, by executing the cells below.


In [12]:
# Peek at the first few lines of the RBBH results
data_rbbh.head()


Out[12]:
query_id_x subject_id_x identity_x identity_y query_coverage_x query_coverage_y subject_coverage_x subject_coverage_y bitscore_x bitscore_y Evalue_x Evalue_y
0 WP_002551314.1 WP_031319201.1 91.73 91.73 100.000000 100.000000 100.000000 100.000000 246.0 246.0 3.000000e-85 3.000000e-85
1 WP_002551326.1 WP_003213601.1 94.32 94.32 100.634921 100.000000 100.000000 100.634921 625.0 625.0 0.000000e+00 0.000000e+00
2 WP_002551363.1 WP_017341421.1 75.81 75.81 94.416244 96.373057 96.373057 94.416244 244.0 258.0 2.000000e-82 8.000000e-88
3 WP_002551394.1 WP_017335870.1 96.36 96.36 94.827586 94.827586 94.827586 94.827586 511.0 511.0 0.000000e+00 0.000000e+00
4 WP_002551454.1 WP_003218037.1 94.42 94.42 100.000000 100.000000 100.000000 100.000000 380.0 380.0 1.000000e-135 9.000000e-136

In [13]:
# Show summary statistics for RBBH
data_rbbh.describe()


Out[13]:
identity_x identity_y query_coverage_x query_coverage_y subject_coverage_x subject_coverage_y bitscore_x bitscore_y Evalue_x Evalue_y
count 3535.000000 3535.000000 3535.000000 3535.000000 3535.000000 3535.000000 3535.000000 3535.000000 3.535000e+03 3.535000e+03
mean 75.608518 75.605519 96.491999 96.427486 96.406108 96.517690 497.937115 498.246704 6.220469e-03 6.569461e-03
std 17.037202 17.026326 11.426042 12.091670 12.046244 11.500334 375.018657 375.116295 9.138723e-02 1.097458e-01
min 20.970000 20.970000 5.509182 3.730018 3.730018 5.509182 23.100000 23.100000 0.000000e+00 0.000000e+00
25% 68.180000 68.105000 98.058232 98.380098 98.378378 98.097243 240.500000 241.000000 0.000000e+00 0.000000e+00
50% 80.310000 80.300000 100.000000 100.000000 100.000000 99.919290 424.000000 426.000000 3.000000e-150 4.000000e-151
75% 87.980000 87.990000 100.000000 100.000000 100.000000 100.000000 654.000000 655.000000 1.000000e-80 5.500000e-81
max 100.000000 100.000000 128.000000 119.727891 120.560748 128.000000 6872.000000 6873.000000 2.400000e+00 3.500000e+00

It is inevitable that the RBBH set will have the same or fewer protein pairs in it, than the number of proteins in the smallest of the forward and reverse protein sets. But how many proteins have been filtered in this comparison? We can find out by executing the cell below.


In [14]:
# Report the size of each of the forward and reverse input, and rbbh output dataframes
s = '\n'.join(["Forward BLAST input: {0} proteins",
               "Reverse BLAST input: {1} proteins",
               "RBBH output: {2} proteins"])
print(s.format(len(data_fwd), len(data_rev), len(data_rbbh)))
print("(min difference = {0})".format(min(len(data_fwd), len(data_rev)) - len(data_rbbh)))


Forward BLAST input: 5256 proteins
Reverse BLAST input: 6371 proteins
RBBH output: 3535 proteins
(min difference = 1721)
Approximately what proportion of best BLAST matches have been discarded?

Visualising RBBH output

We can get a better idea of what this processing has done by looking at a visual representation of the percentage identity and coverage of RBBH, compared to the (forward) one-way matches. We can do this by executing the cells below.

First, let's look at the percentage identity of best BLAST matches:


In [15]:
# Histogram of forward match percentage identity (one-way)
data_fwd.identity.hist(bins=100)


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x105e27630>

In [16]:
# Histogram of forward match percentage identity (RBBH)
data_rbbh.identity_x.hist(bins=100)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x10655f668>
What has been the effect of excluding best matches that do not have an RBBH reverse match?

Next, we can inspect the query and subject coverage of RBBH results, compared to the one-way forward BLAST matches by executing the cell below.


In [17]:
# Plot 2D histograms of query coverage against subject coverage for the 
# one-way forward matches, and those retained after calculating RBBH
rbbh.plot_hist2d(data_fwd.query_coverage, data_fwd.subject_coverage,
                      "one-way query COV", "one-way subject COV", 
                      "one-way coverage comparison")
rbbh.plot_hist2d(data_rbbh.query_coverage_x, data_rbbh.subject_coverage_x,
                      "RBBH (fwd) query COV", "RBBH (fwd) subject COV", 
                      "RBBH_comparisons.ipynbH coverage comparison")


  • Which one-way matches have been excluded by carrying out RBBH?
  • What is the biological significance of excluding those matches?
  • What would be a reasonable filter to exclude the remaining suspect matches?

Filtering RBBH output

The find_rbbh() function allows us to apply cutoff filters on percentage identity or coverage (or both) for an RBBH match - this, and visualisation of the results is done in the cells below.

NOTE: There is a software tool (CRB-BLAST - Conditional Reciprocal Best BLAST) available that calculates reciprocal best matches, and statistically evaluates an 'optimal' E-value cutoff, in order to improve accuracy of ortholog assignment.

In [18]:
# Calculate ID and coverage-filtered RBBH for the two Pseudomonas datasets
# This returns three dataframes: df1_filtered and df2_filtered are the 
# filtered forward and reverse BLAST results , and rbbh_filtered is the
# dataframe of reciprocal best BLAST hits
df1_filtered, df2_filtered, rbbh_filtered = rbbh.find_rbbh(data_fwd, data_rev, pid=40, cov=70)

# Histogram of forward match percentage identity (RBBH, filtered)
rbbh_filtered.identity_x.hist(bins=100)


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x105b412e8>

In [19]:
# Plot 2D histograms of query coverage against subject coverage for the 
# one-way forward matches retained after calculating RBBH and
# filtering on percentage identity and coverage
rbbh.plot_hist2d(rbbh_filtered.query_coverage_x, rbbh_filtered.subject_coverage_x,
                "filtered RBBH (fwd) query COV", "filtered_RBBH (fwd) subject COV", 
                "filtered RBBH coverage comparison")


Visualising RBBH with ACT

Finally for this exercise, we will visualise the RBBH between P. syringae B728a and P. fluorescens NCIMB 11764 using ACT (as in exercise 01), comparing the output to that obtained by a BLASTN comparison of the chromosomes.

First, we need to generate an output file describing our (filtered) RBBH that ACT can read. We do this by executing the cell below. This does two things:

  • Gets the locations of protein features on the chromosome of each organism from a .gbff file, using the helper function read_genbank(), putting them in a variable called features.
  • Writes the RBBH to a .crunch format file (pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch), which ACT can read.

In [20]:
# Read feature locations for each Pseudomonas file
features = rbbh.read_genbank("data/pseudomonas/GCF_000988485.1_ASM98848v1_genomic.gbff",
                             "data/pseudomonas/GCF_000293885.2_ASM29388v3_genomic.gbff")

# Write a .crunch file of filtered RBBH for the Pseudomonas comparisons
rbbh.write_crunch(rbbh_filtered, features,
                  fwd="GCF_000988485.1_ASM98848v1_genomic",
                  rev="GCF_000293885.2_ASM29388v3_genomic",
                  outdir="data/pseudomonas_blastp",
                  filename="B728a_rbbh_NCIMB_11764.crunch")


Loaded 11197 features
Wrote file to data/pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch

Now we can load the two genome sequences, a BLASTN comparison, and the RBBH comparisons into ACT.

If ACT is not already running, we start it at the terminal with:

act &

and open the file dialog, expanding it to show five fields, as in the previous exercise. First, select File -> Open to bring up the file selection dialog box.

Then click on the more files... button once, to give us two more fields.

Now we add the sequence files, placing the P. syringae B728a genome sequence (data/pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna) in Sequence file 2, and the P. fluorescens NCIMB 11764 genome (data/pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna) in two slots: Sequence file 1 and Sequence file 3, as shown.

Next, we add the RBBH output file data/pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch to the field Comparison file 1:

And finally, we use a BLASTN comparison file (data/pseudomonas_blastn/B728a_blastn_NCIMB_11764.tab) comparing these two genomes, adding it to the field Comparison file 2:

and then click on the Apply button. This produces the initial alignment:

which we can zoom in on to inspect differences between the BLASTN and BLASTP comparisons.

  • How do the BLASTN and RBBH alignments compare to each other, in general?
  • What kinds of differences are there?
  • What sorts of biologically useful information can you get from each alignment that is not available from the other alignment?
  • When is RBBH an appropriate comparison, and when is whole-genome alignment an appropriate comparison?