BLAST
hitsAs discussed in the lecture, reciprocal Best BLAST
Hits (RBBH) are a common approximation to orthology for protein sequences, used in comparative genomics. An RBBH is found when two proteins, each in a different genome, find each other as the best scoring match in the other genome.
We will use the Biopython
libraries to interact with and manipulate sequence data, and the Pandas
data analysis libraries to manipulate numerical data and BLAST
output.
Most of the code you will use is imported from the local bs32010
module in this directory, to avoid clutter in this notebook. You can inspect this module if you are interested in how the functions work.
In [ ]:
%pylab inline
from bs32010 import ex06 # Local functions and data
BLASTP
comparisons for RBBHReciprocal best hits are built from "one-way" best hits. For organisms A
and B
, the best hits to the proteins of A
in B
would be found by a one-way BLASTP
search of the proteins of A
against those of B
. Likewise, the best hits to the proteins of B
in A
would be found by a one-way BLASTP
search of the proteins of B
against those of A
. This would give us two BLAST
output files:
A_vs_B.tab
B_vs_A.tab
Then, to identify the RBBHs for A
and B
, we would identify those proteins of A
and B
that find each other as the best-scoring match in both files.
The number of BLASTP
comparisons we need to perform increases quickly with the number of additional organisms we include. For three organisms (X
, Y
, Z
), we aim to produce three sets of RBBH, for each of the ways of combining the organisms in pairs:
X
vs Y
X
vs Z
Y
vs Z
and we would need to run two one-way BLASTP
searches for each of those outputs.
For $k$ organisms, we need to run $k^2 - k = k(k -1)$ BLASTP
searches, in order to obtain $\frac{k(k -1)}{2}$ RBBH output files. This is, for even reasonably small numbers of organisms, time-consuming and best automated by a script.
There are alternative, faster, sequence alignment/search programs that could be used, but the choice of program affects the final results, as found in Ward and Moreno-Hagelsieb (2014).
BLAST matches provide two parameters by which we can rank and filter results: coverage and identity.
These are useful parameters because they allow us to tune our BLAST
output to reflect the degree of sequence similarity (and therefore presumed functional similarity) we want to have in our final RBBH set.
To calculate percentage identity ($PID$) of the BLAST
match directly (though this value is provided for us to use in the BLASTP
output), we would calculate:
and to calculate query and subject coverage ($COV_{q}$ and $COV_{s}$, which we can request from custom BLASTP
output), we find:
BLAST
output dataYou will be working with the chromosomal protein complements of related bacteria are provided in the rbbh_data
directory:
NC_004547.faa
: Pectobacterium atrosepticum SCRI1043NC_013421.faa
: Pectobacterium wasabiae WPP163NC_010694.faa
: Erwinia tasmaniensis ET1_99and BLASTP comparisons of each protein sequence file that, to save time (this takes ≈15min on my laptop, without parallelisation), have already been conducted against both of the other protein sequence files:
NC_004547
vs NC_013421
NC_004547
vs NC_010694
NC_013421
vs NC_004547
NC_010694
vs NC_004547
NC_010694
vs NC_013421
using BLAST commands of the form:
blastp -query rbbh_data/NC_004547.faa -db rbbh_data/rbbh_output/NC_013421.faa \
-outfmt "6 qseqid sseqid qlen slen length nident pident evalue bitscore" \
-out rbbh_data/rbbh_output/NC_004547_vs_NC_013421.tab -max_target_seqs 1
If you would like to see how they were run, please examine the accompanying script file run_rbbh_blast.sh
The one-way BLAST
output is found in the rbbh_data/rbbh_output
directory:
rbbh_data/
├── NC_004547.faa
├── NC_004547.gbk
├── NC_010694.faa
├── NC_010694.gbk
├── NC_013421.faa
├── NC_013421.gbk
└── rbbh_output
├── NC_004547.faa.phr
├── NC_004547.faa.pin
├── NC_004547.faa.psq
├── NC_004547_vs_NC_010694.tab
├── NC_004547_vs_NC_013421.tab
├── NC_010694.faa.phr
├── NC_010694.faa.pin
├── NC_010694.faa.psq
├── NC_010694_vs_NC_004547.tab
├── NC_010694_vs_NC_013421.tab
├── NC_013421.faa.phr
├── NC_013421.faa.pin
├── NC_013421.faa.psq
└── NC_013421_vs_NC_004547.tab
One of the required comparison files is missing!
NC_013421
vs NC_010694
You will need to run the appropriate BLASTP
command (see above) to complete the analysis.
Exercise 1 (5min): Run the BLASTP
comparison, then inspect the rbbh_data/rbbh_output
directory to see the one-way BLASTP
output
BLASTP
outputThe BLASTP
output in rbbh_data/rbbh_output
is in plain text, tab-separated format. BLASTP
was run set to find only a single match for each protein query, and the output is not in standard BLASTP
form. The columns in each of the *.tab
files correspond to:
query_id, subject_id, query_length, subject_length, alignment_length, identical_sites, percentage_identity, E-value , bitscore
The function ex06.find_rbbh()
will find the RBBH for any pair of the three files NC_004547
, NC_013421
, and NC_010694
. It returns three Pandas
dataframes:
df1
: the forward direction one-way BLAST search datadf2
: the reverse direction one-way BLAST search datarbbh
: the corresponding set of reciprocal best hitsFor example, the code below shows the first few lines of each of the dataframes for the comparison between SCRI1043 and Erwinia tasmaniensis.
df1, df2, rbbh1 = ex06.find_rbbh('NC_004547', 'NC_010694')
df1.head()
df2.head()
rbbh1.head()
Exercise 2(5min): Run this code, and inspect the first few lines of each dataframe
In [ ]:
# Enter code here
Exercise 3 (5min): Perform the same analysis for the Pectobacterium chromosomes NC_004547
and NC_013421
, using the variable names df3
, df4
, and rbbh2
In [ ]:
# Enter code here
You can find the number of rows in a dataframe using the function len()
. For example, the dataframe df1
should have 4466 rows, which you can find by running the function len(df1)
.
Exercise 4 (5min): How many rows do the one-way hits for each analysis have, and how many RBBH result in each analysis?
In [ ]:
# Enter code here
By using the .describe()
method of a dataframe, we can get more summary statistical information, such as the means and quantiles of a column of data. For example, to get a summary of the information in rbbh1
we could run
rbbh1.describe()
Exercise 5 (5min): Use the .describe()
method for both RBBH results, and look at the data for percentage identity and percentage coverage.
In [ ]:
# Enter code here
In this section, you will use simple visualisation approaches to inspect the behaviour of the one-way and reciprocal BLAST output calculated above.
You will be viewing some of the basic relationships between BLAST's reported measures of match quality, using scatterplots.
To plot these relationships, you will use a helper function from ex06
: ex06.plot_scatter()
. This takes five arguments, in order:
For example,
ex06.plot_scatter(df1.bitscore, log10(df1.Evalue),
"bitscore", "log_{10}(E-value)",
"one-way E-value vs bitscore")
will plot log-transformed E-value against bitscore for the dataframe df1
, with the indicated labels and title.
Exercise 6 (5min): What is the relationship between E-value and bitscore for BLAST
hits? Use the code above to see if this is consistent for more than one set of comparisons.
In [ ]:
# Enter code here
Exercise 7 (5min): Use the plot_scatter()
function to investigate how E-value and bitscore vary with query length
HINT: df1.columns
will list the column names of the dataframe df1
.
In [ ]:
# Enter code here
You should be able to see that:
So, longer queries might make one-way matches look 'better' than they really are, if only E-values are considered.
You will now view some of the basic relationships between BLAST
's reported measures of match quality, using 2D histograms, or heatmaps.
In this representation, the X and Y values define a plane, which is divided into a grid (or sometimes hexfield) of cells. These cells are coloured by the number of points that fill them, so that variation in the frequency of datapoints can easily be seen.
BLAST
reports percentage identity of the aligned region, not percentage identity between the query and subject sequence. Two sequences that share a common motif exactly may, in some circumstances, result in an alignment with 100% identity, even though the sequences are quite different - even if they differ over nearly all their length.BLAST
, but is not, by default. As calculated above, it reflects the proportion of the query or subject sequence covered by the BLAST
alignment. It is useful therefore to distinguish between instances where a high percentage identity corresponds to a full-length protein match, and where it reflects only that a single domain is found to be in common between two proteins. Note that there are two possible coverage values: one for the query, and one for the subject.You will be plotting one-way BLAST
hit query coverage on the x-axis, and subject coverage on the y-axis. Cells are filled with colours from the default (jet
) colourmap. The ex06.plot_hist2d
function defines 100 bins in each axis by default, so there are approximately 10,000 cells in the resulting plots. In this case, cells with large numbers of points are coloured red, and those with few points are coloured blue (on a log scale). Empty cells are left white.
To use the ex06.plot_hist2d()
function, you specify arguments just as for the scatterplot function:
ex06.plot_hist2d(df1.query_coverage, df1.subject_coverage,
"one-way COVq", "one-way COVs",
"one-way coverage comparison")
Exercise 8 (5min): What is the relationship between query coverage and subject for BLAST
hits? Use the code above to see if this is consistent for more than one set of comparisons.
In [ ]:
# Enter code here
The relationship between query and subject sequence coverage is complex:
BLAST
hits) are at or near 100% coverage of both query and subject sequences.BLAST
hit covers less than 20% of the subject sequence.Exercise 9 (5min): What is the relationship between query coverage and percentage identity of a match for BLAST
hits? Use the code above to see if this is consistent for more than one set of comparisons.
In [ ]:
# Enter code here
You should again see complex relationship. The bulk of the data is found in the upper right corner of the plot, indicating that this is a full-length match to the query sequence, and that the match has nearly 100% identity.
But there is a second population that is smeared across the lower half of the graph, where the query coverage varies widely, and the percentage identity rarely rises above 50%
Now that we know how our one-way BLAST
matches behave, we can see how considering only reciprocal best BLAST
matches affects our results. You will now compare the relationship between the RBBHs and the one-way BLAST matches for the analyses you have run above.
Exercise 10 (10min): Produce heatmaps of query coverage against subject coverage for the one-way BLAST
hits, and the corresponding RBBH output, for one of your analyses.
HINT: You will want to use columns query_coverage_x
and subject_coverage_x
for your RBBH dataframe.
In [ ]:
# Enter code here
By retaining only RBBH, most of the complexity of the coverage plot has been discarded. In particular, the large cluster of low COVs sequences in the lower-left of the graph, corresponding to query sequences matching only a subdomain of the subject sequence, has disappeared.
Although there are still small numbers of RBBH where coverage is low in one, other or both query and subject sequences, the matches have been very efficiently reduced to the large set in the upper-right corner, indicative of full-length matches of both query and subject sequence, which are intuitively more likely to correspond to orthologues than matches in any other region of the plot.
Exercise 11 (10min): Produce heatmaps of query coverage against percentage identity for the one-way BLAST
hits, and the corresponding RBBH output, for one of your analyses.
HINT: You will want to use columns query_coverage_x
and percentage_identity_x
for your RBBH dataframe.
In [ ]:
# Enter code here
The complexity seen in the one-way BLAST match plot has all but disappeared when only RBBH are considered. This has left, almost exclusively, BLAST matches to the full query sequence, with a level of sequence identity above 60% for the alignment.
There are, again, small numbers of RBBH for which there is not a (nearly) full-length query sequence alignment, or for which the alignment identity is low. These can be removed by imposing a coverage and/or sequence identity filter, e.g. using the pid
and cov
arguments of the find_rbbh()
function.
Exercise 12 (5min): What threshold values of percentage identity and coverage do you think might be reasonable to restrict the RBBH results, to identify mostly orthologues, and why?
To view your RBBH comparisons in ACT
, you need to write the RBBH found above to a .crunch
format output file. This job is made easier because your data is collected in a Pandas
dataframe.
However, we do not yet have enough information to generate such a file, because you do not know the locations of your gene features, with respect to their source sequences. The read_genbank()
function in the ex06
module allows us to get this data, and the the write_crunch()
function will allow you to generate .crunch
output:
features = ex06.read_genbank("rbbh_data/NC_004547.gbk", "rbbh_data/NC_010694.gbk")
ex06.write_crunch(rbbh1, features, filename="NC_004547_vs_NC_010694.crunch")
Exercise 13 (5min): Modify the code above to write the results of your RBBH analysis to a .crunch
file for viewing with ACT
In [ ]:
# Enter code here
HINT: You can check the first few lines of the output file by executing the next cell
In [ ]:
!head NC_004547_vs_NC_010694.crunch