BLAST hitsAs discussed in the lecture, reciprocal Best BLAST Hits (RBBH) are a common approximation to orthology for protein sequences, used in comparative genomics. An RBBH is found when two proteins, each in a different genome, find each other as the best scoring match in the other genome.
We will use the Biopython libraries to interact with and manipulate sequence data, and the Pandas data analysis libraries to manipulate numerical data and BLAST output.
Most of the code you will use is imported from the local bs32010 module in this directory, to avoid clutter in this notebook. You can inspect this module if you are interested in how the functions work.
In [ ]:
%pylab inline
from bs32010 import ex06 # Local functions and data
BLASTP comparisons for RBBHReciprocal best hits are built from "one-way" best hits. For organisms A and B, the best hits to the proteins of A in B would be found by a one-way BLASTP search of the proteins of A against those of B. Likewise, the best hits to the proteins of B in A would be found by a one-way BLASTP search of the proteins of B against those of A. This would give us two BLAST output files:
A_vs_B.tabB_vs_A.tabThen, to identify the RBBHs for A and B, we would identify those proteins of A and B that find each other as the best-scoring match in both files.
The number of BLASTP comparisons we need to perform increases quickly with the number of additional organisms we include. For three organisms (X, Y, Z), we aim to produce three sets of RBBH, for each of the ways of combining the organisms in pairs:
X vs YX vs ZY vs Zand we would need to run two one-way BLASTP searches for each of those outputs.
For $k$ organisms, we need to run $k^2 - k = k(k -1)$ BLASTP searches, in order to obtain $\frac{k(k -1)}{2}$ RBBH output files. This is, for even reasonably small numbers of organisms, time-consuming and best automated by a script.
There are alternative, faster, sequence alignment/search programs that could be used, but the choice of program affects the final results, as found in Ward and Moreno-Hagelsieb (2014).
BLAST matches provide two parameters by which we can rank and filter results: coverage and identity.
These are useful parameters because they allow us to tune our BLAST output to reflect the degree of sequence similarity (and therefore presumed functional similarity) we want to have in our final RBBH set.
To calculate percentage identity ($PID$) of the BLAST match directly (though this value is provided for us to use in the BLASTP output), we would calculate:
and to calculate query and subject coverage ($COV_{q}$ and $COV_{s}$, which we can request from custom BLASTP output), we find:
BLAST output dataYou will be working with the chromosomal protein complements of related bacteria are provided in the rbbh_data directory:
NC_004547.faa: Pectobacterium atrosepticum SCRI1043NC_013421.faa: Pectobacterium wasabiae WPP163NC_010694.faa: Erwinia tasmaniensis ET1_99and BLASTP comparisons of each protein sequence file that, to save time (this takes ≈15min on my laptop, without parallelisation), have already been conducted against both of the other protein sequence files:
NC_004547 vs NC_013421NC_004547 vs NC_010694NC_013421 vs NC_004547NC_010694 vs NC_004547NC_010694 vs NC_013421using BLAST commands of the form:
blastp -query rbbh_data/NC_004547.faa -db rbbh_data/rbbh_output/NC_013421.faa \
-outfmt "6 qseqid sseqid qlen slen length nident pident evalue bitscore" \
-out rbbh_data/rbbh_output/NC_004547_vs_NC_013421.tab -max_target_seqs 1
If you would like to see how they were run, please examine the accompanying script file run_rbbh_blast.sh
The one-way BLAST output is found in the rbbh_data/rbbh_output directory:
rbbh_data/
├── NC_004547.faa
├── NC_004547.gbk
├── NC_010694.faa
├── NC_010694.gbk
├── NC_013421.faa
├── NC_013421.gbk
└── rbbh_output
├── NC_004547.faa.phr
├── NC_004547.faa.pin
├── NC_004547.faa.psq
├── NC_004547_vs_NC_010694.tab
├── NC_004547_vs_NC_013421.tab
├── NC_010694.faa.phr
├── NC_010694.faa.pin
├── NC_010694.faa.psq
├── NC_010694_vs_NC_004547.tab
├── NC_010694_vs_NC_013421.tab
├── NC_013421.faa.phr
├── NC_013421.faa.pin
├── NC_013421.faa.psq
└── NC_013421_vs_NC_004547.tab
One of the required comparison files is missing!
NC_013421 vs NC_010694You will need to run the appropriate BLASTP command (see above) to complete the analysis.
Exercise 1 (5min): Run the BLASTP comparison, then inspect the rbbh_data/rbbh_output directory to see the one-way BLASTP output
BLASTP outputThe BLASTP output in rbbh_data/rbbh_output is in plain text, tab-separated format. BLASTP was run set to find only a single match for each protein query, and the output is not in standard BLASTP form. The columns in each of the *.tab files correspond to:
query_id, subject_id, query_length, subject_length, alignment_length, identical_sites, percentage_identity, E-value , bitscore
The function ex06.find_rbbh() will find the RBBH for any pair of the three files NC_004547, NC_013421, and NC_010694. It returns three Pandas dataframes:
df1: the forward direction one-way BLAST search datadf2: the reverse direction one-way BLAST search datarbbh: the corresponding set of reciprocal best hitsFor example, the code below shows the first few lines of each of the dataframes for the comparison between SCRI1043 and Erwinia tasmaniensis.
df1, df2, rbbh1 = ex06.find_rbbh('NC_004547', 'NC_010694')
df1.head()
df2.head()
rbbh1.head()
Exercise 2(5min): Run this code, and inspect the first few lines of each dataframe
In [ ]:
# Enter code here
Exercise 3 (5min): Perform the same analysis for the Pectobacterium chromosomes NC_004547 and NC_013421, using the variable names df3, df4, and rbbh2
In [ ]:
# Enter code here
You can find the number of rows in a dataframe using the function len(). For example, the dataframe df1 should have 4466 rows, which you can find by running the function len(df1).
Exercise 4 (5min): How many rows do the one-way hits for each analysis have, and how many RBBH result in each analysis?
In [ ]:
# Enter code here
By using the .describe() method of a dataframe, we can get more summary statistical information, such as the means and quantiles of a column of data. For example, to get a summary of the information in rbbh1 we could run
rbbh1.describe()
Exercise 5 (5min): Use the .describe() method for both RBBH results, and look at the data for percentage identity and percentage coverage.
In [ ]:
# Enter code here
In this section, you will use simple visualisation approaches to inspect the behaviour of the one-way and reciprocal BLAST output calculated above.
You will be viewing some of the basic relationships between BLAST's reported measures of match quality, using scatterplots.
To plot these relationships, you will use a helper function from ex06: ex06.plot_scatter(). This takes five arguments, in order:
For example,
ex06.plot_scatter(df1.bitscore, log10(df1.Evalue),
"bitscore", "log_{10}(E-value)",
"one-way E-value vs bitscore")
will plot log-transformed E-value against bitscore for the dataframe df1, with the indicated labels and title.
Exercise 6 (5min): What is the relationship between E-value and bitscore for BLAST hits? Use the code above to see if this is consistent for more than one set of comparisons.
In [ ]:
# Enter code here
Exercise 7 (5min): Use the plot_scatter() function to investigate how E-value and bitscore vary with query length
HINT: df1.columns will list the column names of the dataframe df1.
In [ ]:
# Enter code here
You should be able to see that:
So, longer queries might make one-way matches look 'better' than they really are, if only E-values are considered.
You will now view some of the basic relationships between BLAST's reported measures of match quality, using 2D histograms, or heatmaps.
In this representation, the X and Y values define a plane, which is divided into a grid (or sometimes hexfield) of cells. These cells are coloured by the number of points that fill them, so that variation in the frequency of datapoints can easily be seen.
BLAST reports percentage identity of the aligned region, not percentage identity between the query and subject sequence. Two sequences that share a common motif exactly may, in some circumstances, result in an alignment with 100% identity, even though the sequences are quite different - even if they differ over nearly all their length.BLAST, but is not, by default. As calculated above, it reflects the proportion of the query or subject sequence covered by the BLAST alignment. It is useful therefore to distinguish between instances where a high percentage identity corresponds to a full-length protein match, and where it reflects only that a single domain is found to be in common between two proteins. Note that there are two possible coverage values: one for the query, and one for the subject.You will be plotting one-way BLAST hit query coverage on the x-axis, and subject coverage on the y-axis. Cells are filled with colours from the default (jet) colourmap. The ex06.plot_hist2d function defines 100 bins in each axis by default, so there are approximately 10,000 cells in the resulting plots. In this case, cells with large numbers of points are coloured red, and those with few points are coloured blue (on a log scale). Empty cells are left white.
To use the ex06.plot_hist2d() function, you specify arguments just as for the scatterplot function:
ex06.plot_hist2d(df1.query_coverage, df1.subject_coverage,
"one-way COVq", "one-way COVs",
"one-way coverage comparison")
Exercise 8 (5min): What is the relationship between query coverage and subject for BLAST hits? Use the code above to see if this is consistent for more than one set of comparisons.
In [ ]:
# Enter code here
The relationship between query and subject sequence coverage is complex:
BLAST hits) are at or near 100% coverage of both query and subject sequences.BLAST hit covers less than 20% of the subject sequence.Exercise 9 (5min): What is the relationship between query coverage and percentage identity of a match for BLAST hits? Use the code above to see if this is consistent for more than one set of comparisons.
In [ ]:
# Enter code here
You should again see complex relationship. The bulk of the data is found in the upper right corner of the plot, indicating that this is a full-length match to the query sequence, and that the match has nearly 100% identity.
But there is a second population that is smeared across the lower half of the graph, where the query coverage varies widely, and the percentage identity rarely rises above 50%
Now that we know how our one-way BLAST matches behave, we can see how considering only reciprocal best BLAST matches affects our results. You will now compare the relationship between the RBBHs and the one-way BLAST matches for the analyses you have run above.
Exercise 10 (10min): Produce heatmaps of query coverage against subject coverage for the one-way BLAST hits, and the corresponding RBBH output, for one of your analyses.
HINT: You will want to use columns query_coverage_x and subject_coverage_x for your RBBH dataframe.
In [ ]:
# Enter code here
By retaining only RBBH, most of the complexity of the coverage plot has been discarded. In particular, the large cluster of low COVs sequences in the lower-left of the graph, corresponding to query sequences matching only a subdomain of the subject sequence, has disappeared.
Although there are still small numbers of RBBH where coverage is low in one, other or both query and subject sequences, the matches have been very efficiently reduced to the large set in the upper-right corner, indicative of full-length matches of both query and subject sequence, which are intuitively more likely to correspond to orthologues than matches in any other region of the plot.
Exercise 11 (10min): Produce heatmaps of query coverage against percentage identity for the one-way BLAST hits, and the corresponding RBBH output, for one of your analyses.
HINT: You will want to use columns query_coverage_x and percentage_identity_x for your RBBH dataframe.
In [ ]:
# Enter code here
The complexity seen in the one-way BLAST match plot has all but disappeared when only RBBH are considered. This has left, almost exclusively, BLAST matches to the full query sequence, with a level of sequence identity above 60% for the alignment.
There are, again, small numbers of RBBH for which there is not a (nearly) full-length query sequence alignment, or for which the alignment identity is low. These can be removed by imposing a coverage and/or sequence identity filter, e.g. using the pid and cov arguments of the find_rbbh() function.
Exercise 12 (5min): What threshold values of percentage identity and coverage do you think might be reasonable to restrict the RBBH results, to identify mostly orthologues, and why?
To view your RBBH comparisons in ACT, you need to write the RBBH found above to a .crunch format output file. This job is made easier because your data is collected in a Pandas dataframe.
However, we do not yet have enough information to generate such a file, because you do not know the locations of your gene features, with respect to their source sequences. The read_genbank() function in the ex06 module allows us to get this data, and the the write_crunch() function will allow you to generate .crunch output:
features = ex06.read_genbank("rbbh_data/NC_004547.gbk", "rbbh_data/NC_010694.gbk")
ex06.write_crunch(rbbh1, features, filename="NC_004547_vs_NC_010694.crunch")
Exercise 13 (5min): Modify the code above to write the results of your RBBH analysis to a .crunch file for viewing with ACT
In [ ]:
# Enter code here
HINT: You can check the first few lines of the output file by executing the next cell
In [ ]:
!head NC_004547_vs_NC_010694.crunch