We often wish to establish an equivalence between the CDS features on two genomes - by which we mean some assertion that sequence A
on genome 1 is the "same thing" (in some sense) as sequence B
on genome 2. This equivalence can take many practical forms (same catalytic function, same binding interaction, same role in a pathway, and so on) but, given the volume of sequence data available today, is usually established on the basis of sequence similarity. This similarity is then taken as a proxy for the actual equivalence we're interested in.
When sequencing a new pathogen genome, or obtaining a novel transcriptome, we may want to annotate the coding sequences in that genome by determining orthologs - the equivalent sequences - in some other genome.
In this notebook, we will look at three methods (there are many others, but we are constrained by time!) of identifying equivalent sequence features in genomes, in bulk.
All three methods we will consider involve BLASTP
comparisons between the protein complements of a plant pathogen genome and a related non-pathogenic isolate. They can be considered to fall under three categories, and all depend on initial BLASTP
comparisons.
BLASTP
matchBLASTP
matchBLASTP
matchesWe will also need to run some Python
code to process and visualise the clustering output.
BLASTP
comparisons between protein complements for prokaryotesPython
and Pandas
to collect, examine and visualise tabular format dataBLAST
matchesBLAST
matches.If this is successful, you should see the input marker to the left of the cell change from
In [ ]:
to (for example)
In [1]:
and you may see output appear below the cell.
Software
Publications
Blogs
BLAST
matches (BBH)It is still common to see one-way matches used - even if only informally, or as a first attempt - as a means of identifying equivalent proteins/features in a genome. In this section, we'll carry out a one-way BLAST
search between the protein complements of the plant pathogen P. syringae B728a and its non-pathogenic relative P. fluorescens NCIMB 11764, and inspect the results graphically.
BLASTP
queryWe will use the blastp
command at the terminal to use every protein sequence in the P. syringae B728a annotation as a query against the predicted proteome of P. fluorescens NCIMB 11764.
The BLAST
databases have already been created for you to save time (using the scripts/02-cds_feature_comparisons.sh
script), and the results are in the pseudomonas_blastp
directory:
$ tree ./pseudomonas_blastp
./pseudomonas_blastp
├── GCF_000012245.1_ASM1224v1_protein.phr
├── GCF_000012245.1_ASM1224v1_protein.pin
├── GCF_000012245.1_ASM1224v1_protein.psq
├── GCF_000293885.2_ASM29388v3_protein.phr
├── GCF_000293885.2_ASM29388v3_protein.pin
├── GCF_000293885.2_ASM29388v3_protein.psq
├── GCF_000988485.1_ASM98848v1_protein.phr
├── GCF_000988485.1_ASM98848v1_protein.pin
└── GCF_000988485.1_ASM98848v1_protein.psq
We will use some custom settings to make our analysis easier to carry out.
To carry out the one-way BLASTP
search of P. syringae B728a against P. fluorescens NCIMB 11764, we would execute the following command in the terminal:
blastp -query pseudomonas/GCF_000988485.1_ASM98848v1_protein.faa \
-db pseudomonas_blastp/GCF_000293885.2_ASM29388v3_protein \
-max_target_seqs 1 \
-outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore" \
-out pseudomonas_blastp/B728a_vs_NCIMB_11764.tab
This will take a few minutes to complete, so to save time the comparison has already been made for you, with the result file being placed in pseudomonas_blastp/B728a_vs_NCIMB_11764.tab
.
In [ ]:
%pylab inline
# Import helper module
from helpers import ex02
The first thing we do is load in the BLASTP
output we generated, so that we can plot some of the key features. We do that using the ex02.read_data()
function in the cell below. This puts the data into a dataframe called data_fwd
.
In [ ]:
# Load one-way BLAST results into a data frame called data_fwd
data_fwd = ex02.read_data("pseudomonas_blastp/B728a_vs_NCIMB_11764.tab")
In [ ]:
# Show first few lines of the loaded data
data_fwd.head()
In [ ]:
# Show descriptive statistics for the table
data_fwd.describe()
There are 5265 rows in this table, one for each of the query protein sequences in the P. syringae B728a annotation.
We can look at the distribution of values in the dataframe rows using the .hist()
method for any column of interest. For example, data_fwd.subject_length.hist()
plots a histogram of the values in the subject_length
column.
In [ ]:
# Plot a histogram of alignment lengths for the BLAST data
data_fwd.alignment_length.hist(bins=100)
In [ ]:
# Plot a histogram of percentage identity for the BLAST data
data_fwd.identity.hist(bins=100)
In [ ]:
# Plot a histogram of query_coverage for the BLAST data
data_fwd.query_coverage.hist(bins=100)
In [ ]:
# Plot a histogram of percentage coverage for the BLAST data
data_fwd.subject_coverage.hist(bins=100)
We can view the relationship between query coverage and subject coverage, and query coverage and match identity for these one-way best BLAST
hits by plotting a 2D histogram, with the helper function ex02.plot_hist2d()
in the cell below.
In [ ]:
# Plot 2D histogram of subject sequence (match) coverage against query
# sequence coverag
ex02.plot_hist2d(data_fwd.query_coverage, data_fwd.subject_coverage,
"one-way query COV", "one-way subject COV",
"one-way coverage comparison")
ex02.plot_hist2d(data_fwd.query_coverage, data_fwd.identity,
"one-way query COV", "one-way match PID",
"one-way coverage/identity comparison")
BLAST
matches (RBBH)To perform a reciprocal BLAST search between two sets of proteins S1
and S2
(say), we need to carry out the forward search of S1
vs S2
, and the reverse search S2
vs S1
.
Reciprocal best BLAST
matches are those where the sequence G(S1)
(a gene/CDS from sequence set S1
) used as a query makes its best BLAST
match to sequence G(S2)
(a gene/CDS from sequence set S2
), and when sequence G(S2)
is used as a query it makes its best match to sequence G(S1)
(see figure below).
We carried out the forward search above, for P. syringae B728a (our sequence set S1
) against P. fluorescens NCIMB 11764 (our sequence set S2
), and now we will carry out the corresponding reverse search by executing the command below at the terminal:
blastp -query pseudomonas/GCF_000293885.2_ASM29388v3_protein.faa \
-db pseudomonas_blastp/GCF_000988485.1_ASM98848v1_protein \
-max_target_seqs 1 \
-outfmt "6 qseqid sseqid qlen slen length nident pident qcovs evalue bitscore" \
-out pseudomonas_blastp/NCIMB_11764_vs_B728a.tab
As before, this would few minutes to complete, so to save some time the comparison has already been made for you, with the result file being placed in pseudomonas_blastp/NCIMB_11764_vs_B728a.tab
.
We'll load the results into a dataframe called data_rev
using the helper function ex02.read_data()
in the cell below.
In [ ]:
# Load one-way BLAST results into a data frame called data_fwd
data_rev = ex02.read_data("pseudomonas_blastp/NCIMB_11764_vs_B728a.tab")
The ex02
module provides a function called find_rbbh()
which calculates reciprocal best BLAST hits from forward and reverse BLAST
searches. The calculation can be performed by executing the cell below.
In [ ]:
# Calculate RBBH for the two Pseudomonas datasets
# This returns three dataframes: df1 and df2 are the forward and reverse BLAST
# results (filtered, if any filters were used), and rbbh is the dataframe of
# reciprocal best BLAST hits
df1, df2, rbbh = ex02.find_rbbh(data_fwd, data_rev)
We can inspect the dataframe of RBBH using the .head()
and .describe()
methods, by executing the cells below.
In [ ]:
# Peek at the first few lines of the RBBH results
rbbh.head()
In [ ]:
# Show summary statistics for RBBH
rbbh.describe()
It is inevitable that the RBBH set will have the same or fewer protein pairs in it, than the number of proteins in the smallest of the forward and reverse protein sets. But how many proteins have been filtered in this comparison? We can find out by executing the cell below.
In [ ]:
# Report the size of each of the forward and reverse input, and rbbh output dataframes
s = '\n'.join(["Forward BLAST input: {0} proteins",
"Reverse BLAST input: {1} proteins",
"RBBH output: {2} proteins"])
print(s.format(len(data_fwd), len(data_rev), len(rbbh)))
print("(min difference = {0})".format(min(len(data_fwd), len(data_rev)) - len(rbbh)))
We can get a better idea of what this processing has done by looking at a visual representation of the percentage identity and coverage of RBBH, compared to the (forward) one-way matches. We can do this by executing the cells below.
First, let's look at the percentage identity of best BLAST
matches:
In [ ]:
# Histogram of forward match percentage identity (one-way)
data_fwd.identity.hist(bins=100)
In [ ]:
# Histogram of forward match percentage identity (RBBH)
rbbh.identity_x.hist(bins=100)
Next, we can inspect the query and subject coverage of RBBH results, compared to the one-way forward BLAST
matches by executing the cell below.
In [ ]:
# Plot 2D histograms of query coverage against subject coverage for the
# one-way forward matches, and those retained after calculating RBBH
ex02.plot_hist2d(data_fwd.query_coverage, data_fwd.subject_coverage,
"one-way query COV", "one-way subject COV",
"one-way coverage comparison")
ex02.plot_hist2d(rbbh.query_coverage_x, rbbh.subject_coverage_x,
"RBBH (fwd) query COV", "RBBH (fwd) subject COV",
"RBBH coverage comparison")
The find_rbbh()
function allows us to apply cutoff filters on percentage identity or coverage (or both) for an RBBH match - this, and visualisation of the results is done in the cells below.
In [ ]:
# Calculate ID and coverage-filtered RBBH for the two Pseudomonas datasets
# This returns three dataframes: df1_filtered and df2_filtered are the
# filtered forward and reverse BLAST results , and rbbh_filtered is the
# dataframe of reciprocal best BLAST hits
df1_filtered, df2_filtered, rbbh_filtered = ex02.find_rbbh(data_fwd, data_rev, pid=40, cov=70)
# Histogram of forward match percentage identity (RBBH, filtered)
rbbh_filtered.identity_x.hist(bins=100)
In [ ]:
# Plot 2D histograms of query coverage against subject coverage for the
# one-way forward matches retained after calculating RBBH and
# filtering on percentage identity and coverage
ex02.plot_hist2d(rbbh_filtered.query_coverage_x, rbbh_filtered.subject_coverage_x,
"filtered RBBH (fwd) query COV", "filtered_RBBH (fwd) subject COV",
"filtered RBBH coverage comparison")
First, we need to generate an output file describing our (filtered) RBBH that ACT
can read. We do this by executing the cell below. This does two things:
.gbff
file, using the helper function read_genbank()
, putting them in a variable called features
..crunch
format file (pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch
), which ACT
can read.
In [ ]:
# Read feature locations for each Pseudomonas file
features = ex02.read_genbank("pseudomonas/GCF_000988485.1_ASM98848v1_genomic.gbff",
"pseudomonas/GCF_000293885.2_ASM29388v3_genomic.gbff")
# Write a .crunch file of filtered RBBH for the Pseudomonas comparisons
ex02.write_crunch(rbbh_filtered, features,
fwd="GCF_000988485.1_ASM98848v1_genomic",
rev="GCF_000293885.2_ASM29388v3_genomic",
outdir="pseudomonas_blastp",
filename="B728a_rbbh_NCIMB_11764.crunch")
Now we can load the two genome sequences, the BLASTN
comparison, and the RBBH comparisons into ACT
.
If ACT
is not already running, we start it at the terminal as before:
act &
and open the file dialog, expanding it to show five fields, as in the previous exercise. First, select File -> Open
to bring up the file selection dialog box.
Then click on the more files...
button once, to give us two more fields.
Now we add the sequence files, placing the P. syringae B728a genome sequence (pseudomonas/GCF_000988485.1_ASM98848v1_genomic.fna
) in Sequence file 2
, and the P. fluorescens NCIMB 11764 genome (pseudomonas/GCF_000293885.2_ASM29388v3_genomic.fna
) in two slots: Sequence file 1
and Sequence file 3
, as shown.
Next, we add the RBBH output file pseudomonas_blastp/B728a_rbbh_NCIMB_11764.crunch
to the field Comparison file 1
:
And finally, we use the BLASTN
comparison file (pseudomonas_blastn/B728a_vs_NCIMB_11764.tab
) we prepared in the earlier exercise, adding it to the field Comparison file 2
:
and then click on the Apply
button. This produces the initial alignment:
which we can zoom in on to inspect differences between the BLASTN
and BLASTP
comparisons.