NCBI
remote BLAST+
using Python and Biopython in the Jupyter notebookBLAST+
query, and its connection to running a BLAST+
search on the serverBLAST+
output from the NCBI
server into a Python variableBLAST+
outputThere are advantages to using a programmatic interface for remote BLAST queries:
Where it is impractical to submit a large number of similar queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead.
You have the opportunity to change custom options to help refine your query, just as in the website interface.
If you need to repeat a query, it can be trivial to get the same settings every time, if you use a programmatic approach.
We import these tools, and some standard library packages for working with files (os
) below, in the same way we imported Python libraries elsewhere in the workshop.
In [ ]:
# Import Python libraries
%matplotlib inline
import os # This lets us interact with the operating system
from Bio import SeqIO # This lets us handle reading/writing sequence data
from Bio.Blast import NCBIWWW # This lets us run remote BLAST searches
You will use Biopython
in the code blocks below to perform a BLASTX
search - this will query a penicillin-binding protein from Kitasatospora against a restricted subset of the nr
database - and write the results to file.
NCBIWWW.qblast()
method, specifying your query sequence, database, and BLAST
programTo run the remote job, you need the same kind of information as if you were running BLAST
via the web interface - these arguments are compulsory:
BLAST
program to usebut you can provide some extra choices when you run the remote job, including restricting the remote search on the basis of taxonomy - which we will do here.
In [ ]:
# Define path to data directory
datadir = os.path.join("data", "blast")
# Load sequence of penicillin-binding protein, and inspect the information
seqfile = os.path.join(datadir, "k_sp_CB01950_penicillin.fasta")
seq = SeqIO.read(seqfile, "fasta")
print(seq)
Doing this does not change the original sequence or its information in any way, but it creates a new presentation of that data, which we can use as our query:
In [ ]:
# We need the sequence as a string, so use the .format() method
print(seq.format("fasta"))
The remote NCBI BLAST+
service provides large, comprehensive databases such as nr
, nt
and refseq
, but it doesn't provide very many specialised databases. Searches against large databases - where you don't care about most of the sequences, and most of the sequences are unlikely to match - can be wasteful and take longer than necessary to get the useful answer you're looking for. But a smaller, specialised organism-centric database - which would be much quicker to search - may not exist separately.
If you were using the NCBI website to carry out this BLAST
query, you could use the Organism
field in the web browser interface to specify Kitasatospora (taxid: 2063)
when querying against the complete nr
database. This would have the effect of only querying against sequences in nr
that originated from Kitasatospora, reducing the overall search time and giving you the relevant hits you needed without producing a very large set of output.
NCBIWWW.qblast()
, you need to specify the argument entrez_query
in that function.organism
field.
In [ ]:
# Perform the BLAST search on the nr database, restricted to Kitasatospora spp.
result_handle = NCBIWWW.qblast("blastx", "nr", seq.format("fasta"),
entrez_query="kitasatospora[ORGN]",
format_type="Text")
In [ ]:
# Show human-readable results
output = result_handle.read()
print(output)
BLASTX
results to fileThis is a common operation in programmatic approaches to bioinformatics: once a result is obtained, we usually want to save it to a file. Most command-line programs will do this for you, but when working programmatically mostly you will need to save it explicitly, yourself.
The Python code for saving the contents of output
to the file output/kitasatospora/remote_blastx_query_01.txt
is given in the cell below:
In [ ]:
# Make directory to save output, if it doesn't exist
outdir = os.path.join("data", "blast", "output")
os.makedirs(outdir, exist_ok=True)
# Save output to file
outfilename = os.path.join(outdir, "remote_blastx_query_01.txt")
with open(outfilename, 'w') as outfh:
outfh.write(output)
This code does three main things:
outfilename
, with the path to the file we want to writeoutfh
output
into the outfh
handle (and then closes the handle)When this is done, the BLAST
search results we got from NCBI are written to the file data/blast/output/remote_blastx_query_01.txt
, just as though we did the search locally. You can inspect the contents of that file at the terminal using a command like:
cat data/blast/output/remote_blastx_query_01.txt
In [ ]:
# Run a command in the shell
!cat data/blast/output/remote_blastx_query_01.txt
BLASTX
searchTo do this, we can use the code in the cell below to create a list
of organism/taxon names, and loop over that list, performing a new BLASTX
query each time, and writing a new output file for each.
In [ ]:
# Make directory to save output, if it doesn't exist
outdir = os.path.join("data", "blast", "output")
os.makedirs(outdir, exist_ok=True)
# Define a list of taxa to search against
taxa = ['corynebacterium', 'streptomyces', 'erwinia']
# Loop over taxa and perform a BLASTX search for each
for taxon in taxa:
print("Performing BLASTX search against NCBI nr, restricted to %s" % taxon)
# Perform BLASTX search
result_handle = NCBIWWW.qblast("blastx", "nr", seq.format("fasta"),
entrez_query="%s[ORGN]" % taxon,
format_type="Text")
# Write output to file
outfilename = os.path.join(outdir, "remote_blastx_query_%s.txt" % taxon)
with open(outfilename, 'w') as outfh:
outfh.write(result_handle.read())