02b - Using NCBI BLAST+ Service with Biopython

Learning Outcomes

  • Use of NCBI remote BLAST+ using Python and Biopython in the Jupyter notebook
  • Creating a remote BLAST+ query, and its connection to running a BLAST+ search on the server
  • Reading BLAST+ output from the NCBI server into a Python variable
  • Using Entrez queries to customise your remote search
  • Computational analysis and visualisation of BLAST+ output

1. Introduction

This notebook presents examples of methods for using `BLAST` programmatically, with the webservice provided by NCBI. This can be used to automate your `BLAST` searches, so that you don't need to point-and-click your way through the `NCBI` website.

All calculations are run on NCBI's servers, using NCBI's databases (not a local `BLAST` installation), but you are controlling the search using `Python` code in this notebook.

There are advantages to using a programmatic interface for remote BLAST queries:

  • It is easy to set up repeatable searches for many sequences, or collections of sequences
  • It is easy to read in the search results and conduct downstream analyses that add value to your search

Where it is impractical to submit a large number of similar queries via a web form (because it is tiresome to point-and-click over and over again), this can be handled programmatically instead.

You have the opportunity to change custom options to help refine your query, just as in the website interface.

If you need to repeat a query, it can be trivial to get the same settings every time, if you use a programmatic approach.

2. Python imports

To interact with the NCBI's `BLAST` service, we will use the free `Biopython` programming tools. These provide an interface to interact with NCBI's `BLAST` server, run jobs, and to retrieve the output files.

We import these tools, and some standard library packages for working with files (os) below, in the same way we imported Python libraries elsewhere in the workshop.


In [ ]:
# Import Python libraries
%matplotlib inline

import os                      # This lets us interact with the operating system

from Bio import SeqIO          # This lets us handle reading/writing sequence data
from Bio.Blast import NCBIWWW  # This lets us run remote BLAST searches

You will run a `BLASTX` search, querying a nucleotide sequence against a protein database, to identify potential homologues. What is different about this search is that you will be conducting it at NCBI, using `Python` code running on your machine (or in the cloud).

You will use Biopython in the code blocks below to perform a BLASTX search - this will query a penicillin-binding protein from Kitasatospora against a restricted subset of the nr database - and write the results to file.

Running a remote `BLAST` search with `Biopython` is, in some ways, simpler than running a local `BLAST` query. The key steps are:
  1. Read the query sequence(s) from a source (possibly a local file, but maybe a remote database)
  2. Run a remote job with the NCBIWWW.qblast() method, specifying your query sequence, database, and BLAST program
  3. Parse the output you get back from NCBI

To run the remote job, you need the same kind of information as if you were running BLAST via the web interface - these arguments are compulsory:

  • the BLAST program to use
  • query sequence(s) to search with
  • the database to search in

but you can provide some extra choices when you run the remote job, including restricting the remote search on the basis of taxonomy - which we will do here.

A. Load Query Sequence

Firstly, you will need a query sequence for the search.

You will load the sequence for a penicillin-binding protein, reading it from a local `FASTA` file, using `Biopython's` `SeqIO()` module.

When data such as biological sequences are read in, their metadata - information on database IDs, and other features - follows them. `Biopython` does a nice job of showing us this information if we look at it with the `print()` function:

In [ ]:
# Define path to data directory
datadir = os.path.join("data", "blast")

# Load sequence of penicillin-binding protein, and inspect the information
seqfile = os.path.join(datadir, "k_sp_CB01950_penicillin.fasta")
seq = SeqIO.read(seqfile, "fasta")
print(seq)
However, the remote `BLAST` server requires us to present our sequence in `FASTA` format!

One of the clever things about `Biopython`'s sequence objects - and a big advantage of using programmatic approaches - is that we can readily convert our sequence information into a number of different formats.

To do this, we can use the sequence's `.format()` method to produce a `FASTA`-formatted string.

Doing this does not change the original sequence or its information in any way, but it creates a new presentation of that data, which we can use as our query:


In [ ]:
# We need the sequence as a string, so use the .format() method
print(seq.format("fasta"))
You are now almost ready to build your remote `BLAST` query

The last two things you need to do are to consider the database we're going to search against, and the format you want the data returned in.

B. Using Entrez queries to modify the remote database

`NCBI` provide a service called `Entrez E-Utilities` which allows for complex text-based searches to be defined. A detailed consideration of this very powerful tool is beyond the scope of this workshop, but we will use the service to replicate the `Organism` field of the NCBI web browser interface that allows us to restrict the `nr` database only to sequences from a particular taxon.

The remote NCBI BLAST+ service provides large, comprehensive databases such as nr, nt and refseq, but it doesn't provide very many specialised databases. Searches against large databases - where you don't care about most of the sequences, and most of the sequences are unlikely to match - can be wasteful and take longer than necessary to get the useful answer you're looking for. But a smaller, specialised organism-centric database - which would be much quicker to search - may not exist separately.

If you were using the NCBI website to carry out this BLAST query, you could use the Organism field in the web browser interface to specify Kitasatospora (taxid: 2063) when querying against the complete nr database. This would have the effect of only querying against sequences in nr that originated from Kitasatospora, reducing the overall search time and giving you the relevant hits you needed without producing a very large set of output.

  • To perform a similar filtering of the search using NCBIWWW.qblast(), you need to specify the argument entrez_query in that function.
  • The value to be passed is a string that describes the search field you want to filter on. *There are many search field options (see the list here), but we will use only one in this lesson: the organism field.

The `Entrez` query syntax is the same syntax that you would use in any browser-based NCBI database search, and is reported back to you when you refine searches at NCBI. Looking at that output is a good way to find interesting ways to slice large databases into smaller subsets of interest.

The organism field [ORGN]

The syntax for filtering on organism is <value>[ORGN] where <value> is the term you want to filter on. This can be the name of the organism, or an identifier such as the NCBI taxon ID.

C. Run BLASTX

In the cell below, the remote search is conducted on the `nr` database, but restricted only to the set of sequences originating from *Kitasatospora*, by specififying `entrez_query="kitasatospora[ORGN]"`:

Sometimes the remote `qblast` queries to NCBI do not run quickly!

In [ ]:
# Perform the BLAST search on the nr database, restricted to Kitasatospora spp.
result_handle = NCBIWWW.qblast("blastx", "nr", seq.format("fasta"),
                               entrez_query="kitasatospora[ORGN]",
                               format_type="Text")

In [ ]:
# Show human-readable results
output = result_handle.read()
print(output)

D. Save BLASTX results to file

The results you obtained above are human-readable, and similar to the default output type you saw from command-line/terminal `BLAST` in [notebook 02](02-blast_at_terminal.ipynb).

But, for now, they exist only in the variable called `output`. If we want to come back to these results at some time in the future, we will need to save them to a file somewhere.

This is a common operation in programmatic approaches to bioinformatics: once a result is obtained, we usually want to save it to a file. Most command-line programs will do this for you, but when working programmatically mostly you will need to save it explicitly, yourself.

The Python code for saving the contents of output to the file output/kitasatospora/remote_blastx_query_01.txt is given in the cell below:


In [ ]:
# Make directory to save output, if it doesn't exist
outdir = os.path.join("data", "blast", "output")
os.makedirs(outdir, exist_ok=True)

# Save output to file
outfilename = os.path.join(outdir, "remote_blastx_query_01.txt")
with open(outfilename, 'w') as outfh:
    outfh.write(output)

This code does three main things:

  1. It creates a variable called outfilename, with the path to the file we want to write
  2. It opens that file, ready for writing, as a handle called outfh
  3. It writes the contents of output into the outfh handle (and then closes the handle)

When this is done, the BLAST search results we got from NCBI are written to the file data/blast/output/remote_blastx_query_01.txt, just as though we did the search locally. You can inspect the contents of that file at the terminal using a command like:

cat data/blast/output/remote_blastx_query_01.txt
A neat feature of the Jupyter notebook is the ability to run commands in `code` cells, as if you were working at a command-line terminal.

To do this, prefix the command you want to run with an exclamation mark

In [ ]:
# Run a command in the shell
!cat data/blast/output/remote_blastx_query_01.txt

Biological Motivation

We would like to perform a `BLASTX` query with this penicillin-binding protein against a number of different taxa in `nr` to compare the results, using a definable E-value threshold, and inspect and compare the output.

To do this, we can use the code in the cell below to create a list of organism/taxon names, and loop over that list, performing a new BLASTX query each time, and writing a new output file for each.


In [ ]:
# Make directory to save output, if it doesn't exist
outdir = os.path.join("data", "blast", "output")
os.makedirs(outdir, exist_ok=True)

# Define a list of taxa to search against
taxa = ['corynebacterium', 'streptomyces', 'erwinia']

# Loop over taxa and perform a BLASTX search for each
for taxon in taxa:
    print("Performing BLASTX search against NCBI nr, restricted to %s" % taxon)
    # Perform BLASTX search
    result_handle = NCBIWWW.qblast("blastx", "nr", seq.format("fasta"),
                               entrez_query="%s[ORGN]" % taxon,
                               format_type="Text")
    # Write output to file
    outfilename = os.path.join(outdir, "remote_blastx_query_%s.txt" % taxon)
    with open(outfilename, 'w') as outfh:
        outfh.write(result_handle.read())