Downloading genome data from NCBI with Biopython and Entrez

Introduction

In this worksheet, you will use Biopython to download pathogen genome data from NCBI programmatically with Python.

It is possible to obtain the same data by point-and-click from a browser, at the terminal using a program like wget, or by other means, but scripting data downloads in this way has advantages, such as:

  • automation - only one script is required to download many sequences
  • reproducibility - the same data will be downloaded each time, and copy-paste errors will be avoided
  • self-documentation - the script itself describes exactly how the data was obtained
  • future adaptability (and reuse) - only minor changes to the script may be required for the next analysis or project
Note: large data sets: if you wish to download large datasets, then using wget, ftp or other methods can be better than programmatic access via Entrez. The Entrez interface may give errors partway through large downloads, and is not designed for large data transfers.

This Jupyter notebook provides some examples of scripting genome downloads from NCBI singly, and in groups. This method of obtaining genome data uses the Entrez interface that NCBI provides for automated querying of its data.

Running cells in this notebook

This is an interactive notebook, which means you are able to run the code that is written in each of the cells.

If this is successful, you should see the input marker to the left of the cell change from

In [ ]:

to (for example)

In [1]:

and you may see output appear below the cell.

Requirements

To complete this worksheet, you will need:
  • an active internet connection
  • the Biopython libraries

Entrez

Entrez is the name NCBI give to the tools they provide as a computational interface to the data they hold across their genomic and other databases (e.g. PubMed). Many scripts and programs that interact with NCBI to download data (e.g. from GenBank or RefSeq) will be using this set of tools.

Caveats
There are usage caps for this service, and it is possible to over-use Entrez. If this happens, you or your IP address may be blacklisted. In order to avoid this, you should keep to the following guidelines:
  • Make no more than three URL requests per second
  • Make large queries outwith the hours of 0900-1700 EST (1400-2200 GMT)
  • Provide your email address as an identifier when querying

Programming libraries, such as Biopython's Bio.Entrez module, will usually help you stay within those guidelines by limiting the frequency of queries, and insisting that you provide an email address.

Biopython and Bio.Entrez

Biopython is a widely-used library, providing bioinformatics tools for the popular Python programming language. Similar libraries exist for other programming languages.

Bio.Entrez is a module of Biopython that provides tools to make queries against the NCBI databases using the Entrez interface.

1. Connecting to NCBI

In order to use the Bio.Entrez module, you need to import it. This is how modules become available for use in Python.


In [ ]:
# This line imports the Bio.Entrez module, and makes it available
# as 'Entrez'.
from Bio import Entrez

# The line below imports the Bio.SeqIO module, which allows reading
# and writing of common bioinformatics sequence formats.
from Bio import SeqIO

# Create a new directory (if needed) for output/downloads
import os
outdir = "ncbi_downloads"
os.makedirs(outdir, exist_ok=True)

# This line sets the variable 'Entrez.email' to the specified
# email address. You should substitute your own address for the
# example address provided below. Please do not provide a
# fake name.
Entrez.email = "Fakey.McFakename@example.com"

# This line sets the name of the tool that is making the queries
Entrez.tool = "Biopython_NCBI_Entrez_downloads.ipynb"

2. Using Bio.Entrez to list available databases

When you send a query or request to NCBI using Bio.Entrez, the remote service will send back data in XML format. This is a file format designed to be easy for computers to read, but is very verbose and difficult to read for humans.

The Bio.Entrez module can read() this data so that you can extract useful information.

In the example below, you will ask NCBI for a list of the databases you can search by using the Entrez.einfo() function. This will return a handle containing the XML response from NCBI. This will be read into a record that you can inspect and manipulate, by the Entrez.read() function.


In [ ]:
# The line below uses the Entrez.einfo() function to
# ask NCBI what databases are available. The result is
# 'stored' in a variable called 'handle'
handle = Entrez.einfo()

# In the line below, the response from NCBI is read
# into a record, that organises NCBI's response into
# something you can work with.
record = Entrez.read(handle)

The variable record contains a list of the available databases at NCBI, which you can see by executing the cell below:


In [ ]:
print(record["DbList"])

You may recognise some of the database names, such as pubmed, nuccore, assembly, sra, and taxonomy.

Entrez allows you to query these databases using Entrez.esearch() in much the same way that you just obtained the list of databases with Entrez.einfo().

3. Using Bio.Entrez to find genome assemblies at NCBI

In the cells below, you will use Bio.Entrez to identify assemblies for the bacterial plant pathogen Ralstonia solanacearum. As our interest is genome data, we will query against the assembly database at NCBI. This database contains entries for all genome assemblies, whether complete or draft.

We are interested in Ralstonia solanacearum, so will search against the assembly database with the text "Ralstonia solanacearum" as a query. The function that allows us to do this is Entrez.esearch(). By default, searches are limited to 20 results (as on the NCBI webpage), but we can change this.


In [ ]:
# The line below carries out a search of the `assembly` database at NCBI,
# using the phrase `Ralstonia solanacearum` as the search query,
# and asks NCBI to return up to the first 100 results
handle = Entrez.esearch(db="assembly", term="Ralstonia solanacearum", retmax=100)

# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)

The returned information can be viewed by running the cell below.

The output may look confusing at first, but it simply describes the database identifiers that uniquely identify the assemblies present in the assembly database that correspond to the query we made, and a few other pieces of information (number of returned entries, total number of entries that could have been returned, how the query was processed) that we do not need, right now.


In [ ]:
# This line prints the downloaded information from NCBI, so
# we can read it.
print(record)

For now, we are interested in the list of database identifiers, in record['IdList']. We will use these to get information from the assembly database.

We will look at a single record first, and then consider how to get all the Ralstonia genomes at the same time.

4. Downloading a single genome from NCBI

In this section, you will use one of the database identifiers returned from your search at NCBI to identify and download the GenBank records corresponding to a single assembly of Ralstonia solanacearum.

To do this, we will select a single accession from the list in record["IdList"], using the code in the cell below.


In [ ]:
# The line below takes the first value in the list of 
# database accessions record["IdList"], and places it in
# the variable 'accession'
accession = record["IdList"][0]

# Show the contents of the variable 'accession'
print(accession)

Linking across databases

We need to specify the database for which we have the accession (or UID), and which database we want to query for related records (in this case, nucleotide).


In [ ]:
# The line below requests the identifiers (UIDs) for all
# records in the `nucleotide` database that correspond to the
# assembly UID that is stored in the variable 'accession'
handle = Entrez.elink(dbfrom="assembly", db="nucleotide",
                     from_uid=accession)

# We place the downloaded information in the variable 'links'
links = Entrez.read(handle)

The links variable may contain links to more than one version of the genome (NCBI keep third-party managed genome data in GenBank/INSDC records, and NCBI-'owned' data in RefSeq records).

The function below extracts only the INSDC information from the Elink() query. It is not important that you understand the code.


In [ ]:
# The code below provides a function that extracts nucleotide
# database accessions for INSDC data from the result of an
# Entrez.elink() query.
def extract_insdc(links):
    """Returns the link UIDs for RefSeq entries, from the
    passed Elink search results"""
    # Work only with INSDC accession UIDs
    linkset = [ls for ls in links[0]['LinkSetDb'] if
              ls['LinkName'] == 'assembly_nuccore_insdc']
    if 0 == len(linkset):  # There are no INSDC UIDs
        raise ValueError("Elink() output has no assembly_nuccore_insdc data")
    # Make a list of the INSDC UIDs
    uids = [i['Id'] for i in linkset[0]['Link']]
    return uids

You will use the extract_insdc() function to get the accession IDs for the sequences in this Ralstonia solanacearum genome, in the cell below.


In [ ]:
# The line below uses the extract_insdc() function to get INSDC/GenBank
# accession UIDs for the components of the genome/assembly referred to
# in the 'links' variable. These will be stored in the variable
# 'nuc_uids'
nuc_uids = extract_insdc(links)

# Show the contents of 'nuc_uids'
print(nuc_uids)

Fetching sequence records from NCBI

Now we have accession UIDs for the nucleotide sequences of the assembly, you will use Entrez.efetch as before to fetch each sequence record from NCBI.

We need to tell NCBI which database we want to use (in this case, nucleotide), and the identifiers for the records (the values in nuc_uids). To get all the data at the same time, we can join the accession ids into a single string, with commas to separate the individual UIDs.

We will also tell NCBI two further pieces of information:

  1. The format we want the data returned in. We will ask for GenBank format (gbwithparts) to obtain the genome sequence and feature annotations.
  2. How we want the data returned. We will ask for plain text (text).

In [ ]:
# The lines below retrieve (fetch) the GenBank records for
# each database entry specified in `nuc_uids`, in plain text
# format. These are parsed with Biopython's SeqIO module into
# SeqRecords, which structure the data into a usable format.
# The SeqRecords are placed in the variable 'records'.
records = []
for nuc_uid in nuc_uids:
    handle = Entrez.efetch(db="nucleotide", rettype="gbwithparts", retmode="text",
                          id=nuc_uid)
    records.append(SeqIO.read(handle, 'genbank'))

By running the cell below, you can see that each sequence in the Ralstonia solanacearum assembly has been downloaded into a SeqRecord, and that it contains useful metadata, describing the sequence assembly and properties of the annotation.


In [ ]:
# Show the contents of each downloaded `SeqRecord`.
for record in records:
    print(record, "\n")

Writing sequence data with Biopython

The SeqIO module can be used to write sequence data out to a file on your local hard drive. You will do this in the cells below, using the SeqIO.write() function.

Firstly, in the cell below, you will write GenBank format files that preserve both sequence and annotation data. For the SeqIO.write() function, we need to specify the list of SeqRecords (records), the output filename to which they will be written, and the format we wish to write (in this case "genbank").


In [ ]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.gbk", in GenBank format.
# The function returns the number of sequences that were written to file
SeqIO.write(records, os.path.join(outdir, "ralstonia.gbk"), "genbank")

If you inspect the newly-created ralstonia.gbk file, you should see that it contains complete GenBank records, describing this genome.

GenBank files are detailed and large, and sometimes we only want to consider the genome sequence itself, not its annotation. The FASTA sequence can be written out on its own by specifyinf the "fasta" format to SeqIO.write() instead. This time, we write the output to data/ralstonia.fasta.


In [ ]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.fasta", in FASTA format.
SeqIO.write(records, os.path.join(outdir, "ralstonia.fasta"), "fasta")