In this worksheet, you will use Biopython
to download pathogen genome data from NCBI
programmatically with Python.
It is possible to obtain the same data by point-and-click from a browser, at the terminal using a program like wget
, or by other means, but scripting data downloads in this way has advantages, such as:
This Jupyter notebook provides some examples of scripting genome downloads from NCBI
singly, and in groups. This method of obtaining genome data uses the Entrez
interface that NCBI provides for automated querying of its data.
This is an interactive notebook, which means you are able to run the code that is written in each of the cells.
If this is successful, you should see the input marker to the left of the cell change from
In [ ]:
to (for example)
In [1]:
and you may see output appear below the cell.
Biopython
tutorial for Entrez
: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109Biopython
technical documentation for Bio.Entrez
: http://biopython.org/DIST/docs/api/Bio.Entrez-module.htmlEntrez
introductory documentation at NCBI
: http://www.ncbi.nlm.nih.gov/books/NBK25497/Entrez
help: http://www.ncbi.nlm.nih.gov/books/NBK3837/Entrez
Quick Start Guide: http://www.ncbi.nlm.nih.gov/books/NBK25500/Entrez
Entrez
is the name NCBI
give to the tools they provide as a computational interface to the data they hold across their genomic and other databases (e.g. PubMed
). Many scripts and programs that interact with NCBI
to download data (e.g. from GenBank or RefSeq) will be using this set of tools.
Biopython
and Bio.Entrez
Biopython
is a widely-used library, providing bioinformatics tools for the popular Python programming language. Similar libraries exist for other programming languages.
Bio.Entrez
is a module of Biopython
that provides tools to make queries against the NCBI
databases using the Entrez
interface.
In [ ]:
# This line imports the Bio.Entrez module, and makes it available
# as 'Entrez'.
from Bio import Entrez
# The line below imports the Bio.SeqIO module, which allows reading
# and writing of common bioinformatics sequence formats.
from Bio import SeqIO
# Create a new directory (if needed) for output/downloads
import os
outdir = "ncbi_downloads"
os.makedirs(outdir, exist_ok=True)
# This line sets the variable 'Entrez.email' to the specified
# email address. You should substitute your own address for the
# example address provided below. Please do not provide a
# fake name.
Entrez.email = "Fakey.McFakename@example.com"
# This line sets the name of the tool that is making the queries
Entrez.tool = "Biopython_NCBI_Entrez_downloads.ipynb"
Bio.Entrez
to list available databasesWhen you send a query or request to NCBI
using Bio.Entrez
, the remote service will send back data in XML format. This is a file format designed to be easy for computers to read, but is very verbose and difficult to read for humans.
The Bio.Entrez
module can read()
this data so that you can extract useful information.
In the example below, you will ask NCBI
for a list of the databases you can search by using the Entrez.einfo()
function. This will return a handle containing the XML response from NCBI
. This will be read into a record that you can inspect and manipulate, by the Entrez.read()
function.
In [ ]:
# The line below uses the Entrez.einfo() function to
# ask NCBI what databases are available. The result is
# 'stored' in a variable called 'handle'
handle = Entrez.einfo()
# In the line below, the response from NCBI is read
# into a record, that organises NCBI's response into
# something you can work with.
record = Entrez.read(handle)
The variable record
contains a list of the available databases at NCBI
, which you can see by executing the cell below:
In [ ]:
print(record["DbList"])
You may recognise some of the database names, such as pubmed
, nuccore
, assembly
, sra
, and taxonomy
.
Entrez
allows you to query these databases using Entrez.esearch()
in much the same way that you just obtained the list of databases with Entrez.einfo()
.
Bio.Entrez
to find genome assemblies at NCBI
In the cells below, you will use Bio.Entrez
to identify assemblies for the bacterial plant pathogen Ralstonia solanacearum. As our interest is genome data, we will query against the assembly
database at NCBI
. This database contains entries for all genome assemblies, whether complete or draft.
We are interested in Ralstonia solanacearum, so will search against the assembly
database with the text "Ralstonia solanacearum"
as a query. The function that allows us to do this is Entrez.esearch()
. By default, searches are limited to 20 results (as on the NCBI
webpage), but we can change this.
In [ ]:
# The line below carries out a search of the `assembly` database at NCBI,
# using the phrase `Ralstonia solanacearum` as the search query,
# and asks NCBI to return up to the first 100 results
handle = Entrez.esearch(db="assembly", term="Ralstonia solanacearum", retmax=100)
# This line converts the returned information from NCBI into a form we
# can use, as before.
record = Entrez.read(handle)
The returned information can be viewed by running the cell below.
The output may look confusing at first, but it simply describes the database identifiers that uniquely identify the assemblies present in the assembly
database that correspond to the query we made, and a few other pieces of information (number of returned entries, total number of entries that could have been returned, how the query was processed) that we do not need, right now.
In [ ]:
# This line prints the downloaded information from NCBI, so
# we can read it.
print(record)
For now, we are interested in the list of database identifiers, in record['IdList']
. We will use these to get information from the assembly
database.
We will look at a single record first, and then consider how to get all the Ralstonia genomes at the same time.
NCBI
In this section, you will use one of the database identifiers returned from your search at NCBI
to identify and download the GenBank records corresponding to a single assembly of Ralstonia solanacearum.
To do this, we will select a single accession from the list in record["IdList"]
, using the code in the cell below.
In [ ]:
# The line below takes the first value in the list of
# database accessions record["IdList"], and places it in
# the variable 'accession'
accession = record["IdList"][0]
# Show the contents of the variable 'accession'
print(accession)
We need to specify the database for which we have the accession (or UID
), and which database we want to query for related records (in this case, nucleotide
).
In [ ]:
# The line below requests the identifiers (UIDs) for all
# records in the `nucleotide` database that correspond to the
# assembly UID that is stored in the variable 'accession'
handle = Entrez.elink(dbfrom="assembly", db="nucleotide",
from_uid=accession)
# We place the downloaded information in the variable 'links'
links = Entrez.read(handle)
The links
variable may contain links to more than one version of the genome (NCBI
keep third-party managed genome data in GenBank/INSDC records, and NCBI
-'owned' data in RefSeq records).
The function below extracts only the INSDC information from the Elink()
query. It is not important that you understand the code.
In [ ]:
# The code below provides a function that extracts nucleotide
# database accessions for INSDC data from the result of an
# Entrez.elink() query.
def extract_insdc(links):
"""Returns the link UIDs for RefSeq entries, from the
passed Elink search results"""
# Work only with INSDC accession UIDs
linkset = [ls for ls in links[0]['LinkSetDb'] if
ls['LinkName'] == 'assembly_nuccore_insdc']
if 0 == len(linkset): # There are no INSDC UIDs
raise ValueError("Elink() output has no assembly_nuccore_insdc data")
# Make a list of the INSDC UIDs
uids = [i['Id'] for i in linkset[0]['Link']]
return uids
You will use the extract_insdc()
function to get the accession IDs for the sequences in this Ralstonia solanacearum genome, in the cell below.
In [ ]:
# The line below uses the extract_insdc() function to get INSDC/GenBank
# accession UIDs for the components of the genome/assembly referred to
# in the 'links' variable. These will be stored in the variable
# 'nuc_uids'
nuc_uids = extract_insdc(links)
# Show the contents of 'nuc_uids'
print(nuc_uids)
NCBI
Now we have accession UIDs for the nucleotide sequences of the assembly, you will use Entrez.efetch
as before to fetch each sequence record from NCBI.
We need to tell NCBI
which database we want to use (in this case, nucleotide
), and the identifiers for the records (the values in nuc_uids
). To get all the data at the same time, we can join the accession ids into a single string, with commas to separate the individual UIDs.
We will also tell NCBI
two further pieces of information:
gbwithparts
) to obtain the genome sequence and feature annotations.text
).
In [ ]:
# The lines below retrieve (fetch) the GenBank records for
# each database entry specified in `nuc_uids`, in plain text
# format. These are parsed with Biopython's SeqIO module into
# SeqRecords, which structure the data into a usable format.
# The SeqRecords are placed in the variable 'records'.
records = []
for nuc_uid in nuc_uids:
handle = Entrez.efetch(db="nucleotide", rettype="gbwithparts", retmode="text",
id=nuc_uid)
records.append(SeqIO.read(handle, 'genbank'))
By running the cell below, you can see that each sequence in the Ralstonia solanacearum assembly has been downloaded into a SeqRecord
, and that it contains useful metadata, describing the sequence assembly and properties of the annotation.
In [ ]:
# Show the contents of each downloaded `SeqRecord`.
for record in records:
print(record, "\n")
Biopython
The SeqIO
module can be used to write sequence data out to a file on your local hard drive. You will do this in the cells below, using the SeqIO.write()
function.
Firstly, in the cell below, you will write GenBank format files that preserve both sequence and annotation data. For the SeqIO.write()
function, we need to specify the list of SeqRecord
s (records
), the output filename to which they will be written, and the format we wish to write (in this case "genbank"
).
In [ ]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.gbk", in GenBank format.
# The function returns the number of sequences that were written to file
SeqIO.write(records, os.path.join(outdir, "ralstonia.gbk"), "genbank")
If you inspect the newly-created ralstonia.gbk
file, you should see that it contains complete GenBank records, describing this genome.
GenBank files are detailed and large, and sometimes we only want to consider the genome sequence itself, not its annotation. The FASTA sequence can be written out on its own by specifyinf the "fasta"
format to SeqIO.write()
instead. This time, we write the output to data/ralstonia.fasta
.
In [ ]:
# The line below writes the sequence data in 'seqdata' to
# the local file "data/ralstonia.fasta", in FASTA format.
SeqIO.write(records, os.path.join(outdir, "ralstonia.fasta"), "fasta")