Retrieving all GenBank genome sequences for a bacterial genus

Some tasks we are interested in need to be repeated for a set of genomes, whenever new isolate sequences become available. This notebook will take us through the process of collecting all genomes from GenBank that belong to a named genus.

Importing Biopython

We will use Biopython's interface to Entrez, the interface to NCBI's online resources. There are some guidelines to its use that need to be followed, so you are not banned by NCBI:

  1. For any series of more than 100 requests, conduct your work outwith US peak times.
  2. Make no more than three queries each second (enforced by Biopython).
  3. Set your Entrez.email parameter correctly.
  4. Set your Entrez.tool parameter correctly.
  5. Use session histories where appropriate.

We start by importing packages, including Biopython's Entrez module, and setting the appropriate parameters.


In [ ]:
import os
from Bio import Entrez, SeqIO

Entrez.email = ""  # Use your own real email
Entrez.tool = "Biopython_get_GenBank_genomes.ipynb"

Searching the database

We're using Bio.Entrez.esearch() with the genome database to look for our search term. In this case, it's Pectobacterium, a genus of plant pathogenic bacteria.

We know we can look in the genome database, because we checked by hand at http://www.ncbi.nlm.nih.gov/genome, and because it's one of the databases named if we look with Entrez.einfo().


In [ ]:
genus = "Pectobacterium"
query_text = "{0} AND bacteria[Organism]".format(genus)
handle = Entrez.esearch(db='genome', term=query_text)
record = Entrez.read(handle)

We can get the number of returned records by looking at record["Count"]:


In [ ]:
record["Count"]

But what are our records? We can see their GenBank identifiers by looking at record["IdList"]:


In [ ]:
record["IdList"]

But this isn't immediately informative. We're going to have to look at the assemblies associated with these identifiers in GenBank. We do this with Entrez.elink(), searching for associations between the genome database and the assembly database, compiling all the resulting Link UIDs in a single list.

What are the links we're allowed? Well, there's a big list at http://www.ncbi.nlm.nih.gov/corehtml/query/static/entrezlinks.html, but we can also inspect NCBI's web interface directly to see that http://www.ncbi.nlm.nih.gov/assembly is the likely prefix/database we're looking for.


In [ ]:
asm_links = []
for uid in record["IdList"]:
    links = Entrez.read(Entrez.elink(dbfrom="genome", db="assembly", retmode="text", from_uid=uid))
    [asm_links.append(d.values()[0]) for d in links[0]['LinkSetDb'][0]['Link']]
print("We find {0} genome entries: {1}".format(len(asm_links), asm_links))

Now we can recover links to the nucleotide database for each of these UIDs. There may be several such links, but as we are looking for the full assembly, we care only about the assembly_nuccore_insdc sequences, which are the contigs.

We collect these into a dictionary of contig UIDs, keyed by assembly UID, called sequid_links:


In [ ]:
sequid_links = {}
for uid in asm_links:
    links = Entrez.read(Entrez.elink(dbfrom="assembly", db="nucleotide", retmode="gb", from_uid=uid))
    contigs = [l for l in links[0]['LinkSetDb'] if l['LinkName'] == 'assembly_nuccore_insdc'][0]
    sequid_links[uid] = [e['Id'] for e in contigs['Link']]
expected_contigs = {}
print("There are {0} genomes identified for {1}:".format(len(sequid_links), genus))
for k, v in sorted(sequid_links.items()):
    print("Assembly UID {0}: {1} contigs".format(k, len(v)))
    expected_contigs[k] = len(v)

Once we have these nucleotide database identifiers, we can grab all the sequences and write them out as multi-FASTA files, with Entrez.efetch(). The assembly records themselves though, have to be obtained with Entrez.esummary(), and then a byzantine set of keywords navigated, to get the information we're interested in.

We use the assembly UID without version number as the filename, and write a labels.txt file suitable for use with pyani.


In [ ]:
# Make sure there's a relevant output directory
if not os.path.exists(genus):
    os.mkdir(genus)
if not os.path.exists("failures"):
    os.mkdir("failures")

# Write output
with open(os.path.join(genus, 'labels.txt'), 'w') as lfh:
    with open(os.path.join(genus, 'classes.txt'), 'w') as cfh:
        for asm_uid, contigs in sorted(sequid_links.items()):
            # Get assembly record information
            asm_record = Entrez.read(Entrez.esummary(db='assembly', id=asm_uid, rettype='text'))
            asm_organism = asm_record['DocumentSummarySet']['DocumentSummary'][0]['SpeciesName']
            try:
                asm_strain = asm_record['DocumentSummarySet']['DocumentSummary'][0]['Biosource']['InfraspeciesList'][0]['Sub_value']
            except:
                asm_strain = ""
            gname = asm_record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession'].split('.')[0]
            filestem = os.path.join(genus, gname)
            # Write a labels.txt and a classes.txt file suitable for pyani
            glab, species = asm_organism.split(' ', 1)
            glab = glab[0]
            labelstr = "{0}\t{1}. {2} {3}".format(gname, glab, species, asm_strain)
            print >> lfh, labelstr
            print >> cfh, "{0}\t{1}".format(gname, asm_organism)
            print(labelstr)
            # Get FASTA records for each of the contigs (we could do this with GenBank instead,
            # but sometimes these are not formatted correctly with sequences)
            query_uids = ','.join(contigs)
            tries, success = 0, False
            while success == False and tries < 20:
                # Also check for total sequence length errors?
                try:
                    print("UID:{0} download attempt {1}".format(asm_uid, tries + 1))
                    records = list(SeqIO.parse(Entrez.efetch(db='nucleotide', id=query_uids,
                                                             rettype="fasta", retmode='text'),
                                               'fasta'))
                    if len(records) == expected_contigs[asm_uid]: # No exceptions, num records = expected
                        success = True
                    else: # No exceptions, but not all contigs
                        print('{0} records downloaded, expected {1}'.format(len(records),
                                                                           expected_contigs[asm_uid]))
                        SeqIO.write(records, os.path.join("failures",
                                                          "{0}_{1}_failed.fasta".format(asm_uid, tries)),
                                                          'fasta')
                        tries += 1
                except:  # Catch any errors, incl. from SeqIO.parse and Entrez.efetch
                    tries += 1
            if tries >= 10:
                print("Download failed for {0}\n".format(labelstr))
            print("UID:{0} has {1} records, total length {2}\n".format(asm_uid, len(records),
                                                                       sum([len(r) for r in records])))
            SeqIO.write(records, "{0}.fasta".format(filestem), 'fasta')

In [ ]: