We will use Biopython's interface to Entrez, the interface to NCBI's online resources. There are some guidelines to its use that need to be followed, so you are not banned by NCBI:
Entrez.email
parameter correctly.Entrez.tool
parameter correctly.We start by importing packages, including Biopython's Entrez
module, and setting the appropriate parameters.
In [ ]:
import os
from Bio import Entrez, SeqIO
Entrez.email = "" # Use your own real email
Entrez.tool = "Biopython_get_GenBank_genomes.ipynb"
We're using Bio.Entrez.esearch()
with the genome
database to look for our search term. In this case, it's Pectobacterium, a genus of plant pathogenic bacteria.
We know we can look in the genome
database, because we checked by hand at http://www.ncbi.nlm.nih.gov/genome, and because it's one of the databases named if we look with Entrez.einfo()
.
In [ ]:
genus = "Pectobacterium"
query_text = "{0} AND bacteria[Organism]".format(genus)
handle = Entrez.esearch(db='genome', term=query_text)
record = Entrez.read(handle)
We can get the number of returned records by looking at record["Count"]
:
In [ ]:
record["Count"]
But what are our records? We can see their GenBank identifiers by looking at record["IdList"]
:
In [ ]:
record["IdList"]
But this isn't immediately informative. We're going to have to look at the assemblies associated with these identifiers in GenBank. We do this with Entrez.elink()
, searching for associations between the genome
database and the assembly
database, compiling all the resulting Link
UIDs in a single list.
What are the links we're allowed? Well, there's a big list at http://www.ncbi.nlm.nih.gov/corehtml/query/static/entrezlinks.html, but we can also inspect NCBI's web interface directly to see that http://www.ncbi.nlm.nih.gov/assembly is the likely prefix/database we're looking for.
In [ ]:
asm_links = []
for uid in record["IdList"]:
links = Entrez.read(Entrez.elink(dbfrom="genome", db="assembly", retmode="text", from_uid=uid))
[asm_links.append(d.values()[0]) for d in links[0]['LinkSetDb'][0]['Link']]
print("We find {0} genome entries: {1}".format(len(asm_links), asm_links))
Now we can recover links to the nucleotide
database for each of these UIDs. There may be several such links, but as we are looking for the full assembly, we care only about the assembly_nuccore_insdc
sequences, which are the contigs.
We collect these into a dictionary of contig UIDs, keyed by assembly UID, called sequid_links
:
In [ ]:
sequid_links = {}
for uid in asm_links:
links = Entrez.read(Entrez.elink(dbfrom="assembly", db="nucleotide", retmode="gb", from_uid=uid))
contigs = [l for l in links[0]['LinkSetDb'] if l['LinkName'] == 'assembly_nuccore_insdc'][0]
sequid_links[uid] = [e['Id'] for e in contigs['Link']]
expected_contigs = {}
print("There are {0} genomes identified for {1}:".format(len(sequid_links), genus))
for k, v in sorted(sequid_links.items()):
print("Assembly UID {0}: {1} contigs".format(k, len(v)))
expected_contigs[k] = len(v)
Once we have these nucleotide
database identifiers, we can grab all the sequences and write them out as multi-FASTA files, with Entrez.efetch()
. The assembly records themselves though, have to be obtained with Entrez.esummary()
, and then a byzantine set of keywords navigated, to get the information we're interested in.
We use the assembly UID without version number as the filename, and write a labels.txt
file suitable for use with pyani
.
In [ ]:
# Make sure there's a relevant output directory
if not os.path.exists(genus):
os.mkdir(genus)
if not os.path.exists("failures"):
os.mkdir("failures")
# Write output
with open(os.path.join(genus, 'labels.txt'), 'w') as lfh:
with open(os.path.join(genus, 'classes.txt'), 'w') as cfh:
for asm_uid, contigs in sorted(sequid_links.items()):
# Get assembly record information
asm_record = Entrez.read(Entrez.esummary(db='assembly', id=asm_uid, rettype='text'))
asm_organism = asm_record['DocumentSummarySet']['DocumentSummary'][0]['SpeciesName']
try:
asm_strain = asm_record['DocumentSummarySet']['DocumentSummary'][0]['Biosource']['InfraspeciesList'][0]['Sub_value']
except:
asm_strain = ""
gname = asm_record['DocumentSummarySet']['DocumentSummary'][0]['AssemblyAccession'].split('.')[0]
filestem = os.path.join(genus, gname)
# Write a labels.txt and a classes.txt file suitable for pyani
glab, species = asm_organism.split(' ', 1)
glab = glab[0]
labelstr = "{0}\t{1}. {2} {3}".format(gname, glab, species, asm_strain)
print >> lfh, labelstr
print >> cfh, "{0}\t{1}".format(gname, asm_organism)
print(labelstr)
# Get FASTA records for each of the contigs (we could do this with GenBank instead,
# but sometimes these are not formatted correctly with sequences)
query_uids = ','.join(contigs)
tries, success = 0, False
while success == False and tries < 20:
# Also check for total sequence length errors?
try:
print("UID:{0} download attempt {1}".format(asm_uid, tries + 1))
records = list(SeqIO.parse(Entrez.efetch(db='nucleotide', id=query_uids,
rettype="fasta", retmode='text'),
'fasta'))
if len(records) == expected_contigs[asm_uid]: # No exceptions, num records = expected
success = True
else: # No exceptions, but not all contigs
print('{0} records downloaded, expected {1}'.format(len(records),
expected_contigs[asm_uid]))
SeqIO.write(records, os.path.join("failures",
"{0}_{1}_failed.fasta".format(asm_uid, tries)),
'fasta')
tries += 1
except: # Catch any errors, incl. from SeqIO.parse and Entrez.efetch
tries += 1
if tries >= 10:
print("Download failed for {0}\n".format(labelstr))
print("UID:{0} has {1} records, total length {2}\n".format(asm_uid, len(records),
sum([len(r) for r in records])))
SeqIO.write(records, "{0}.fasta".format(filestem), 'fasta')
In [ ]: