Read the chapter 4 and the sections 5.1, 9.3, 9.6, 9.15, and 9.16 of the Biopython tutorial. It is also recommended to read the sections 5.4, 5.5, 9.10, and 9.12, which are touched by this notebook.
Sequence records are modelled by the SeqRecord
class in the Bio.SeqRecord
module. Sequence records contain a sequence and its associated metadata (such as annotations).
In [ ]:
import Bio.SeqRecord as BSR
SeqRecord objects contain the following attributes:
seq
: sequence, typically a Seq
objectid
: primary identifier, stringname
: common name, stringdescription
: short human-readable description, stringletter_annotations
: metadata pertaining to single lettersfeatures
: structured metadata as SeqFeature objectsannotations
: unstructured metadatadbxrefs
: cross-referencesSeqRecord objects can be easily created.
In [ ]:
import Bio.Seq as BS
import Bio.Alphabet as BA
# sequence
seq = BS.Seq('MDGEDVQALVIDNGSGMCKA', BA.generic_protein)
# sequence record
record = BSR.SeqRecord(seq)
print(record)
In [ ]:
# get sequence from record
print(record.seq)
The attributes can be modified as needed.
In [ ]:
# add identifier
record.id = "AC500001"
# add name
record.name = "DUMMY"
# add description
record.description = "Dummy sequence"
print(record)
Unstructured annotations are organised into a dictionary by their name (dictionary key) and can contain basically anything.
In [ ]:
# add annotation named 'source'
record.annotations['source'] = "Imaginary organism"
print(record)
In [ ]:
print(record.annotations['source'])
Per-letter annotations are also organised by their name into a dictionary. The annotations must be iterable objects of the same length at the sequence.
In [ ]:
# add letter annotation named 'structure'
record.letter_annotations['structure'] = '---EEE--EEEEE-EEEEEE'
print(record)
In [ ]:
print(record.letter_annotations['structure'])
Features are a list of SeqFeature
objects (see below).
In [ ]:
# no features yet
print(record.features)
Cross-references are a list of strings.
In [ ]:
# add cross-references
record.dbxrefs = ['DB:dummy', 'Project:imaginary']
print(record)
SeqRecord can be modified (slicing, concatenation, etc.) to produce new records. In general, the letter annotations and features are preserved but annotations and cross-references are not. You must add them manually if needed. Remember to change the id, name, and description of the new record, too.
See the Biopython tutorial for details.
Higher-level information about a sequence is modelled by the SeqFeature
class in the Bio.SeqFeature
module. A SeqFeature object specifies a region of the sequence and associates some information to the region. The objects contain the following attributes:
type
: feature type as stringlocation
: sequence regionqualifiers
: dictionary of unstructured information
In [ ]:
import Bio.SeqFeature as BSF
In [ ]:
feature = BSF.SeqFeature(BSF.FeatureLocation(0,10), type='domain')
feature.qualifiers['evidence'] = 'experimental'
print(feature)
In [ ]:
# type
print(feature.type)
In [ ]:
# location
print(feature.location)
In [ ]:
# qualifiers
for key, value in feature.qualifiers.items():
print(key, ':', value)
A single position of a sequence can be described by ExactPosition
objects (but there are also fuzzy variants). Continuous and discontinuous ranges can be described by FeatureLocation
and CompoundLocation
objects, respectively. See the Bio.SeqFeature
module for details.
In [ ]:
# position 5 of sequence (zero-based indexing)
pos = BSF.ExactPosition(5)
print(repr(pos))
In [ ]:
# positions can treated as integers
print(repr(pos + 1))
In [ ]:
# region 6-10 of sequence (one-based, inclusive)
# which is 5:10 in Python-style indexing (zero-based, end exclusive)
loc1 = BSF.FeatureLocation(5, 10)
print(loc1)
In [ ]:
# region 21-25 (inclusive)
loc2 = BSF.FeatureLocation(20, 25)
print(loc2)
In [ ]:
# join of two regions (such as gene with two exons)
loc3 = BSF.CompoundLocation([loc1, loc2])
print(loc3)
Locations can be queried for their details.
In [ ]:
# start position
print(loc3.start)
In [ ]:
# end position
print(loc3.end)
In [ ]:
# continuous sub-regions
for p in loc3.parts:
print(p)
In [ ]:
# is the position 4 in the region?
print(4 in loc3)
In [ ]:
# is the position 8 in the region?
print(8 in loc3)
In [ ]:
# is one region within another? (not supported)
try:
print(loc1 in loc3)
except ValueError as e:
print(e)
In [ ]:
# full sequence
seq = BS.Seq('GGACTCTTAGCGGCTCACGCACTTTCTTCCGAAGACGGAACCCG', BA.generic_dna)
# two-exon gene within sequence
feature = BSF.SeqFeature(loc3, type='gene', strand=1)
In [ ]:
# the nucleotides specified by the feature
print(repr(feature.extract(seq)))
Features (or locations, in fact) are also aware of the strand (if applicable).
In [ ]:
# feature in reverse strand
loc4 = BSF.FeatureLocation(0, 5, strand=-1)
print(loc4)
In [ ]:
# strand (1 == forward, -1 == reverse)
print(loc4.strand)
In [ ]:
feature = BSF.SeqFeature(loc4, type='gene')
# the five nucleotides specified by the feature
# (note that the output is a reverse complement)
print(repr(feature.extract(seq)))
The Bio.Entrez
contains tools to query Entrez databases and fetch sequence records. They make use of the Entrez Programming Utilities (also known as EUtils), described in detail here. The output of these tools is usually in the XML format, which can be parsed by the utility functions of the Bio.Entrez
module.
Important: Be aware that the Entrez utilities impose usage limits and that you can get yourself blacklisted if you violate them. Biopython takes care of keeping an eye on those for you. You should also fill in a non-fake email address. See the Frequency, Timing and Registration of E-utility URL Requests section of the manual.
In [ ]:
import Bio.Entrez as BE
# set email globally
BE.email = 'your.name@example.com'
# remember to set your real email address before continuing to the examples below
SET-YOUR-REAL-EMAIL-ADDRESS-BEFORE-COMMENTING-OUT-THIS-LINE
The esearch
function can be used to search and retrieve primary IDs of records. It requires the database and the query as input and accepts some optional arguments as well. See the ESearch documentation for details.
In [ ]:
# search "nucleotide" database for the "accD" gene of "opuntia" organism
handle = BE.esearch(db="nucleotide",
term="opuntia[ORGN] accD[gene]",
# return ID list in XML format
rettype="uilist",
retmode="xml")
The output can be stored to a file for later use, which is a cost-effective approach for sequences in particular.
In [ ]:
# save to file
with open('results.xml', 'w') as f:
f.write(handle.read())
The read
function in Bio.Entrez
can parse the output of EUtils tools into a dictionary. It expects input that contains exactly one record.
In this case, the output is the query result that contains the retrieved IDs (and some other information).
In [ ]:
# load from file and parse
with open('results.xml') as f:
results = BE.read(f)
# (you could also parse 'handle' directly)
# retrieved IDs
print(results['IdList'])
The efetch
function can be used to retrieve specific records. It requires the database and the list of IDs as input and accepts some optional arguments as well. See the EFetch documentation for details.
The output format can be specified. The available formats depend on the database.
In [ ]:
# the ids of three records from the previous search
ids = results['IdList']
# fetch records from "nucleotide" database
handle = BE.efetch(db="nucleotide",
id=ids,
# return sequences in FASTA format
rettype="fasta",
retmode="text")
# save to file
with open('sequences.fasta', 'w') as f:
f.write(handle.read())
In [ ]:
# show file content
with open('sequences.fasta') as f:
print(f.read())
Full records can be obtained in the GenBank format in plain text (and in XML).
In [ ]:
# fetch records from "nucleotide" database
handle = BE.efetch(db="nucleotide",
id=ids,
# return sequences in GenBank plain text format
rettype="gb",
retmode="text")
# save to file
with open('sequences.gb', 'w') as f:
f.write(handle.read())
In [ ]:
# show file content
with open('sequences.gb') as f:
print(f.read())
The Bio.SeqIO
module contains functions to parse the retrieved sequence records into SeqRecord
objects. The read
function parses an input that contains a single record, whereas the parse
function returns a generator that produces all records from the input sequentially. They require the source (filename or file-like object) and the format as input.
(Bio.Entrez
can parse the XML output of efetch
but the result is a dictionary rather than SeqRecord
because Bio.Entrez
is intended for all Entrez databases and is not specific to sequence data.)
In [ ]:
import Bio.SeqIO as BSIO
Since parse
returns a generator, you can process each record at the time without storing all records in memory simultaneously.
In [ ]:
# parse and process FASTA sequences one at the time
for record in BSIO.parse('sequences.fasta', 'fasta'):
print(record)
print()
The amount of information varies between formats. You should use the format that is the most appropriate for your application.
In [ ]:
# parse records in GenBank format, which contains more information than FASTA
for record in BSIO.parse('sequences.gb', 'gb'):
print(record)
print()
The SeqRecord
objects can also be written to a file by Bio.SeqIO
.
In [ ]:
# write 'record' to file named 'record.gb' in a GenBank format
# (note that the first argument is a list of records)
BSIO.write([record], 'record.gb', 'gb')
In [ ]:
# search "nucleotide" database for the "accD" gene of "opuntia" organism using history
handle = BE.esearch(db="nucleotide",
term="opuntia[ORGN] accD[gene]",
# use history
usehistory="y",
# return ID list in XML format
rettype="uilist",
retmode="xml")
results = BE.read(handle)
The result will contain WebEnv
and QueryKey
information, which replace the list of IDs in efetch
. These pieces of information identify the result set of your search.
In [ ]:
webenv = results['WebEnv']
print(webenv)
In [ ]:
querykey = results['QueryKey']
print(querykey)
In [ ]:
# fetch records from "nucleotide" database
handle = BE.efetch(db="nucleotide",
webenv=webenv,
query_key=querykey,
rettype="gb",
retmode="text")
# save to file
with open('sequences.gb', 'w') as f:
f.write(handle.read())
A large number of records should be retrieved in batches, which can be achieved with the retstart
and retmax
arguments. The former indicates how many records to skip from the beginning and the latter how many records to retrieve.
Note that the example below parses the records directly but, also in this case, the retrieved records could be stored to a file before parsing.
In [ ]:
# function to query "nucleotide" database and retrieve a specific number of records in batches of a given size
def retrieve(query, retrieve_size, batch_size):
# search using history
handle = BE.esearch(db="nucleotide",
term=query,
usehistory="y",
rettype="uilist",
retmode="xml")
results = BE.read(handle)
webenv = results['WebEnv']
querykey = results['QueryKey']
# adjust maximum size if less than maximum number of hits available
retrieve_size = min(int(results['Count']), retrieve_size)
# split results into batches
for i in range(0, retrieve_size, batch_size):
# fetch records
handle = BE.efetch(db="nucleotide",
webenv=webenv,
query_key=querykey,
# fetch N records starting from i
retstart=i,
retmax=batch_size,
rettype="gb",
retmode="text")
for record in BSIO.parse(handle, 'gb'):
# acts as an iterator and give records out as they come
yield record
# example call to function
# (retrieve 10 records in batches of 5; usually these numbers are much larger)
for record in retrieve("opuntia[ORGN] accD[gene]", 10, 5):
print(record)
print()