Sequence records, part 1

Instructions

Read the chapter 4 and the sections 5.1, 9.3, 9.6, 9.15, and 9.16 of the Biopython tutorial. It is also recommended to read the sections 5.4, 5.5, 9.10, and 9.12, which are touched by this notebook.

Objectives

  • create and manipulate SeqRecord and SeqFeature objects
  • search and fetch sequence records from Entrez databases
  • read sequence records from files

Summary

Sequence records are modelled by the SeqRecord class in the Bio.SeqRecord module. Sequence records contain a sequence and its associated metadata (such as annotations).


In [ ]:
import Bio.SeqRecord as BSR

SeqRecord objects contain metadata

SeqRecord objects contain the following attributes:

  • seq : sequence, typically a Seq object
  • id : primary identifier, string
  • name : common name, string
  • description : short human-readable description, string
  • letter_annotations : metadata pertaining to single letters
  • features : structured metadata as SeqFeature objects
  • annotations : unstructured metadata
  • dbxrefs : cross-references

SeqRecord objects can be easily created.


In [ ]:
import Bio.Seq as BS
import Bio.Alphabet as BA

# sequence
seq = BS.Seq('MDGEDVQALVIDNGSGMCKA', BA.generic_protein)
# sequence record
record = BSR.SeqRecord(seq)
print(record)

In [ ]:
# get sequence from record
print(record.seq)

The attributes can be modified as needed.


In [ ]:
# add identifier
record.id = "AC500001"
# add name
record.name = "DUMMY"
# add description
record.description = "Dummy sequence"
print(record)

Unstructured annotations are organised into a dictionary by their name (dictionary key) and can contain basically anything.


In [ ]:
# add annotation named 'source'
record.annotations['source'] = "Imaginary organism"
print(record)

In [ ]:
print(record.annotations['source'])

Per-letter annotations are also organised by their name into a dictionary. The annotations must be iterable objects of the same length at the sequence.


In [ ]:
# add letter annotation named 'structure'
record.letter_annotations['structure'] = '---EEE--EEEEE-EEEEEE'
print(record)

In [ ]:
print(record.letter_annotations['structure'])

Features are a list of SeqFeature objects (see below).


In [ ]:
# no features yet
print(record.features)

Cross-references are a list of strings.


In [ ]:
# add cross-references
record.dbxrefs = ['DB:dummy', 'Project:imaginary']
print(record)

SeqRecord can be modified (slicing, concatenation, etc.) to produce new records. In general, the letter annotations and features are preserved but annotations and cross-references are not. You must add them manually if needed. Remember to change the id, name, and description of the new record, too.

See the Biopython tutorial for details.

Features are SeqFeature objects

Higher-level information about a sequence is modelled by the SeqFeature class in the Bio.SeqFeature module. A SeqFeature object specifies a region of the sequence and associates some information to the region. The objects contain the following attributes:

  • type: feature type as string
  • location: sequence region
  • qualifiers: dictionary of unstructured information

In [ ]:
import Bio.SeqFeature as BSF

In [ ]:
feature = BSF.SeqFeature(BSF.FeatureLocation(0,10), type='domain')
feature.qualifiers['evidence'] = 'experimental'
print(feature)

In [ ]:
# type
print(feature.type)

In [ ]:
# location
print(feature.location)

In [ ]:
# qualifiers
for key, value in feature.qualifiers.items():
    print(key, ':', value)

Locations can be specified in various ways

A single position of a sequence can be described by ExactPosition objects (but there are also fuzzy variants). Continuous and discontinuous ranges can be described by FeatureLocation and CompoundLocation objects, respectively. See the Bio.SeqFeature module for details.


In [ ]:
# position 5 of sequence (zero-based indexing)
pos = BSF.ExactPosition(5)
print(repr(pos))

In [ ]:
# positions can treated as integers
print(repr(pos + 1))

In [ ]:
# region 6-10 of sequence (one-based, inclusive)
# which is 5:10 in Python-style indexing (zero-based, end exclusive)
loc1 = BSF.FeatureLocation(5, 10)
print(loc1)

In [ ]:
# region 21-25 (inclusive)
loc2 = BSF.FeatureLocation(20, 25)
print(loc2)

In [ ]:
# join of two regions (such as gene with two exons)
loc3 = BSF.CompoundLocation([loc1, loc2])
print(loc3)

Locations can be queried for their details.


In [ ]:
# start position
print(loc3.start)

In [ ]:
# end position
print(loc3.end)

In [ ]:
# continuous sub-regions
for p in loc3.parts:
    print(p)

In [ ]:
# is the position 4 in the region?
print(4 in loc3)

In [ ]:
# is the position 8 in the region?
print(8 in loc3)

In [ ]:
# is one region within another? (not supported)
try:
    print(loc1 in loc3)
except ValueError as e:
    print(e)

Specified region can be extracted from full sequence


In [ ]:
# full sequence
seq = BS.Seq('GGACTCTTAGCGGCTCACGCACTTTCTTCCGAAGACGGAACCCG', BA.generic_dna)
# two-exon gene within sequence
feature = BSF.SeqFeature(loc3, type='gene', strand=1)

In [ ]:
# the nucleotides specified by the feature
print(repr(feature.extract(seq)))

Features (or locations, in fact) are also aware of the strand (if applicable).


In [ ]:
# feature in reverse strand
loc4 = BSF.FeatureLocation(0, 5, strand=-1)
print(loc4)

In [ ]:
# strand (1 == forward, -1 == reverse)
print(loc4.strand)

In [ ]:
feature = BSF.SeqFeature(loc4, type='gene')
# the five nucleotides specified by the feature
# (note that the output is a reverse complement)
print(repr(feature.extract(seq)))

Sequence records can be obtained from NCBI's Entrez databases

The Bio.Entrez contains tools to query Entrez databases and fetch sequence records. They make use of the Entrez Programming Utilities (also known as EUtils), described in detail here. The output of these tools is usually in the XML format, which can be parsed by the utility functions of the Bio.Entrez module.

Important: Be aware that the Entrez utilities impose usage limits and that you can get yourself blacklisted if you violate them. Biopython takes care of keeping an eye on those for you. You should also fill in a non-fake email address. See the Frequency, Timing and Registration of E-utility URL Requests section of the manual.


In [ ]:
import Bio.Entrez as BE

# set email globally
BE.email = 'your.name@example.com'

# remember to set your real email address before continuing to the examples below
SET-YOUR-REAL-EMAIL-ADDRESS-BEFORE-COMMENTING-OUT-THIS-LINE

ESearch for searching databases

The esearch function can be used to search and retrieve primary IDs of records. It requires the database and the query as input and accepts some optional arguments as well. See the ESearch documentation for details.


In [ ]:
# search "nucleotide" database for the "accD" gene of "opuntia" organism
handle = BE.esearch(db="nucleotide",
                    term="opuntia[ORGN] accD[gene]",
                    # return ID list in XML format
                    rettype="uilist",
                    retmode="xml")

The output can be stored to a file for later use, which is a cost-effective approach for sequences in particular.


In [ ]:
# save to file
with open('results.xml', 'w') as f:
    f.write(handle.read())

The read function in Bio.Entrez can parse the output of EUtils tools into a dictionary. It expects input that contains exactly one record.

In this case, the output is the query result that contains the retrieved IDs (and some other information).


In [ ]:
# load from file and parse
with open('results.xml') as f:
    results = BE.read(f)
# (you could also parse 'handle' directly)

# retrieved IDs
print(results['IdList'])

EFetch for retrieving records

The efetch function can be used to retrieve specific records. It requires the database and the list of IDs as input and accepts some optional arguments as well. See the EFetch documentation for details.

The output format can be specified. The available formats depend on the database.


In [ ]:
# the ids of three records from the previous search
ids = results['IdList']
# fetch records from "nucleotide" database
handle = BE.efetch(db="nucleotide",
                   id=ids,
                   # return sequences in FASTA format
                   rettype="fasta",
                   retmode="text")
# save to file
with open('sequences.fasta', 'w') as f:
    f.write(handle.read())

In [ ]:
# show file content
with open('sequences.fasta') as f:
    print(f.read())

Full records can be obtained in the GenBank format in plain text (and in XML).


In [ ]:
# fetch records from "nucleotide" database
handle = BE.efetch(db="nucleotide",
                   id=ids,
                   # return sequences in GenBank plain text format
                   rettype="gb",
                   retmode="text")
# save to file
with open('sequences.gb', 'w') as f:
    f.write(handle.read())

In [ ]:
# show file content
with open('sequences.gb') as f:
    print(f.read())

SeqIO produces SeqRecord objects

The Bio.SeqIO module contains functions to parse the retrieved sequence records into SeqRecord objects. The read function parses an input that contains a single record, whereas the parse function returns a generator that produces all records from the input sequentially. They require the source (filename or file-like object) and the format as input.

(Bio.Entrez can parse the XML output of efetch but the result is a dictionary rather than SeqRecord because Bio.Entrez is intended for all Entrez databases and is not specific to sequence data.)


In [ ]:
import Bio.SeqIO as BSIO

Since parse returns a generator, you can process each record at the time without storing all records in memory simultaneously.


In [ ]:
# parse and process FASTA sequences one at the time
for record in BSIO.parse('sequences.fasta', 'fasta'):
    print(record)
    print()

The amount of information varies between formats. You should use the format that is the most appropriate for your application.


In [ ]:
# parse records in GenBank format, which contains more information than FASTA
for record in BSIO.parse('sequences.gb', 'gb'):
    print(record)
    print()

The SeqRecord objects can also be written to a file by Bio.SeqIO.


In [ ]:
# write 'record' to file named 'record.gb' in a GenBank format
# (note that the first argument is a list of records)
BSIO.write([record], 'record.gb', 'gb')

Large queries with history server and batch retrieve

NCBI expects that large queries are made by using their support for query history because it saves their computational resources. To use the history, add the usehistory="y" argument to your function call.


In [ ]:
# search "nucleotide" database for the "accD" gene of "opuntia" organism using history
handle = BE.esearch(db="nucleotide",
                    term="opuntia[ORGN] accD[gene]",
                    # use history
                    usehistory="y",
                    # return ID list in XML format
                    rettype="uilist",
                    retmode="xml")
results = BE.read(handle)

The result will contain WebEnv and QueryKey information, which replace the list of IDs in efetch. These pieces of information identify the result set of your search.


In [ ]:
webenv = results['WebEnv']
print(webenv)

In [ ]:
querykey = results['QueryKey']
print(querykey)

In [ ]:
# fetch records from "nucleotide" database
handle = BE.efetch(db="nucleotide",
                   webenv=webenv,
                   query_key=querykey,
                   rettype="gb",
                   retmode="text")
# save to file
with open('sequences.gb', 'w') as f:
    f.write(handle.read())

A large number of records should be retrieved in batches, which can be achieved with the retstart and retmax arguments. The former indicates how many records to skip from the beginning and the latter how many records to retrieve.

Note that the example below parses the records directly but, also in this case, the retrieved records could be stored to a file before parsing.


In [ ]:
# function to query "nucleotide" database and retrieve a specific number of records in batches of a given size
def retrieve(query, retrieve_size, batch_size):
    # search using history
    handle = BE.esearch(db="nucleotide",
                        term=query,
                        usehistory="y",
                        rettype="uilist",
                        retmode="xml")
    results = BE.read(handle)
    webenv = results['WebEnv']
    querykey = results['QueryKey']
    # adjust maximum size if less than maximum number of hits available
    retrieve_size = min(int(results['Count']), retrieve_size)
    # split results into batches
    for i in range(0, retrieve_size, batch_size):
        # fetch records
        handle = BE.efetch(db="nucleotide",
                           webenv=webenv,
                           query_key=querykey,
                           # fetch N records starting from i
                           retstart=i,
                           retmax=batch_size,
                           rettype="gb",
                           retmode="text")
        for record in BSIO.parse(handle, 'gb'):
            # acts as an iterator and give records out as they come
            yield record

# example call to function
# (retrieve 10 records in batches of 5; usually these numbers are much larger)
for record in retrieve("opuntia[ORGN] accD[gene]", 10, 5):
    print(record)
    print()