Gene Ontology and BLAST search

Instructions

Gene Ontology is not discussed by the Biopython tutorial because Biopython does not support Gene Ontology. You should read the Gene Ontology documentation to the extent you need to understand the purpose and overall content of Gene Ontology. The OBO Flat File Format Guide is another useful resource for this course.

For BLAST, read the sections 7.1, 8.1, 8.2, and 8.3 of the Biopython tutorial. If you are interested, you can also take a look at the sections 7.3 and 7.4 for alternative way of handling BLAST results.

Objectives

  • parse and use Gene Ontology terms
  • perform BLAST searches and process their results

Summary

Gene Ontology is an extensive ontology that contains terms for cellular components, molecular functions, and biological processes. The meanings of the terms are described and their relationships to other terms are defined. Among others, UniProt uses Gene Ontology to annotate protein sequences.

BLAST is a method to search for similar sequences from a database. NCBI provides an online BLAST service that can be used to query several databases. A programmatic interface to the NCBI BLAST is implemented in the Bio.Blast module of Biopython. The Bio.SearchIO module defines a generic interface to sequence search program outputs.

GeneOntology is a de facto resource for characterising proteins

Gene Ontology is available for download at http://purl.obolibrary.org/obo/go.obo in the OBO format. The OBO format is a simple, human-readable format. Gene Ontology is not supported by Biopython, but the syntax of the OBO format is easy to parse.

The OBO format is intended for representing ontologies. It is hence very expressive and not all of its features are used by all ontologies. The OBO Flat File Format Guide is a good source to start learning the details of the format. To deeply understand the OBO representation of an ontology, you should also know the basic principles of ontology construction. It is not within the scope of this course, however.

The terms are defined in [Term] stanzas. Each stanza contains at least one tag: value ! comment line.

The example below contains the terms membrane and ATPase activator activity. Note how the terms have basic information (such as id, name, namespace, and def) but also additional information (such as is_a and relationship).

[Term]
id: GO:0016020
name: membrane
namespace: cellular_component
def: "A lipid bilayer along with all the proteins and protein complexes embedded in it an attached to it." [GOC:dos, GOC:mah, ISBN:0815316194]
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
subset: goslim_yeast
xref: Wikipedia:Biological_membrane
is_a: GO:0005575 ! cellular_component

[Term]
id: GO:0001671
name: ATPase activator activity
namespace: molecular_function
def: "Binds to and increases the ATP hydrolysis activity of an ATPase." [GOC:ajp]
synonym: "ATPase stimulator activity" EXACT []
xref: reactome:R-HSA-5251955 "HSP40s activate intrinsic ATPase activity of HSP70s in the nucleoplasm"
xref: reactome:R-HSA-5251959 "HSP40s activate intrinsic ATPase activity of HSP70s in the cytosol"
is_a: GO:0008047 ! enzyme activator activity
is_a: GO:0060590 ! ATPase regulator activity
relationship: part_of GO:0032781 ! positive regulation of ATPase activity
relationship: positively_regulates GO:0016887 ! ATPase activity

The relationship types are defined as [Typedef] stanzas. The syntax is the same as in [Term] stanzas, but the set of tags is different.

[Typedef]
id: positively_regulates
name: positively regulates
namespace: external
xref: RO:0002213
holds_over_chain: negatively_regulates negatively_regulates
is_a: regulates ! regulates
transitive_over: part_of ! part of

When you are implementing your own parser, you are free to only consider the tags that are relevant to your analysis and to collect the values into the data structure that best suits your needs. For simple analyses, it is enough to have a parser that reads the OBO file one line at the time and constructs named tuples from the parsed tag-value pairs.

NCBI online BLAST service is supported by Biopython

The Bio.Blast.NCBIWWW module has the function qblast, with which the NCBI BLAST service can be used programmatically. The workflow of using NCBI BLAST is similar to that of using EUtils.

The qblast function requires the BLAST variant, the database, and the query sequence as arguments. The sequence can be given as a plain sequence, as an ID or in FASTA format. The available databases are listed, among others, in the README of the NCBI BLAST FTP site.

There are also many optional arguments for fine-tuning the search. Make sure to check the default values as they may not be what you expect.

IMPORTANT: As with any other online service, do not overload the NCBI BLAST server with too many or too large queries. BLAST searches are slow and the NCBI BLAST (even with NCBI's massive server capacity) is no exception. Please wait patiently until you get the results and save your results to a file in order to save computational resources. (Biopython will handle the waiting for you.)


In [ ]:
import Bio.Blast.NCBIWWW as BBNW

In [ ]:
import Bio.Seq as BS
import Bio.Alphabet as BA

# BLAST program to use
prog = "blastp"
# database to search against
database = "swissprot"
# query sequence as a Seq object
query = BS.Seq("IRVEGNLRVEYLDDRNTFRHSVVVPYEPPE",
               alphabet=BA.IUPAC.protein)

# run NCBI BLAST
handle = BBNW.qblast(prog, database, query)

# save to file
# (particularly useful with BLAST, which is slow to run)
with open('blast-results.xml', 'w') as f:
    f.write(handle.read())

The NCBI BLAST will accept several query sequences simultaneously. In fact, it is preferred to send all query sequences at once, if possible.


In [ ]:
# BLAST program to use
prog = "blastp"
# database to search against
database = "swissprot"
# query sequences as a list of IDs
query = ['P01013', 'P12345']
# NCBI BLAST expects IDs as a string with one ID per line
query = "\n".join(query)

# run NCBI BLAST
handle = BBNW.qblast(prog, database, query)

# save to file
# (particularly useful with BLAST, which is slow to run)
with open('blast-results-many.xml', 'w') as f:
    f.write(handle.read())

As with the UniProt API, the qblast function reflects the behaviour of the corresponding website. It is therefore a good idea to take advantage of the graphical interface when designing the analysis and debugging the code. The NCBI BLAST and its descriptions are available at https://blast.ncbi.nlm.nih.gov/Blast.cgi.

SearchIO module provides an interface to search results

The XML output of BLAST could be parsed with the Bio.Blast.NCBIXML module, which is specifically intended for this purpose, but the Bio.SearchIO module provides a more convenient interface. The Bio.SearchIO module follows the typical organisation of search results:

  • An output file contains one or more query results (QueryResult).
  • A query result contains one or more hits (Hit).
  • A hit contains one or more high-scoring pairs (HSP).

The corresponding Biopython classes are shown in the parentheses.

These data structures work well with BLAST. The process with which BLAST finds matching sequences is as follows:

  • Create words (i.e. short segments) from the query sequence.
  • Select the similar words that have the highest scores against the words in the query sequence.
  • Scan the database for the exact matches of the selected words.
  • Extend the matches until the score between the query sequence and the database sequence starts to decrease. These extensions are called high-scoring segment pairs (HSPs).
  • Filter and evaluate the HSPs so that only the best (i.e. most significant) alignments are kept.

A single hit is therefore composed of at least one HSP.

Each search program uses their own output format which determines what kind of information is available in the results. The information provided by BLAST is documented in the Bio.SearchIO.BlastIO module, for example.

QueryResult object contains the hits to one query sequence

When running a search query, a query result is returned for each query sequence. A query result can be parsed into a QueryResult object, which contains the basic information about the query as well as the hits to the query. As always, the read function parses a single query and the parse function iterates over several of them. Since various file formats can be parsed with the Bio.SearchIO module, you must specify the file format.


In [ ]:
import Bio.SearchIO as BSIO

In [ ]:
# parse the result into a QueryResult object
with open('blast-results.xml') as f:
    # remember to specify the file format (here 'blast-xml')
    result = BSIO.read(f, 'blast-xml')

# a single QueryResult
print(result)

In [ ]:
# basic information about the search program
print(result.program)
print(result.version)
print()

# basic information about the query
print(result.id)
print(result.description)
# (query sequence length)
print(result.seq_len)
# (target database)
print(result.target)
# (substitution matrix)
print(result.param_matrix)
print()

# all available pieces of information (ignore keys starting with underscore)
for k in result.__dict__.keys():
    print(k)

The parse function will parse and iterate over several query results.


In [ ]:
# parse the results into QueryResult objects
with open('blast-results-many.xml') as f:
    # one QueryResult per query sequence
    for r in BSIO.parse(f, 'blast-xml'):
        print(r, end='\n\n')

QueryResult objects behave like lists and dictionaries

The hits contained in a QueryResult object can be accessed and sliced by indices (like lists). The hits can also be accessed by their ids (like dictionary).


In [ ]:
# number of hits
print(len(result), end='\n\n')

# select the first three hits
# (returns a new QueryResult object)
top_result = result[:3]

# iterate over hits
for hit in top_result:
    print(hit, end='\n\n')

In [ ]:
# iterate over the hit keys (i.e. their ids)
for uid in top_result.hit_keys:
    print(uid)
print()

# hit to specific database entry
print(top_result['sp|Q64662.1|'], end='\n\n')

# does entry exist among the hits?
print('sp|Q64662.1|' in top_result, end='\n\n')

The hits can be sorted in-place or into a new QueryResult object. Like with the built-in sort function in Python, you can specify the sort order by supplying the sort function. The hits can also be filtered much like the built-in filter function in Python.


In [ ]:
# function to sort by sequence length
fn = lambda hit: hit.seq_len

# sort by sequence length in reverse order
# and produce a new QueryResult object (in_place=False)
sorted_result = result.sort(key=fn, reverse=True, in_place=False)

# original top-3 ids
print(result.hit_keys[:3], end='\n\n')
# sorted top-3 ids
print(sorted_result.hit_keys[:3], end='\n\n')

In [ ]:
# filter function to get hits with sequence length > 500
fn = lambda hit: hit.seq_len > 500

# filter hits and produce a new QueryResult object
filtered_result = result.hit_filter(fn)

# number of original hits
print(len(result), end='\n\n')
# number of filtered hits
print(len(filtered_result), end='\n\n')

Hit objects contain the details of single database entries

A Hit object represents a single database entry that was matched to the query sequence. It contains the basic information about the entry as well as the HSPs that were found during the search as HSP objects. You can access, sort and filter HSP objects within Hit objects much like Hit objects within QueryResult objects.


In [ ]:
# (use the first hit as an example)
hit = result[0]

print(hit, end='\n\n')

# basic information about the query
print(hit.query_id)
print(hit.query_description)
print()

# basic information about the hit entry
# (these are about the database entry, not the query sequence)
print(hit.id)
print(hit.description)
print(hit.seq_len)
print(hit.accession)
print()

# all available pieces of information (ignore keys starting with underscore)
for k in hit.__dict__.keys():
    print(k)

In [ ]:
# number of HSPs
print(len(hit), end='\n\n')

# select the first three HSPs (there is only one HSP in this case)
# (returns a new Hit object)
top_hsp = hit[:3]

# iterate over HSPs
for hsp in hit:
    print(hsp, end='\n\n')

# (notice that, unlike Hits, HSPs do not have keys
#  by which you could access them)

In [ ]:
# function to sort by E-value
fn = lambda hsp: hsp.evalue

# sort by E-value in reverse order
# and produce a new Hit object (in_place=False)
sorted_hit = hit.sort(key=fn, reverse=True, in_place=False)

# original HSPs
print(hit, end='\n\n')
# sorted HSPs
print(sorted_hit, end='\n\n')

# filter function to get HSPs with E-value < 10^-6
fn = lambda hsp: hsp.evalue < 0.000001

# filter HSPs and produce a new Hit object
filtered_hit = hit.filter(fn)

# number of original hits
print(len(hit), end='\n\n')
# number of filtered hits
print(len(filtered_hit), end='\n\n')

HSP objects can also be filtered directly from QueryResult objects. This is convenient because you may need to filter your hits based on the values within their HSPs.


In [ ]:
# filter function to get HSPs with E-value < 10^-6
fn = lambda hsp: hsp.evalue < 0.000001

# filter hits and produce a new QueryResult object
hspfiltered_result = result.hsp_filter(fn)

# number of original hits
print(len(result), end='\n\n')
# E-values in original hits
for e in sorted([hsp.evalue for hsp in hit]
                for hit in result):
    print(e)
print()

# number of filtered hits
print(len(hspfiltered_result), end='\n\n')
# E-values in filtered hits
for e in sorted([hsp.evalue for hsp in hit]
                for hit in hspfiltered_result):
    print(e)
print()

HSPs contain the details of the matched segments of sequences

An HSP represents a match that was found between the segments of query and database sequences. It contains the basic information about the matched sequences as well as the match itself. The alignment of the matched segments is available along with its statistics.

The documentation of HSP contains a list of available pieces of information. Algorithm-specific information may also be available.


In [ ]:
# (use the first HSP as an example)
hsp = hit[0]

print(hsp, end='\n\n')

# basic information about the query sequence
print(hsp.query_id)
print(hsp.query_description)
print()

# basic information about the hit entry
print(hsp.hit_id)
print(hsp.hit_description)
print()

# matched query and hit sequence segments as Seq objects
print(hsp.query, end='\n\n')
print(hsp.hit, end='\n\n')

In [ ]:
# start and end positions of the match within the query sequence
print(hsp.query_start)
print(hsp.query_end)
print()

# start and end positions of the match within the hit sequence
print(hsp.hit_start)
print(hsp.hit_end)
print()

# the alignment of the matched segments
print(hsp.aln)

In [ ]:
# basic information about the match
# (E-value)
print(hsp.evalue)
# (bit score)
print(hsp.bitscore)
# (number of identical residues in alignment)
print(hsp.ident_num)
# (number of positive residues in alignment)
print(hsp.pos_num)
# (number of gaps in alignment)
print(hsp.gap_num)
print()