Computing basic semantic similarities between GO terms

Adapted from book chapter written by Alex Warwick Vesztrocy and Christophe Dessimoz

In this section we look at how to compute semantic similarity between GO terms. First we need to write a function that calculates the minimum number of branches connecting two GO terms.


In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.insert(0, "..")
from goatools import obo_parser

go = obo_parser.GODag("../go-basic.obo")


load obo file ../go-basic.obo
46268
../go-basic.obo: format-version(1.2) data-version(releases/2016-03-19)
 nodes imported

In [2]:
go_id3 = 'GO:0048364'
go_id4 = 'GO:0044707'
print(go[go_id3])
print(go[go_id4])


GO:0048364	level-03	depth-04	root development [biological_process] 
GO:0044707	level-02	depth-02	single-multicellular organism process [biological_process] 

Let's get all the annotations from arabidopsis.


In [3]:
from goatools.associations import read_gaf

associations = read_gaf("http://geneontology.org/gene-associations/gene_association.tair.gz")


  READ 211727 items: http://geneontology.org/gene-associations/gene_association.tair.gz

Now we can calculate the semantic distance and semantic similarity, as so:


In [4]:
from goatools.semantic import semantic_similarity

sim = semantic_similarity(go_id3, go_id4, go)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))


The semantic similarity between terms GO:0048364 and GO:0044707 is 0.25.

Then we can calculate the information content of the single term, GO:0048364.


In [5]:
from goatools.semantic import TermCounts, get_info_content

# First get the counts of each GO term.
termcounts = TermCounts(go, associations)

# Calculate the information content
go_id = "GO:0048364"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))


Information content (GO:0048364) = 7.75481392334

Resnik's similarity measure is defined as the information content of the most informative common ancestor. That is, the most specific common parent-term in the GO. Then we can calculate this as follows:


In [6]:
from goatools.semantic import resnik_sim

sim_r = resnik_sim(go_id3, go_id4, go, termcounts)
print('Resnik similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_r))


Resnik similarity score (GO:0048364, GO:0044707) = 4.0540784252

Lin's similarity measure is defined as: $$ \textrm{sim}_{\textrm{Lin}}(t_{1}, t_{2}) = \frac{-2*\textrm{sim}_{\textrm{Resnik}}(t_1, t_2)}{IC(t_1) + IC(t_2)} $$

Then we can calculate this as


In [7]:
from goatools.semantic import lin_sim

sim_l = lin_sim(go_id3, go_id4, go, termcounts)
print('Lin similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_l))


Lin similarity score (GO:0048364, GO:0044707) = -0.607721957763