Computing basic semantic similarities between GO terms

Adapted from book chapter written by Alex Warwick Vesztrocy and Christophe Dessimoz

In this section we look at how to compute semantic similarity between GO terms. First we need to write a function that calculates the minimum number of branches connecting two GO terms.


In [1]:
%load_ext autoreload
%autoreload 2

from goatools.obo_parser import GODag
godag = GODag("go-basic.obo")


go-basic.obo: fmt(1.2) rel(2019-04-17) 47,398 GO Terms

In [2]:
go_id3 = 'GO:0048364'
go_id4 = 'GO:0032501'
print(godag[go_id3])
print(godag[go_id4])


GO:0048364	level-04	depth-04	root development [biological_process]
GO:0032501	level-01	depth-01	multicellular organismal process [biological_process]

Let's get all the annotations from arabidopsis.


In [3]:
# from goatools.associations import read_gaf
# associations = read_gaf("tair.gaf")

import os
from goatools.associations import dnld_assc
fin_gaf = os.path.join(os.getcwd(), "tair.gaf")
associations = dnld_assc(fin_gaf, godag)


HMS:0:00:08.821464 249,902 annotations READ: /mnt/c/Users/note2/Data/git/tmp/goatools/notebooks/tair.gaf 
19891 IDs in association branch, BP

In [4]:
# Find deepest common ancestor
from goatools.semantic import deepest_common_ancestor
go_root = deepest_common_ancestor([go_id3, go_id4], godag)
print(go_root)


GO:0008150

Plot the two terms of interest and highlight their deepest common ancestor

color color GO Term Description
blue #d5ffff GO:0008150 deepest common ancestor
green #d1ffbd GO:0048364 User GO Term
green #d1ffbd GO:0032501 User GO Term
$ scripts/go_plot.py GO:0008150#d5ffff GO:0048364#d1ffbd GO:0032501#d1ffdb -o aaa_lin.png --gaf=tair.gaf

go-basic.obo: fmt(1.2) rel(2019-02-07) 47,387 GO Terms
  READ      236,943 associations: tair.gaf
#d5ffff GO:0008150 # BP 29699  3.30 L00 D00       biological_process
#f1fbfd GO:0032502 # BP  3220  5.02 L01 D01 A     developmental process
#d1ffdb GO:0032501 # BP  1003  5.48 L01 D01 B     multicellular organismal process
        GO:0048856 # BP  1040  5.46 L02 D02 A     anatomical structure development
        GO:0099402 # BP    17  6.90 L03 D03 A     plant organ development
#d1ffbd GO:0048364 # BP     4  7.56 L04 D04 A     root development

Now we can calculate the semantic distance and semantic similarity, as so:


In [5]:
from goatools.semantic import semantic_similarity

sim = semantic_similarity(go_id3, go_id4, godag)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))


The semantic similarity between terms GO:0048364 and GO:0032501 is 0.2.

Then we can calculate the information content of the single term, GO:0048364.


In [6]:
from goatools.semantic import TermCounts, get_info_content

# First get the counts of each GO term.
termcounts = TermCounts(godag, associations)

# Calculate the information content
go_id = "GO:0048364"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))


Information content (GO:0048364) = 7.653820637909629

Resnik's similarity measure is defined as the information content of the most informative common ancestor. That is, the most specific common parent-term in the GO. Then we can calculate this as follows:


In [7]:
from goatools.semantic import resnik_sim

sim_r = resnik_sim(go_id3, go_id4, godag, termcounts)
print('Resnik similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_r))


Resnik similarity score (GO:0048364, GO:0032501) = 3.2732508872692763

Lin's similarity measure is defined as: $$ \textrm{sim}_{\textrm{Lin}}(t_{1}, t_{2}) = \frac{2*\textrm{sim}_{\textrm{Resnik}}(t_1, t_2)}{IC(t_1) + IC(t_2)} $$

Then we can calculate this as


In [8]:
from goatools.semantic import lin_sim

sim_l = lin_sim(go_id3, go_id4, godag, termcounts)
print('Lin similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_l))


Lin similarity score (GO:0048364, GO:0032501) = 0.49819994053182504