Adapted from book chapter written by Alex Warwick Vesztrocy and Christophe Dessimoz
In this section we look at how to compute semantic similarity between GO terms. First we need to write a function that calculates the minimum number of branches connecting two GO terms.
In [1]:
%load_ext autoreload
%autoreload 2
from goatools.obo_parser import GODag
godag = GODag("go-basic.obo")
In [2]:
go_id3 = 'GO:0048364'
go_id4 = 'GO:0032501'
print(godag[go_id3])
print(godag[go_id4])
Let's get all the annotations from arabidopsis.
In [3]:
# from goatools.associations import read_gaf
# associations = read_gaf("tair.gaf")
import os
from goatools.associations import dnld_assc
fin_gaf = os.path.join(os.getcwd(), "tair.gaf")
associations = dnld_assc(fin_gaf, godag)
In [4]:
# Find deepest common ancestor
from goatools.semantic import deepest_common_ancestor
go_root = deepest_common_ancestor([go_id3, go_id4], godag)
print(go_root)
color | color | GO Term | Description |
---|---|---|---|
blue | #d5ffff | GO:0008150 | deepest common ancestor |
green | #d1ffbd | GO:0048364 | User GO Term |
green | #d1ffbd | GO:0032501 | User GO Term |
$ scripts/go_plot.py GO:0008150#d5ffff GO:0048364#d1ffbd GO:0032501#d1ffdb -o aaa_lin.png --gaf=tair.gaf
go-basic.obo: fmt(1.2) rel(2019-02-07) 47,387 GO Terms
READ 236,943 associations: tair.gaf
#d5ffff GO:0008150 # BP 29699 3.30 L00 D00 biological_process
#f1fbfd GO:0032502 # BP 3220 5.02 L01 D01 A developmental process
#d1ffdb GO:0032501 # BP 1003 5.48 L01 D01 B multicellular organismal process
GO:0048856 # BP 1040 5.46 L02 D02 A anatomical structure development
GO:0099402 # BP 17 6.90 L03 D03 A plant organ development
#d1ffbd GO:0048364 # BP 4 7.56 L04 D04 A root development
Now we can calculate the semantic distance and semantic similarity, as so:
In [5]:
from goatools.semantic import semantic_similarity
sim = semantic_similarity(go_id3, go_id4, godag)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))
Then we can calculate the information content of the single term, GO:0048364
.
In [6]:
from goatools.semantic import TermCounts, get_info_content
# First get the counts of each GO term.
termcounts = TermCounts(godag, associations)
# Calculate the information content
go_id = "GO:0048364"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
Resnik's similarity measure is defined as the information content of the most informative common ancestor. That is, the most specific common parent-term in the GO. Then we can calculate this as follows:
In [7]:
from goatools.semantic import resnik_sim
sim_r = resnik_sim(go_id3, go_id4, godag, termcounts)
print('Resnik similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_r))
Lin's similarity measure is defined as: $$ \textrm{sim}_{\textrm{Lin}}(t_{1}, t_{2}) = \frac{2*\textrm{sim}_{\textrm{Resnik}}(t_1, t_2)}{IC(t_1) + IC(t_2)} $$
Then we can calculate this as
In [8]:
from goatools.semantic import lin_sim
sim_l = lin_sim(go_id3, go_id4, godag, termcounts)
print('Lin similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_l))