Adapted from book chapter written by Alex Warwick Vesztrocy and Christophe Dessimoz
In this section we look at how to compute semantic similarity between GO terms. First we need to write a function that calculates the minimum number of branches connecting two GO terms.
In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, "..")
from goatools import obo_parser
go = obo_parser.GODag("../go-basic.obo")
In [2]:
go_id3 = 'GO:0048364'
go_id4 = 'GO:0044707'
print(go[go_id3])
print(go[go_id4])
Let's get all the annotations from arabidopsis.
In [3]:
from goatools.associations import read_gaf
associations = read_gaf("http://geneontology.org/gene-associations/gene_association.tair.gz")
Now we can calculate the semantic distance and semantic similarity, as so:
In [4]:
from goatools.semantic import semantic_similarity
sim = semantic_similarity(go_id3, go_id4, go)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))
Then we can calculate the information content of the single term, GO:0048364
.
In [5]:
from goatools.semantic import TermCounts, get_info_content
# First get the counts of each GO term.
termcounts = TermCounts(go, associations)
# Calculate the information content
go_id = "GO:0048364"
infocontent = get_info_content(go_id, termcounts)
print('Information content ({}) = {}'.format(go_id, infocontent))
Resnik's similarity measure is defined as the information content of the most informative common ancestor. That is, the most specific common parent-term in the GO. Then we can calculate this as follows:
In [6]:
from goatools.semantic import resnik_sim
sim_r = resnik_sim(go_id3, go_id4, go, termcounts)
print('Resnik similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_r))
Lin's similarity measure is defined as: $$ \textrm{sim}_{\textrm{Lin}}(t_{1}, t_{2}) = \frac{-2*\textrm{sim}_{\textrm{Resnik}}(t_1, t_2)}{IC(t_1) + IC(t_2)} $$
Then we can calculate this as
In [7]:
from goatools.semantic import lin_sim
sim_l = lin_sim(go_id3, go_id4, go, termcounts)
print('Lin similarity score ({}, {}) = {}'.format(go_id3, go_id4, sim_l))