Search VariantAnnotations using SO term sets

This code demonstrates how to use named sets of SO terms to annotate variants or search for variants. Possible uses include classififcation of variation by "impact" (e.g., à la SnpEff) and classification variation by region.

Important definitions: An SO "term" refers to a concept. The proper primary key for a SO term is a SO "id" (e.g., SO:0001147). Each term also has a "name" (natural_variant_site), "definition" ("Describes the natural sequence variants due to polymorphisms..."), and other fields. Names may change (and have changed) for a given SO id; therefore, developers should use SO ids internally.

This notebook generates SO maps that consists of a set of annotation labels (as map keys) mapped to a set of SO terms. Each map is named. For example, a SnpEff map might have labels "high", "moderate", and "low", with each label mapping to a set of SO terms.

The results here may be compared with the survey of terms from the same region in Exploring SO terms



In [1]:

    
import itertools
import pprint
import re

from IPython.display import HTML, display

import ga4gh.client
import prettytable
import requests

print(ga4gh.__version__)









    



0.1.dev632+ncb43455c1003



In [2]:

    
gc = ga4gh.client.HttpClient("http://localhost:8000")

region_constraints = dict(referenceName="1", start=0, end=int(1e10))
variant_set_id = 'YnJjYTE6T1I0Rg'
variant_annotation_sets = list(gc.searchVariantAnnotationSets(variant_set_id))
variant_annotation_set = variant_annotation_sets[0]
print("Using first variant annotation set (of {n} total) for variant set {vs_id}\nvas_id={vas.id}".format(
    n=len(variant_annotation_sets), vs_id=variant_set_id, vas=variant_annotation_set))









    



Using first variant annotation set (of 1 total) for variant set YnJjYTE6T1I0Rg
vas_id=YnJjYTE6T1I0Rjp2YXJpYW50YW5ub3RhdGlvbnM



In [3]:

    
# poor-man's SO name-to-id map
# so_name_id_map will look like this:
# {u'natural_variant_site': u'SO:0001147',
#  u'polypeptide_zinc_ion_contact_site': u'SO:0001103',
#  u'methylated_adenine': u'SO:0000161',
#  ...
id_name_re = re.compile("id: (?P<id>SO:\d+)\nname: (?P<name>\S+)")
url = "https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-xp-simple.obo"
so_name_id_map = {
    m.group('name'): m.group('id')
    for m in (id_name_re.search(s) for s in requests.get(url).text.split("\n\n")) if m is not None
    }



In [4]:

    
def mk_effect_id_filter(so_ids=[]):
    """return list of OntologyTerm effect filters for the given list of so ids

    >>> print(_mk_effect_id_filter("SO:123 SO:456".split()))
    [{'id': 'SO:123'}, {'id': 'SO:456'}]
    """
    return [{"id": id} for id in so_ids]

def remap_with_so_ids(so_name_map):
    """For a map of label => [so names], return a map of label => [so ids]"""
    def _map1(n):
        try:
            return so_name_id_map[n]
        except KeyError:
            print("SO term name '{n}' is not (currently) valid".format(n=n))
    return {label: filter(None, (_map1(n) for n in names)) for label, names in so_name_map.items()}

SnpEff map

classifies SO terms by predicted impact

derived from https://github.com/pcingola/SnpEff/blob/18ad192f751d2e34595949dda8b25c295c8a9ca1/src/main/java/org/snpeff/snpEffect/EffectType.java



In [5]:

    
snpeff_so_name_map = {
    "high": [
        "chromosome_large_deletion",
        "chromosome_large_inversion",
        "chromosome_large_duplication",
        "gene_rearrangement",
        "gene_deleted",
        "gene_fusion",
        "gene_fusion_reverese",
        "transcript_deleted",
        "exon_deleted",
        "exon_deleted_partial",
        "exon_duplication",
        "exon_duplication_partial",
        "exon_inversion",
        "exon_inversion_partial",
        "frame_shift",
        "stop_gained",
        "stop_lost",
        "start_lost",
        "splice_site_acceptor",
        "splice_site_donor",
        "rare_amino_acid",
        "protein_protein_interaction_locus",
        "protein_structural_interaction_locus",
    ], 

    "moderate": [
        "non_synonymous_coding",
        "codon_insertion",
        "codon_change_plus_codon_insertion",
        "codon_deletion",
        "codon_change_plus_codon_deletion",
        "utr_5_deleted",
        "utr_3_deleted",
        "splice_site_branch_u12",
        "splice_site_region",
        "splice_site_branch",
        "non_synonymous_stop",
        "non_synonymous_start",
        "synonymous_coding",
        "synonymous_start",
        "synonymous_stop",
        "codon_change",
    ], 

    "low": [
        "gene_inversion",
        "gene_duplication",
        "transcript_duplication",
        "transcript_inversion",
        "utr_5_prime",
        "utr_3_prime",
        "start_gained",
        "upstream",
        "downstream",
        "motif",
        "motif_deleted",
        "regulation",
        "micro_rna",
    ],

    "modifiers": [
        "custom",
        "next_prot",
        "intron_conserved",
        "intron",
        "intragenic",
        "intergenic_conserved",
        "intergenic",
        "cds",
        "exon",
        "transcript",
        "gene",
        "sequence",
        "chromosome_elongation",
        "chromosome",
        "genome",
        "none",
    ]
}



In [6]:

    
snpeff_so_id_map = remap_with_so_ids(snpeff_so_name_map)









    



SO term name 'chromosome_large_deletion' is not (currently) valid
SO term name 'chromosome_large_inversion' is not (currently) valid
SO term name 'chromosome_large_duplication' is not (currently) valid
SO term name 'gene_rearrangement' is not (currently) valid
SO term name 'gene_deleted' is not (currently) valid
SO term name 'gene_fusion_reverese' is not (currently) valid
SO term name 'transcript_deleted' is not (currently) valid
SO term name 'exon_deleted' is not (currently) valid
SO term name 'exon_deleted_partial' is not (currently) valid
SO term name 'exon_duplication' is not (currently) valid
SO term name 'exon_duplication_partial' is not (currently) valid
SO term name 'exon_inversion' is not (currently) valid
SO term name 'exon_inversion_partial' is not (currently) valid
SO term name 'frame_shift' is not (currently) valid
SO term name 'splice_site_acceptor' is not (currently) valid
SO term name 'splice_site_donor' is not (currently) valid
SO term name 'rare_amino_acid' is not (currently) valid
SO term name 'protein_protein_interaction_locus' is not (currently) valid
SO term name 'protein_structural_interaction_locus' is not (currently) valid
SO term name 'non_synonymous_coding' is not (currently) valid
SO term name 'codon_insertion' is not (currently) valid
SO term name 'codon_change_plus_codon_insertion' is not (currently) valid
SO term name 'codon_deletion' is not (currently) valid
SO term name 'codon_change_plus_codon_deletion' is not (currently) valid
SO term name 'utr_5_deleted' is not (currently) valid
SO term name 'utr_3_deleted' is not (currently) valid
SO term name 'splice_site_branch_u12' is not (currently) valid
SO term name 'splice_site_region' is not (currently) valid
SO term name 'splice_site_branch' is not (currently) valid
SO term name 'non_synonymous_stop' is not (currently) valid
SO term name 'non_synonymous_start' is not (currently) valid
SO term name 'synonymous_coding' is not (currently) valid
SO term name 'synonymous_start' is not (currently) valid
SO term name 'synonymous_stop' is not (currently) valid
SO term name 'codon_change' is not (currently) valid
SO term name 'gene_inversion' is not (currently) valid
SO term name 'gene_duplication' is not (currently) valid
SO term name 'transcript_duplication' is not (currently) valid
SO term name 'transcript_inversion' is not (currently) valid
SO term name 'utr_5_prime' is not (currently) valid
SO term name 'utr_3_prime' is not (currently) valid
SO term name 'start_gained' is not (currently) valid
SO term name 'upstream' is not (currently) valid
SO term name 'downstream' is not (currently) valid
SO term name 'motif' is not (currently) valid
SO term name 'motif_deleted' is not (currently) valid
SO term name 'regulation' is not (currently) valid
SO term name 'micro_rna' is not (currently) valid
SO term name 'custom' is not (currently) valid
SO term name 'next_prot' is not (currently) valid
SO term name 'intron_conserved' is not (currently) valid
SO term name 'intragenic' is not (currently) valid
SO term name 'intergenic_conserved' is not (currently) valid
SO term name 'intergenic' is not (currently) valid
SO term name 'cds' is not (currently) valid
SO term name 'chromosome_elongation' is not (currently) valid
SO term name 'none' is not (currently) valid



In [7]:

    
snpeff_so_id_map









    Out[7]:





{'high': [u'SO:0001565', u'SO:0001587', u'SO:0001578', u'SO:0002012'],
 'low': [],
 'moderate': [],
 'modifiers': [u'SO:0000188',
  u'SO:0000147',
  u'SO:0000673',
  u'SO:0000704',
  u'SO:1000082',
  u'SO:0000340',
  u'SO:0001026']}

Region name map

This is really just a contrived example of another kind of SO annotation that someone might find useful.



In [8]:

    
region_so_name_map = {
    "locus": [
        "gene_fusion",
        "upstream_gene_variant",
    ],

    "cds": [
        "missense_variant",
        "start_lost",
        "stop_gained",
        "stop_lost",
        "synonymous_variant",
    ],
    
    # note that utr, upstream, and downstream sets overlap intentionally
    "utr": [
        "3_prime_UTR_variant",
        "5_prime_UTR_variant", 
    ],
    
    "upstream": [
        "5_prime_UTR_variant",
        "upstream_gene_variant",
    ],

    "downstream": [
        "3_prime_UTR_variant",
        "downstream_gene_variant",
    ],
}



In [9]:

    
region_so_id_map = remap_with_so_ids(region_so_name_map)

Meta maps

so_maps contains both maps



In [10]:

    
so_maps = {"snpeff": snpeff_so_id_map, "region": region_so_id_map}
pprint.pprint(so_maps)









    



{'region': {'cds': [u'SO:0001583',
                    u'SO:0002012',
                    u'SO:0001587',
                    u'SO:0001578',
                    u'SO:0001819'],
            'downstream': [u'SO:0001624', u'SO:0001632'],
            'locus': [u'SO:0001565', u'SO:0001631'],
            'upstream': [u'SO:0001623', u'SO:0001631'],
            'utr': [u'SO:0001624', u'SO:0001623']},
 'snpeff': {'high': [u'SO:0001565',
                     u'SO:0001587',
                     u'SO:0001578',
                     u'SO:0002012'],
            'low': [],
            'moderate': [],
            'modifiers': [u'SO:0000188',
                          u'SO:0000147',
                          u'SO:0000673',
                          u'SO:0000704',
                          u'SO:1000082',
                          u'SO:0000340',
                          u'SO:0001026']}}

Search for variants by each SO map



In [11]:

    
field_names = "n_vars name:label n_so_ids so_ids".split()
pt = prettytable.PrettyTable(field_names=field_names)

for name, so_map in so_maps.items():
    for label, so_ids in so_map.items():
        vs = []
        # Searching with an empty filter means no filtering
        # This should be changed: searching should be by inclusion, not lack of exclusion.
        if len(so_ids)>0:
            efilter = mk_effect_id_filter(so_ids)
            vs = list(gc.searchVariantAnnotations(variant_annotation_set.id,
                                                  effects=efilter,
                                                  **region_constraints))
        pt.add_row([
            len(vs),
            name + ":" + label,
            len(so_ids),
            " ".join(so_ids)
            ])

display(HTML(pt.get_html_string()))









    





    
        n_vars
        name:label
        n_so_ids
        so_ids
    
    
        1
        snpeff:high
        4
        SO:0001565 SO:0001587 SO:0001578 SO:0002012
    
    
        0
        snpeff:moderate
        0
        
    
    
        0
        snpeff:low
        0
        
    
    
        0
        snpeff:modifiers
        7
        SO:0000188 SO:0000147 SO:0000673 SO:0000704 SO:1000082 SO:0000340 SO:0001026
    
    
        0
        region:utr
        2
        SO:0001624 SO:0001623
    
    
        63
        region:locus
        2
        SO:0001565 SO:0001631
    
    
        24
        region:cds
        5
        SO:0001583 SO:0002012 SO:0001587 SO:0001578 SO:0001819
    
    
        56
        region:downstream
        2
        SO:0001624 SO:0001632
    
    
        63
        region:upstream
        2
        SO:0001623 SO:0001631

Label variants with SO maps

This is essentially the inverse of the above



In [12]:

    
# invert the SO map (name: {SO: label})
def invert_so_map(so_map):
    """for a so_map of {label: [so_id]}, return the inverse {so_id: [labels]}.
    so_id:label is many:many
    """
    lmap = sorted((so, label) for label, so_ids in so_map.items() for so in so_ids)
    return {k: list(sl[1] for sl in sli) for k, sli in itertools.groupby(lmap, key=lambda e: e[0])}

def unique_labels_for_so_ids(so_labels_map, so_ids):
    """given a map of {so: [labels]} and a list of so_ids, return a list of unique labels"""
    uniq_labels = set(itertools.chain.from_iterable(so_labels_map.get(so_id, []) for so_id in so_ids))
    return list(uniq_labels)

def build_variant_record(v):
    so_ids = list(set(eff.id for te in v.transcriptEffects for eff in te.effects))
    impacts = unique_labels_for_so_ids(so_labels_maps["snpeff"], so_ids)
    regions = unique_labels_for_so_ids(so_labels_maps["region"], so_ids)
    return dict(
        g = v.transcriptEffects[0].hgvsAnnotation.genomic,
        t = v.transcriptEffects[0].hgvsAnnotation.transcript,
        p = v.transcriptEffects[0].hgvsAnnotation.protein,
        so_ids = " ".join(so_ids),
        impacts = " ".join(impacts),
        regions = " ".join(regions)
        )



In [13]:

    
so_labels_maps = {name: invert_so_map(so_map) for name, so_map in so_maps.items()}
pprint.pprint(so_labels_maps)









    



{'region': {u'SO:0001565': ['locus'],
            u'SO:0001578': ['cds'],
            u'SO:0001583': ['cds'],
            u'SO:0001587': ['cds'],
            u'SO:0001623': ['upstream', 'utr'],
            u'SO:0001624': ['downstream', 'utr'],
            u'SO:0001631': ['locus', 'upstream'],
            u'SO:0001632': ['downstream'],
            u'SO:0001819': ['cds'],
            u'SO:0002012': ['cds']},
 'snpeff': {u'SO:0000147': ['modifiers'],
            u'SO:0000188': ['modifiers'],
            u'SO:0000340': ['modifiers'],
            u'SO:0000673': ['modifiers'],
            u'SO:0000704': ['modifiers'],
            u'SO:0001026': ['modifiers'],
            u'SO:0001565': ['high'],
            u'SO:0001578': ['high'],
            u'SO:0001587': ['high'],
            u'SO:0002012': ['high'],
            u'SO:1000082': ['modifiers']}}



In [14]:

    
variants = list(gc.searchVariantAnnotations(
        variant_annotation_set.id,
        effects = mk_effect_id_filter("SO:0001587 SO:0001819".split()),
        **region_constraints))

field_names = "g t p so_ids impacts regions".split()
pt = prettytable.PrettyTable(field_names=field_names)

for v in variants:
    vrec = build_variant_record(v)
    pt.add_row([vrec[k] for k in field_names])
    
display(HTML(pt.get_html_string()))









    





    
        g
        t
        p
        so_ids
        impacts
        regions
    
    
        1:g.69486C>T
        NM_001005484.1:c.396C>T
        NM_001005484.1:p.Asn132Asn
        SO:0001819
        
        cds
    
    
        1:g.69534T>C
        NM_001005484.1:c.444T>C
        NM_001005484.1:p.His148His
        SO:0001819
        
        cds
    
    
        1:g.69594T>C
        NM_001005484.1:c.504T>C
        NM_001005484.1:p.Asp168Asp
        SO:0001819
        
        cds
    
    
        1:g.69768G>A
        NM_001005484.1:c.678G>A
        NM_001005484.1:p.Ser226Ser
        SO:0001819
        
        cds
    
    
        1:g.69810T>A
        NM_001005484.1:c.720T>A
        NM_001005484.1:p.Val240Val
        SO:0001819
        
        cds
    
    
        1:g.69869T>A
        NM_001005484.1:c.779T>A
        NM_001005484.1:p.Leu260*
        SO:0001587
        high
        cds
    
    
        1:g.69876A>G
        NM_001005484.1:c.786A>G
        NM_001005484.1:p.Lys262Lys
        SO:0001819
        
        cds
    
    
        1:g.69897T>C
        NM_001005484.1:c.807T>C
        NM_001005484.1:p.Ser269Ser
        SO:0001819
        
        cds



In [ ]:

n_vars	name:label	n_so_ids	so_ids
1	snpeff:high	4	SO:0001565 SO:0001587 SO:0001578 SO:0002012
0	snpeff:moderate	0
0	snpeff:low	0
0	snpeff:modifiers	7	SO:0000188 SO:0000147 SO:0000673 SO:0000704 SO:1000082 SO:0000340 SO:0001026
0	region:utr	2	SO:0001624 SO:0001623
63	region:locus	2	SO:0001565 SO:0001631
24	region:cds	5	SO:0001583 SO:0002012 SO:0001587 SO:0001578 SO:0001819
56	region:downstream	2	SO:0001624 SO:0001632
63	region:upstream	2	SO:0001623 SO:0001631

g	t	p	so_ids	impacts	regions
1:g.69486C>T	NM_001005484.1:c.396C>T	NM_001005484.1:p.Asn132Asn	SO:0001819		cds
1:g.69534T>C	NM_001005484.1:c.444T>C	NM_001005484.1:p.His148His	SO:0001819		cds
1:g.69594T>C	NM_001005484.1:c.504T>C	NM_001005484.1:p.Asp168Asp	SO:0001819		cds
1:g.69768G>A	NM_001005484.1:c.678G>A	NM_001005484.1:p.Ser226Ser	SO:0001819		cds
1:g.69810T>A	NM_001005484.1:c.720T>A	NM_001005484.1:p.Val240Val	SO:0001819		cds
1:g.69869T>A	NM_001005484.1:c.779T>A	NM_001005484.1:p.Leu260*	SO:0001587	high	cds
1:g.69876A>G	NM_001005484.1:c.786A>G	NM_001005484.1:p.Lys262Lys	SO:0001819		cds
1:g.69897T>C	NM_001005484.1:c.807T>C	NM_001005484.1:p.Ser269Ser	SO:0001819		cds