A variant annotation record has a json structure like the following:
{u'createDateTime': u'2015-11-18T00:00:00Z',
u'id': u'YnJjYTE6T1I0Rjp2YXJpYW50YW5ub3RhdGlvbnM6MTo2NDExNTo4NzFkNGM4OWE1Mzc0NjQwNjA2NDM0OTkzYWVmNGFmZQ',
u'info': {},
u'transcriptEffects': [{u'CDSLocation': None,
u'alternateBases': u'A',
u'analysisResults': [],
u'cDNALocation': None,
u'effects': [{u'id': u'SO:0001631',
u'sourceName': None,
u'sourceVersion': None,
u'term': u'upstream_gene_variant'}],
u'featureId': u'NM_001005484.1',
u'hgvsAnnotation': {u'genomic': u'1:g.64116C>A',
u'protein': u'',
u'transcript': u'NM_001005484.1:c.-4975C>A'},
u'id': u'2053be57055a40663aa02b2cdc9c7351',
u'proteinLocation': None},
{u'CDSLocation': None,
u'alternateBases': u'A',
u'analysisResults': [],
u'cDNALocation': None,
u'effects': [{u'id': u'SO:0000605',
u'sourceName': None,
u'sourceVersion': None,
u'term': u'intergenic_region'}],
u'featureId': u'FAM138A-OR4F5',
u'hgvsAnnotation': {u'genomic': u'1:g.64116C>A',
u'protein': u'',
u'transcript': u'n.64116C>A'},
u'id': u'6e6a547b0bdb446a78a3819bfcd6e06c',
u'proteinLocation': None}],
u'variantAnnotationSetId': u'YnJjYTE6T1I0Rjp2YXJpYW50YW5ub3RhdGlvbnM',
u'variantId': u'YnJjYTE6T1I0RjoxOjY0MTE1OmU4Y2MyOTg2MGJmOTJjZGVmOTEwY2IyMzllYWVkZDI0'}
That is: variant annotation —⪪ transcript effects —⪪ effects
In the sample data, there are many variants with multiple transcript effects, but all transcriptEffects have exactly one effect (see below for data).
searchVariantAnnotations accepts an list of json-formatted array of effect filters. It is unclear (to me) how this should behave when multiple filters are provided. Specifically, given a set F of SO ids provided as a filter filtering (e.g., {SO:1,SO:2}), and a set S of SO ids associated with all transcriptEffects of a variant annotation, does a variant annotation VA with S "match" F if:
* F ⋂ S ≠ {} -- at least one f ∈ F is in S
* F ⊂ S -- all f ∈ F are also in S
* S ⊂ F -- all s ∈ S are also in F
* F = S -- sets are identical (⇒ all of the above are true)
It's hard to know what we want without use cases. However, it seems clear that users are likely to have one of two expectations:
Let's test.
In [1]:
import itertools
import pandas as pd
from pivottablejs import pivot_ui
import ga4gh.client
print(ga4gh.__version__)
gc = ga4gh.client.HttpClient("http://localhost:8000")
In [2]:
region_constraints = dict(referenceName="1", start=0, end=int(1e10))
variant_set_id = 'YnJjYTE6T1I0Rg'
variant_annotation_sets = list(gc.searchVariantAnnotationSets(variant_set_id))
variant_annotation_set = variant_annotation_sets[0]
print("Using first variant annotation set (of {n} total) for variant set {vs_id}\nvas_id={vas.id}".format(
n=len(variant_annotation_sets), vs_id=variant_set_id, vas=variant_annotation_set))
In [3]:
variant_annotations = [{
'va':va,
'n_te': len(list(va.transcriptEffects)),
'n_ef': len(list(ef for te in va.transcriptEffects for ef in te.effects)),
'sos': ";".join(sorted(set("{ef.id}:{ef.term}".format(ef=ef)
for te in va.transcriptEffects
for ef in te.effects)))
}
for va in gc.searchVariantAnnotations(variant_annotation_set.id, **region_constraints)
]
variant_annotations_df = pd.DataFrame(variant_annotations)
The following is an inline graphic image. See instructions below it for reproducing it.
To regenerate this data:
In [21]:
pivot_ui(variant_annotations_df)
Out[21]:
Using the data above, we can search for single and multiple terms and compare to expectations.
We'll be using this function:
Signature: gc.searchVariantAnnotations(variantAnnotationSetId, referenceName=None, referenceId=None,
start=None, end=None, featureIds=[], effects=[])
Docstring:
Returns an iterator over the Annotations fulfilling the specified conditions from the specified
AnnotationSet.
The JSON string for an effect term must be specified on the command line :
`--effects '{"term": "exon_variant"}'`.
In [14]:
def _mk_effect_filter(so_ids=[]):
"""return list of so_id effect filters for the given list of so_ids
>>> print(_mk_effect_filter(so_ids="SO:1 SO:2 SO:3".split()))
['{"id":"SO:1"}', '{"id":"SO:2"}', '{"id":"SO:3"}']
"""
return [{"id": so_id} for so_id in so_ids]
def _fetch_variant_annotations(gc, so_ids=[], **args):
return gc.searchVariantAnnotations(variant_annotation_set.id,
effects=_mk_effect_filter(so_ids),
**args)
In [19]:
# expected:
#so_terms
#SO:0000605:intergenic_region 697
#SO:0000605:intergenic_region;SO:0001631:upstream_gene_variant 63
#SO:0000605:intergenic_region;SO:0001632:downstream_gene_variant 56
#SO:0001583:missense_variant 16
#SO:0001587:stop_gained 1
#SO:0001819:synonymous_variant 7
[(so_set,
len(list(_fetch_variant_annotations(gc, so_ids=so_set.split(), **region_constraints))))
for so_set in ["SO:0001819", "SO:0001632", "SO:0000605",
"SO:0000605 SO:0001632", "SO:0001632 SO:0000605",
"SO:9999999", "SO:0000605 SO:999999"]
]
Out[19]:
In [ ]: