Using hgvs

This notebook demonstrates major features of the hgvs package.



In [3]:

    
import hgvs
hgvs.__version__









    Out[3]:





'0.5.0a6.dev3+nf998c16a46b3.d20161012'

Variant I/O

Initialize the parser



In [4]:

    
# You only need to do this once per process
import hgvs.parser
hp = hgvsparser = hgvs.parser.Parser()

Parse a simple variant



In [5]:

    
v = hp.parse_hgvs_variant("NC_000007.13:g.21726874G>A")



In [6]:

    
v









    Out[6]:





SequenceVariant(ac=NC_000007.13, type=g, posedit=21726874G>A)



In [7]:

    
v.ac, v.type









    Out[7]:





('NC_000007.13', 'g')



In [6]:

    
v.posedit









    Out[6]:





PosEdit(pos=21726874, edit=G>A, uncertain=False)



In [7]:

    
v.posedit.pos









    Out[7]:





Interval(start=21726874, end=21726874, uncertain=False)



In [8]:

    
v.posedit.pos.start









    Out[8]:





SimplePosition(base=21726874, uncertain=False)

Parsing complex variants



In [9]:

    
v = hp.parse_hgvs_variant("NM_003777.3:c.13552_*36del57")



In [10]:

    
v.posedit.pos.start, v.posedit.pos.end









    Out[10]:





(BaseOffsetPosition(base=13552, offset=0, datum=1, uncertain=False),
 BaseOffsetPosition(base=36, offset=0, datum=2, uncertain=False))



In [11]:

    
v.posedit.edit









    Out[11]:





NARefAlt(ref=57, alt=None, uncertain=False)

Formatting variants

All objects may be formatted simply by "stringifying" or printing them using str, print(), or "".format().



In [12]:

    
str(v)









    Out[12]:





'NM_003777.3:c.13552_*36del57'



In [13]:

    
print(v)









    



NM_003777.3:c.13552_*36del57



In [14]:

    
"{v} spans the CDS end".format(v=v)









    Out[14]:





'NM_003777.3:c.13552_*36del57 spans the CDS end'

Projecting variants between sequences

Set up a dataprovider

Mapping variants requires exon structures, alignments, CDS bounds, and raw sequence. These are provided by a hgvs.dataprovider instance. The only dataprovider provided with hgvs uses UTA. You may write your own by subsclassing hgvs.dataproviders.interface.



In [12]:

    
import hgvs.dataproviders.uta
hdp = hgvs.dataproviders.uta.connect()

Initialize mapper classes

The VariantMapper class projects variants between two sequence accessions using alignments from a specified source. In order to use it, you must know that two sequences are aligned. VariantMapper isn't demonstrated here.

AssemblyMapper builds on VariantMapper and handles identifying appropriate sequences. It is configured for a particular genome assembly.



In [14]:

    
import hgvs.variantmapper
#vm = variantmapper = hgvs.variantmapper.VariantMapper(hdp)
am37 = easyvariantmapper = hgvs.variantmapper.AssemblyMapper(hdp, assembly_name='GRCh37')
am38 = easyvariantmapper = hgvs.variantmapper.AssemblyMapper(hdp, assembly_name='GRCh38')

c_to_g

This is the easiest case because there is typically only one alignment between a transcript and the genome. (Exceptions exist for pseudoautosomal regions.)



In [ ]:

    
var_c = hp.parse_hgvs_variant("NM_015120.4:c.35G>C")
var_g = am37.c_to_g(var_c)
var_g



In [ ]:

    
am38.c_to_g(var_c)

g_to_c

In order to project a genomic variant onto a transcript, you must tell the AssemblyMapper which transcript to use.



In [19]:

    
am37.relevant_transcripts(var_g)









    Out[19]:





['NM_015120.4']



In [20]:

    
am37.g_to_c(var_g, "NM_015120.4")









    Out[20]:





SequenceVariant(ac=NM_015120.4, type=c, posedit=35T>C)

c_to_p



In [21]:

    
var_p = am37.c_to_p(var_c)
str(var_p)









    Out[21]:





'NP_055935.4:p.(Leu12Pro)'



In [22]:

    
var_p.posedit.uncertain = False
str(var_p)









    Out[22]:





'NP_055935.4:p.Leu12Pro'

Projecting in the presence of a genome-transcript gap

As of Oct 2016, 1033 RefSeq transcripts in 433 genes have gapped alignments. These gaps require special handlingin order to maintain the correspondence of positions in an alignment. hgvs uses the precomputed alignments in UTA to correctly project variants in exons containing gapped alignments.

This example demonstrates projecting variants in the presence of a gap in the alignment of NM_015120.4 (ALMS1) with GRCh37 chromosome 2. (The alignment with GRCh38 is similarly gapped.) Specifically, the adjacent genomic positions 73613031 and 73613032 correspond to the non-adjacent CDS positions 35 and 39.

 NM_015120.4  c         15 >                           >       58
 NM_015120.4  n        126 > CCGGGCGAGCTGGAGGAGGAGGAG  >      169
                             |||||||||||   ||||||||||  21=3I20= 
 NC_000002.11 g   73613021 > CCGGGCGAGCT---GGAGGAGGAG  > 73613041
 NC_000002.11 g   73613021 < GGCCCGCTCGA---CCTCCTCCTC  < 73613041



In [35]:

    
str(am37.c_to_g(hp.parse_hgvs_variant("NM_015120.4:c.35G>C")))









    Out[35]:





'NC_000002.11:g.73613031T>C'



In [36]:

    
str(am37.c_to_g(hp.parse_hgvs_variant("NM_015120.4:c.39G>C")))









    Out[36]:





'NC_000002.11:g.73613032G>C'

Normalizing variants

In hgvs, normalization means shifting variants 3' (as requried by the HGVS nomenclature) as well as rewriting variants. The variant "NM_001166478.1:c.30_31insT" is in a poly-T run (on the transcript). It should be shifted 3' and is better written as dup, as shown below:

                                         *                       NC_000006.11:g.49917127dupA
   NC_000006.11 g   49917117 > AGAAAGAAAAATAAAACAAAG  > 49917137 
   NC_000006.11 g   49917117 < TCTTTCTTTTTATTTTGTTTC  < 49917137 
                               |||||||||||||||||||||  21= 
 NM_001166478.1 n         41 < TCTTTCTTTTTATTTTGTTTC  <       21 NM_001166478.1:n.35dupT
 NM_001166478.1 c         41 <                        <       21 NM_001166478.1:c.30_31insT



In [24]:

    
import hgvs.normalizer
hn = hgvs.normalizer.Normalizer(hdp)



In [25]:

    
v = hp.parse_hgvs_variant("NM_001166478.1:c.30_31insT")
str(hn.normalize(v))









    Out[25]:





'NM_001166478.1:c.35dupT'

A more complex normalization example

This example is based on https://github.com/biocommons/hgvs/issues/382/.

   NC_000001.11 g   27552104 > CTTCACACGCATCCTGACCTTG > 27552125
   NC_000001.11 g   27552104 < GAAGTGTGCGTAGGACTGGAAC < 27552125
                               |||||||||||||||||||||| 22= 
 NM_001029882.3 n        843 < GAAGTGTGCGTAGGACTGGAAC <      822 
 NM_001029882.3 c         12 <                        <      -10 
                                         ^^  
                                         NM_001029882.3:c.1_2del
                                         NM_001029882.3:n.832_833delAT
                                         NC_000001.11:g.27552114_27552115delAT



In [26]:

    
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1A>G"))









    Out[26]:





SequenceVariant(ac=NC_000001.11, type=g, posedit=27552115T>C)



In [27]:

    
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.2T>G"))









    Out[27]:





SequenceVariant(ac=NC_000001.11, type=g, posedit=27552114A>C)



In [28]:

    
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_2del"))









    Out[28]:





SequenceVariant(ac=NC_000001.11, type=g, posedit=27552114_27552115delAT)

The genomic coordinates for the SNVs at c.1 and c.2 match those for the del at c.1_2. Good!

Now, notice what happens with c.1_3del, c.1_4del, and c.1_5del:



In [29]:

    
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_3del"))









    Out[29]:





SequenceVariant(ac=NC_000001.11, type=g, posedit=27552114_27552116delATC)



In [30]:

    
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_4del"))









    Out[30]:





SequenceVariant(ac=NC_000001.11, type=g, posedit=27552112_27552115delGCAT)



In [31]:

    
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_5del"))









    Out[31]:





SequenceVariant(ac=NC_000001.11, type=g, posedit=27552112_27552116delGCATC)

Explanation:

On the transcript, c.1_2delAT deletes AT from …AGGATGCG…, resulting in …AGGGCG…. There's no ambiguity about what sequence was actually deleted.

c.1_3delATG deletes ATG, resulting in …AGGCG…. Note that you could also get this result by deleting GAT. This is an example of an indel that is subject to normalization and hgvs does this.

c.1_4delATGC and 1_5delATGCG have similar behaviors.

Normalization is always 3' with respect to the reference sequence. So, after projecting from a - strand transcript to the genome, normalization will go in the opposite direction to the transcript. It will have roughly the same effect as being 5' shifted on the transcript (but revcomp'd).

For more precise control, see the normalize and replace_reference options of AssemblyMapper.

Validating variants

hgvs.validator.Validator is a composite of two classes, hgvs.validator.IntrinsicValidator and hgvs.validator.ExtrinsicValidator. Intrinsic validation evaluates a given variant for internal consistency, such as requiring that insertions specify adjacent positions. Extrinsic validation evaluates a variant using external data, such as ensuring that the reference nucleotide in the variant matches that implied by the reference sequence and position. Validation returns True if successful, and raises an exception otherwise.



In [32]:

    
import hgvs.validator
hv = hgvs.validator.Validator(hdp)



In [33]:

    
hv.validate(hp.parse_hgvs_variant("NM_001166478.1:c.30_31insT"))









    Out[33]:





True



In [34]:

    
from hgvs.exceptions import HGVSError

try:
    hv.validate(hp.parse_hgvs_variant("NM_001166478.1:c.30_32insT"))
except HGVSError as e:
    print(e)









    



insertion length must be 1



In [ ]: