In [3]:
import hgvs
hgvs.__version__
Out[3]:
In [4]:
# You only need to do this once per process
import hgvs.parser
hp = hgvsparser = hgvs.parser.Parser()
In [5]:
v = hp.parse_hgvs_variant("NC_000007.13:g.21726874G>A")
In [6]:
v
Out[6]:
In [7]:
v.ac, v.type
Out[7]:
In [6]:
v.posedit
Out[6]:
In [7]:
v.posedit.pos
Out[7]:
In [8]:
v.posedit.pos.start
Out[8]:
In [9]:
v = hp.parse_hgvs_variant("NM_003777.3:c.13552_*36del57")
In [10]:
v.posedit.pos.start, v.posedit.pos.end
Out[10]:
In [11]:
v.posedit.edit
Out[11]:
In [12]:
str(v)
Out[12]:
In [13]:
print(v)
In [14]:
"{v} spans the CDS end".format(v=v)
Out[14]:
In [12]:
import hgvs.dataproviders.uta
hdp = hgvs.dataproviders.uta.connect()
The VariantMapper class projects variants between two sequence accessions using alignments from a specified source. In order to use it, you must know that two sequences are aligned. VariantMapper isn't demonstrated here.
AssemblyMapper builds on VariantMapper and handles identifying appropriate sequences. It is configured for a particular genome assembly.
In [14]:
import hgvs.variantmapper
#vm = variantmapper = hgvs.variantmapper.VariantMapper(hdp)
am37 = easyvariantmapper = hgvs.variantmapper.AssemblyMapper(hdp, assembly_name='GRCh37')
am38 = easyvariantmapper = hgvs.variantmapper.AssemblyMapper(hdp, assembly_name='GRCh38')
In [ ]:
var_c = hp.parse_hgvs_variant("NM_015120.4:c.35G>C")
var_g = am37.c_to_g(var_c)
var_g
In [ ]:
am38.c_to_g(var_c)
In [19]:
am37.relevant_transcripts(var_g)
Out[19]:
In [20]:
am37.g_to_c(var_g, "NM_015120.4")
Out[20]:
In [21]:
var_p = am37.c_to_p(var_c)
str(var_p)
Out[21]:
In [22]:
var_p.posedit.uncertain = False
str(var_p)
Out[22]:
As of Oct 2016, 1033 RefSeq transcripts in 433 genes have gapped alignments. These gaps require special handlingin order to maintain the correspondence of positions in an alignment. hgvs uses the precomputed alignments in UTA to correctly project variants in exons containing gapped alignments.
This example demonstrates projecting variants in the presence of a gap in the alignment of NM_015120.4 (ALMS1) with GRCh37 chromosome 2. (The alignment with GRCh38 is similarly gapped.) Specifically, the adjacent genomic positions 73613031 and 73613032 correspond to the non-adjacent CDS positions 35 and 39.
NM_015120.4 c 15 > > 58
NM_015120.4 n 126 > CCGGGCGAGCTGGAGGAGGAGGAG > 169
||||||||||| |||||||||| 21=3I20=
NC_000002.11 g 73613021 > CCGGGCGAGCT---GGAGGAGGAG > 73613041
NC_000002.11 g 73613021 < GGCCCGCTCGA---CCTCCTCCTC < 73613041
In [35]:
str(am37.c_to_g(hp.parse_hgvs_variant("NM_015120.4:c.35G>C")))
Out[35]:
In [36]:
str(am37.c_to_g(hp.parse_hgvs_variant("NM_015120.4:c.39G>C")))
Out[36]:
In hgvs, normalization means shifting variants 3' (as requried by the HGVS nomenclature) as well as rewriting variants. The variant "NM_001166478.1:c.30_31insT" is in a poly-T run (on the transcript). It should be shifted 3' and is better written as dup, as shown below:
* NC_000006.11:g.49917127dupA
NC_000006.11 g 49917117 > AGAAAGAAAAATAAAACAAAG > 49917137
NC_000006.11 g 49917117 < TCTTTCTTTTTATTTTGTTTC < 49917137
||||||||||||||||||||| 21=
NM_001166478.1 n 41 < TCTTTCTTTTTATTTTGTTTC < 21 NM_001166478.1:n.35dupT
NM_001166478.1 c 41 < < 21 NM_001166478.1:c.30_31insT
In [24]:
import hgvs.normalizer
hn = hgvs.normalizer.Normalizer(hdp)
In [25]:
v = hp.parse_hgvs_variant("NM_001166478.1:c.30_31insT")
str(hn.normalize(v))
Out[25]:
This example is based on https://github.com/biocommons/hgvs/issues/382/.
NC_000001.11 g 27552104 > CTTCACACGCATCCTGACCTTG > 27552125
NC_000001.11 g 27552104 < GAAGTGTGCGTAGGACTGGAAC < 27552125
|||||||||||||||||||||| 22=
NM_001029882.3 n 843 < GAAGTGTGCGTAGGACTGGAAC < 822
NM_001029882.3 c 12 < < -10
^^
NM_001029882.3:c.1_2del
NM_001029882.3:n.832_833delAT
NC_000001.11:g.27552114_27552115delAT
In [26]:
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1A>G"))
Out[26]:
In [27]:
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.2T>G"))
Out[27]:
In [28]:
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_2del"))
Out[28]:
The genomic coordinates for the SNVs at c.1 and c.2 match those for the del at c.1_2. Good!
Now, notice what happens with c.1_3del, c.1_4del, and c.1_5del:
In [29]:
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_3del"))
Out[29]:
In [30]:
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_4del"))
Out[30]:
In [31]:
am38.c_to_g(hp.parse_hgvs_variant("NM_001029882.3:c.1_5del"))
Out[31]:
Explanation:
On the transcript, c.1_2delAT deletes AT from …AGGATGCG…, resulting in …AGGGCG…. There's no ambiguity about what sequence was actually deleted.
c.1_3delATG deletes ATG, resulting in …AGGCG…. Note that you could also get this result by deleting GAT. This is an example of an indel that is subject to normalization and hgvs does this.
c.1_4delATGC and 1_5delATGCG have similar behaviors.
Normalization is always 3' with respect to the reference sequence. So, after projecting from a - strand transcript to the genome, normalization will go in the opposite direction to the transcript. It will have roughly the same effect as being 5' shifted on the transcript (but revcomp'd).
For more precise control, see the normalize
and replace_reference
options of AssemblyMapper
.
hgvs.validator.Validator
is a composite of two classes, hgvs.validator.IntrinsicValidator
and hgvs.validator.ExtrinsicValidator
. Intrinsic validation evaluates a given variant for internal consistency, such as requiring that insertions specify adjacent positions. Extrinsic validation evaluates a variant using external data, such as ensuring that the reference nucleotide in the variant matches that implied by the reference sequence and position. Validation returns True
if successful, and raises an exception otherwise.
In [32]:
import hgvs.validator
hv = hgvs.validator.Validator(hdp)
In [33]:
hv.validate(hp.parse_hgvs_variant("NM_001166478.1:c.30_31insT"))
Out[33]:
In [34]:
from hgvs.exceptions import HGVSError
try:
hv.validate(hp.parse_hgvs_variant("NM_001166478.1:c.30_32insT"))
except HGVSError as e:
print(e)
In [ ]: