596: Support for custom transcripts

See https://github.com/biocommons/hgvs/issues/596

UTA#220 added support for custom transcripts, which really means custom alignments associated with a new alt_aln_method, "splign-manual". The loading/data/splign-manual contains examples.

Although hgvs supports multiple alignment methods already (e.g., one can specify "blat", "splign" in VariantMapper::g_to_c()), several areas of code assume one default alt_aln_method.

This work was sponsored by Invitae, which uses the VariantMapper interface. Therefore, this new feature focuses on minimal changes to extend hgvs to better support using multiple alignment methods with VariantMapper. These are demonstrated below.

Setup

Initialize hgvs and create two variants from CYP2C19 to use as examples (VCV000634882.1)


In [1]:
from hgvs.easy import (__version__, parser, hdp, vm)
from hgvs.exceptions import HGVSDataNotAvailableError
__version__


Out[1]:
'1.5.0.post2.dev11+g0772088.d20200328'

In [2]:
# hgvs_g = "NC_000010.11:g.94762693G>A"  # GRCh38
hgvs_g = "NC_000010.10:g.96522450G>A"  # GRCh37
hgvs_c = "NM_000769.4:c.-13G>A"

var_c = parser.parse(hgvs_c)
var_g = parser.parse(hgvs_g)

Discovering available alignments

Alignments for a specified transcript

This approach identifies available alignments and then selects the desired one to use with VariantMapper::c_to_g() as above.


In [3]:
hdp.get_tx_mapping_options(var_c.ac)


Out[3]:
[['NM_000769.4', 'NC_000010.10', 'splign-manual']]

In [4]:
# or, for a more complete example with many options:
hdp.get_tx_mapping_options("NM_001807.4")


Out[4]:
[['NM_001807.4', 'AC_000141.1', 'splign'],
 ['NM_001807.4', 'NC_000009.11', 'blat'],
 ['NM_001807.4', 'NC_000009.11', 'splign'],
 ['NM_001807.4', 'NC_000009.11', 'splign-manual'],
 ['NM_001807.4', 'NC_000009.12', 'splign'],
 ['NM_001807.4', 'NC_018920.2', 'splign'],
 ['NM_001807.4', 'NG_016394.1', 'splign']]

Alignments for a gene


In [5]:
alignments = hdp.get_tx_for_gene("CYP2C19")
alignments.sort(key=lambda a: (a["tx_ac"], a["alt_ac"], a["alt_aln_method"]))
alignments


Out[5]:
[['CYP2C19', 82, 1555, 'ENST00000371321', 'NC_000010.10', 'genebuild'],
 ['CYP2C19', None, None, 'ENST00000464755', 'NC_000010.10', 'genebuild'],
 ['CYP2C19', None, None, 'ENST00000480405', 'NC_000010.10', 'genebuild'],
 ['CYP2C19', 0, 1473, 'NM_000769.1', 'AC_000142.1', 'splign'],
 ['CYP2C19', 0, 1473, 'NM_000769.1', 'NC_000010.10', 'blat'],
 ['CYP2C19', 0, 1473, 'NM_000769.1', 'NC_000010.10', 'splign'],
 ['CYP2C19', 0, 1473, 'NM_000769.1', 'NC_018921.2', 'splign'],
 ['CYP2C19', 0, 1473, 'NM_000769.1', 'NG_008384.2', 'splign'],
 ['CYP2C19', 25, 1498, 'NM_000769.2', 'NC_000010.10', 'blat'],
 ['CYP2C19', 25, 1498, 'NM_000769.2', 'NC_000010.10', 'splign'],
 ['CYP2C19', 25, 1498, 'NM_000769.2', 'NC_000010.11', 'splign'],
 ['CYP2C19', 25, 1498, 'NM_000769.3', 'NC_000010.10', 'splign'],
 ['CYP2C19', 25, 1498, 'NM_000769.3', 'NC_000010.11', 'splign'],
 ['CYP2C19', 25, 1498, 'NM_000769.4', 'NC_000010.10', 'splign-manual']]

Alignments for a genomic region (new method)

The hgvs data provider method get_tx_for_region used a supplied alignment method, which necessarily filtered alignments for a specific alignment method. A new method, get_alignments_for_region provides similar functionality with an optional alt_aln_method parameter.

When alt_aln_method is None (or not provided), all alignments are returned. When alt_aln_method is provided, it behaves exactly like get_tx_for_region.


In [6]:
start = var_g.posedit.pos.start.base
end = var_g.posedit.pos.end.base

alignments = hdp.get_alignments_for_region(var_g.ac, start, end)
alignments.sort(key=lambda a: (a["tx_ac"], a["alt_ac"], a["alt_aln_method"]))
alignments


Out[6]:
[['ENST00000371321', 'NC_000010.10', 1, 'genebuild', 96522380, 96613017],
 ['ENST00000464755', 'NC_000010.10', 1, 'genebuild', 96447910, 96612830],
 ['ENST00000480405', 'NC_000010.10', 1, 'genebuild', 96522437, 96536207],
 ['NM_000769.2', 'NC_000010.10', 1, 'blat', 96522437, 96612962],
 ['NM_000769.2', 'NC_000010.10', 1, 'splign', 96522437, 96612962],
 ['NM_000769.3', 'NC_000010.10', 1, 'splign', 96522437, 96615308],
 ['NM_000769.4', 'NC_000010.10', 1, 'splign-manual', 96522437, 96615304]]

Alternate method for transcript-to-genome projections: Using try...except

This approach follows the easier-to-ask-for-forgiveness-than-permission principle. Code tries "splign", which is expected to satisfy the majority of cases, and then falls back to "splign-manual" for failures. The advantage of this method is that there is only one database fetch for the most common case.


In [7]:
try:
    vm.c_to_g(var_c, "NC_000010.10")
except HGVSDataNotAvailableError as e:
    print(f"Got {e!r}")


Got HGVSDataNotAvailableError('No tx_info for (tx_ac=NM_000769.4,alt_ac=NC_000010.10,alt_aln_method=splign)')

In [8]:
vm.c_to_g(var_c, "NC_000010.10", alt_aln_method="splign-manual")


Out[8]:
SequenceVariant(ac=NC_000010.10, type=g, posedit=96522450G>A, gene=None)