Varcode - Quick Start

Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences. What this means is that this library can help you annotate gene variants, i.e. changes in DNA with their potential effect on the protein/transcript that they encode for. This notebook highlights the basic functionality provided by varcode in a nut-shell. If you haven't installed varcode yet, please refer to the original README file for specific instructions on how to install the library.

Working with variants and their effects

Let's start with a really simple example to demonstrate what varcode can do for you. Imagine that the sequencing core facility has just sent you the sequencing results of that sample you submitted earlier to them. And now, they sent you a list of variants that they inferred using their analysis pipeline. You open the file just to see that they haven't annotated the variants, so you don't know whether these variants are within a gene and if so whether they affect the protein sequence or not. Enter varcode, which is specifically designed to solve this annotation problem.

For the sake of the example, let's simplify things and assume that we are interested in annotating this variant, a nucleotide change from an A into a T at the 1,404,553,136th base of chromosome 7. To annotate this variant, we first have to model it within varcode using the Variant class. We are going to assume that this coordinate is consistent with the human assembly GRCh37. Varcode stands on the shoulders of PyEnsembl, a Python interface to Ensembl reference genome metadata.

So to start things off, let's import the following classes that are relevant to our example:


In [1]:
from varcode import Variant
from pyensembl import ensembl_grch37

Now let's create a new Variant that will represent our variant of interest:


In [2]:
myVariant = Variant(contig=7, start=140453136, ref="A", alt="T", ensembl=ensembl_grch37)

Now that we defined this variant, we can start annotating it; but let's start with this trivial example, where we ask for a short descriptive description of the variant:


In [3]:
myVariant.short_description


Out[3]:
'chr7 g.140453136A>T'

this is our variation, but expressed using the offical variation nomenclature.

How about asking about the gene this variant is in:


In [4]:
myVariant.coding_genes


INFO:root:Cached file Homo_sapiens.GRCh37.75.gtf from URL ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
Out[4]:
[Gene(id=ENSG00000157764, name=BRAF, biotype=protein_coding, location=7:140419127-140624564)]

Looks like this variant lies within the BRAF gene; but what about the potential effects of this variant to the product of this gene?


In [5]:
myEffects = myVariant.effects()
myEffects


INFO:root:Cached file Homo_sapiens.GRCh37.75.cdna.all.fa from URL ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
INFO:root:Cached file Homo_sapiens.GRCh37.75.pep.all.fa from URL ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.75.pep.all.fa.gz
/usr/local/lib/python3.4/site-packages/Bio/Seq.py:151: BiopythonWarning: Biopython Seq objects now use string comparison. Older versions of Biopython used object comparison. During this transition, please use hash(id(my_seq)) or my_dict[id(my_seq)] if you want the old behaviour, or use hash(str(my_seq)) or my_dict[str(my_seq)] for the new string hashing behaviour.
  "the new string hashing behaviour.", BiopythonWarning)
Out[5]:
<EffectCollection with 4 elements>
  -- Substitution(variant=chr7 g.140453136A>T, transcript_name=BRAF-001, transcript_id=ENST00000288602, effect_description=p.V600E)
  -- IncompleteTranscript(variant=chr7 g.140453136A>T, transcript_name=BRAF-005, transcript_id=ENST00000479537)
  -- IncompleteTranscript(variant=chr7 g.140453136A>T, transcript_name=BRAF-003, transcript_id=ENST00000496384)
  -- IncompleteTranscript(variant=chr7 g.140453136A>T, transcript_name=BRAF-002, transcript_id=ENST00000497784)

What the above list tells us is that this variation can potentially affect four different BRAF transcripts and out of four, one will result in a Substitution, i.e. a coding mutation which causes simple substitution of one amino acid for another. For the other transcripts, IncompleteTranscript type implies that varcode can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).

That is all great, but dealing with multiple effects is hard and we usually want to pick the one that causes the most dramatic change in the gene product. Varcode provides an easy way to get to this worst effect:


In [6]:
topPriorityEffect = myEffects.top_priority_effect()
topPriorityEffect


Out[6]:
Substitution(variant=chr7 g.140453136A>T, transcript_name=BRAF-001, transcript_id=ENST00000288602, effect_description=p.V600E)

So overall, this what we have learned about the variant using varcode:


In [7]:
print ('The mutation %s leads to a %s in gene %s (%s)' % (myVariant.short_description, type(topPriorityEffect).__name__, topPriorityEffect.gene_name, topPriorityEffect.short_description))


The mutation chr7 g.140453136A>T leads to a Substitution in gene BRAF (p.V600E)

Moreover, varcode can provide you with the altered protein sequence, which becomes important especially for analyses that use the variation information within the context of a few aminoacids surrounding the mutated location:


In [8]:
variantLocation = topPriorityEffect.aa_mutation_start_offset
topPriorityEffect.original_protein_sequence[variantLocation-3:variantLocation+4]


Out[8]:
Seq('LATVKSR', SingleLetterAlphabet())

In [9]:
topPriorityEffect.mutant_protein_sequence[variantLocation-3:variantLocation+4]


Out[9]:
Seq('LATEKSR', SingleLetterAlphabet())

See that valenine (V) changing into a glutamine (E)? That is the effect of our variant within the aminoacid context. That was easy, right?

Importing variants from a MAF or VCF file

Now that we know how to work with a single variant and extract annotations for it, it is now time for us to learn how to work with collections of variants all together.

In the previous section, we assumed that there was a single variation that we wanted to annotate; but in the real world, we usually receive the variant calls from a sequencing analysis as a file that is formatted with either Mutation Annotation Format+Specification) (MAF) or Variant Call Format (VCF). Varcode has built-in methods to load variants described by either of the formats:


In [10]:
from varcode import load_maf, load_vcf

The following loads mutations from the MAF file distributed within our code repository:


In [11]:
mafVariants = load_maf("../test/data/tcga_ov.head.maf")
mafVariants  # should load 4 variants


Out[11]:
<VariantCollection from 'tcga_ov.head.maf' with 4 elements>
  -- Variant(contig=1, start=1650797, ref=A, alt=G, genome=GRCh37)
  -- Variant(contig=1, start=23836447, ref=C, alt=A, genome=GRCh37)
  -- Variant(contig=1, start=231401797, ref=A, alt=C, genome=GRCh37)
  -- Variant(contig=11, start=124617502, ref=C, alt=G, genome=GRCh37)

and this should load variants from one of the VCF files:


In [12]:
vcfVariants = load_vcf("../test/data/somatic_hg19_14muts.vcf")
vcfVariants  # should load 14 variants


Out[12]:
<VariantCollection from 'somatic_hg19_14muts.vcf' with 14 elements>
  -- Variant(contig=1, start=53513530, ref=A, alt=C, genome=GRCh37)
  -- Variant(contig=1, start=228295398, ref=G, alt=T, genome=GRCh37)
  -- Variant(contig=10, start=49658590, ref=T, alt=C, genome=GRCh37)
  -- Variant(contig=10, start=51585166, ref=G, alt=T, genome=GRCh37)
  -- Variant(contig=10, start=96709040, ref=A, alt=C, genome=GRCh37)
  -- Variant(contig=10, start=119134281, ref=G, alt=T, genome=GRCh37)
  -- Variant(contig=11, start=118244286, ref=., alt=., genome=GRCh37)
  -- Variant(contig=12, start=14794076, ref=C, alt=A, genome=GRCh37)
  -- Variant(contig=12, start=25398284, ref=C, alt=G, genome=GRCh37)
  -- Variant(contig=12, start=42778752, ref=T, alt=A, genome=GRCh37)
  -- Variant(contig=14, start=31144202, ref=A, alt=C, genome=GRCh37)
  -- Variant(contig=16, start=25704209, ref=G, alt=A, genome=GRCh37)
  -- Variant(contig=17, start=7577548, ref=., alt=A, genome=GRCh37)
  -- Variant(contig=17, start=36731197, ref=C, alt=AAT, genome=GRCh37)

You can even extract summary statistics from these variant collections if you would like to have a quick look at the overall variants and the altered genes:


In [13]:
vcfVariants.gene_counts()


Out[13]:
Counter({'SCP2': 1, 'HS3ST4': 1, 'SCFD1': 1, 'TP53': 1, 'GUCY2C': 1, 'CYP2C9': 1, 'PDZD8': 1, 'MRPL55': 1, 'NCOA4': 1, 'UBE4A': 1, 'SRCIN1': 1, 'ARHGAP22': 1, 'KRAS': 1, 'PPHLN1': 1})

In [14]:
mafVariants.gene_counts()


INFO:root:Cached file Homo_sapiens.GRCh37.75.gtf from URL ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
Out[14]:
Counter({'VSIG2': 1, 'GNPAT': 1, 'E2F2': 1, 'RP1-283E3.8': 1, 'CDK11A': 1})

Filtering Variants

Having a collection of variants is great, but you often need to filter them down to see, for example, whether any of the variants affect your gene of interest (TP53):


In [15]:
tp53Mutations = vcfVariants.groupby_gene_name()["TP53"].effects()
tp53Mutations


Out[15]:
<EffectCollection with 11 elements>
  -- FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.G245fs)
  -- FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-019, transcript_id=ENST00000359597, effect_description=p.G245fs)
  -- FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-018, transcript_id=ENST00000413465, effect_description=p.G245fs)
  -- FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-005, transcript_id=ENST00000420246, effect_description=p.G245fs)
  -- FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.G245fs)
  -- FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-004, transcript_id=ENST00000455263, effect_description=p.G245fs)
  -- NoncodingTranscript(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-006, transcript_id=ENST00000504290)
  -- NoncodingTranscript(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-008, transcript_id=ENST00000504937)
  -- IncompleteTranscript(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-015, transcript_id=ENST00000509690)
  -- NoncodingTranscript(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-007, transcript_id=ENST00000510385)
  -- IncompleteTranscript(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-013, transcript_id=ENST00000514944)

or, for example, you might want to drop all mutations that do not affect a protein sequence or variants that fall in a non-coding genomic region:


In [16]:
vcfEffects = vcfVariants.effects()
nonSilentMutations = vcfEffects.drop_silent_and_noncoding()
nonSilentMutations.top_priority_effect_per_gene_id()


/usr/local/lib/python3.4/site-packages/Bio/Seq.py:151: BiopythonWarning: Biopython Seq objects now use string comparison. Older versions of Biopython used object comparison. During this transition, please use hash(id(my_seq)) or my_dict[id(my_seq)] if you want the old behaviour, or use hash(str(my_seq)) or my_dict[str(my_seq)] for the new string hashing behaviour.
  "the new string hashing behaviour.", BiopythonWarning)
Out[16]:
OrderedDict([('ENSG00000110344', Insertion(variant=chr11 g.118244286, transcript_name=UBE4A-002, transcript_id=ENST00000431736, effect_description=p.340ins)), ('ENSG00000165650', Substitution(variant=chr10 g.119134281G>T, transcript_name=PDZD8-001, transcript_id=ENST00000334464, effect_description=p.T153K)), ('ENSG00000162910', Substitution(variant=chr1 g.228295398G>T, transcript_name=MRPL55-201, transcript_id=ENST00000366731, effect_description=p.R103S)), ('ENSG00000017373', FrameShift(variant=chr17 g.36731197C>AAT, transcript_name=SRCIN1-002, transcript_id=ENST00000578925, effect_description=p.V117fs)), ('ENSG00000128805', Substitution(variant=chr10 g.49658590T>C, transcript_name=ARHGAP22-002, transcript_id=ENST00000417912, effect_description=p.S544G)), ('ENSG00000092108', Substitution(variant=chr14 g.31144202A>C, transcript_name=SCFD1-006, transcript_id=ENST00000458591, effect_description=p.E391D)), ('ENSG00000138293', Substitution(variant=chr10 g.51585166G>T, transcript_name=NCOA4-205, transcript_id=ENST00000452682, effect_description=p.C438F)), ('ENSG00000141510', FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.G245fs)), ('ENSG00000070019', Substitution(variant=chr12 g.14794076C>A, transcript_name=GUCY2C-001, transcript_id=ENST00000261170, effect_description=p.A670S)), ('ENSG00000133703', Substitution(variant=chr12 g.25398284C>G, transcript_name=KRAS-004, transcript_id=ENST00000256078, effect_description=p.G12A))])

or, you might want to get all effects whose priority falls below an Insertion:


In [17]:
from varcode import Insertion
vcfEffects.filter_by_effect_priority(Insertion).top_priority_effect_per_gene_id()


Out[17]:
OrderedDict([('ENSG00000110344', Insertion(variant=chr11 g.118244286, transcript_name=UBE4A-002, transcript_id=ENST00000431736, effect_description=p.340ins)), ('ENSG00000138109', ExonicSpliceSite(exon=Exon(exon_id=ENSE00002485068, gene_name=CYP2C9, contig=10, start=96708865, end=96709041), alternate_effect=Substitution(variant=chr10 g.96709040A>C, transcript_name=CYP2C9-001, transcript_id=ENST00000260682, effect_description=p.K273T))), ('ENSG00000141510', FrameShift(variant=chr17 g.7577548_7577549insA, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.G245fs)), ('ENSG00000017373', FrameShift(variant=chr17 g.36731197C>AAT, transcript_name=SRCIN1-002, transcript_id=ENST00000578925, effect_description=p.V117fs)), ('ENSG00000116171', ExonicSpliceSite(exon=Exon(exon_id=ENSE00003489518, gene_name=SCP2, contig=1, start=53513530, end=53513609), alternate_effect=Substitution(variant=chr1 g.53513530A>C, transcript_name=SCP2-001, transcript_id=ENST00000371514, effect_description=p.D490A)))])

Summary

Here, we went over how you can varcode to annotate variants (either one-by-one or as a batch from a collection) and filter their effects based on biological criteria. For more examples on methods provided by varcode, we suggest taking a look at our code tests. A significant portion of the varcode terminology is inherited from PyEnsembl; therefore, you can also take a look at its documentation for further clarification on internal workings of this library.