Varcode
is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.
What this means is that this library can help you annotate gene variants, i.e. changes in DNA with their potential effect on the protein/transcript that they encode for.
This notebook highlights the basic functionality provided by varcode
in a nut-shell.
If you haven't installed varcode
yet,
please refer to the original README file for specific instructions on how to install the library.
Let's start with a really simple example to demonstrate what varcode
can do for you.
Imagine that the sequencing core facility has just sent you the sequencing results of that sample you submitted earlier to them.
And now, they sent you a list of variants that they inferred using their analysis pipeline.
You open the file just to see that they haven't annotated the variants,
so you don't know whether these variants are within a gene
and if so whether they affect the protein sequence or not.
Enter varcode
, which is specifically designed to solve this annotation problem.
For the sake of the example,
let's simplify things and assume that we are interested in annotating this variant,
a nucleotide change from an A
into a T
at the 1,404,553,136
th base of chromosome 7
.
To annotate this variant, we first have to model it within varcode
using the Variant
class.
We are going to assume that this coordinate is consistent with the human assembly GRCh37.
Varcode
stands on the shoulders of PyEnsembl
, a Python interface to Ensembl reference genome metadata.
So to start things off, let's import the following classes that are relevant to our example:
In [1]:
from varcode import Variant
from pyensembl import ensembl_grch37
Now let's create a new Variant
that will represent our variant of interest:
In [2]:
myVariant = Variant(contig=7, start=140453136, ref="A", alt="T", ensembl=ensembl_grch37)
Now that we defined this variant, we can start annotating it; but let's start with this trivial example, where we ask for a short descriptive description of the variant:
In [3]:
myVariant.short_description
Out[3]:
this is our variation, but expressed using the offical variation nomenclature.
How about asking about the gene this variant is in:
In [4]:
myVariant.coding_genes
Out[4]:
Looks like this variant lies within the BRAF gene; but what about the potential effects of this variant to the product of this gene?
In [5]:
myEffects = myVariant.effects()
myEffects
Out[5]:
What the above list tells us is that this variation can potentially affect four different BRAF transcripts
and out of four, one will result in a Substitution
,
i.e. a coding mutation which causes simple substitution of one amino acid for another.
For the other transcripts, IncompleteTranscript
type implies that
varcode
can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).
That is all great, but dealing with multiple effects is hard and we usually want to pick the one that causes the most dramatic change in the gene product.
Varcode
provides an easy way to get to this worst effect:
In [6]:
topPriorityEffect = myEffects.top_priority_effect()
topPriorityEffect
Out[6]:
So overall, this what we have learned about the variant using varcode
:
In [7]:
print ('The mutation %s leads to a %s in gene %s (%s)' % (myVariant.short_description, type(topPriorityEffect).__name__, topPriorityEffect.gene_name, topPriorityEffect.short_description))
Moreover, varcode
can provide you with the altered protein sequence,
which becomes important especially for analyses that use the variation information within the context of a few aminoacids surrounding the mutated location:
In [8]:
variantLocation = topPriorityEffect.aa_mutation_start_offset
topPriorityEffect.original_protein_sequence[variantLocation-3:variantLocation+4]
Out[8]:
In [9]:
topPriorityEffect.mutant_protein_sequence[variantLocation-3:variantLocation+4]
Out[9]:
See that valenine (V
) changing into a glutamine (E
)?
That is the effect of our variant within the aminoacid context.
That was easy, right?
Now that we know how to work with a single variant and extract annotations for it, it is now time for us to learn how to work with collections of variants all together.
In the previous section,
we assumed that there was a single variation that we wanted to annotate;
but in the real world,
we usually receive the variant calls from a sequencing analysis as a file
that is formatted with either Mutation Annotation Format+Specification) (MAF) or Variant Call Format (VCF).
Varcode
has built-in methods to load variants described by either of the formats:
In [10]:
from varcode import load_maf, load_vcf
The following loads mutations from the MAF file distributed within our code repository:
In [11]:
mafVariants = load_maf("../test/data/tcga_ov.head.maf")
mafVariants # should load 4 variants
Out[11]:
and this should load variants from one of the VCF files:
In [12]:
vcfVariants = load_vcf("../test/data/somatic_hg19_14muts.vcf")
vcfVariants # should load 14 variants
Out[12]:
You can even extract summary statistics from these variant collections if you would like to have a quick look at the overall variants and the altered genes:
In [13]:
vcfVariants.gene_counts()
Out[13]:
In [14]:
mafVariants.gene_counts()
Out[14]:
In [15]:
tp53Mutations = vcfVariants.groupby_gene_name()["TP53"].effects()
tp53Mutations
Out[15]:
or, for example, you might want to drop all mutations that do not affect a protein sequence or variants that fall in a non-coding genomic region:
In [16]:
vcfEffects = vcfVariants.effects()
nonSilentMutations = vcfEffects.drop_silent_and_noncoding()
nonSilentMutations.top_priority_effect_per_gene_id()
Out[16]:
or, you might want to get all effects whose priority falls below an Insertion
:
In [17]:
from varcode import Insertion
vcfEffects.filter_by_effect_priority(Insertion).top_priority_effect_per_gene_id()
Out[17]:
Here, we went over how you can varcode
to annotate variants (either one-by-one or as a batch from a collection)
and filter their effects based on biological criteria.
For more examples on methods provided by varcode
,
we suggest taking a look at our code tests.
A significant portion of the varcode
terminology is inherited from PyEnsembl;
therefore, you can also take a look at its documentation for further clarification on internal workings of this library.