The VAPr package retrieves variant information from ANNOVAR and myvariant.info and consolidates it into a single local database for ease of use in investigating and filtering variant findings. The aggregated information is structured into lists of python dictionaries, with each variant being described by a multi-level dictionary. This approach flexibly accommodates the wide variety of information attributes available for different variants. Further, this specific format permits its parsing to a MongoDb instance (dictionaries are the python representation of JSON objects), which enables the user to efficiently store, query and filter such data.
Finally, the package also has the added functionality to create csv and vcf files from MongoDB. The built-in filters allow the user to rapidly query data that meets certain criteria as a list of documents, which can the be transformed into more widely accepted formats such as vcf and csv files. It should be noted that here, the main differential the package offers is the ability to write these files preserving all the annotation data. In the vcf files, for instance, outputs will have a 'Otherinfo' column where all the data coming from ANNOVAR and myvariant.info is condensed (while still preserving its structure).
Having the data stored in a database offers a variety of benefits. In particular, it enables the user to set customized queries and rapidly iterate over a specific procedure and get maximum reproducibility. It also enables the storage of data coming from different sources, and its rapid access.
Notes on required software
the following libraries will be installed upon installing VAPr:
Other libraries that are needed, but should natively e installed on most OS:
Further, a MongoDB database must be set up. Refer to the documentation page for more information. Similarly, ANNOVAR must be downloaded, alongside with its supporting databases (also listed on the documentation page).
In [1]:
import os
from IPython.display import Image, display, HTML
Image(filename=os.path.dirname(os.path.realpath('__file__')) + '/simpler.jpg')
Out[1]:
In [1]:
from VAPr import vapr_core
import os
In [2]:
IN_PATH = "/path/to/vcf"
OUT_PATH = "/path/to/out"
ANNOVAR_PATH = "/path/to/annovar"
MONGODB = 'VariantDatabase'
COLLECTION = 'CEU_trio_01012018'
In [3]:
annotator = vapr_core.VaprAnnotator(input_dir=IN_PATH,
output_dir=OUT_PATH,
mongo_db_name=MONGODB,
mongo_collection_name=COLLECTION,
build_ver='hg19',
vcfs_gzipped=False,
annovar_install_path=ANNOVAR_PATH)
If you plan to use Annovar, please make sure to download the necessary Annovar databases. When Annovar is first installed, it does not install Annovar databases by default. The vapr_core has a method download_annovar_databases() that will download the necessary annovar databases. Note: only run this command the first time that you use VAPr.
In [ ]:
annotator.download_annovar_databases()
The following command runs Annovar and processes the annotations, gets MyVariant.info annotations, merges the annotations by HGVS id into a JSON document, and uploads the documents to a MongoDB database. VAPr will also automatically include sample information from the VCF as well.
In [ ]:
dataset = annotator.annotate(num_processes=8)
In [ ]:
dataset_light = annotator.annotate_lite(num_processes=8)
In [4]:
dataset = vapr_core.VaprDataset(MONGODB, COLLECTION)
In [3]:
rare_deleterious_variants = dataset.get_rare_deleterious_variants()
rare_deleterious_variants[0]
Out[3]:
Here we implement three different filters that allow for the retrieval of specific variants. The filters are implemented as MongoDB queries, and are designed to provide the user with a set of relevant variants. A template is provided for defining custom queries as well. The output of the queries is a list of dictionaries (JSON documents), where each dictionary contains a single variant document and its annotations.
Further, the package allows the user to parse these variants into an annotated csv or vcf file. If needed, annotated, unfiltered vcf and csv files can also be created. They will have the same length (number of variants) as the original files, but will contain much more complete annotation data coming from myvariant.info and ANNOVAR databases.
In [21]:
# Apply filter.
rare_deleterious_variants = dataset.get_rare_deleterious_variants()
In [22]:
# Apply filter.
known_disease_variants = dataset.get_known_disease_variants()
In [23]:
len(known_disease_variants)
Out[23]:
In [ ]:
# Apply filter
deleterious_compound_heterozygous = dataset.get_deleterious_compound_heterozygous_variants()
In [5]:
# Apply filter.
denovo_variants = dataset.get_de_novo_variants(proband="NA12878",
ancestor1="NA12891",
ancestor2="NA12892")
denovo_variants[0]
Out[5]:
As long as you have a MongoDB instance running, filtering can be perfomed through pymongo as shown by the code below. If a list is intended to be created from it, simply add: filter = list(filter)
If you'd like to customize your filters, a good idea would be to look at the available fields to be filtered. Looking at the myvariant.info documentation, you can see what are all the fields avaialble and can be used for filtering.
In [7]:
from pymongo import MongoClient
client = MongoClient()
db = getattr(client, MONGODB)
collection = getattr(db, COLLECTION)
filtered = collection.find({"$and": [
{"$or": [{"func_knowngene": "exonic"},
{"func_knowngene": "splicing"}]},
{"cosmic70": {"$exists": True}},
{"1000g2015aug_all": {"$lt": 0.05}}
]})
filtered = list(filtered)