This notebook will walk you through the steps of how variants coming from a VCF can be annotated efficiently and thoroughly using the package variantannotation. In particular, the package is aimed at providing a way of retrieving variant information using ANNOVAR and myvariant.info and consolidating it in conveninent formats. It is well-suited for bioinformaticians interested in aggregating variant information into a single database for ease of use and to provide higher analysis capabities.
The aggregation is performed specifically by structuring the data in lists of python dictionaries, with each variant being described by a multi-level dictionary. The choice was made due to the inconsistencies that exist between data availability, and the necessity to store their information in a flexible manner. Further, this specific format permits its parsing to a MongoDb instance (dictionaries are the python representation of JSON objects), which enables the user to efficiently store, query and filter such data.
Finally, the package also has the added functionality to create csv and vcf files from MongoDB. The class Filters allows the user to rapidly query data that meets certain criteria as a list of documents, and the class FileWriter can transform such list into more widely accepted formats such as vcf and csv files. It should be noted that here, the main differential the package offers is the ability to write these files preserving all the annotation data. In the vcf files, for instance, outputs will have a 'Otherinfo' column where all the data coming from ANNOVAR and myvariant.info is condensed (while still preserving its structure). For vcf files, outputs will have around ~120-200 columns, depending on the amount of variant data that can be retrieved from myvvariant.info.
Notes on required software
the following libraries will be install upon installing variantannotation:
Other libraries that are needed, but should natively e installed on most OS:
Further, a MongoDB database must be set up. Refer to the documentation page for more information. Similarly, ANNOVAR must be downloaded, alongside with its supporting databases (also listed on the documentation page).
In [19]:
os.getcwd()
Out[19]:
In [22]:
import os
import re
import sys
import vcf
import time
import pysam
import myvariant
import collections
import numpy as np
import pandas as pd
sys.path.append(os.getcwd().replace("notebooks/dnaSeq/VAPr_Variant_Annotation_Prioritization", "src/dnaSeq/VAPr"))
#variantannotation functions
from variantannotation import annotate_batch
from variantannotation import create_output_files
from variantannotation import myvariant_parsing_utils
from variantannotation import mongo_DB_export
from variantannotation import utilities
from variantannotation import MongoDB_querying
In [2]:
ANNOVAR_PATH = '/data/annovar/'
FILE_NAMES = ['Tumor_RNAseq_variants.vcf', 'Tumor_targeted_seq.vcf', 'normal_targeted_seq.vcf', 'normal_blood_WGS.vqsr.vcf', 'somatic_mutect_old.vcf']
IN_PATH = '/data/ccbb_internal/interns/Carlo/test_vcf/'
OUT_PATH = '/data/ccbb_internal/interns/Carlo/test_vcf_out/'
vcf_file = IN_PATH
In [3]:
#Check if file paths are correctly pointing to the specified files.
for i in range(0, len(FILE_NAMES)):
print IN_PATH+FILE_NAMES[i]
This will run ANNOVAR. A csv file named tumortargcsvout.hg19_multianno.csv will appear in the OUT_PATH specified. The csv file can then be processed and integrated with the data coming from myvariant.info. This command may take a some time to run (5-30 minutes for each file depending on file size). To keep things simple, we can start by looking at one file only. Let's run annovar on it. In any case, if you have multiple files to work on, you can run them in parallel by running the block after the next one.
In [ ]:
utilities.run_annovar(ANNOVAR_PATH, IN_PATH+FILE_NAMES[0], OUT_PATH)
In [ ]:
#Annovar runs as a subprocess on every file. They will run in parallel for speed up.
for i in range(0, len(FILE_NAMES)):
utilities.run_annovar(ANNOVAR_PATH, IN_PATH+FILE_NAMES[i], OUT_PATH)
#This serves to give a real-time feedback of the ANNOVAR progress and status.
Specify the name and location of the csv file that ANNOVAR produces as output
In [4]:
filepath_out = '/data/ccbb_internal/interns/Carlo/test_vcf_out/'
filepath_in = '/data/ccbb_internal/interns/Carlo/test_vcf/'
#For safety, check the files in directory. Either run '!ls' here on iPython, or go to the directory and check
#manually for existing files. There should be once csv file for every vcf file.
VCF_FILE_NAME = 'Tumor_RNAseq_variants.vcf'
CSV_FILE_NAME = 'Tumor_RNAseq_variants.hg19_multianno.csv'
The package offers 4 different methods to obtain variant data. Two of them require annovar, while the other two are based solely on the use of myvariant.info service. The latter can be used without having to worry about downloading and installing annovar databases, but it tends to return partial or no information for some variants.
The different methods also enable the user to decide how the data will be parsed to MongoDB. 1 and 3 parse the data by chunks: the user specifies a number of variants (usually 1000), and the data from the vcf and csv files are parsed as soon as those 1000 variants are processed and integrated. This enables huge files to be processed without having to hold them in memory and potentially cause a Memory Overflow error.
Methods 2 and 4, on the other hand, process the files on their entirety and send them to MongoDB at once. Well-suited for smaller files. See docs for more info.
For this tutorial, we will use method #1. Data from annovar (as a csv file) will be obtained 1000 lines at a time, instead of attempting to parse and process an entire csv file at once.
As soon as you run the scripts from variantannotaiton, variant data will be retrieved from myvariant.info and the data will automatically be integrated and stored to MongoDB. Database and collection name should be specified, and there must be a running MongoDB connection. The script will set up a client to communicate between python (through pymongo) and the the database.
In general, the shell command:
mongod --dbpath ../data/db
where data/db is the designated location where the data will be stored, will initiate MongoDB. After this, the script should store data to the directory automatically. For pymongo, and more information on how to set up a Mongo Database: https://docs.mongodb.com/getting-started/python/
In [17]:
chunksize = 10000
step = 0
#Get variant list. Should always be the first step after running ANNOVAR
open_file = myvariant_parsing_utils.VariantParsing()
#Name Collections & DB. Change them to something appropriate. Each file should live in a collection
db_name = 'Variant_Prioritization_Workflow'
collection_name = 'Test_Tumor_RNAseq'
list_file = open_file.get_variants_from_vcf(filepath_in+VCF_FILE_NAME)
as_batch = annotate_batch.AnnotationMethods()
as_batch.by_chunks(list_file, chunksize, step, filepath_out+CSV_FILE_NAME, collection_name, db_name)
Out[17]:
Here we implement three different filters that allow for the retrieval of specific variants. The filters are implemented as MongoDB queries, and are designed to provie the user with a set of relevant variants. In case the user would like to define its own querying, a template is provided. The output of the queries is a list of dictionaries (JSON documents), where each dictionary contains data reltive to one variant.
Further, the package allows the user to parse these variants into an annotated csv or vcf file. If needed, annotated, unfiltered vcf and csv files can also be created. They will have the same length (number of variants) as the original files, but will contain much more complete annotation data coming from myvariant.info and ANNOVAR databases.
To create a csv file, just the filtered output is needed. To create an annotated vcf file, a tab indexed file (.tbi) file is needed (see comments in section Create unfiltered annotated vcf and csv files at the end of this page). This can be created using tabix.
First, the file needs to be compressed:
From the command line, running:
bgzip -c input_file.vcf > input_file.vcf.gz
returns input_vcf_file.vcf.gz
and running
tabix input_vcf_file.vcf.gz
will return: input_vcf_file.vcf.gz.tbi
In [5]:
filepath = '/data/ccbb_internal/interns/Carlo'
In [9]:
#Create output files (if needed): specify name of files and path
rare_cancer_variants_csv = filepath + "/tumor_rna_rare_cancer_vars_csv.csv"
rare_cancer_variants_vcf = filepath + "/tumor_rna_rare_cancer_vars_vcf.vcf"
input_vcf_compressed = filepath + '/test_vcf/Tumor_RNAseq_variants.vcf.gz'
#Apply filter.
filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_cancer_variants = filter_collection.rare_cancer_variant()
#Crete writer object for filtered lists:
my_writer = create_output_files.FileWriter(db_name, collection_name)
#cancer variants filtered files
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)
my_writer.generate_annotated_vcf(rare_cancer_variants,input_vcf_compressed, rare_cancer_variants_vcf)
Out[9]:
In [13]:
#Apply filter.
filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_disease_variants = filter_collection.rare_disease_variant()
Zero variants found. Writing a csv output won't make much sense. You can still customize the filters the way you'd like, as you can see below.
As long as you have a MongoDB instance running, filtering can be perfomed trough pymongo as shown by the code below. If a list is intended to be created from it, simply add: filter2 = list(filter2)
If you'd like to customize your filters, a good idea would be to look at the available fields to be filtered. Looking at the myvariant.info documentation, you can see what are all the fields avaialble and can be used for filtering.
In [ ]:
from pymongo import MongoClient
client = MongoClient()
db = client.My_Variant_Database
collection = db.ANNOVAR_MyVariant_chunks
filter2 = collection.find({ "$and": [
{"$or": [{"ThousandGenomeAll": {"$lt": 0.05}}, {"ThousandGenomeAll": {"$exists": False}}]},
{"$or": [{"ESP6500siv2_all": { "$lt": 0.05}}, {"ESP6500siv2_all": { "$exists": False}}]},
{"$or": [{"Func_knownGene": "exonic"}, {"Func_knownGene": "splicing"}]},
{"ExonicFunc_knownGene": {"$ne": "synonymous SNV"}},
{"Genotype_Call.DP": {"$gte": 10}},
{"cosmic70": { "$exists": True}}
]})
In [15]:
#Create output files (if needed): specify name of files and path
out_unfiltered_vcf_file = filepath + "/out_unfiltered_rnaseq_vcf.vcf"
out_unfiltered_csv_file = filepath + "/out_unfiltered_rnaseq_csv.csv"
input_vcf_compressed = filepath + '/test_vcf/Tumor_RNAseq_variants.vcf.gz'
#Create writer object
#db and collection name must be specified since no list is given. The entire collection will be queried.
my_writer = create_output_files.FileWriter(db_name, collection_name)
#Write collection to csv and vcf
#The in_vcf_file must be the .vcf.gz file and it needs to have an associated .tbi file.
my_writer.generate_unfiltered_annotated_csv(out_unfiltered_csv_file)
my_writer.generate_unfiltered_annotated_vcf(input_vcf_compressed, out_unfiltered_vcf_file)
Out[15]:
...