I have been asked to help develop an in-house pipeline for DNA variant ranking. The request was:
Simply put, we need to develop a consistent ranking strategy for DNA variants - perhaps one that mirrors the protocol at [ http://www.semel.ucla.edu/research/variant-filtering-ranking-pipeline ]. For the most part, Guorong has the pipeline built already for generating fully annotated variant tables ... [snip /] ... It’s a matter of coming up with the filtering rules on the columns of data.
I read the link, which describes a four-tier filtering pipeline in which tier 1 is disruptive variants in coding exons (stop-gain, splice-site, short indels), tier 2 is damaging coding variants (predicted mis-sense), and tier 3 and 4 are variants conserved or accelerated in multiple lineages or just one lineage, respectively. I've been asked to produce a decision tree, akin to this one shown in figure 10.20 of Exploring Personal Genomics (2013) by Joel T. Dudley and Konrad J. Karczewski:
Information for making the scoring decisions is already being produced by our existing pipeline using the following call to annotate the .vcf variant file using ANNOVAR (see http://annovar.openbioinformatics.org/en/latest/ ):
I look up each of the pieces of this to ensure I understand what is going on:
Let's dig a bit more into what each of those different "protocols" is (quotes are from http://annovar.openbioinformatics.org/en/latest/user-guide/region/ unless noted otherwise):
ANNOVAR also provides useful tables summarizing the available and/or relevant values for some of these annotations:
from http://annovar.openbioinformatics.org/en/latest/user-guide/gene/:
Func.knownGene:
Value | Default precedence | Explanation |
---|---|---|
exonic | 1 | variant overlaps a coding exon |
splicing | 1 | variant is within 2-bp of a splicing junction (use -splicing_threshold to change this) |
ncRNA | 2 | variant overlaps a transcript without coding annotation in the gene definition |
UTR5 | 3 | variant overlaps a 5' untranslated region |
UTR3 | 3 | variant overlaps a 3' untranslated region |
intronic | 4 | variant overlaps an intron |
upstream | 5 | variant overlaps 1-kb region upstream of transcription start site |
downstream | 5 | variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this) |
intergenic | 6 | variant is in intergenic region |
ExonicFunc.knownGene:
Annotation | Precedence | Explanation |
---|---|---|
frameshift insertion | 1 | an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence |
frameshift deletion | 2 | a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence |
frameshift block substitution | 3 | a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence |
stopgain | 4 | a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as "stopgain"! |
stoploss | 5 | a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site |
nonframeshift insertion | 6 | an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence |
nonframeshift deletion | 7 | a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence |
nonframeshift block substitution | 8 | a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence |
nonsynonymous SNV | 9 | a single nucleotide change that cause an amino acid change |
synonymous SNV | 10 | a single nucleotide change that does not cause an amino acid change |
unknown | 11 | unknown function (due to various errors in the gene structure definition in the database file) |
from http://annovar.openbioinformatics.org/en/latest/user-guide/filter/:
Classifications of available scores:
Score (dbtype) | # variants in LJB23 build hg19 | Categorical Prediction |
---|---|---|
SIFT (sift) | 77593284 | D: Deleterious (sift<=0.05); T: tolerated (sift>0.05) |
PolyPhen 2 HDIV (pp2_hdiv) | 72533732 | D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452) |
PolyPhen 2 HVar (pp2_hvar) | 72533732 | D: Probably damaging (>=0.909), P: possibly damaging (0.447<=pp2_hdiv<=0.909); B: benign (pp2_hdiv<=0.446) |
LRT (lrt) | 68069321 | D: Deleterious; N: Neutral; U: Unknown |
MutationTaster (mt) | 88473874 | A" ("disease_causing_automatic"); "D" ("disease_causing"); "N" ("polymorphism"); "P" ("polymorphism_automatic" |
MutationAssessor (ma) | 74631375 | H: high; M: medium; L: low; N: neutral. H/M means functional and L/N means non-functional |
FATHMM (fathmm) | 70274896 | D: Deleterious; T: Tolerated |
MetaSVM (metasvm) | 82098217 | D: Deleterious; T: Tolerated |
MetaLR (metalr) | 82098217 | D: Deleterious; T: Tolerated |
GERP++ (gerp++) | 89076718 | higher scores are more deleterious |
PhyloP (phylop) | 89553090 | higher scores are more deleterious |
SiPhy (siphy) | 88269630 | higher scores are more deleterious |
I also find the following additional information on the http://annovar.openbioinformatics.org/en/latest/user-guide/filter/ page:
I can't find any info on the ANNOVAR website about VEST3.
Additionally, CCBB has added custom annotations to the ANNOVAR output that include the PharmGKB database's drug and phenotype information for each variant (when found).
The first step in this tree is to determine whether the subject's ethnicity is known. This is because we are only interested in rare variants (on the theory that most people aren't sick, so a variant that makes you sick must ipso facto be rare) but different ethnic groups have different allele frequencies. Therefore, if we know the subject's ethnicity, it seems logical to compare it to allele frequencies for that ethnicity; if not, using the combined population of all ethnicities is probably the best bet.
The next decision point looks at the maximum of all the relevent (as defined above) allele frequencies (from 1000 Genomes, Exome Sequencing Project, etc) and discards any variant whose maximum allele frequency across all those is not less than 0.05--e.g., 5%. Subsequently, any variants with a listing in the genomicSuperDups table (e.g., likely to occur in a segmental duplication) are also discarded; this is because the ANNOVAR documentation at http://annovar.openbioinformatics.org/en/latest/user-guide/region/ states "Genetic variants that are mapped to segmental duplications are most likely sequence alignment errors and should be treated with extreme caution. Sometimes they manifest as SNPs with high fold coverage and probably high confidence score, but they may actually represent two non-polymorphic sites in the genomes that happen to have the same flanking sequence.”
Remaining variants are checked for whether they have any known annotations in human disease (according to GWAS studies, ClinVar, COSMIC, or NCI-60) and, if so, routed straight to the top-tier results set; if not, we determine whether they are exonic. Exonic variants that are likely to disrupt or truncate the protein go into the top-tier bin, those that are likely to cause other protein mutations go mostly into either the Tier 2 or Tier 3 bins, depending on how likely the mutations are to be deleterious. Variants that are synonymous SNVs, variants of unknown function, or nonsynonymous SNVs that are nonetheless not predicted to be deleterious by any of the listed classifiers are further triaged (note that VEST3 is not included in the classifier list, as I don't know anything about it or its outputs, and BOTH polyphen predictions ARE included although I think probably only one is relevant because I don't know WHICH is relevant for any given study ... these are issues I'd like to address in the future.)
The last triage is a final check of the CADD score; since CADD is supposed to "assign a score to each possible mutation in the human genome" (see above), I expect that each variant will have one, and a CADD phred of 30 are variants in the top 0.1% of CADD scores, which might be interesting and are put into the lowest tier (Tier 4) while all remaining exonic variants are discarded. Variants not categorized as exonic but as "exonic;splicing" are likewise assessed by their various classifier calls and CADD phred score in the same way as the exonic variants, on the grounds that disrupting splicing is also a change likely to cause protein disruptions or mutations.
Variants that are not exonic or splicing are assessed for possible regulatory disruption. All variants called as "ncRNA_exonic", "ncRNA_splicing", or "ncRNA_exonic;splicing" go straight into Tier 3, as do those that fall in 3' UTRs and are in miRNA binding sites as predicted by TargetScanS, and those that are upstream of genes and are located in transcription factor binding sites as predicted by the Transfac Matrix Database. All remaining variants that do not meet any of these criteria are again triaged a final time by CADD phred score; those with a CADD phred > 30 are placed in Tier 4 while the rest are discarded.
I talked through the first draft with Dr. Aaron Chang, who made some suggestions for changes, leading to the following revision:
This version adds an additional decision point, right after the check for variants known in human disease, to put any variants that are known to have a drug association straight into the top tier (since PharmGKB drug data is available). The handling of nonsynonymous SNVs (and exonic;splicing variants) that aren't classified as deleterious by ALL classifiers has been made more forgiving, and more tiers have been added to reflect this. Variants that are categorized as deleterious by ANY classifier are placed into a new tier (renumbered as 3) representing likely (maybe ... ?) protein disruption, while those not classified as deleterious by even one classifier are now checked for CADD phred score and those with values > 30 are placed into a second new tier (renumbered as 4) of "possible protein disrupters". Even those that fail this CADD phred check are not discarded but are placed into the lowest tier (renumbered as 6) for the really desperate to examine.
In discussing the previous version, Aaron emphasized the high relevance of anything with information already known about it. I therefore went on to reorganize a bit further:
This version puts the variants with known drug or disease relationships into the top tier even if they are not rare in terms of allele frequency. However, this turns out to be better in theory than in practice: I try running through this decision tree on our GX HLI Tumor RNASeq pharmGKB v2 project as a test, executing the filtering by hand in Array Studio, and I get the following results:
I see that the top tier is un-usably large, containing almost 3000 items, 2700 of which are variants that have some known human disease/phenotype association. When I dig further into those, I see that most of those 2700 are simply variants that have a non-zero allele frequency in the NCI-60 data set, which probably doesn't merit such high prioritization. I also see that NOTHING passes either of the CADD phred filters ... perhaps top 0.1% is too strict? (Confusingly, I also see a lot of variants that have the "." of a "NA" for their CADD scores--both raw and phred--which confuses me as I thought CADD was supposed to be calculated for all possible variants? I need to suss this out later.) Finally, I see that the question of sample ethnicity (and thus what set of allele frequencies should be interrogated) is really a different sort of decision than the rest ... all the others are decisions made per variant, but that one is a decision at the level of the sample all the variants come from and determines what kind of information will be used in the per-variant decisions.
Given these findings, I revise the decision tree once again:
This time I am removing the ethnicity check (which should still be done, but as described above is more of a pre-analysis step), lowering the CADD phred threshold, and adding a step between "all classifiers call deleterious" and "only one classifier calls deleterious" for "more than half of classifiers call deleterious". I also make the drug/disease associations a last-ditch triage step rather than the highest priority step, so that anything that has a drug or disease association will never be discarded but also won't overwhelm more immediately relevant variants.
When I (again, manually through Array Studio) run this decision tree on the GX HLI Tumor RNASeq pharmGKB v2 project, which includes 422,710 variants identified by RNA-Seq, below is how the decision tree filters them. (Note that decision points are numbered here so that I can trace the provenance of each outcome by its path through the tree.)
This decision tree produces more reasonable results: generally smaller populations in higher tiers, widening out to larger sets at lower tiers. The bottom tier (Tier 6) still has, in my opinion, too many items in it, and again most of those (2425 out of 2729) are simply variants with an NCI-60 allele frequency assigned to them (any NCI-60 allele frequency). But I don't anticipate looking at those anyway, and since I kept track of the path through the tree for each variant, I could filter them out later if I actually did have reason to examine that tier.
I think version 4 is a solid, defensible decision tree based on the information we currently have, but there are several points I'd like to improve. There are a few pieces of info we currently have that I feel are not being used adequately:
I also think we there's additional information we don't have that would make the tree better, possibly including:
Finally, I think there's some information that we are currently getting that we don't need, and could take out to speed up the ANNOVAR processing/reduce the size of the output:
In [ ]: