Prioritizing Variants In Patient Genome Data

Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)

Background
Decision Tree Development
Next Steps

Background

I have been asked to help develop an in-house pipeline for DNA variant ranking. The request was:

Simply put, we need to develop a consistent ranking strategy for DNA variants - perhaps one that mirrors the protocol at [ http://www.semel.ucla.edu/research/variant-filtering-ranking-pipeline ]. For the most part, Guorong has the pipeline built already for generating fully annotated variant tables ... [snip /] ... It’s a matter of coming up with the filtering rules on the columns of data.

I read the link, which describes a four-tier filtering pipeline in which tier 1 is disruptive variants in coding exons (stop-gain, splice-site, short indels), tier 2 is damaging coding variants (predicted mis-sense), and tier 3 and 4 are variants conserved or accelerated in multiple lineages or just one lineage, respectively. I've been asked to produce a decision tree, akin to this one shown in figure 10.20 of Exploring Personal Genomics (2013) by Joel T. Dudley and Konrad J. Karczewski:

Information for making the scoring decisions is already being produced by our existing pipeline using the following call to annotate the .vcf variant file using ANNOVAR (see http://annovar.openbioinformatics.org/en/latest/ ):

perl table_annovar.pl /mnt/data/workspace/data/normal_blood_wgs_hli_B1000000012_S1.final.vqsr.vcf.avinput humandb/ -buildver hg19 -out /mnt/data/workspace/data/normal_blood_wgs_hli_B1000000012_S1.vqsr_annotated -remove -protocol knownGene,tfbsConsSites,cytoBand,targetScanS,genomicSuperDups,gwasCatalog,esp6500siv2_all,1000g2014oct_all,snp138,ljb26_all,cg46,cg69,popfreq_all_20150413,clinvar_20150330,cosmic70,nci60 -operation g,r,r,r,r,r,f,f,f,f,f,f,f,f,f,f -nastring .

I look up each of the pieces of this to ensure I understand what is going on:

/mnt/data/workspace/data/normal_blood_wgs_hli_B1000000012_S1.final.vqsr.vcf.avinput: The .vcf information converted into the ANNOVAR input format
humandb/: The location of the ANNOVAR database information on the local system
-buildver hg19: A switch telling ANNOVAR to expect the input file coordinates to be against human genome build hg19 rather than the default hg18
-out /mnt/data/workspace/data/normal_blood_wgs_hli_B1000000012_S1.vqsr_annotated: The location of the output file
-remove: A switch telling ANNOVAR to remove all temporary files
-protocol knownGene,tfbsConsSites,cytoBand,targetScanS,genomicSuperDups,gwasCatalog,esp6500siv2_all,1000g2014oct_all,snp138, ljb26_all,cg46,cg69,popfreq_all_20150413,clinvar_20150330,cosmic70,nci60: A switch giving a (long) list of all the "protocols" (annotation types) to be included in the output
-operation g,r,r,r,r,r,f,f,f,f,f,f,f,f,f,f: A switch indicating which level applies to each of the protocols listed by the previous switch; according to http://annovar.openbioinformatics.org/en/latest/user-guide/startup/, "g means gene-based, r means region-based and f means filter-based"
-nastring .: Apparently, a switch indicating that when a particular field is not applicable ("na") for a given variant, a dot should be put in that field (note that I say "apparently" because I can't find an actual description of this switch in the ANNOVAR web content, but this matches with what I see in the output)

Let's dig a bit more into what each of those different "protocols" is (quotes are from http://annovar.openbioinformatics.org/en/latest/user-guide/region/ unless noted otherwise):

knownGene: annotates variants to UCSC's Known Gene collection rather than RefSeq (ANNOVAR's default); ANNOVAR documentation notes this is desirable since UCSC (and Ensembl) are more complete than RefSeq in terms of computationally predicted and poorly annotated genes. As one would expect, this annotation operates on the "gene" level. The knownGene annotation provides several designations for each variant, including Func.knownGene, which indicates the expected function of the gene (e.g., exonic, intergenic, etc--see below) and, for predicted-exonic variants, ExonicFunc.knownGene, which indicates the predicted function of that exonic variant (e.g., stop gain, nonsynonymous mutation, etc--again, see below).
tfbsConsSites: annotates with "the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment, where score and threshold are computed with the Transfac Matrix Database".
cytoBand: annotates with "Giemsa-stained chromosomes bands" ... whatever that means?
targetScanS: "the TargetScanS annotation ... shows conserved mammalian microRNA regulatory target sites for conserved microRNA families in the 3' UTR regions of Refseq Genes, as predicted by TargetScanHuman”.
genomicSuperDups: annotates with "all genomic loci at least 1 kb in length with at least 90% similarity to at least one other locus" from "the 'genomicSuperDups' table ('Duplications of >1000 Bases of Non-RepeatMasked Sequence') from the UCSC genome browser" (from http://blog.goldenhelix.com/bchristensen/sequence-analysis-methods-not-just-for-sequence-data/ ).
gwasCatalog: annotates "variants that were previously reported to be associated with diseases or traits in genome-wide association studies", although I don't know the source of this info.
- Note that from here on down, all the quotes are from http://annovar.openbioinformatics.org/en/latest/user-guide/filter/ unless otherwise specified.
esp6500siv2_all: Annotates with the allele frequency data from "[t]he ESP [,] ... a NHLBI funded exome sequencing project aiming to identify genetic variants in exonic regions from over 6000 individuals, including healthy ones as well as subjects with different diseases". Note that "[a]s of March 2015, the most recent version of ESP is esp6500siv2, so whenever possible, users should use this database for annotation", and the 'all' means all ethnicity groups.
1000g2014oct_all: annotates with "alternative allele frequency data in 1000 Genomes Project for autosomes and sex chromosomes"; again, 'all' means all ethnicities.
snp138: annotates "the variant[s] that are already reported in dbSNP and also identif[ies] the corresponding rs identifiers".
ljb26_all: annotates with all scores from 'the LJB databases', so-called for the authors (Liu, Jian, Boerwinkle) and more commonly known as dbNSFP; these are "SIFT scores, PolyPhen2 HDIV scores, PolyPhen2 HVAR scores, LRT scores, MutationTaster scores, MutationAssessor score[s], FATHMM scores, GERP++ scores, PhyloP scores and SiPhy scores". Note that observation of ANNOVAR outputs shows this list is incomplete; outputs also include RadialSVM (referred to as MetaSVM in most of the documentation), LR (referred to as MetaLr in most of the documentation), VEST3, and CADD (more details on all of the scores below).
cg46: annotates with "allele frequency data from 46 unrelated subjects" on the Complete Genomics sequencing platform; note that the subjects were healthy.
cg69: annotates with "allele frequency data ... 69 related subjects (including the 46 unrelated subjects)" on the Complete Genomics sequencing platform; note that the subjects were healthy.
popfreq_all_20150413: annotates with "allele frequency from several population frequency databases, including 1000 Genomes Project (ALL+5 ethnicity groups), ESP6500 (ALL+2 ethnicity groups), ExAC (ALL+7 ethnicity groups), CG46 ... The name of the database is suffixed by the date when the database is made. Currently, 20150413 is the suffix."
clinvar_20150330: annotates with known information about relationships among variations and human health from the ClinVar database.
cosmic70: annotates with "somatic mutations reported in literature in various types of cancers" from COSMIC, the "Catalogue Of Somatic Mutations In Cancer".
nci60: annotates with "allele frequency information from these 60 cell lines ["The NCI-60 cell lines [,]... the most frequently studied human tumor cell lines in cancer research"] based on their exome sequencing data", compiled by ANNOVAR.

ANNOVAR also provides useful tables summarizing the available and/or relevant values for some of these annotations:

from http://annovar.openbioinformatics.org/en/latest/user-guide/gene/:
Func.knownGene:

Value	Default precedence	Explanation
exonic	1	variant overlaps a coding exon
splicing	1	variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)
ncRNA	2	variant overlaps a transcript without coding annotation in the gene definition
UTR5	3	variant overlaps a 5' untranslated region
UTR3	3	variant overlaps a 3' untranslated region
intronic	4	variant overlaps an intron
upstream	5	variant overlaps 1-kb region upstream of transcription start site
downstream	5	variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this)
intergenic	6	variant is in intergenic region

ExonicFunc.knownGene:

Annotation	Precedence	Explanation
frameshift insertion	1	an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence
frameshift deletion	2	a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence
frameshift block substitution	3	a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence
stopgain	4	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as "stopgain"!
stoploss	5	a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site
nonframeshift insertion	6	an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence
nonframeshift deletion	7	a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence
nonframeshift block substitution	8	a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence
nonsynonymous SNV	9	a single nucleotide change that cause an amino acid change
synonymous SNV	10	a single nucleotide change that does not cause an amino acid change
unknown	11	unknown function (due to various errors in the gene structure definition in the database file)

from http://annovar.openbioinformatics.org/en/latest/user-guide/filter/:
Classifications of available scores:

Score (dbtype)	# variants in LJB23 build hg19	Categorical Prediction
SIFT (sift)	77593284	D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)
PolyPhen 2 HDIV (pp2_hdiv)	72533732	D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452)
PolyPhen 2 HVar (pp2_hvar)	72533732	D: Probably damaging (>=0.909), P: possibly damaging (0.447<=pp2_hdiv<=0.909); B: benign (pp2_hdiv<=0.446)
LRT (lrt)	68069321	D: Deleterious; N: Neutral; U: Unknown
MutationTaster (mt)	88473874	A" ("disease_causing_automatic"); "D" ("disease_causing"); "N" ("polymorphism"); "P" ("polymorphism_automatic"
MutationAssessor (ma)	74631375	H: high; M: medium; L: low; N: neutral. H/M means functional and L/N means non-functional
FATHMM (fathmm)	70274896	D: Deleterious; T: Tolerated
MetaSVM (metasvm)	82098217	D: Deleterious; T: Tolerated
MetaLR (metalr)	82098217	D: Deleterious; T: Tolerated
GERP++ (gerp++)	89076718	higher scores are more deleterious
PhyloP (phylop)	89553090	higher scores are more deleterious
SiPhy (siphy)	88269630	higher scores are more deleterious

I also find the following additional information on the http://annovar.openbioinformatics.org/en/latest/user-guide/filter/ page:

Polyphen2_HDIV: "pp2hdiv should be used when evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data. The authors recommend calling "probably damaging" if the score is between 0.957 and 1, and "possibly damaging" if the score is between 0.453 and 0.956, and "benign" is the score is between 0 and 0.452."
Polyphen2_HVAR: "pp2hvar should be used for diagnostics of Mendelian diseases, which requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles.The authors recommend calling "probably damaging" if the score is between 0.909 and 1, and "possibly damaging" if the score is between 0.447 and 0.908, and "benign" is the score is between 0 and 0.446."
FATHMM: "If a score is smaller than -1.5 the corresponding NS is predicted as "D(AMAGING)"; otherwise it is predicted as "T(OLERATED)". .... The method is less well known compared to SIFT and PolyPhen, but in our experience it works really well and better than SIFT/PolyPhen."
RadialSVM: "MetaSVM is developed by Coco Dong at my lab in collaboration with Dr. Xiaoming Liu. It is composed of two steps: (1) perform imputation for whole-exome variants and fill out missing scores for SIFT, PolyPhen, MutationAssessor and so on. (2) Normalize all scores to 0-1 range (3) use a radial SVM model to train prediction model using all available scores and some population genetics parameters, and then apply the model on whole-exome variants."
PhyloP/SiPhy/GERP++: "These three can be considered as competitors of each other (just like SIFT and PolyPhen are competitors of each other)."
PhyloP: "PhyloP score is based on multiple alignments of 46 genomes. … . The larger the score, the more conserved the site."
SiPhy: "SiPhy score is based on 29 mammals genomes. The larger the score, the more conserved the site."
GERP++: "The ljb23_gerp++ contains only annotation for coding variants!!! …. Generally the higher the score, the more conserved the site."
CADD: "CADD (Combined Annotation Dependent Depletion) is a score that is based on SVM on multiple other scores. One nice thing about it is that it assigns a score to each possible mutation in the human genome, therefore can evaluate non-coding variants as well as coding ones. …. The comma-delimited values are raw scores and phred-scaled scores. Basically, for scaled scores, 10 means 10% percentile highest scores, 20 means 1% percentile highest scores, and 30% means 0.1% percentile highest scores."

I can't find any info on the ANNOVAR website about VEST3.

Additionally, CCBB has added custom annotations to the ANNOVAR output that include the PharmGKB database's drug and phenotype information for each variant (when found).

Table of Contents

Decision Tree

Working from the stub example provided and the tiers suggested by the Semel Institute document, I mocked up a first version of a decision tree:

Version One

The first step in this tree is to determine whether the subject's ethnicity is known. This is because we are only interested in rare variants (on the theory that most people aren't sick, so a variant that makes you sick must ipso facto be rare) but different ethnic groups have different allele frequencies. Therefore, if we know the subject's ethnicity, it seems logical to compare it to allele frequencies for that ethnicity; if not, using the combined population of all ethnicities is probably the best bet.

The next decision point looks at the maximum of all the relevent (as defined above) allele frequencies (from 1000 Genomes, Exome Sequencing Project, etc) and discards any variant whose maximum allele frequency across all those is not less than 0.05--e.g., 5%. Subsequently, any variants with a listing in the genomicSuperDups table (e.g., likely to occur in a segmental duplication) are also discarded; this is because the ANNOVAR documentation at http://annovar.openbioinformatics.org/en/latest/user-guide/region/ states "Genetic variants that are mapped to segmental duplications are most likely sequence alignment errors and should be treated with extreme caution. Sometimes they manifest as SNPs with high fold coverage and probably high confidence score, but they may actually represent two non-polymorphic sites in the genomes that happen to have the same flanking sequence.”

Remaining variants are checked for whether they have any known annotations in human disease (according to GWAS studies, ClinVar, COSMIC, or NCI-60) and, if so, routed straight to the top-tier results set; if not, we determine whether they are exonic. Exonic variants that are likely to disrupt or truncate the protein go into the top-tier bin, those that are likely to cause other protein mutations go mostly into either the Tier 2 or Tier 3 bins, depending on how likely the mutations are to be deleterious. Variants that are synonymous SNVs, variants of unknown function, or nonsynonymous SNVs that are nonetheless not predicted to be deleterious by any of the listed classifiers are further triaged (note that VEST3 is not included in the classifier list, as I don't know anything about it or its outputs, and BOTH polyphen predictions ARE included although I think probably only one is relevant because I don't know WHICH is relevant for any given study ... these are issues I'd like to address in the future.)

The last triage is a final check of the CADD score; since CADD is supposed to "assign a score to each possible mutation in the human genome" (see above), I expect that each variant will have one, and a CADD phred of 30 are variants in the top 0.1% of CADD scores, which might be interesting and are put into the lowest tier (Tier 4) while all remaining exonic variants are discarded. Variants not categorized as exonic but as "exonic;splicing" are likewise assessed by their various classifier calls and CADD phred score in the same way as the exonic variants, on the grounds that disrupting splicing is also a change likely to cause protein disruptions or mutations.

Variants that are not exonic or splicing are assessed for possible regulatory disruption. All variants called as "ncRNA_exonic", "ncRNA_splicing", or "ncRNA_exonic;splicing" go straight into Tier 3, as do those that fall in 3' UTRs and are in miRNA binding sites as predicted by TargetScanS, and those that are upstream of genes and are located in transcription factor binding sites as predicted by the Transfac Matrix Database. All remaining variants that do not meet any of these criteria are again triaged a final time by CADD phred score; those with a CADD phred > 30 are placed in Tier 4 while the rest are discarded.

Table of Contents

Version Two

I talked through the first draft with Dr. Aaron Chang, who made some suggestions for changes, leading to the following revision:

This version adds an additional decision point, right after the check for variants known in human disease, to put any variants that are known to have a drug association straight into the top tier (since PharmGKB drug data is available). The handling of nonsynonymous SNVs (and exonic;splicing variants) that aren't classified as deleterious by ALL classifiers has been made more forgiving, and more tiers have been added to reflect this. Variants that are categorized as deleterious by ANY classifier are placed into a new tier (renumbered as 3) representing likely (maybe ... ?) protein disruption, while those not classified as deleterious by even one classifier are now checked for CADD phred score and those with values > 30 are placed into a second new tier (renumbered as 4) of "possible protein disrupters". Even those that fail this CADD phred check are not discarded but are placed into the lowest tier (renumbered as 6) for the really desperate to examine.

Table of Contents

Version Three

In discussing the previous version, Aaron emphasized the high relevance of anything with information already known about it. I therefore went on to reorganize a bit further:

This version puts the variants with known drug or disease relationships into the top tier even if they are not rare in terms of allele frequency. However, this turns out to be better in theory than in practice: I try running through this decision tree on our GX HLI Tumor RNASeq pharmGKB v2 project as a test, executing the filtering by hand in Array Studio, and I get the following results:

I see that the top tier is un-usably large, containing almost 3000 items, 2700 of which are variants that have some known human disease/phenotype association. When I dig further into those, I see that most of those 2700 are simply variants that have a non-zero allele frequency in the NCI-60 data set, which probably doesn't merit such high prioritization. I also see that NOTHING passes either of the CADD phred filters ... perhaps top 0.1% is too strict? (Confusingly, I also see a lot of variants that have the "." of a "NA" for their CADD scores--both raw and phred--which confuses me as I thought CADD was supposed to be calculated for all possible variants? I need to suss this out later.) Finally, I see that the question of sample ethnicity (and thus what set of allele frequencies should be interrogated) is really a different sort of decision than the rest ... all the others are decisions made per variant, but that one is a decision at the level of the sample all the variants come from and determines what kind of information will be used in the per-variant decisions.

Table of Contents

Version Four

Given these findings, I revise the decision tree once again:

This time I am removing the ethnicity check (which should still be done, but as described above is more of a pre-analysis step), lowering the CADD phred threshold, and adding a step between "all classifiers call deleterious" and "only one classifier calls deleterious" for "more than half of classifiers call deleterious". I also make the drug/disease associations a last-ditch triage step rather than the highest priority step, so that anything that has a drug or disease association will never be discarded but also won't overwhelm more immediately relevant variants.

When I (again, manually through Array Studio) run this decision tree on the GX HLI Tumor RNASeq pharmGKB v2 project, which includes 422,710 variants identified by RNA-Seq, below is how the decision tree filters them. (Note that decision points are numbered here so that I can trace the provenance of each outcome by its path through the tree.)

This decision tree produces more reasonable results: generally smaller populations in higher tiers, widening out to larger sets at lower tiers. The bottom tier (Tier 6) still has, in my opinion, too many items in it, and again most of those (2425 out of 2729) are simply variants with an NCI-60 allele frequency assigned to them (any NCI-60 allele frequency). But I don't anticipate looking at those anyway, and since I kept track of the path through the tree for each variant, I could filter them out later if I actually did have reason to examine that tier.

Table of Contents

Next Steps

I think version 4 is a solid, defensible decision tree based on the information we currently have, but there are several points I'd like to improve. There are a few pieces of info we currently have that I feel are not being used adequately:

Conservation metrics: This decision tree doesn't use PhyloP, SiPhy, or GERP++ values based on conservation of sequence at all. I'd like to learn more about those and figure out where in the tree they could add value; my guess would be they might be able to sieve out some potentially meaningful variants that aren't exonic/splicing/ncRNA, which currently are mostly ignored. It is perhaps worth noting that ANNOVAR can also annotate conserved genomic regions (unlike the scores above, which are at the level of conserved bases) from phastCons.
NCI-60 allele frequency: Conversely, as discussed above, I feel this tree may be using the NCI-60 allele frequencies too strongly. Can we be more sensible about them, to avoid having them take over Tier 6? Or should they have their own tier?

I also think we there's additional information we don't have that would make the tree better, possibly including:

Chromatin marks: If a variant fell in a region known to be an active promotor or enhancer, we'd probably upgrade it to Tier 5; if it fell into a known repressed region, we might want to discard or down-grade it even if it met the criteria for a tier above discard. I'm not sure that knowing that a variant fell in an active elongation region would affect our decision-making. Apparently we can get this info from ENCODE into ANNOVAR, but there is a caveat that "although UCSC may provide more ENCODE annotations for hg19 in the future, at least at end of 2013 rich annotation is only available for hg18. The examples below all used hg18. If you are interested to annotate your hg19 variants, some of these databases may not be available" (http://annovar.openbioinformatics.org/en/latest/user-guide/region/#identify-variants-in-encode-annotated-regions ), which could be a problem if it is still applicable. Another source might be Manolis Kellis' HaploReg database, which apparently includes chromatin information for many different tissue types, perhaps making it more relevant ... but I don't know if it is importable into ANNOVAR ...
DNA accessibility: The Semel Institute protocol states that "conservation is a poor predictor of functional regulatory variation itself, whereas it can accurately trace a protein-DNA binding event when considered in combination with nucleotide-level DNA accessibility". Knowing that a variant fell in an accessible region, combined with use of the conservation metrics (see above) could justify pulling something out of Tier 6 or the Discard pile and putting it into ... a new tier? The ANNOVAR documentation mentions that ENCODE provides Genome Browser tracks for DNase1 hypersensitivity that can be used in ANNOVAR.
tRNA rate: Knowing that a synonymous SNV strongly affected tRNA rate (either upwards or downwards) would probably put it into Tier 5 rather than Tier 6. Although we probably can't know the actual change in tRNA rate due to a synonymous mutation, we could calculate whether the created codon is more or less rare than the original and by how much (since rare codons are usually slower).
RNA stability: Knowing that a synonymous SNV (or ncRNA_* variant) is very destabilizing would probably put it into Tier 5 rather than Tier 6. However, basic googling doesn't inform me of any large-scale databases of this information, and it isn't something calculable.

Finally, I think there's some information that we are currently getting that we don't need, and could take out to speed up the ANNOVAR processing/reduce the size of the output:

cytoBand: Not currently being used (should it be?)
cg46 and cg69: According to the ANNOVAR documentation at http://annovar.openbioinformatics.org/en/latest/user-guide/filter/, "Each technical platform, such as Complete Genomics and Illumina HiSeq, may generate some platform specific sequencing artifacts. Complete genomics provides whole-genome data for a relatively small group of healthy subjects, but this data set can be quite useful to filter out technical artifacts for CG users." Since we're not working with Complete Genomics data, perhaps these aren't very relevant?
esp6500siv2_all and 1000g2014oct_all: These databases' allele frequencies are are apparently also included in the output of popfreq_all_20150413, so we are getting them twice, which is clearly unnecessary.



In [ ]: