Bio-IT Hackathon: FAIR ClinVar

The ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/) is a public repository of submissions from researchers on the genetic variants known in the human genome, and their assocciated diseases. The whole database can be downloaded as one gzip file in several formats, including vcf and xml. While deeply informative, this database is currently best used only on the NCBI website, and the relationships between meta-data are unclear. The database is also continually updated, (some portions daily), and the new database files are updated monthly. Therefore, we also wanted clear documentation on what we did and why. This way the method could be repeated with the new version of the database, and strengthen the arguement for changing how the database is generate/released.

Goals:

  • Assess the FAIR qualities of the NCBI ClinVAR database according to the 15 FAIR principles
  • Wrangle the database, and process using the FAIRifier (https://bioit.fair-dtls.surf-hosted.nl/fairifier/)
  • Correct deficenies in the FAIRness of the database
  • Create a relational scheme for the subjects (variables) in the file

Pre-processing

We found that the main vcf file contains both the whole database (over 200,000 entries and 58 columns) Since the metadata is incorporated into the file, we needed to trim the file to a proof of concept csv for FAIRizing, while including the meta-data names as header names in the csv file.

Our initial FAIR assessment:

  1. No Globally unique identifiers
  2. Metadata and data in same file, but this is a feature of the data
  3. No metadata access when data is no longer available
  4. Metadata doesn't use a broadly accessible language (assuming RDF was what was required)
  5. Metadata using FAIR vocabularies - I don't think so.
  6. Metadata doesn't have a complete versioning history but has some form of detailed provenance.
  7. We question the, "metadata is richly described with a plurality of accurate and relevant attributes."

CSV proof-of-concept file made using python


In [10]:
import json
import re
import os
import urllib.request as request
import gzip
import argparse
import shutil
from collections import OrderedDict
import os
import re

filePath = 'clinvar.vcf';
outputfile = open('clinvar.csv','w');

################################################
#			     Helper Methods                #
################################################
def extractInfoString( info ):
	result = []

	clinallele_index = " ".join( clinallele_re.search( info ).group(1).split(",") )
	diseases = " ".join( disease_re.search(info).group(1).split(",") )
	clinsigs = " ".join( clinsig_re.search(info).group(1).split(',') )
	clinrevstats = " ".join( clinrevstat_re.search(info).group(1).split(",") )
	clinaccs = " ".join( clinacc_re.search(info).group(1).split(",") )
	gene_group = gene_re.search(info)

	if gene_group :
		gene = "".join( gene_group.group(1) )
	else:
		gene = ""

	result.append( clinallele_index.replace('\n','') )
	result.append( diseases.replace('\n','') )
	result.append( clinsigs.replace('\n','') )
	result.append( clinrevstats.replace('\n','') )
	result.append( clinaccs.replace('\n','') )
	result.append( gene );

	return result

def listToCSVRow( dataList ):

	row = ""
	for item in dataList:
		item = item.replace(',','')
		row += ',' + item

	return row[1:]



################################################
#			Fields in Info We Need             #
################################################
clinallele_re = re.compile("CLNALLE=(-?\d+)")
disease_re = re.compile("CLNDBN=([^;]*)")
clinsig_re = re.compile("CLNSIG=([^;]*)")
clinrevstat_re = re.compile("CLNREVSTAT=([^;]*)")
clinacc_re = re.compile("CLNACC=([^;]*)")
gene_re = re.compile("GENEINFO=(\w+)")


fixed_tittle = "CHROM,POS,ID,REF,ALT,QUAL,FILTER"
info_tittle = "CLNALLE,CLNDBN,CLNSIG,CLNREVSTAT,CLNACC,GENEINFO"

full_tittle = fixed_tittle + ',' + info_tittle;

outputfile.write(full_tittle + os.linesep)


################################################
#			       Start Parsing               #
################################################
with open( filePath ) as f:
	for line in f:
		if line.startswith("#",0, 2):
			continue;
		fieldList = line.split('\t')
		fixedList = fieldList[0:7];
		infoString = fieldList[7];
		infoList = extractInfoString( infoString )
		row = listToCSVRow( fixedList + infoList )
		outputfile.write( row + os.linesep)

FAIRification

We submited the csv file to the fairifier What did we do?

The CLNACC field, which is RCV#, was used to make a new column for the persistent ID like https://www.ncbi.nlm.nih.gov/clinvar/RCV000148988/

Relational scheme

30,000 ft view

Using common terms

Using the metadata labels

Create RDF file


In [8]:
from StringIO import StringIO
from rdflib import Graph, URIRef
contents = '''\
subject1\tpredicate1\tsubject2
subject2\tpredicate2\tobject2'''  
tabfile = StringIO(contents)
graph = rdflib.Graph()

for line in tabfile:
    triple = line.split()                # triple is now a list of 3 strings
    triple = (URIRef(t) for t in triple) # we have to wrap them in URIRef
    graph.add(triple)                    # and add to the graph
print graph.serialize(format='nt')


<subject1> <predicate1> <subject2> .
<subject2> <predicate2> <object2> .



In [ ]:

Future directions

  • Create RDF file with complete meta-data associations (~58) include stakeholder engagment
  • Improve machine interoperability
  • Test ML classifiers based on the relations