The ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/) is a public repository of submissions from researchers on the genetic variants known in the human genome, and their assocciated diseases. The whole database can be downloaded as one gzip file in several formats, including vcf and xml. While deeply informative, this database is currently best used only on the NCBI website, and the relationships between meta-data are unclear. The database is also continually updated, (some portions daily), and the new database files are updated monthly. Therefore, we also wanted clear documentation on what we did and why. This way the method could be repeated with the new version of the database, and strengthen the arguement for changing how the database is generate/released.
We found that the main vcf file contains both the whole database (over 200,000 entries and 58 columns) Since the metadata is incorporated into the file, we needed to trim the file to a proof of concept csv for FAIRizing, while including the meta-data names as header names in the csv file.
In [10]:
import json
import re
import os
import urllib.request as request
import gzip
import argparse
import shutil
from collections import OrderedDict
import os
import re
filePath = 'clinvar.vcf';
outputfile = open('clinvar.csv','w');
################################################
# Helper Methods #
################################################
def extractInfoString( info ):
result = []
clinallele_index = " ".join( clinallele_re.search( info ).group(1).split(",") )
diseases = " ".join( disease_re.search(info).group(1).split(",") )
clinsigs = " ".join( clinsig_re.search(info).group(1).split(',') )
clinrevstats = " ".join( clinrevstat_re.search(info).group(1).split(",") )
clinaccs = " ".join( clinacc_re.search(info).group(1).split(",") )
gene_group = gene_re.search(info)
if gene_group :
gene = "".join( gene_group.group(1) )
else:
gene = ""
result.append( clinallele_index.replace('\n','') )
result.append( diseases.replace('\n','') )
result.append( clinsigs.replace('\n','') )
result.append( clinrevstats.replace('\n','') )
result.append( clinaccs.replace('\n','') )
result.append( gene );
return result
def listToCSVRow( dataList ):
row = ""
for item in dataList:
item = item.replace(',','')
row += ',' + item
return row[1:]
################################################
# Fields in Info We Need #
################################################
clinallele_re = re.compile("CLNALLE=(-?\d+)")
disease_re = re.compile("CLNDBN=([^;]*)")
clinsig_re = re.compile("CLNSIG=([^;]*)")
clinrevstat_re = re.compile("CLNREVSTAT=([^;]*)")
clinacc_re = re.compile("CLNACC=([^;]*)")
gene_re = re.compile("GENEINFO=(\w+)")
fixed_tittle = "CHROM,POS,ID,REF,ALT,QUAL,FILTER"
info_tittle = "CLNALLE,CLNDBN,CLNSIG,CLNREVSTAT,CLNACC,GENEINFO"
full_tittle = fixed_tittle + ',' + info_tittle;
outputfile.write(full_tittle + os.linesep)
################################################
# Start Parsing #
################################################
with open( filePath ) as f:
for line in f:
if line.startswith("#",0, 2):
continue;
fieldList = line.split('\t')
fixedList = fieldList[0:7];
infoString = fieldList[7];
infoList = extractInfoString( infoString )
row = listToCSVRow( fixedList + infoList )
outputfile.write( row + os.linesep)
We submited the csv file to the fairifier What did we do?
The CLNACC field, which is RCV#, was used to make a new column for the persistent ID like https://www.ncbi.nlm.nih.gov/clinvar/RCV000148988/
In [8]:
from StringIO import StringIO
from rdflib import Graph, URIRef
contents = '''\
subject1\tpredicate1\tsubject2
subject2\tpredicate2\tobject2'''
tabfile = StringIO(contents)
graph = rdflib.Graph()
for line in tabfile:
triple = line.split() # triple is now a list of 3 strings
triple = (URIRef(t) for t in triple) # we have to wrap them in URIRef
graph.add(triple) # and add to the graph
print graph.serialize(format='nt')
In [ ]: