Scheduled Integration of ClinGen Gene-Disease Validity Data into WikiData

ClinGen (Clinical Genome Resource) develops curated data of genetic associations
CC0 https://clinicalgenome.org/docs/terms-of-use/

This scheduled bot operates through WDI to integrate ClinGen Gene-Disease Validity Data
https://search.clinicalgenome.org/kb/gene-validity/
https://github.com/SuLab/GeneWikiCentral/issues/116
http://jenkins.sulab.org/

Python script contributions, in order: Sabah Ul-Hasan, Andra Waagmeester, Andrew Su, Ginger Tsueng

Checks

Login automatically aligns with given environment
For loop checks for both HGNC Qid and MONDO Qid per each row (ie if HGNC absent or multiple, then checks MONDO)
For loop works on multiple Qid option, tested using A2ML1 as pseudo example
For loop puts correct Qid for either HGNC or MONDO, if available </br>
For loop only writes 'complete' in output if written to Wikidata

Issues

create_reference() and update_retrieved_if_new_multiple_refs functions adds and/or updates ref to existing HGNC or MONDO value in genetic association statement within 180 days (doesn't overwrite URLs from non-ClinGen sources)
Updated, not updated, skipped - not definitive or mapping error..
Maybe get of Definitive column, but keep Gene and Disease QID



In [15]:

    
# Relevant Modules and Libraries

## Installations by shell 
!pip install --upgrade pip # Installs pip, ensures it's up-to-date
!pip3 install tqdm # Visualizes installation progress (progress bar)
!pip3 install wikidataintegrator # For wikidata

## Installations by python
from wikidataintegrator import wdi_core, wdi_login # Core and login from wikidataintegrator module
from wikidataintegrator.ref_handlers import update_retrieved_if_new_multiple_refs # For retrieving references
from datetime import datetime # For identifying the current date and time

import copy # Copies references needed in the .csv for uploading to wikidata
import time # For keeping track of total for loop run time

import os # OS package to ensure interaction between the modules (ie WDI) and current OS being used

import pandas as pd # Pandas for data organization, then abbreviated to pd
import numpy as np # Another general purpose package
from termcolor import colored # Imports colored package from termcolor

# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context









    



Requirement already up-to-date: pip in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (20.0.2)
Requirement already satisfied: tqdm in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (4.32.1)
Requirement already satisfied: wikidataintegrator in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (0.4.2)
Requirement already satisfied: simplejson in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (3.16.0)
Requirement already satisfied: sparql-slurper in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (0.2.2)
Requirement already satisfied: oauthlib in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (3.1.0)
Requirement already satisfied: requests in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (2.22.0)
Requirement already satisfied: python-dateutil in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (2.8.0)
Requirement already satisfied: jsonasobj in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (1.2.1)
Requirement already satisfied: ShExJSG in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (0.6.8)
Requirement already satisfied: pyshex in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (0.7.12)
Requirement already satisfied: mwoauth in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from wikidataintegrator) (0.3.5)
Requirement already satisfied: sparqlwrapper>=1.8.2 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from sparql-slurper->wikidataintegrator) (1.8.4)
Requirement already satisfied: rdflib>=4.2.2 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from sparql-slurper->wikidataintegrator) (4.2.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from requests->wikidataintegrator) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from requests->wikidataintegrator) (1.24.2)
Requirement already satisfied: certifi>=2017.4.17 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from requests->wikidataintegrator) (2019.6.16)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from requests->wikidataintegrator) (2.8)
Requirement already satisfied: six>=1.5 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from python-dateutil->wikidataintegrator) (1.12.0)
Requirement already satisfied: pyshexc>=0.5.4 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from ShExJSG->wikidataintegrator) (0.7.0)
Requirement already satisfied: pyjsg>=0.9.0 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from ShExJSG->wikidataintegrator) (0.9.1)
Requirement already satisfied: rdflib-jsonld>=0.4.0 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from pyshex->wikidataintegrator) (0.4.0)
Requirement already satisfied: cfgraph>=0.2.1 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from pyshex->wikidataintegrator) (0.2.1)
Requirement already satisfied: requests-oauthlib in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from mwoauth->wikidataintegrator) (1.2.0)
Requirement already satisfied: PyJWT<2.0.0,>=1.0.1 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from mwoauth->wikidataintegrator) (1.7.1)
Requirement already satisfied: isodate in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from rdflib>=4.2.2->sparql-slurper->wikidataintegrator) (0.6.0)
Requirement already satisfied: pyparsing in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from rdflib>=4.2.2->sparql-slurper->wikidataintegrator) (2.4.0)
Requirement already satisfied: antlr4-python3-runtime>=4.7 in /Users/sulhasan/anaconda3/lib/python3.7/site-packages (from pyshexc>=0.5.4->ShExJSG->wikidataintegrator) (4.7.2)



In [16]:

    
# Login for running WDI

print("Logging in...") 

## **remove lines when scheduling to Jenkins** Enter your own username and password 
os.environ["WDUSER"] = "username" # Uses os package to call and set the environment for wikidata username
os.environ["WDPASS"] = "password"

## Conditional that outputs error command if not in the local python environment
if "WDUSER" in os.environ and "WDPASS" in os.environ: 
    WDUSER = os.environ['WDUSER']
    WDPASS = os.environ['WDPASS']
else: 
    raise ValueError("WDUSER and WDPASS must be specified in local.py or as environment variables")      

## Sets attributed username and password as 'login'
login = wdi_login.WDLogin(WDUSER, WDPASS)









    



Logging in...
https://www.wikidata.org/w/api.php
Successfully logged in as Sulhasan



In [11]:

    
# ClinGen gene-disease validity data

## Read as csv
df = pd.read_csv('https://search.clinicalgenome.org/kb/gene-validity.csv', skiprows=6, header=None)  

## Label column headings
df.columns = ['Gene', 'HGNC Gene ID', 'Disease', 'MONDO Disease ID','SOP','Classification','Report Reference URL','Report Date']

## Create time stamp of when downloaded (error if isoformat() used)
timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z")

## Create empty columns for output file (ignore warnings)
df['Status'] = "pending" # "Status" column with 'pending' for all cells: 'error', 'complete', 'skipped' (meaning previously logged within 180 days)
df['Definitive'] = "" # Empty cell to be replaced with 'yes' or 'no' string
df['Gene QID'] = "" # To be replaced with 'absent' or 'multiple'
df['Disease QID'] = "" # To be replaced with 'absent' or 'multiple'

df.head(6)









    Out[11]:







  
    
      
      Gene
      HGNC Gene ID
      Disease
      MONDO Disease ID
      SOP
      Classification
      Report Reference URL
      Report Date
    
  
  
    
      0
      CDK4
      HGNC:1773
      melanoma, cutaneous malignant, susceptibility ...
      MONDO_0012183
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-13T19:44:34.667Z
    
    
      1
      F12
      HGNC:3530
      congenital factor XII deficiency
      MONDO_0009315
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-22T17:00:00.000Z
    
    
      2
      GABRG2
      HGNC:4087
      epilepsy
      MONDO_0005027
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-21T17:00:00.000Z
    
    
      3
      KIT
      HGNC:6342
      gastrointestinal stromal tumor
      MONDO_0011719
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-14T18:46:19.396Z
    
    
      4
      PDGFRA
      HGNC:8803
      gastrointestinal stromal tumor
      MONDO_0011719
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-13T19:55:09.878Z
    
    
      5
      PRRT2
      HGNC:30500
      infantile convulsions and choreoathetosis
      MONDO_0011178
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-21T21:38:14.191Z



In [13]:

    
# Create a function for adding references to then be iterated in the loop "create_reference()"

def create_reference(): # Indicates a parameter included before running rest of function (otherwise may not recognize)
        refStatedIn = wdi_core.WDItemID(value="Q64403342", prop_nr="P248", is_reference=True) # ClinGen Qid = Q64403342, 'stated in' Pid = P248 
        timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z") # Create time stamp of when downloaded (error if isoformat() used)
        refRetrieved = wdi_core.WDTime(timeStringNow, prop_nr="P813", is_reference=True) # Calls on previous 'timeStringNow' string, 'retrieved' Pid = P813
        refURL = wdi_core.WDUrl((df.loc[index, 'Report Reference URL']), prop_nr="P854", is_reference=True) # 'reference URL' Pid = P854
        return [refStatedIn, refRetrieved, refURL]



In [14]:

    
# For loop that executes the following through each row of the dataframe 

start_time = time.time() # Keep track of how long it takes loop to run

for index, row in df.iterrows(): # Index is a row number, row is all variables and values for that row
        
    # Identify the string in the Gene or Disease column for a given row
    HGNC = df.loc[index, 'HGNC Gene ID'].replace("HGNC:", "") # .replace() changes HGNC: to space for SparQL query
    MONDO = df.loc[index, 'MONDO Disease ID'].replace("_", ":")
    
    # SparQL query to search for Gene or Disease in Wikidata based on HGNC ID (P354) or MonDO ID (P5270)
    sparqlQuery_HGNC = "SELECT * WHERE {?gene wdt:P354 \""+HGNC+"\"}" 
    result_HGNC = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_HGNC) # Resultant query
    sparqlQuery_MONDO = "SELECT * WHERE {?disease wdt:P5270 \""+MONDO+"\"}" 
    result_MONDO = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery_MONDO)
    
    # Assign resultant length of dictionary for either Gene or Disease (number of Qid)
    HGNC_qlength = len(result_HGNC["results"]["bindings"]) 
    MONDO_qlength = len(result_MONDO["results"]["bindings"])
    
    # Conditional utilizing length value for output table, accounts for absent/present combos
    if HGNC_qlength == 1:
        HGNC_qid = result_HGNC["results"]["bindings"][0]["gene"]["value"].replace("http://www.wikidata.org/entity/", "")
        df.at[index, 'Gene QID'] = HGNC_qid # Input HGNC Qid in 'Gene QID' cell  
    if HGNC_qlength < 1: # If no Qid
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Gene QID'] = "absent"  
    if HGNC_qlength > 1: # If multiple Qid
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Gene QID'] = "multiple"
        
    if MONDO_qlength == 1:
        MONDO_qid = result_MONDO["results"]["bindings"][0]["disease"]["value"].replace("http://www.wikidata.org/entity/", "") 
        df.at[index, 'Disease QID'] = MONDO_qid  
    if MONDO_qlength < 1: 
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Disease QID'] = "absent" 
    if MONDO_qlength > 1:
        df.at[index, 'Status'] = "error" 
        df.at[index, 'Disease QID'] = "multiple" 
        
    # Conditional inputs error such that only rows are written for where Classification = 'Definitive'
    if row['Classification']!='Definitive': # If the string is NOT 'Definitive' for the Classification column
        df.at[index, 'Status'] = "error" # Then input "error" in the Status column
        df.at[index, 'Definitive'] = "no" # And'no' for Definitive column
        continue # Skips rest and goes to next row
    else: # Otherwise
        df.at[index, 'Definitive'] = "yes" # Input 'yes' for Definitive column, go to next step
  
    # Conditional continues to write into WikiData only if 1 Qid for each + Definitive classification 
    if HGNC_qlength == 1 & MONDO_qlength == 1:
        
        # Call upon create_reference() function created   
        reference = create_reference() 
        
        # Add disease value to gene item page, and gene value to disease item page (symmetry)
        
        # Creates 'gene assocation' statement (P2293) whether or not it's already there, and includes the references
        statement_HGNC = [wdi_core.WDItemID(value=MONDO_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] 
        wikidata_HGNCitem = wdi_core.WDItemEngine(wd_item_id=HGNC_qid, 
                                                  data=statement_HGNC, 
                                                  global_ref_mode='CUSTOM', # parameter that looks within 180 days
                                                  ref_handler=update_retrieved_if_new_multiple_refs, 
                                                  append_value=["P2293"])
        wikidata_HGNCitem.get_wd_json_representation() # Gives json structure that submitted to API, helpful for debugging 
        wikidata_HGNCitem.write(login)
        
        statement_MONDO = [wdi_core.WDItemID(value=HGNC_qid, prop_nr="P2293", references=[copy.deepcopy(reference)])] 
        wikidata_MONDOitem = wdi_core.WDItemEngine(wd_item_id=MONDO_qid, 
                                                   data=statement_MONDO, 
                                                   global_ref_mode='CUSTOM',
                                                   ref_handler=update_retrieved_if_new_multiple_refs, 
                                                   append_value=["P2293"])
        wikidata_MONDOitem.get_wd_json_representation()
        wikidata_MONDOitem.write(login)
        
        HGNC_name = df.loc[index, 'Gene'] # To output gene name > HGNC ID
        MONDO_name = df.loc[index, 'Disease']
        df.at[index, 'Status'] = "complete" 
        
end_time = time.time() # Captures when loop run ends
print("The total time of this loop is:", end_time - start_time, "seconds, or", (end_time - start_time)/60, "minutes")

# Write output to a .csv file
now = datetime.now() # Retrieves current time and saves it as 'now'
# Includes hour:minute:second_dd-mm-yyyy time stamp (https://en.wikipedia.org/wiki/ISO_8601)
df.to_csv("ClinGenBot_Status-Output_" + now.isoformat() + ".csv")  # isoformat









    



The total time of this loop is: 35.3995099067688 seconds, or 0.58999183177948 minutes






    Out[14]:







  
    
      
      Gene
      HGNC Gene ID
      Disease
      MONDO Disease ID
      SOP
      Classification
      Report Reference URL
      Report Date
      Status
      Definitive
      Gene QID
      Disease QID
    
  
  
    
      0
      CDK4
      HGNC:1773
      melanoma, cutaneous malignant, susceptibility ...
      MONDO_0012183
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-13T19:44:34.667Z
      complete
      yes
      Q14911754
      Q55999819
    
    
      1
      F12
      HGNC:3530
      congenital factor XII deficiency
      MONDO_0009315
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-22T17:00:00.000Z
      complete
      yes
      Q14862722
      Q18555042
    
    
      2
      GABRG2
      HGNC:4087
      epilepsy
      MONDO_0005027
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-21T17:00:00.000Z
      complete
      yes
      Q18025166
      Q41571
    
    
      3
      KIT
      HGNC:6342
      gastrointestinal stromal tumor
      MONDO_0011719
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-14T18:46:19.396Z
      complete
      yes
      Q20969938
      Q1495661
    
    
      4
      PDGFRA
      HGNC:8803
      gastrointestinal stromal tumor
      MONDO_0011719
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-13T19:55:09.878Z
      complete
      yes
      Q18030422
      Q1495661
    
    
      5
      PRRT2
      HGNC:30500
      infantile convulsions and choreoathetosis
      MONDO_0011178
      SOP7
      Definitive
      https://search.clinicalgenome.org/kb/gene-vali...
      2020-01-21T21:38:14.191Z
      complete
      yes
      Q18048847
      Q6029036

	Gene	HGNC Gene ID	Disease	MONDO Disease ID	SOP	Classification	Report Reference URL	Report Date
0	CDK4	HGNC:1773	melanoma, cutaneous malignant, susceptibility ...	MONDO_0012183	SOP7	Definitive	https://search.clinicalgenome.org/kb/gene-vali...	2020-01-13T19:44:34.667Z
1	F12	HGNC:3530	congenital factor XII deficiency	MONDO_0009315	SOP7	Definitive	https://search.clinicalgenome.org/kb/gene-vali...	2020-01-22T17:00:00.000Z
2	GABRG2	HGNC:4087	epilepsy	MONDO_0005027	SOP7	Definitive	https://search.clinicalgenome.org/kb/gene-vali...	2020-01-21T17:00:00.000Z
3	KIT	HGNC:6342	gastrointestinal stromal tumor	MONDO_0011719	SOP7	Definitive	https://search.clinicalgenome.org/kb/gene-vali...	2020-01-14T18:46:19.396Z
4	PDGFRA	HGNC:8803	gastrointestinal stromal tumor	MONDO_0011719	SOP7	Definitive	https://search.clinicalgenome.org/kb/gene-vali...	2020-01-13T19:55:09.878Z
5	PRRT2	HGNC:30500	infantile convulsions and choreoathetosis	MONDO_0011178	SOP7	Definitive	https://search.clinicalgenome.org/kb/gene-vali...	2020-01-21T21:38:14.191Z