GARD data loading with WDI

Example of how to use wikidata integrator to add synonyms from GARD. GARD data has already PARTIALLY been loaded into Wikidata via Mix N Match. It is important not to overwrite Mix N Match results

Import all necessary modules and data


In [1]:
from wikidataintegrator import wdi_core, wdi_login
from wikidataintegrator.ref_handlers import update_retrieved_if_new_multiple_refs
import pandas as pd
from pandas import read_csv
import requests
from tqdm.notebook import trange, tqdm
import ipywidgets 
import widgetsnbextension
import time

login section


In [57]:
print("retrieving API credentials")
import wdi_user_config
api_dict = wdi_user_config.get_gard_credentials()
header_info = {api_dict['name']: api_dict['value']}


retrieving API credentials

In [59]:
print("Logging in...")
import wdi_user_config ## Credentials stored in a wdi_user_config file
login_dict = wdi_user_config.get_credentials()
login = wdi_login.WDLogin(login_dict['WDUSER'], login_dict['WDPASS'])


Logging in...
https://www.wikidata.org/w/api.php
Successfully logged in as Gtsulab

Pull all disease entities from GARD


In [58]:
gard_results = requests.get('https://api.rarediseases.info.nih.gov/api/diseases',
                           headers=header_info)
print(gard_results)


<Response [200]>

In [59]:
gard_df = pd.read_json(gard_results.text)
print(gard_df.head(n=2))


   diseaseId                        diseaseName  hasGardWebPage  \
0      13018  10q22.3q23 microdeletion syndrome            True   
1       5658     11-beta-hydroxylase deficiency            True   

                                         identifiers  isRare  \
0                                                 []    True   
1  [{'identifierType': 'OMIM', 'identifierId': '2...    True   

                                            synonyms  \
0  [Del(10)(q22.3q23.3), Deletion 10q22.3q23.3, M...   
1  [Congenital adrenal hyperplasia due to 11-beta...   

                                          websiteUrl  
0  https://rarediseases.info.nih.gov/diseases/130...  
1  https://rarediseases.info.nih.gov/diseases/565...  

Although we can easily pull the synonyms from this dataframe and upload them to Wikidata, we only have permission to upload data specifically generated by GARD. Hence we will need to visit each disease's page in GARD to check the source of the synonyms. While we're at it, we can also pull alternate identifiers (which will NOT be loaded to Wikidata), but can be used for mapping. Since the Mix N Match community has already done a lot of GARD ID mapping, we will only need these alternative identifiers for items which don't yet have GARD IDs mapped.


In [82]:
## The resulting json file has a key "mainPropery" which is where our desired data is stored
## Since it looks like a misspelling, we'll store that key as a variable so that it'll be easy to
## change in the future if the key is changed in the future
key_of_interest = "mainPropery"

In [68]:
"""
## Unit test: Request and parse a sample page
i=1
fail_list = []

sample_result = requests.get('https://api.rarediseases.info.nih.gov/api/diseases/'+str(gard_df.iloc[i]['diseaseId']),
                           headers=header_info)

json_result = sample_result.json()
data_of_interest = json_result.get(key_of_interest)

## Check if there are synonyms that don't have a source (ie- are by GARD)
sourced_syn = data_of_interest.get('synonyms-with-source')
identifier_results = data_of_interest.get('identifiers')

tmpdict = pd.DataFrame(sourced_syn).fillna("GARD")
tmpdict['diseaseId'] = gard_df.iloc[i]['diseaseId']
print(tmpdict)

## Check if there are identifiers that can be used for xrefs
identifier_dict = pd.DataFrame(identifier_results).fillna("None")
print(identifier_dict)
"""


                                                name         source  diseaseID
0  Congenital adrenal hyperplasia due to 11-beta-...           GARD       5658
1                             Adrenal hyperplasia IV           GARD       5658
2                              Adrenal hyperplasia 4           GARD       5658
3             Steroid 11-beta-hydroxylase deficiency           GARD       5658
4              Adrenal hyperplasia hypertensive form           GARD       5658
5                               P450c11b1 deficiency           GARD       5658
6          CAH due to 11-beta-hydroxylase deficiency  OrphaData.Org       5658
7                                 CYP11B1 deficiency  OrphaData.Org       5658
  identifierId identifierType
0       202010           OMIM
1        90795       ORPHANET
2     C0268292           UMLS

In [88]:
gard_id_list = gard_df['diseaseId'].unique().tolist()
#gard_id_list = [13018,5658,10095] ## Iteration test
fail_list = []
no_syns = []
no_idens = []
identifier_df = pd.DataFrame(columns=['diseaseId','identifierId','identifierType'])
synonyms_df = pd.DataFrame(columns=['diseaseId','name','source'])

for i in tqdm(range(len(gard_id_list))):
    try:
        sample_result = requests.get('https://api.rarediseases.info.nih.gov/api/diseases/'+str(gard_df.iloc[i]['diseaseId']),
                                   headers=header_info)
        json_result = sample_result.json()
        data_of_interest = json_result.get(key_of_interest)
        ## Check if there are synonyms that don't have a source (ie- are by GARD)
        sourced_syn = data_of_interest.get('synonyms-with-source')
        tmpdict = pd.DataFrame(sourced_syn).fillna("GARD")
        tmpdict['diseaseId'] = gard_df.iloc[i]['diseaseId']
        if len(tmpdict) == 0:
            no_syns.append(gard_df.iloc[i]['diseaseId'])
        else:
            synonyms_df = pd.concat((synonyms_df,tmpdict),ignore_index=True)

        ## Check if there are identifiers that can be used for xrefs
        identifier_results = data_of_interest.get('identifiers')
        identifier_dict = pd.DataFrame(identifier_results).fillna("None")
        identifier_dict['diseaseId'] = gard_df.iloc[i]['diseaseId']
        if len(identifier_dict) == 0:
            no_idens.append(gard_df.iloc[i]['diseaseId'])
        else:
            identifier_df = pd.concat((identifier_df,identifier_dict),ignore_index=True)
    
    except:
        fail_list.append(gard_df.iloc[i]['diseaseId'])

print("Identifiers found: ", len(identifier_df))
print("Synonyms found: ", len(synonyms_df))
print("Requests failed: ",len(fail_list))
print("GARD IDs with no synonyms: ", len(no_syns))
print("GARD IDs with no xrefs: ", len(no_idens))


Identifiers found:  10542
Synonyms found:  15788
Requests failed:  0
GARD IDs with no synonyms:  1234
GARD IDs with no xrefs:  916

In [90]:
## Export results to avoid having to hit the API again
identifier_df.to_csv('data/identifier_df.tsv',sep='\t',header=True)
synonyms_df.to_csv('data/synonyms_df.tsv',sep='\t',header=True)

with open('data/no_syns.txt','w') as outwrite:
    for eachentry in no_syns:
        outwrite.write(str(eachentry)+'\n')

with open('data/no_idens.txt','w') as idenwrite:
    for eachiden in no_idens:
        idenwrite.write(str(eachiden)+'\n')

In [76]:
print(identifier_df)


  diseaseID identifierID identifierId identifierType
0      5658          NaN       202010           OMIM
1      5658          NaN        90795       ORPHANET
2      5658          NaN     C0268292           UMLS

Import any data that was exported


In [2]:
identifier_df = read_csv('data/identifier_df.tsv',delimiter='\t',header=0,index_col=0)
synonyms_df = read_csv('data/synonyms_df.tsv',delimiter='\t',header=0,index_col=0, encoding='latin-1')

no_syns=[]
with open('data/no_syns.txt','r') as syn_read:
    for line in syn_read:
        no_syns.append(line.strip('\n'))
no_idens=[]
with open('data/no_idens.txt','r') as iden_read:
    for line in no_idens:
        no_idens.append(line.strip('\n'))

Pull all WD entities with GARD IDs


In [3]:
# Retrieve all QIDs with GARD IDs

sparqlQuery = "SELECT * WHERE {?item wdt:P4317 ?GARD}"
result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)

In [4]:
gard_in_wd_list = []

for i in tqdm(range(len(result["results"]["bindings"]))):
    gard_id = result["results"]["bindings"][i]["GARD"]["value"]
    wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
    gard_in_wd_list.append({'WDID':wdid,'diseaseId':gard_id})

gard_in_wd = pd.DataFrame(gard_in_wd_list)
print(gard_in_wd.head(n=3))


        WDID diseaseId
0   Q5514398         1
1  Q18553682     10346
2  Q32038811     10539

Identify GARD diseases not yet in Wikidata

Currently, there is no bot to add GARD ID to Wikidata entities, so the GARD IDs in Wikidata were added via Mix N Match. Identify the GARD diseases not yet in Wikidata, and determine if they can be mapped using one of the other identifiers available via GARD (eg- Orphanet)


In [14]:
gard_in_wd_id_list = gard_in_wd['diseaseId'].unique().tolist()

gard_not_in_wd = identifier_df.loc[~identifier_df['diseaseId'].isin(gard_in_wd_id_list)]
print(len(gard_not_in_wd))
print(len(gard_not_in_wd['diseaseId'].unique().tolist()))
print(gard_not_in_wd.head(n=2))
property_list = gard_not_in_wd['identifierType'].unique().tolist()
print(property_list)


3718
1880
   diseaseId identifierId identifierType
0       5658       202010           OMIM
1       5658        90795       ORPHANET
['OMIM', 'ORPHANET', 'UMLS', 'SNOMED CT', 'ICD 10', 'NCI Thesaurus', 'ICD 10-CM', 'MeSH']

Pull disease lists based on identifiers so that multiple merges can be used to determine best fit


In [16]:
prop_id_dict = {'OMIM':'P492', 'ORPHANET':'P1550', 'UMLS':'P2892',
                'SNOMED CT':'P5806', 'ICD 10':'P494', 'NCI Thesaurus':'P1748',
                'ICD 10-CM':'P4229', 'MeSH':'P486'}
print(prop_id_dict['OMIM'])


P492

In [26]:
sparql_start = 'SELECT * WHERE {?item wdt:'
sparql_end = '}'

identifier_megalist=[]

for eachidtype in property_list:
    sparqlQuery = sparql_start + prop_id_dict[eachidtype] + ' ?identifierId'+sparql_end
    result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
    for i in tqdm(range(len(result["results"]["bindings"]))):
        id_id = result["results"]["bindings"][i]['identifierId']["value"]
        wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
        identifier_megalist.append({'WDID':wdid,'identifierId':id_id, 'identifierType':eachidtype})
    print(len(identifier_megalist))
    time.sleep(2)
        
identifier_megadf = pd.DataFrame(identifier_megalist)
identifier_megadf.to_csv('data/identifier_megadf.tsv',sep='\t',header=True)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-585eb1860589> in <module>
      8     result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)
      9     for i in result["results"]["bindings"]:
---> 10         id_id = result["results"]["bindings"][i]['identifierId']["value"]
     11         wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
     12         identifier_megalist.append({'WDID':wdid,'identifierId':id_id, 'identifierType':property_list[eachidtype]})

TypeError: list indices must be integers or slices, not dict

In [61]:
## For each Gard Disease Entry, check for multiple mappings to the same WDID
missing_gard_merge = gard_not_in_wd.merge(identifier_megadf,on=(['identifierId', 'identifierType']), how="inner")
still_missing = gard_not_in_wd.loc[~gard_not_in_wd['diseaseId'].isin(missing_gard_merge['diseaseId'].unique().tolist())]
print("Disease IDs for which identifiers couldn't be used to find a QID: ",len(still_missing))

## Determine the number of identifiers that support a merge
potential_gard = missing_gard_merge.groupby(['diseaseId','WDID']).size().reset_index(name='identifier_count')
mapping_check1 = gard_ids_to_add.groupby('diseaseId').size().reset_index(name='qid_count')
one_to_many = mapping_check1.loc[mapping_check1['qid_count']>1]
#print(len(one_to_many))
mapping_check2 = gard_ids_to_add.groupby('WDID').size().reset_index(name='gardid_count')
many_to_one = mapping_check2.loc[mapping_check2['gardid_count']>1]
#print(len(many_to_one))
gard_mapping_issue_ids = one_to_many['diseaseId'].unique().tolist() + many_to_one['WDID'].unique().tolist()

gard_to_add = potential_gard.loc[~potential_gard['diseaseId'].isin(gard_mapping_issue_ids) & 
                                     ~potential_gard['WDID'].isin(gard_mapping_issue_ids) &
                                     ~potential_gard['diseaseId'].isin(still_missing)]

gard_to_add_full = gard_to_add.merge(gard_df,on='diseaseId',how="left")

gard_to_auto_add = gard_to_add_full.loc[gard_to_add_full['identifier_count']>1]
gard_to_suggest = gard_to_add_full.loc[gard_to_add_full['identifier_count']==1]
print(gard_to_auto_add.head(n=2))


Disease IDs for which identifiers couldn't be used to find a QID:  276
   diseaseId       WDID  identifier_count                    diseaseName  \
0          1   Q5514398                 3               GRACILE syndrome   
1         16  Q55998700                 2  Oculomotor apraxia Cogan type   

   hasGardWebPage                                        identifiers  isRare  \
0            True  [{'identifierType': 'OMIM', 'identifierId': '6...    True   
1            True  [{'identifierType': 'ORPHANET', 'identifierId'...    True   

                                            synonyms  \
0  [FLNMS, Finnish lactic acidosis with hepatic h...   
1  [Congenital oculomotor apraxia, Cogan's syndro...   

                                          websiteUrl  
0  https://rarediseases.info.nih.gov/diseases/1/g...  
1  https://rarediseases.info.nih.gov/diseases/16/...  

Add the appropriate Wikidata statements

After removing items which have issues with no alternative identifier by which the GARD entry can be mapped, gard entries that map to multiple Wikidata entities, and multiple gard entries that map to a single wikidata entity based entirely on the other identifiers for that entry provided by GARD, we're left with entries we can add and suggest. Entries which map to a single WDID based on MULTIPLE Identifier mappings can be scripted. Entities which map to a single WDID based on a single Identifier, would probably be best sent to Mix N Match to avoid complaints further down the line.


In [62]:
# GARD rare disease ID P4317

from datetime import datetime
import copy
def create_reference(gard_url):
    refStatedIn = wdi_core.WDItemID(value="Q47517289", prop_nr="P248", is_reference=True)
    timeStringNow = datetime.now().strftime("+%Y-%m-%dT00:00:00Z")
    refRetrieved = wdi_core.WDTime(timeStringNow, prop_nr="P813", is_reference=True)
    refURL = wdi_core.WDUrl(value=gard_url, prop_nr="P854", is_reference=True)

    return [refStatedIn, refRetrieved, refURL]

In [68]:
## Unit test --  write a statement
gard_qid = gard_to_auto_add.iloc[1]['WDID']
gard_url = gard_to_auto_add.iloc[1]['websiteUrl']
gard_id = str(gard_to_auto_add.iloc[1]['diseaseId'])
reference = create_reference(gard_url)
gard_prop = "P4317" 
statement = [wdi_core.WDString(value=gard_id, prop_nr=gard_prop, references=[copy.deepcopy(reference)])]
item = wdi_core.WDItemEngine(wd_item_id=gard_qid, data=statement, append_value=gard_prop,
                           global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
item.write(login)
edit_id = item.lastrevid
print(gard_id, gard_qid, gard_url)


16 Q55998700 https://rarediseases.info.nih.gov/diseases/16/oculomotor-apraxia-cogan-type

In [70]:
## Test write with 10 items completed successfully

gard_map_revision_list = []

i=0
for i in tqdm(range(len(gard_to_auto_add))):
    gard_qid = gard_to_auto_add.iloc[i]['WDID']
    gard_url = gard_to_auto_add.iloc[i]['websiteUrl']
    gard_id = str(gard_to_auto_add.iloc[i]['diseaseId'])
    reference = create_reference(gard_url)
    gard_prop = "P4317" 
    statement = [wdi_core.WDString(value=gard_id, prop_nr=gard_prop, references=[copy.deepcopy(reference)])]
    item = wdi_core.WDItemEngine(wd_item_id=gard_qid, data=statement, append_value=gard_prop,
                               global_ref_mode='CUSTOM', ref_handler=update_retrieved_if_new_multiple_refs)
    item.write(login,edit_summary='added GARD ID')
    gard_map_revision_list.append(item.lastrevid)
    i=i+1


1 Q5514398 https://rarediseases.info.nih.gov/diseases/1/gracile-syndrome
16 Q55998700 https://rarediseases.info.nih.gov/diseases/16/oculomotor-apraxia-cogan-type
79 Q2964433 https://rarediseases.info.nih.gov/diseases/79/jansen-type-metaphyseal-chondrodysplasia
92 Q4352925 https://rarediseases.info.nih.gov/diseases/92/meleda-disease
143 Q55780845 https://rarediseases.info.nih.gov/diseases/143/hairy-elbows
157 Q55781934 https://rarediseases.info.nih.gov/diseases/157/santos-mateus-leal-syndrome
172 Q55786312 https://rarediseases.info.nih.gov/diseases/172/macrocephaly-short-stature-paraplegia-syndrome
215 Q55999592 https://rarediseases.info.nih.gov/diseases/215/familial-caudal-dysgenesis
217 Q55782090 https://rarediseases.info.nih.gov/diseases/217/roy-maroteaux-kremp-syndrome


In [ ]:
## Export the revision list 
with open('data/mapping_revisions.txt','w') as outwritelog:
    for eachrevid in gard_map_revision_list:
        outwritelog.write(str(eachrevid)+'\n')

Identify synonyms in need of inclusion

Pull all the entities mapped via GARD and their corresponding English Aliases. Determine if a synonym is missing from the Alias list and if so, include it.

Pull all labels and aliases from WD entities with GARD IDs


In [10]:
## pull aliases for all entries with GARD IDs
sparqlQuery = 'SELECT ?item ?itemLabel ?GARD ?alias WHERE {?item wdt:P4317 ?GARD. OPTIONAL {?item skos:altLabel ?alias FILTER (LANG (?alias) = "en").} SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}'
result = wdi_core.WDItemEngine.execute_sparql_query(sparqlQuery)


{'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q424242'}, 'alias': {'xml:lang': 'en', 'type': 'literal', 'value': 'ADEM'}, 'GARD': {'type': 'literal', 'value': '8639'}, 'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'acute disseminated encephalomyelitis'}}

In [37]:
## Format the results from the Wikidata query into Pandas DF for easier manipulation

gard_alias_in_wd_list = []

for i in tqdm(range(len(result["results"]["bindings"]))):
    gard_id = result["results"]["bindings"][i]["GARD"]["value"]
    wdid = result["results"]["bindings"][i]["item"]["value"].replace("http://www.wikidata.org/entity/", "")
    label = result["results"]["bindings"][i]["itemLabel"]["value"]
    try:
        alias = result["results"]["bindings"][i]["alias"]["value"]
    except:
        alias = "No alias"
    gard_alias_in_wd_list.append({'WDID':wdid,'diseaseId':int(gard_id),'label':label,'alias':alias})
    ## Note that Wikidata stores the GARD IDs at strings, while GARD stores as int. Convert to ensure matchability

gard_alias_in_wd = pd.DataFrame(gard_alias_in_wd_list)
print(gard_alias_in_wd.head(n=3))


      WDID                        alias  diseaseId  \
0  Q424242                         ADEM       8639   
1  Q424242  postinfectious encephalitis       8639   
2  Q424242   Postinfective encephalitis       8639   

                                  label  
0  acute disseminated encephalomyelitis  
1  acute disseminated encephalomyelitis  
2  acute disseminated encephalomyelitis  

Compare GARD synonyms with Wikidata aliases and labels


In [55]:
## Pull the aliases that are sourced from GARD
gard_alias = synonyms_df.loc[synonyms_df['source']=='GARD']

## Filter the Wikidata GARD Alias table down to just the GARD IDs in GARD alias DF (ie- has allowable synonyms)
gard_wd_limited_df = gard_alias_in_wd.loc[gard_alias_in_wd['diseaseId'].isin(gard_alias['diseaseId'].unique().tolist())]
alias_check_df = gard_alias.merge(gard_wd_limited_df,on='diseaseId',how='inner').copy()

## Check if the GARD synonym matches anything in the corresponding Wikidata label or alias
alias_check_df['label_match?'] = alias_check_df['name'].str.lower()==alias_check_df['label'].str.lower()
alias_check_df['alias_match?'] = alias_check_df['name'].str.lower()==alias_check_df['alias'].str.lower()

## Identify the GARD synonyms that were found in Wikidata (label or aliases) for removal
synonyms_to_drop = alias_check_df['name'].loc[(alias_check_df['label_match?']==True) | 
                                              (alias_check_df['alias_match?']==True)].unique().tolist()

## Filter out GARD entries that were found in Wikidata
synonyms_to_inspect = alias_check_df.loc[~alias_check_df['name'].isin(synonyms_to_drop)]

## Identify the synonyms to add to wikidata as an alias
synonyms_to_add = synonyms_to_inspect.drop_duplicates(subset=['diseaseId','name','source','WDID','label'], keep='first')
print(synonyms_to_add.head(n=4))
print(len(synonyms_to_add))


    diseaseId                              name source       WDID  \
0       10525  Chromosome 15q11.2 microdeletion   GARD  Q21154057   
4       10525       Chromosome 15q11.2 deletion   GARD  Q21154057   
8       10831                Partial trisomy 1q   GARD  Q55786662   
22      10130         Deletion 22q13.3 syndrome   GARD   Q1926345   

                                   alias  \
0         15q11.2 microdeletion syndrome   
4         15q11.2 microdeletion syndrome   
8   Partial duplication of chromosome 1q   
22                        22q13 deletion   

                                                label  label_match?  \
0                chromosome 15q11.2 deletion syndrome         False   
4                chromosome 15q11.2 deletion syndrome         False   
8   partial duplication of the long arm of chromos...         False   
22                            22q13 deletion syndrome         False   

    alias_match?  
0          False  
4          False  
8          False  
22         False  
743

Write the GARD aliases to Wikidata

Since Aliases don't allow for sourcing/referencing, include a edit comment that an alias from GARD was added.


In [63]:
disease_qid = synonyms_to_add.iloc[0]['WDID']
disease_alias = synonyms_to_add.iloc[0]['name']

In [57]:
print(disease_qid,disease_alias)


Q21154057 Chromosome 15q11.2 microdeletion

In [73]:
## Unit test --  write a statement
wikidata_item = wdi_core.WDItemEngine(wd_item_id=disease_qid)
wikidata_item.set_aliases([disease_alias],lang='en',append=True)
wikidata_item.write(login, edit_summary='added alias from GARD')
print(wikidata_item.get_aliases(lang='en'))
print(wikidata_item.lastrevid)
#wikidata_item.get_aliases(lang='en')


['15q11.2 microdeletion syndrome', 'CHROMOSOME 15q11.2 DELETION SYNDROME', 'Del(15)(q11.2)', 'Monosomy 15q11.2', 'Chromosome 15q11.2 microdeletion']
1047775450

In [ ]:
## Script to run the synonym updates
gard_alias_revision_list = []

i=0
for i in tqdm(range(len(gard_to_auto_add))):
    disease_qid = synonyms_to_add.iloc[i]['WDID']
    disease_alias = synonyms_to_add.iloc[i]['name']
    wikidata_item = wdi_core.WDItemEngine(wd_item_id=disease_qid)
    wikidata_item.set_aliases([disease_alias],lang='en',append=True)
    wikidata_item.write(login, edit_summary='added alias from GARD')
    gard_alias_revision_list.append(wikidata_item.lastrevid)
    i=i+1

In [ ]:
## Export the revision list 
with open('data/alias_revisions.txt','w') as aliaslog:
    for eachrevid in gard_alias_revision_list:
        aliaslog.write(str(eachrevid)+'\n')