Retrieving CAS registry numbers


In [1]:
import re
import pubchempy as pcp

Enable debug logging to make it easier to see what is going on:


In [2]:
import logging

logging.getLogger('pubchempy').setLevel(logging.DEBUG)

A function to get the CAS registry numbers for compounds with a particular SMILES substructure:


In [3]:
def get_substructure_cas(smiles):
    cas_rns = []
    results = pcp.get_synonyms(smiles, 'smiles', searchtype='substructure')
    for result in results:
        for syn in result.get('Synonym', []):
            match = re.match('(\d{2,7}-\d\d-\d)', syn)
            if match:
                cas_rns.append(match.group(1))
    return cas_rns

Test some inputs:


In [4]:
cas_rns = get_substructure_cas('[Pb]')
print(len(cas_rns))
print(cas_rns[:10])


DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/JSON
DEBUG:pubchempy:Request data: smiles=%5BPb%5D
DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/3178699647975629202/synonyms/JSON
DEBUG:pubchempy:Request data: None
2174
[u'7439-92-1', u'15875-18-0', u'54076-28-7', u'14701-27-0', u'15158-12-0', u'52229-97-7', u'724427-66-1', u'598-63-0', u'13427-42-4', u'17398-75-3']

In [5]:
cas_rns = get_substructure_cas('[Se]')
print(len(cas_rns))
print(cas_rns[:10])


DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/JSON
DEBUG:pubchempy:Request data: smiles=%5BSe%5D
DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/852509384123203131/synonyms/JSON
DEBUG:pubchempy:Request data: None
DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/852509384123203131/synonyms/JSON
DEBUG:pubchempy:Request data: None
14577
[u'10102-18-8', u'26970-82-1', u'15498-87-0', u'7782-82-3', u'14013-56-0', u'14013-56-0', u'29528-97-0', u'50647-14-8', u'7782-49-2', u'11125-23-8']

In [6]:
cas_rns = get_substructure_cas('[Ti]')
print(len(cas_rns))
print(cas_rns[:10])


DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/JSON
DEBUG:pubchempy:Request data: smiles=%5BTi%5D
DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/1812119792714669902/synonyms/JSON
DEBUG:pubchempy:Request data: None
1630
[u'13463-67-7', u'1317-80-2', u'1317-70-0', u'98084-96-9', u'100292-32-8', u'101239-53-6', u'116788-85-3', u'12000-59-8', u'12036-20-3', u'12701-76-7']

In [7]:
cas_rns = get_substructure_cas('[Pd]')
print(len(cas_rns))
print(cas_rns[:10])


DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/JSON
DEBUG:pubchempy:Request data: smiles=%5BPd%5D
DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/2802290965277166497/synonyms/JSON
DEBUG:pubchempy:Request data: None
1401
[u'7440-05-3', u'17637-99-9', u'53092-86-7', u'7647-10-1', u'10038-97-8', u'10102-05-3', u'14846-30-1', u'884739-77-9', u'3375-31-3', u'19807-27-3']

We could potentially get a TimeoutError if there are too many results. In this case, it might be better to perform the substructure search and then get the synonyms separately:


In [8]:
cids = pcp.get_cids('[Pd]', 'smiles', searchtype='substructure')


DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/substructure/smiles/JSON
DEBUG:pubchempy:Request data: smiles=%5BPd%5D
DEBUG:pubchempy:Request URL: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/listkey/3838589667186536348/cids/JSON
DEBUG:pubchempy:Request data: None

Then you can do pcp.get_synonyms(cids) with the list of CIDs.