Test querying classifyer

Example from justin: http://classyfire.wishartlab.com/entities/HNDVDQJCIGZPNO-YFKPBYRVSA-N (histidine)

Fetch the result from above query in json format.


In [2]:
import urllib2
import json
import jsonpickle

def get_json(url):
    response = urllib2.urlopen(url)
    data = json.load(response)   
    return data

Let's see what info we get back from classifyer


In [3]:
url = 'http://classyfire.wishartlab.com/entities/HNDVDQJCIGZPNO-YFKPBYRVSA-N.json'
data = get_json(url)
for key in data:
    print key


kingdom
smiles
inchikey
classification_version
description
predicted_lipidmaps_terms
molecular_framework
alternative_parents
subclass
intermediate_nodes
superclass
substituents
external_descriptors
direct_parent
class
predicted_chebi_terms

We need the kingdom, superclass, class, subclass, intermediate_nodes and direct parent to contruct the taxonomy path of this document (InChiKey).

Wrap this nicely as a function. We pass in the inchi key and get back the taxonomy.


In [30]:
def get_taxa_path(inchikey):

    url = 'http://classyfire.wishartlab.com/entities/%s.json' % inchikey
    response = urllib2.urlopen(url)
    data = json.load(response)       
    
    # store the taxonomy path for this inchikey here
    taxa_path = []

    # add the top-4 taxa
    keys = ['kingdom', 'superclass', 'class', 'subclass']
    for key in keys:
        if data[key] is not None:
            taxa_path.append(data[key]['name'])

    # add all the intermediate taxa >level 4 but above the direct parent
    for entry in data['intermediate_nodes']:
        taxa_path.append(entry['name'])

    # add the direct parent
    taxa_path.append(data['direct_parent']['name'])

    return taxa_path

inchikey = 'HNDVDQJCIGZPNO-YFKPBYRVSA-N'
tp = get_taxa_path(inchikey)
print '\n'.join(tp)


Chemical entities
Organic compounds
Organic acids and derivatives
Carboxylic acids and derivatives
Amino acids, peptides, and analogues
Amino acids and derivatives
Alpha amino acids and derivatives
Histidine and derivatives

A method to extract the substituents from a query


In [43]:
def get_substituents(inchikey):
    url = 'http://classyfire.wishartlab.com/entities/%s.json' % inchikey
    response = urllib2.urlopen(url)
    data = json.load(response) 
    return data.get('substituents',None)

Now try with some Mass2Motif from MassBank. First get all the docs above the default doc-topic threshold (0.05). Retrieve the metadata (inchikey) and pass it to Classifyer.


In [31]:
def print_m2m_taxonomy(m2m_id):
    
    server = 'www.ms2lda.org'
    url = 'http://%s/basicviz/get_parents_metadata/%d' % (server, m2m_id)
    data = get_json(url)

    for metadata_str in data:
        doc = jsonpickle.decode(metadata_str)
        inchikey = doc['InChIKey']
        print doc['annotation'], inchikey
        for taxon in get_taxa_path(inchikey):
            print '-', taxon
        print

Print a list of substituents from all of the molecules, ranked by how often they appear


In [24]:
def get_all_substituents(m2m_id):
    server = 'www.ms2lda.org'
    url = 'http://%s/basicviz/get_parents_metadata/%d' % (server, m2m_id)
    data = get_json(url)
    substituents = {}
    for metadata_str in data:
        doc = jsonpickle.decode(metadata_str)
        inchikey = doc['InChIKey']
        substituents[inchikey] = get_substituents(inchikey)
        
    substituent_counts = {}
    for inchikey in substituents:
        for ss in substituents[inchikey]:
            if not ss in substituent_counts:
                substituent_counts[ss] = 1
            else:
                substituent_counts[ss] += 1    
    ss_c = zip(substituent_counts.keys(),substituent_counts.values())
    ss_c = sorted(ss_c,key = lambda x:x[1],reverse = True)
    for ss,count in ss_c:
        print "{},{} (/{})".format(ss,count,len(substituents))

1. Get the Taxonomy of Documents in the Histidine Mass2Motif (MassBank)


In [5]:
print_m2m_taxonomy(1083)


washington_0873 KRBMQYPTDYSENE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Peptides
- Dipeptides

washington_0886 KRBMQYPTDYSENE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Peptides
- Dipeptides

washington_0859 KRBMQYPTDYSENE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Peptides
- Dipeptides

riken_0373 DOUMFZQKYFQNTF-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Cinnamic acids and derivatives
- Hydroxycinnamic acids and derivatives
- Coumaric acids and derivatives

riken_0684 CQOVPNPJLQNMDC-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Peptidomimetics
- Hybrid peptides

riken_0619 HNDVDQJCIGZPNO-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Amino acids and derivatives
- Alpha amino acids and derivatives
- Histidine and derivatives

washington_0860 AYMLQYFMYHISQO-UHFFFAOYSA-N
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Amino acids and derivatives
- Alpha amino acids and derivatives
- Histidine and derivatives


In [27]:
get_all_substituents(1083)


Organic oxide,5 (/5)
Carbonyl group,5 (/5)
Organooxygen compound,5 (/5)
Hydrocarbon derivative,5 (/5)
Carboxylic acid,5 (/5)
Organic oxygen compound,5 (/5)
Azole,4 (/5)
Organoheterocyclic compound,4 (/5)
Monocarboxylic acid or derivatives,4 (/5)
Imidazole,4 (/5)
Histidine or derivatives,4 (/5)
Organonitrogen compound,4 (/5)
Organic nitrogen compound,4 (/5)
Heteroaromatic compound,4 (/5)
Azacycle,4 (/5)
Organopnictogen compound,4 (/5)
Aromatic heteromonocyclic compound,4 (/5)
Imidazolyl carboxylic acid derivative,4 (/5)
Primary amine,3 (/5)
Primary aliphatic amine,3 (/5)
Amino acid,3 (/5)
Amine,3 (/5)
Fatty acyl,2 (/5)
Secondary carboxylic acid amide,2 (/5)
Alpha-amino acid or derivatives,2 (/5)
Aralkylamine,2 (/5)
Carboxamide group,2 (/5)
Carboxylic acid derivative,2 (/5)
N-acyl-alpha amino acid or derivatives,2 (/5)
Amino acid or derivatives,2 (/5)
N-acyl-alpha-amino acid,2 (/5)
Aromatic homomonocyclic compound,1 (/5)
Alcohol,1 (/5)
Primary alcohol,1 (/5)
Enoate ester,1 (/5)
Alpha-amino acid,1 (/5)
Coumaric acid or derivatives,1 (/5)
Dicarboxylic acid or derivatives,1 (/5)
Alpha-amino acid amide,1 (/5)
Phenol,1 (/5)
Monocyclic benzene moiety,1 (/5)
Carboxylic acid ester,1 (/5)
Alpha-dipeptide,1 (/5)
Benzenoid,1 (/5)
Fatty amide,1 (/5)
Hybrid peptide,1 (/5)
Styrene,1 (/5)
Fatty acid ester,1 (/5)
Carbamic acid ester,1 (/5)
Beta-hydroxy acid,1 (/5)
1-hydroxy-4-unsubstituted benzenoid,1 (/5)
Catechol,1 (/5)
Hydroxy acid,1 (/5)
Alpha,beta-unsaturated carboxylic ester,1 (/5)
Beta amino acid or derivatives,1 (/5)
Cinnamic acid ester,1 (/5)
1-hydroxy-2-unsubstituted benzenoid,1 (/5)
Serine or derivatives,1 (/5)
3-phenylpropanoic-acid,1 (/5)

2. Get the Taxonomy of Documents in the Adenine Mass2Motif (MassBank)


In [6]:
print_m2m_taxonomy(1367)


washington_0559 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

washington_0601 GOSWTRUMMSCNCW-UHFFFAOYSA-N
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

metabolights_0044 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

metabolights_0006 LNQVTSROQXJCDD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Ribonucleoside 3'-phosphates
- Ribonucleoside 3'-phosphates

ufz_0051 GFFGJBXGBJISGV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines

ufz_0109 GFFGJBXGBJISGV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines

riken_0621 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

washington_0551 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

mpi_0089 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

mpi_0047 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

ipb_0126 OIRDTQYFTABQOQ-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

eawag_0608 OIRDTQYFTABQOQ-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

metabolights_0014 WUUGFSXJNOTRMR-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- 5'-deoxyribonucleosides
- 5'-deoxy-5'-thionucleosides

metabolights_0016 GOSWTRUMMSCNCW-UHFFFAOYSA-N
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

eawag_0206 MGOHCFMYLBAPRN-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Azoles
- Pyrazoles
- Phenylpyrazoles

washington_0888 WVXRAFOPTSTNLL-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine 2',3'-dideoxyribonucleosides

washington_0592 GOSWTRUMMSCNCW-UHFFFAOYSA-N
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

washington_0732 MRWXACSTFXYYMV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

washington_0846 AVNJCDRLZOVEDM-UHFFFAOYSA-N
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides


In [26]:
get_all_substituents(1367)


Organonitrogen compound,10 (/10)
Organic nitrogen compound,10 (/10)
Heteroaromatic compound,10 (/10)
Azacycle,10 (/10)
Hydrocarbon derivative,10 (/10)
Aromatic heteropolycyclic compound,10 (/10)
Azole,9 (/10)
Imidazole,9 (/10)
Organopnictogen compound,9 (/10)
Organooxygen compound,9 (/10)
Pyrimidine,9 (/10)
Organic oxygen compound,9 (/10)
Alcohol,8 (/10)
Aminopyrimidine,8 (/10)
Imidolactam,8 (/10)
Oxacycle,8 (/10)
6-aminopurine,7 (/10)
Organoheterocyclic compound,7 (/10)
N-substituted imidazole,7 (/10)
Primary alcohol,7 (/10)
Imidazopyrimidine,7 (/10)
Purine,7 (/10)
Glycosyl compound,6 (/10)
Secondary alcohol,6 (/10)
Pentose monosaccharide,6 (/10)
N-glycosyl compound,6 (/10)
Amine,6 (/10)
Monosaccharide,6 (/10)
Oxolane,5 (/10)
Primary aromatic amine,5 (/10)
Primary amine,5 (/10)
Purine nucleoside,4 (/10)
6-alkylaminopurine,3 (/10)
Organic oxide,2 (/10)
Secondary aliphatic/aromatic amine,2 (/10)
Tetrahydrofuran,2 (/10)
5'-deoxy-5'-thionucleoside,1 (/10)
1,2-diol,1 (/10)
Organic phosphoric acid derivative,1 (/10)
Alkyl phosphate,1 (/10)
Purine 2',3'-dideoxyribonucleoside,1 (/10)
Monocyclic benzene moiety,1 (/10)
Pyrazolinone,1 (/10)
Monoalkyl phosphate,1 (/10)
Carbonyl group,1 (/10)
Ribonucleoside 3'-phosphate,1 (/10)
Vinylogous amide,1 (/10)
Phosphoric acid ester,1 (/10)
Thioether,1 (/10)
Carboxylic acid derivative,1 (/10)
Toluene,1 (/10)
Lactam,1 (/10)
Sulfenyl compound,1 (/10)
Dialkylthioether,1 (/10)
Carboxylic acid ester,1 (/10)
Dialkylarylamine,1 (/10)
Monosaccharide phosphate,1 (/10)
Organosulfur compound,1 (/10)
Phenylpyrazole,1 (/10)
Dialkyl ether,1 (/10)
Benzenoid,1 (/10)
Ether,1 (/10)
Monocarboxylic acid or derivatives,1 (/10)
Pentose phosphate,1 (/10)

3. Get the Taxonomy of Documents in the Ferulic Acid Mass2Motif (MassBank)


In [32]:
print_m2m_taxonomy(1430)


washington_0806 IRUHWRSITUYICV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins

washington_0627 ARQXEQLMMNGFDU-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Coumarin glycosides

washington_0222 ZKMLQDNHMSFULN-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Flavonoids
- Flavones

ipb_0145 ARQXEQLMMNGFDU-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Coumarin glycosides

riken_0389 HSHNITRMYYLLCV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins
- 7-hydroxycoumarins

riken_0404 HSHNITRMYYLLCV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins
- 7-hydroxycoumarins

ufz_0294 HSHNITRMYYLLCV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins
- 7-hydroxycoumarins

washington_1029 MXXWOMGUGJBKIW-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Alkaloids and derivatives
- Alkaloids and derivatives

eawag_0029 CXWYCAYNZXSHTF-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Benzofurans
- Benzofurans

ufz_0303 YWWHKOHZGJFMIE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Benzenoids
- Benzene and substituted derivatives
- Benzoic acids and derivatives
- Benzoic acid esters

washington_0408 ZCCUUQDIBDJBTK-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Furanocoumarins
- Linear furanocoumarins
- Psoralens

eawag_0763 FLKPEMZONWLCSK-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Benzenoids
- Benzene and substituted derivatives
- Benzoic acids and derivatives
- Benzoic acid esters

fiocruz_0024 AIONOLUJZLIMTK-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Flavonoids
- O-methylated flavonoids
- 4'-O-methylated flavonoids


In [33]:
get_all_substituents(1430)


Hydrocarbon derivative,10 (/10)
Organic oxide,10 (/10)
Organic oxygen compound,10 (/10)
Organooxygen compound,10 (/10)
Benzenoid,8 (/10)
Oxacycle,8 (/10)
Aromatic heteropolycyclic compound,8 (/10)
Organoheterocyclic compound,7 (/10)
1-benzopyran,6 (/10)
Benzopyran,6 (/10)
Carboxylic acid derivative,5 (/10)
Pyran,5 (/10)
Heteroaromatic compound,5 (/10)
Lactone,5 (/10)
Pyranone,5 (/10)
1-hydroxy-2-unsubstituted benzenoid,4 (/10)
Carbonyl group,3 (/10)
Carboxylic acid ester,3 (/10)
Chromone,2 (/10)
Phenol,2 (/10)
Benzoyl,2 (/10)
Aromatic homomonocyclic compound,2 (/10)
Benzofuran,2 (/10)
Acetal,2 (/10)
Monocarboxylic acid or derivatives,2 (/10)
1-hydroxy-4-unsubstituted benzenoid,2 (/10)
Carboxylic acid,2 (/10)
Benzoate ester,2 (/10)
Dicarboxylic acid or derivatives,2 (/10)
Monocyclic benzene moiety,2 (/10)
Glycosyl compound,1 (/10)
Phenol ether,1 (/10)
Styrene,1 (/10)
5-hydroxyflavonoid,1 (/10)
Flavan,1 (/10)
N-acyl-piperidine,1 (/10)
O-glucuronide,1 (/10)
Benzoic acid,1 (/10)
Naphthopyran,1 (/10)
Carboxamide group,1 (/10)
Chromane,1 (/10)
Methoxyphenol,1 (/10)
7-hydroxyflavonoid,1 (/10)
Hydroxycoumarin,1 (/10)
Alkaloid or derivatives,1 (/10)
Organosulfur compound,1 (/10)
Coumaran,1 (/10)
Phenolic glycoside,1 (/10)
Secondary alcohol,1 (/10)
Hydroxyflavonoid,1 (/10)
Coumarin-7-o-glycoside,1 (/10)
Aryl ketone,1 (/10)
Furan,1 (/10)
Naphthalene,1 (/10)
Hydroxy acid,1 (/10)
Organic sulfonic acid or derivatives,1 (/10)
Methanesulfonate,1 (/10)
Phenoxy compound,1 (/10)
Sulfonyl,1 (/10)
Ether,1 (/10)
Methoxybenzene,1 (/10)
Organonitrogen compound,1 (/10)
Anisole,1 (/10)
Flavanone,1 (/10)
Azacycle,1 (/10)
Tertiary carboxylic acid amide,1 (/10)
Coumarin o-glycoside,1 (/10)
Naphthopyranone,1 (/10)
Alkyl aryl ether,1 (/10)
Aryl alkyl ketone,1 (/10)
Vinylogous acid,1 (/10)
Psoralen,1 (/10)
Piperidine,1 (/10)
Beta-hydroxy acid,1 (/10)
Alcohol,1 (/10)
Sulfonic acid ester,1 (/10)
7-hydroxycoumarin,1 (/10)
Organosulfonic acid ester,1 (/10)
O-glycosyl compound,1 (/10)
3'-hydroxyflavonoid,1 (/10)
Benzodioxole,1 (/10)
1-o-glucuronide,1 (/10)
Organic nitrogen compound,1 (/10)
Oxane,1 (/10)
Organosulfonic acid or derivatives,1 (/10)
Ketone,1 (/10)
Organopnictogen compound,1 (/10)
Polyol,1 (/10)
Monosaccharide,1 (/10)
Glucuronic acid or derivatives,1 (/10)
Flavone,1 (/10)
4p-methoxyflavonoid-skeleton,1 (/10)

TODO List

It would be useful to:

  • Get the substituents for all the massbank and GNPS spectra UPDATE Happening below
  • Be able to extract the docs linked to a m2m using the overlap score (maybe already possible, need to check code) UPDATE: already done, just change the settings for an experiment
  • Be able to get all of the motifs for a particular experiment through a url
  • Summarise the regularity of all substituent terms across the whole datasets

In [40]:
server = 'www.ms2lda.org'
exp_id = 3 # experiment id of massbank
url = 'http://%s/basicviz/get_all_parents_metadata/%d' % (server, exp_id)
data = get_json(url)
inchikeys = []
for metadata_str in data:
    metadata = jsonpickle.decode(metadata_str)
    inchikeys.append(metadata['InChIKey'])

In [45]:
substituents = {}
n_done = 0
for inchikey in inchikeys:
    try:
        substituents[inchikey] = get_substituents(inchikey)
    except: 
        print "Failed on {}".format(inchikey)
    n_done += 1
    if n_done % 10 == 0:
        print n_done


10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
Failed on PHYFQTYBJUILEZ-UHFFFAOYSA-N
Failed on QHMTXANCGGJZRX-UHFFFAOYSA-N
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
Failed on ORFOPKXBNMVMKC-UHFFFAOYSA-N
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000
1010
1020
1030
1040
1050
1060
1070
1080
1090
1100
1110
1120
1130
1140
1150
1160
1170
1180
1190
1200
1210
1220
1230
1240
1250
1260
1270
1280
1290
1300
1310
1320
1330
1340
1350
1360
1370
1380
Failed on ZOTBXTZVPHCKPN-UHFFFAOYSA-N
1390
1400
1410
1420
1430
1440
1450
1460
Failed on WTJKGGKOPKCXLL-UHFFFAOYSA-N
1470
1480
1490
1500
1510
1520
1530
1540
1550
1560
1570
1580
1590
1600
1610
1620
1630
1640
1650
1660
1670
1680
1690
1700
1710
1720
1730
1740
1750
1760
1770
Failed on LSAMUAYPDHUBQD-UHFFFAOYSA-N
1780
1790
1800
1810
1820
1830
Failed on HBOMLICNUCNMMY-UHFFFAOYSA-N
1840
1850
1860
Failed on DKZBBWMURDFHNE-UHFFFAOYSA-N
1870
1880
1890
1900
1910
1920
1930
1940
Failed on LZPNXAULYJPXEH-UHFFFAOYSA-N
1950

In [49]:
import pickle
with open('massbank_substituents.dict','w') as f:
    pickle.dump(substituents,f)

Tally the individual terms


In [47]:
tally = {}
for inchikey in substituents:
    if not substituents[inchikey] == None:
        for ss in substituents[inchikey]:
            if not ss in tally:
                tally[ss] = 1
            else:
                tally[ss] += 1
ss_c = zip(tally.keys(),tally.values())
ss_c = sorted(ss_c,key = lambda x:x[1],reverse = True)

In [48]:
for ss,c in ss_c[:100]:
    print ss,c


Hydrocarbon derivative 1315
Organic oxygen compound 1106
Organooxygen compound 1052
Organic nitrogen compound 941
Organonitrogen compound 935
Organic oxide 926
Organopnictogen compound 926
Azacycle 607
Benzenoid 590
Heteroaromatic compound 580
Organoheterocyclic compound 505
Aromatic heteropolycyclic compound 484
Carbonyl group 459
Amine 428
Monocyclic benzene moiety 410
Carboxylic acid derivative 381
Oxacycle 368
Ether 358
Aromatic heteromonocyclic compound 302
Aromatic homomonocyclic compound 301
Alcohol 296
Alkyl aryl ether 284
Organohalogen compound 276
Monocarboxylic acid or derivatives 266
Secondary alcohol 230
Aryl halide 222
Carboxylic acid 219
1-hydroxy-2-unsubstituted benzenoid 200
Carboxamide group 196
Organochloride 191
Primary amine 186
Phenoxy compound 179
Anisole 178
1-benzopyran 169
Benzopyran 169
Aryl chloride 165
Aralkylamine 163
Tertiary amine 161
Phenol ether 156
Phenol 154
Azole 145
Tertiary aliphatic amine 143
Carboxylic acid ester 138
Organosulfur compound 138
Pyran 138
1-hydroxy-4-unsubstituted benzenoid 137
Primary alcohol 136
Glycosyl compound 133
Pyranone 133
Acetal 132
Polyol 128
Oxane 127
Halobenzene 124
Benzoyl 117
Organic 1,3-dipolar compound 117
Propargyl-type 1,3-dipolar organic compound 116
Chromone 116
Amino acid or derivatives 115
Vinylogous acid 110
Primary aliphatic amine 109
Ketone 108
Amino acid 107
Hydroxyflavonoid 103
O-glycosyl compound 103
Secondary amine 101
Monosaccharide 101
Secondary carboxylic acid amide 101
Tertiary carboxylic acid amide 100
Methoxybenzene 99
Chlorobenzene 93
Aliphatic acyclic compound 93
Fatty acyl 90
Pyrimidine 86
Organofluoride 85
5-hydroxyflavonoid 85
Secondary aliphatic amine 83
Primary aromatic amine 83
Lactam 81
Pyridine 79
Imidazole 77
Sulfonyl 77
Dialkyl ether 76
4'-hydroxyflavonoid 76
Flavone 73
Alpha-amino acid or derivatives 70
Vinylogous amide 70
Aniline or substituted anilines 70
Alkyl halide 64
Lactone 64
Organic sulfonic acid or derivatives 63
7-hydroxyflavonoid 63
Organosulfonic acid or derivatives 63
Aryl ketone 60
Carbonic acid derivative 58
Tertiary alcohol 56
Phenolic glycoside 56
Dicarboxylic acid or derivatives 56
Imidolactam 54
Sulfenyl compound 53
Benzenesulfonyl group 52

In [51]:
import plotly as plotly
from plotly.graph_objs import *
plotly.offline.init_notebook_mode()


Make a bar plot of the prevalence of different terms


In [66]:
data = []
ss,c = zip(*ss_c)
x = ss
n_inchi_keys = len(inchikeys)
y = [100.0*float(count)/float(n_inchi_keys) for count in c]
data.append(
    Bar(
        x = x,
        y = y,
    )
)
layout = Layout(
    xaxis = dict(
        title = 'substituent term',
    ),
    yaxis = dict(
        title = 'percentage of inchi keys',
        type = 'log',
    ),
)
plotly.offline.iplot({'data':data,'layout':layout})

print "There are {} unique terms in this dataset".format(len(x))


There are 1106 unique terms in this dataset

In [ ]: