Group the compounds into the BRITE hierarchy...want to see how the different Kmeans groups do or do not fit with the BRITE hierarchy

Krista, 1 September 2015 Modify, 30 November 2015 to get the pathway BRITE hierarchy

This would probably make more sense as a script, but I did not bother to convert it. Will spit out a pickled file for other use.


In [19]:
import pandas as pd
import re
import glob
import cPickle as cpk

from IPython.core.debugger import Tracer #used this to step into the function and debug it, also need line with Tracer()()

Now go in a read one BRITE file (originally for the compounds)


In [101]:
textC = re.compile('\d+')
mc = textC.search(line)
mc.group(0)
#so, now I have the map number...how do I get the rest of the line?


Out[101]:
'01100'

In [116]:
textC = re.compile(r'(\d+)\s*(.*)$')
mC = textC.search(line)
print mC.group(1)
print mC.group(2)


01100
Metabolic pathways

In [ ]:


In [134]:
def ReadBRITEfile(briteFile):
    forBrite = pd.DataFrame(columns = ['map','A','B','C','wholeThing'])
    # set up the expressions to match each level in the BRITE hierarchy
    
    textA = re.compile(r'(^A<b>)(.+)(</b>)\s*(.*)$')
    textB = re.compile(r'(^B)\s*(.*)$')
    textC = re.compile(r'(\d+)\s*(.*)$')
    #this relies on the fact that the rows are in order: A, with B subheadings, then C subheadings
    setA = []
    idxA = []

    setB = []
    setC = []

    with open(briteFile) as f:
        for idx,line in enumerate(f):
            if line[0] is not '#': #skip over the comments
                mA = textA.search(line) 
                mB = textB.search(line) 
                mC = textC.search(line) 
                if mA:
                    setA = mA.group(2)
                    #house cleaning (probably c)
                    idxA = idx
                    forBrite.loc[idx,'A'] = setA
                    forBrite.loc[idx,'wholeThing'] = line #using this as a double check for now
                    #forBrite.loc[idx,'map'] = mC.group(1)
                elif mB:
                    setB = mB.group(2)
                    forBrite.loc[idx,'A'] = setA
                    forBrite.loc[idx,'B'] = setB
                    forBrite.loc[idx,'wholeThing'] = line
                    #forBrite.loc[idx,'map'] = mC.group(1)
                elif mC:
                    #Tracer()()
                    setC = mC.group(2)
                    forBrite.loc[idx,'A'] = setA
                    forBrite.loc[idx,'B'] = setB
                    forBrite.loc[idx,'C'] = setC
                    forBrite.loc[idx,'wholeThing'] = line
                    forBrite.loc[idx,'map'] = mC.group(1)

        return forBrite

In [153]:
D = glob.glob('*keg.txt')
allBRITE=[]
for idx,nof in enumerate(D):
    allBRITE = ReadBRITEfile(nof)

In [138]:
type(allBRITE)


Out[138]:
pandas.core.frame.DataFrame

In [146]:
allBRITE.loc[allBRITE['map']=='01100']


Out[146]:
map A B C wholeThing
10 01100 Metabolism Global and overview maps Metabolic pathways C 01100 Metabolic pathways\n

In [147]:
allBRITE.loc[allBRITE['map']=='01100'].C


Out[147]:
10    Metabolic pathways
Name: C, dtype: object

In [148]:
type(allBRITE.loc[allBRITE['map']=='01100'].C)


Out[148]:
pandas.core.series.Series

In [ ]:


In [137]:
allBRITE


Out[137]:
map A B C wholeThing
8 NaN Metabolism NaN NaN A<b>Metabolism</b>\n
9 NaN Metabolism Global and overview maps NaN B Global and overview maps\n
10 01100 Metabolism Global and overview maps Metabolic pathways C 01100 Metabolic pathways\n
11 01110 Metabolism Global and overview maps Biosynthesis of secondary metabolites C 01110 Biosynthesis of secondary metaboli...
12 01120 Metabolism Global and overview maps Microbial metabolism in diverse environments C 01120 Microbial metabolism in diverse en...
13 01130 Metabolism Global and overview maps Biosynthesis of antibiotics C 01130 Biosynthesis of antibiotics\n
14 01200 Metabolism Global and overview maps Carbon metabolism C 01200 Carbon metabolism\n
15 01210 Metabolism Global and overview maps 2-Oxocarboxylic acid metabolism C 01210 2-Oxocarboxylic acid metabolism\n
16 01212 Metabolism Global and overview maps Fatty acid metabolism C 01212 Fatty acid metabolism\n
17 01230 Metabolism Global and overview maps Biosynthesis of amino acids C 01230 Biosynthesis of amino acids\n
18 01220 Metabolism Global and overview maps Degradation of aromatic compounds C 01220 Degradation of aromatic compounds\n
19 NaN Metabolism Carbohydrate metabolism NaN B Carbohydrate metabolism\n
20 00010 Metabolism Carbohydrate metabolism Glycolysis / Gluconeogenesis C 00010 Glycolysis / Gluconeogenesis\n
21 00020 Metabolism Carbohydrate metabolism Citrate cycle (TCA cycle) C 00020 Citrate cycle (TCA cycle)\n
22 00030 Metabolism Carbohydrate metabolism Pentose phosphate pathway C 00030 Pentose phosphate pathway\n
23 00040 Metabolism Carbohydrate metabolism Pentose and glucuronate interconversions C 00040 Pentose and glucuronate interconve...
24 00051 Metabolism Carbohydrate metabolism Fructose and mannose metabolism C 00051 Fructose and mannose metabolism\n
25 00052 Metabolism Carbohydrate metabolism Galactose metabolism C 00052 Galactose metabolism\n
26 00053 Metabolism Carbohydrate metabolism Ascorbate and aldarate metabolism C 00053 Ascorbate and aldarate metabolism\n
27 00500 Metabolism Carbohydrate metabolism Starch and sucrose metabolism C 00500 Starch and sucrose metabolism\n
28 00520 Metabolism Carbohydrate metabolism Amino sugar and nucleotide sugar metabolism C 00520 Amino sugar and nucleotide sugar m...
29 00620 Metabolism Carbohydrate metabolism Pyruvate metabolism C 00620 Pyruvate metabolism\n
30 00630 Metabolism Carbohydrate metabolism Glyoxylate and dicarboxylate metabolism C 00630 Glyoxylate and dicarboxylate metab...
31 00640 Metabolism Carbohydrate metabolism Propanoate metabolism C 00640 Propanoate metabolism\n
32 00650 Metabolism Carbohydrate metabolism Butanoate metabolism C 00650 Butanoate metabolism\n
33 00660 Metabolism Carbohydrate metabolism C5-Branched dibasic acid metabolism C 00660 C5-Branched dibasic acid metabolism\n
34 00562 Metabolism Carbohydrate metabolism Inositol phosphate metabolism C 00562 Inositol phosphate metabolism\n
35 NaN Metabolism Energy metabolism NaN B Energy metabolism\n
36 00190 Metabolism Energy metabolism Oxidative phosphorylation C 00190 Oxidative phosphorylation\n
37 00195 Metabolism Energy metabolism Photosynthesis C 00195 Photosynthesis\n
... ... ... ... ... ...
525 NaN Drug Development Target-based classification: Nuclear receptors NaN B Target-based classification: Nuclear recept...
526 07225 Drug Development Target-based classification: Nuclear receptors Glucocorticoid and meneralocorticoid receptor ... C 07225 Glucocorticoid and meneralocortico...
527 07226 Drug Development Target-based classification: Nuclear receptors Progesterone, androgen and estrogen receptor a... C 07226 Progesterone, androgen and estroge...
528 07223 Drug Development Target-based classification: Nuclear receptors Retinoic acid receptor (RAR) and retinoid X re... C 07223 Retinoic acid receptor (RAR) and r...
529 07222 Drug Development Target-based classification: Nuclear receptors Peroxisome proliferator-activated receptor (PP... C 07222 Peroxisome proliferator-activated ...
530 NaN Drug Development Target-based classification: Ion channels NaN B Target-based classification: Ion channels\n
531 07221 Drug Development Target-based classification: Ion channels Nicotinic cholinergic receptor antagonists C 07221 Nicotinic cholinergic receptor ant...
532 07230 Drug Development Target-based classification: Ion channels GABA-A receptor agonists/antagonists C 07230 GABA-A receptor agonists/antagonis...
533 07036 Drug Development Target-based classification: Ion channels Calcium channel blocking drugs C 07036 Calcium channel blocking drugs\n
534 07231 Drug Development Target-based classification: Ion channels Sodium channel blocking drugs C 07231 Sodium channel blocking drugs\n
535 07232 Drug Development Target-based classification: Ion channels Potassium channel blocking and opening drugs C 07232 Potassium channel blocking and ope...
536 07235 Drug Development Target-based classification: Ion channels N-Metyl-D-aspartic acid receptor antagonists C 07235 N-Metyl-D-aspartic acid receptor a...
537 NaN Drug Development Target-based classification: Transporters NaN B Target-based classification: Transporters\n
538 07233 Drug Development Target-based classification: Transporters Ion transporter inhibitors C 07233 Ion transporter inhibitors\n
539 07234 Drug Development Target-based classification: Transporters Neurotransmitter transporter inhibitors C 07234 Neurotransmitter transporter inhib...
540 NaN Drug Development Target-based classification: Enzymes NaN B Target-based classification: Enzymes\n
541 07216 Drug Development Target-based classification: Enzymes Catecholamine transferase inhibitors C 07216 Catecholamine transferase inhibito...
542 07219 Drug Development Target-based classification: Enzymes Cyclooxygenase inhibitors C 07219 Cyclooxygenase inhibitors\n
543 07024 Drug Development Target-based classification: Enzymes HMG-CoA reductase inhibitors C 07024 HMG-CoA reductase inhibitors\n
544 07217 Drug Development Target-based classification: Enzymes Renin-angiotensin system inhibitors C 07217 Renin-angiotensin system inhibitors\n
545 07218 Drug Development Target-based classification: Enzymes HIV protease inhibitors C 07218 HIV protease inhibitors\n
546 NaN Drug Development Structure-based classification NaN B Structure-based classification\n
547 07025 Drug Development Structure-based classification Quinolines C 07025 Quinolines\n
548 07034 Drug Development Structure-based classification Eicosanoids C 07034 Eicosanoids\n
549 07035 Drug Development Structure-based classification Prostaglandins C 07035 Prostaglandins\n
550 NaN Drug Development Skeleton-based classification NaN B Skeleton-based classification\n
551 07110 Drug Development Skeleton-based classification Benzoic acid family C 07110 Benzoic acid family\n
552 07112 Drug Development Skeleton-based classification 1,2-Diphenyl substitution family C 07112 1,2-Diphenyl substitution family\n
553 07114 Drug Development Skeleton-based classification Naphthalene family C 07114 Naphthalene family\n
554 07117 Drug Development Skeleton-based classification Benzodiazepine family C 07117 Benzodiazepine family\n

541 rows × 5 columns


In [ ]:


In [151]:
#now...save all that so I don't have to do this everytime
cpk.dump(allBRITE, open('BRITE_pathwaysOnly.pickle', 'wb'))

In [152]:
cpk.load(open('BRITE_pathwaysOnly.pickle','rb'))


Out[152]:
map A B C wholeThing
8 NaN Metabolism NaN NaN A<b>Metabolism</b>\n
9 NaN Metabolism Global and overview maps NaN B Global and overview maps\n
10 01100 Metabolism Global and overview maps Metabolic pathways C 01100 Metabolic pathways\n
11 01110 Metabolism Global and overview maps Biosynthesis of secondary metabolites C 01110 Biosynthesis of secondary metaboli...
12 01120 Metabolism Global and overview maps Microbial metabolism in diverse environments C 01120 Microbial metabolism in diverse en...
13 01130 Metabolism Global and overview maps Biosynthesis of antibiotics C 01130 Biosynthesis of antibiotics\n
14 01200 Metabolism Global and overview maps Carbon metabolism C 01200 Carbon metabolism\n
15 01210 Metabolism Global and overview maps 2-Oxocarboxylic acid metabolism C 01210 2-Oxocarboxylic acid metabolism\n
16 01212 Metabolism Global and overview maps Fatty acid metabolism C 01212 Fatty acid metabolism\n
17 01230 Metabolism Global and overview maps Biosynthesis of amino acids C 01230 Biosynthesis of amino acids\n
18 01220 Metabolism Global and overview maps Degradation of aromatic compounds C 01220 Degradation of aromatic compounds\n
19 NaN Metabolism Carbohydrate metabolism NaN B Carbohydrate metabolism\n
20 00010 Metabolism Carbohydrate metabolism Glycolysis / Gluconeogenesis C 00010 Glycolysis / Gluconeogenesis\n
21 00020 Metabolism Carbohydrate metabolism Citrate cycle (TCA cycle) C 00020 Citrate cycle (TCA cycle)\n
22 00030 Metabolism Carbohydrate metabolism Pentose phosphate pathway C 00030 Pentose phosphate pathway\n
23 00040 Metabolism Carbohydrate metabolism Pentose and glucuronate interconversions C 00040 Pentose and glucuronate interconve...
24 00051 Metabolism Carbohydrate metabolism Fructose and mannose metabolism C 00051 Fructose and mannose metabolism\n
25 00052 Metabolism Carbohydrate metabolism Galactose metabolism C 00052 Galactose metabolism\n
26 00053 Metabolism Carbohydrate metabolism Ascorbate and aldarate metabolism C 00053 Ascorbate and aldarate metabolism\n
27 00500 Metabolism Carbohydrate metabolism Starch and sucrose metabolism C 00500 Starch and sucrose metabolism\n
28 00520 Metabolism Carbohydrate metabolism Amino sugar and nucleotide sugar metabolism C 00520 Amino sugar and nucleotide sugar m...
29 00620 Metabolism Carbohydrate metabolism Pyruvate metabolism C 00620 Pyruvate metabolism\n
30 00630 Metabolism Carbohydrate metabolism Glyoxylate and dicarboxylate metabolism C 00630 Glyoxylate and dicarboxylate metab...
31 00640 Metabolism Carbohydrate metabolism Propanoate metabolism C 00640 Propanoate metabolism\n
32 00650 Metabolism Carbohydrate metabolism Butanoate metabolism C 00650 Butanoate metabolism\n
33 00660 Metabolism Carbohydrate metabolism C5-Branched dibasic acid metabolism C 00660 C5-Branched dibasic acid metabolism\n
34 00562 Metabolism Carbohydrate metabolism Inositol phosphate metabolism C 00562 Inositol phosphate metabolism\n
35 NaN Metabolism Energy metabolism NaN B Energy metabolism\n
36 00190 Metabolism Energy metabolism Oxidative phosphorylation C 00190 Oxidative phosphorylation\n
37 00195 Metabolism Energy metabolism Photosynthesis C 00195 Photosynthesis\n
... ... ... ... ... ...
525 NaN Drug Development Target-based classification: Nuclear receptors NaN B Target-based classification: Nuclear recept...
526 07225 Drug Development Target-based classification: Nuclear receptors Glucocorticoid and meneralocorticoid receptor ... C 07225 Glucocorticoid and meneralocortico...
527 07226 Drug Development Target-based classification: Nuclear receptors Progesterone, androgen and estrogen receptor a... C 07226 Progesterone, androgen and estroge...
528 07223 Drug Development Target-based classification: Nuclear receptors Retinoic acid receptor (RAR) and retinoid X re... C 07223 Retinoic acid receptor (RAR) and r...
529 07222 Drug Development Target-based classification: Nuclear receptors Peroxisome proliferator-activated receptor (PP... C 07222 Peroxisome proliferator-activated ...
530 NaN Drug Development Target-based classification: Ion channels NaN B Target-based classification: Ion channels\n
531 07221 Drug Development Target-based classification: Ion channels Nicotinic cholinergic receptor antagonists C 07221 Nicotinic cholinergic receptor ant...
532 07230 Drug Development Target-based classification: Ion channels GABA-A receptor agonists/antagonists C 07230 GABA-A receptor agonists/antagonis...
533 07036 Drug Development Target-based classification: Ion channels Calcium channel blocking drugs C 07036 Calcium channel blocking drugs\n
534 07231 Drug Development Target-based classification: Ion channels Sodium channel blocking drugs C 07231 Sodium channel blocking drugs\n
535 07232 Drug Development Target-based classification: Ion channels Potassium channel blocking and opening drugs C 07232 Potassium channel blocking and ope...
536 07235 Drug Development Target-based classification: Ion channels N-Metyl-D-aspartic acid receptor antagonists C 07235 N-Metyl-D-aspartic acid receptor a...
537 NaN Drug Development Target-based classification: Transporters NaN B Target-based classification: Transporters\n
538 07233 Drug Development Target-based classification: Transporters Ion transporter inhibitors C 07233 Ion transporter inhibitors\n
539 07234 Drug Development Target-based classification: Transporters Neurotransmitter transporter inhibitors C 07234 Neurotransmitter transporter inhib...
540 NaN Drug Development Target-based classification: Enzymes NaN B Target-based classification: Enzymes\n
541 07216 Drug Development Target-based classification: Enzymes Catecholamine transferase inhibitors C 07216 Catecholamine transferase inhibito...
542 07219 Drug Development Target-based classification: Enzymes Cyclooxygenase inhibitors C 07219 Cyclooxygenase inhibitors\n
543 07024 Drug Development Target-based classification: Enzymes HMG-CoA reductase inhibitors C 07024 HMG-CoA reductase inhibitors\n
544 07217 Drug Development Target-based classification: Enzymes Renin-angiotensin system inhibitors C 07217 Renin-angiotensin system inhibitors\n
545 07218 Drug Development Target-based classification: Enzymes HIV protease inhibitors C 07218 HIV protease inhibitors\n
546 NaN Drug Development Structure-based classification NaN B Structure-based classification\n
547 07025 Drug Development Structure-based classification Quinolines C 07025 Quinolines\n
548 07034 Drug Development Structure-based classification Eicosanoids C 07034 Eicosanoids\n
549 07035 Drug Development Structure-based classification Prostaglandins C 07035 Prostaglandins\n
550 NaN Drug Development Skeleton-based classification NaN B Skeleton-based classification\n
551 07110 Drug Development Skeleton-based classification Benzoic acid family C 07110 Benzoic acid family\n
552 07112 Drug Development Skeleton-based classification 1,2-Diphenyl substitution family C 07112 1,2-Diphenyl substitution family\n
553 07114 Drug Development Skeleton-based classification Naphthalene family C 07114 Naphthalene family\n
554 07117 Drug Development Skeleton-based classification Benzodiazepine family C 07117 Benzodiazepine family\n

541 rows × 5 columns


In [ ]: