Group the compounds into the BRITE hierarchy...want to see how the different Kmeans groups do or do not fit with the BRITE hierarchy

Krista, 1 September 2015 Modify, 30 November 2015 to get the pathway BRITE hierarchy

This would probably make more sense as a script, but I did not bother to convert it. Will spit out a pickled file for other use.


In [19]:
import pandas as pd
import re
import glob
import cPickle as cpk

from IPython.core.debugger import Tracer #used this to step into the function and debug it, also need line with Tracer()()

Now go in a read one BRITE file (originally for the compounds)


In [101]:
textC = re.compile('\d+')
mc = textC.search(line)
mc.group(0)
#so, now I have the map number...how do I get the rest of the line?


Out[101]:
'01100'

In [116]:
textC = re.compile(r'(\d+)\s*(.*)$')
mC = textC.search(line)
print mC.group(1)
print mC.group(2)


01100
Metabolic pathways

In [ ]:
#next: clean up the top of the data file (want to skip a few rows)...

In [129]:
line = '#DEFINITION  KEGG pathway maps'

In [130]:
line


Out[130]:
'#DEFINITION  KEGG pathway maps'

In [131]:
line[0]


Out[131]:
'#'

In [132]:
if line[0] is not '#':
    print 'yes'

In [133]:
if line[0] is '#':
    print 'yes'


yes

In [ ]:


In [134]:
def ReadBRITEfile(briteFile):
    forBrite = pd.DataFrame(columns = ['map','A','B','C','wholeThing'])
    # set up the expressions to match each level in the BRITE hierarchy
    
    textA = re.compile(r'(^A<b>)(.+)(</b>)\s*(.*)$')
    textB = re.compile(r'(^B)\s*(.*)$')
    textC = re.compile(r'(\d+)\s*(.*)$')
    #this relies on the fact that the rows are in order: A, with B subheadings, then C subheadings
    setA = []
    idxA = []

    setB = []
    setC = []

    with open(briteFile) as f:
        for idx,line in enumerate(f):
            if line[0] is not '#':
                mA = textA.search(line) 
                mB = textB.search(line) 
                mC = textC.search(line) 
                if mA:
                    setA = mA.group(2)
                    #house cleaning (probably c)
                    idxA = idx
                    forBrite.loc[idx,'A'] = setA
                    forBrite.loc[idx,'wholeThing'] = line #using this as a double check for now
                    #forBrite.loc[idx,'map'] = mC.group(1)
                elif mB:
                    setB = mB.group(2)
                    forBrite.loc[idx,'A'] = setA
                    forBrite.loc[idx,'B'] = setB
                    forBrite.loc[idx,'wholeThing'] = line
                    #forBrite.loc[idx,'map'] = mC.group(1)
                elif mC:
                    #Tracer()()
                    setC = mC.group(2)
                    forBrite.loc[idx,'A'] = setA
                    forBrite.loc[idx,'B'] = setB
                    forBrite.loc[idx,'C'] = setC
                    forBrite.loc[idx,'wholeThing'] = line
                    forBrite.loc[idx,'map'] = mC.group(1)

        return forBrite

In [135]:
D = glob.glob('*keg.txt')
allBRITE=[]
for idx,nof in enumerate(D):
    print idx, nof #easy visible counter in Python
    allBRITE = ReadBRITEfile(nof)


0 br08901.keg.txt

In [138]:
type(allBRITE)


Out[138]:
pandas.core.frame.DataFrame

In [146]:
allBRITE.loc[allBRITE['map']=='01100']


Out[146]:
map A B C wholeThing
10 01100 Metabolism Global and overview maps Metabolic pathways C 01100 Metabolic pathways\n

In [147]:
allBRITE.loc[allBRITE['map']=='01100'].C


Out[147]:
10    Metabolic pathways
Name: C, dtype: object

In [148]:
type(allBRITE.loc[allBRITE['map']=='01100'].C)


Out[148]:
pandas.core.series.Series

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [137]:
allBRITE


Out[137]:
map A B C wholeThing
8 NaN Metabolism NaN NaN A<b>Metabolism</b>\n
9 NaN Metabolism Global and overview maps NaN B Global and overview maps\n
10 01100 Metabolism Global and overview maps Metabolic pathways C 01100 Metabolic pathways\n
11 01110 Metabolism Global and overview maps Biosynthesis of secondary metabolites C 01110 Biosynthesis of secondary metaboli...
12 01120 Metabolism Global and overview maps Microbial metabolism in diverse environments C 01120 Microbial metabolism in diverse en...
13 01130 Metabolism Global and overview maps Biosynthesis of antibiotics C 01130 Biosynthesis of antibiotics\n
14 01200 Metabolism Global and overview maps Carbon metabolism C 01200 Carbon metabolism\n
15 01210 Metabolism Global and overview maps 2-Oxocarboxylic acid metabolism C 01210 2-Oxocarboxylic acid metabolism\n
16 01212 Metabolism Global and overview maps Fatty acid metabolism C 01212 Fatty acid metabolism\n
17 01230 Metabolism Global and overview maps Biosynthesis of amino acids C 01230 Biosynthesis of amino acids\n
18 01220 Metabolism Global and overview maps Degradation of aromatic compounds C 01220 Degradation of aromatic compounds\n
19 NaN Metabolism Carbohydrate metabolism NaN B Carbohydrate metabolism\n
20 00010 Metabolism Carbohydrate metabolism Glycolysis / Gluconeogenesis C 00010 Glycolysis / Gluconeogenesis\n
21 00020 Metabolism Carbohydrate metabolism Citrate cycle (TCA cycle) C 00020 Citrate cycle (TCA cycle)\n
22 00030 Metabolism Carbohydrate metabolism Pentose phosphate pathway C 00030 Pentose phosphate pathway\n
23 00040 Metabolism Carbohydrate metabolism Pentose and glucuronate interconversions C 00040 Pentose and glucuronate interconve...
24 00051 Metabolism Carbohydrate metabolism Fructose and mannose metabolism C 00051 Fructose and mannose metabolism\n
25 00052 Metabolism Carbohydrate metabolism Galactose metabolism C 00052 Galactose metabolism\n
26 00053 Metabolism Carbohydrate metabolism Ascorbate and aldarate metabolism C 00053 Ascorbate and aldarate metabolism\n
27 00500 Metabolism Carbohydrate metabolism Starch and sucrose metabolism C 00500 Starch and sucrose metabolism\n
28 00520 Metabolism Carbohydrate metabolism Amino sugar and nucleotide sugar metabolism C 00520 Amino sugar and nucleotide sugar m...
29 00620 Metabolism Carbohydrate metabolism Pyruvate metabolism C 00620 Pyruvate metabolism\n
30 00630 Metabolism Carbohydrate metabolism Glyoxylate and dicarboxylate metabolism C 00630 Glyoxylate and dicarboxylate metab...
31 00640 Metabolism Carbohydrate metabolism Propanoate metabolism C 00640 Propanoate metabolism\n
32 00650 Metabolism Carbohydrate metabolism Butanoate metabolism C 00650 Butanoate metabolism\n
33 00660 Metabolism Carbohydrate metabolism C5-Branched dibasic acid metabolism C 00660 C5-Branched dibasic acid metabolism\n
34 00562 Metabolism Carbohydrate metabolism Inositol phosphate metabolism C 00562 Inositol phosphate metabolism\n
35 NaN Metabolism Energy metabolism NaN B Energy metabolism\n
36 00190 Metabolism Energy metabolism Oxidative phosphorylation C 00190 Oxidative phosphorylation\n
37 00195 Metabolism Energy metabolism Photosynthesis C 00195 Photosynthesis\n
... ... ... ... ... ...
525 NaN Drug Development Target-based classification: Nuclear receptors NaN B Target-based classification: Nuclear recept...
526 07225 Drug Development Target-based classification: Nuclear receptors Glucocorticoid and meneralocorticoid receptor ... C 07225 Glucocorticoid and meneralocortico...
527 07226 Drug Development Target-based classification: Nuclear receptors Progesterone, androgen and estrogen receptor a... C 07226 Progesterone, androgen and estroge...
528 07223 Drug Development Target-based classification: Nuclear receptors Retinoic acid receptor (RAR) and retinoid X re... C 07223 Retinoic acid receptor (RAR) and r...
529 07222 Drug Development Target-based classification: Nuclear receptors Peroxisome proliferator-activated receptor (PP... C 07222 Peroxisome proliferator-activated ...
530 NaN Drug Development Target-based classification: Ion channels NaN B Target-based classification: Ion channels\n
531 07221 Drug Development Target-based classification: Ion channels Nicotinic cholinergic receptor antagonists C 07221 Nicotinic cholinergic receptor ant...
532 07230 Drug Development Target-based classification: Ion channels GABA-A receptor agonists/antagonists C 07230 GABA-A receptor agonists/antagonis...
533 07036 Drug Development Target-based classification: Ion channels Calcium channel blocking drugs C 07036 Calcium channel blocking drugs\n
534 07231 Drug Development Target-based classification: Ion channels Sodium channel blocking drugs C 07231 Sodium channel blocking drugs\n
535 07232 Drug Development Target-based classification: Ion channels Potassium channel blocking and opening drugs C 07232 Potassium channel blocking and ope...
536 07235 Drug Development Target-based classification: Ion channels N-Metyl-D-aspartic acid receptor antagonists C 07235 N-Metyl-D-aspartic acid receptor a...
537 NaN Drug Development Target-based classification: Transporters NaN B Target-based classification: Transporters\n
538 07233 Drug Development Target-based classification: Transporters Ion transporter inhibitors C 07233 Ion transporter inhibitors\n
539 07234 Drug Development Target-based classification: Transporters Neurotransmitter transporter inhibitors C 07234 Neurotransmitter transporter inhib...
540 NaN Drug Development Target-based classification: Enzymes NaN B Target-based classification: Enzymes\n
541 07216 Drug Development Target-based classification: Enzymes Catecholamine transferase inhibitors C 07216 Catecholamine transferase inhibito...
542 07219 Drug Development Target-based classification: Enzymes Cyclooxygenase inhibitors C 07219 Cyclooxygenase inhibitors\n
543 07024 Drug Development Target-based classification: Enzymes HMG-CoA reductase inhibitors C 07024 HMG-CoA reductase inhibitors\n
544 07217 Drug Development Target-based classification: Enzymes Renin-angiotensin system inhibitors C 07217 Renin-angiotensin system inhibitors\n
545 07218 Drug Development Target-based classification: Enzymes HIV protease inhibitors C 07218 HIV protease inhibitors\n
546 NaN Drug Development Structure-based classification NaN B Structure-based classification\n
547 07025 Drug Development Structure-based classification Quinolines C 07025 Quinolines\n
548 07034 Drug Development Structure-based classification Eicosanoids C 07034 Eicosanoids\n
549 07035 Drug Development Structure-based classification Prostaglandins C 07035 Prostaglandins\n
550 NaN Drug Development Skeleton-based classification NaN B Skeleton-based classification\n
551 07110 Drug Development Skeleton-based classification Benzoic acid family C 07110 Benzoic acid family\n
552 07112 Drug Development Skeleton-based classification 1,2-Diphenyl substitution family C 07112 1,2-Diphenyl substitution family\n
553 07114 Drug Development Skeleton-based classification Naphthalene family C 07114 Naphthalene family\n
554 07117 Drug Development Skeleton-based classification Benzodiazepine family C 07117 Benzodiazepine family\n

541 rows × 5 columns


In [ ]:


In [9]:
#now...save all that so I don't have to do this everytime
cpk.dump(allBRITE, open('BRITE_compoundsOnly.pickle', 'wb'))

In [10]:
cpk.load(open('BRITE_compoundsOnly.pickle','rb'))


Out[10]:
cNumber A B C D wholeThing
8 NaN FA Fatty acyls NaN NaN NaN A<b>FA Fatty acyls</b>\n
9 NaN FA Fatty acyls FA01 Fatty Acids and Conjugates NaN NaN B FA01 Fatty Acids and Conjugates\n
10 NaN FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN C FA0101 Straight chain fatty acids\n
11 C00058 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C00058 Formic acid\n
12 C00033 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C00033 Acetic acid\n
13 C00163 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C00163 Propanoic acid\n
14 C00246 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C00246 Butanoic acid\n
15 C00803 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C00803 Pentanoic acid\n
16 C01585 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C01585 Hexanoic acid\n
17 C17714 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C17714 Heptanoic acid\n
18 C06423 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C06423 Octanoic acid\n
19 C01601 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C01601 Nonanoic acid\n
20 C17715 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C17715 Undecanoic acid\n
21 C01571 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C01571 Decanoic acid\n
22 C02679 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C02679 Dodecanoic acid\n
23 C17076 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C17076 Tridecanoic acid\n
24 C06424 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C06424 Tetradecanoic acid\n
25 C16537 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C16537 Pentadecanoic acid\n
26 C00249 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C00249 Hexadecanoic acid\n
27 C01530 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C01530 Octadecanoic acid\n
28 C16535 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C16535 Nonadecanoic acid\n
29 C06425 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C06425 Icosanoic acid\n
30 C08281 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C08281 Docosanoic acid\n
31 C08320 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0101 Straight chain fatty acids NaN D C08320 Tetracosanoic acid\n
32 NaN FA Fatty acyls FA01 Fatty Acids and Conjugates FA0102 Branched fatty acids NaN C FA0102 Branched fatty acids\n
33 C08262 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0102 Branched fatty acids NaN D C08262 3-Methylbutanoic acid\n
34 C16665 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0102 Branched fatty acids NaN D C16665 12-Methyltetradecanoic acid\n
35 C16462 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0102 Branched fatty acids NaN D C16462 3,7-Dimethyl-6-octenoic acid\n
36 C13787 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0102 Branched fatty acids NaN D C13787 17-Methyl-6Z-octadecenoic acid\n
37 C00141 FA Fatty acyls FA01 Fatty Acids and Conjugates FA0102 Branched fatty acids NaN D C00141 3-Methyl-2-oxobutanoic acid\n
... ... ... ... ... ... ...
417 C20104 Venoms Spider venoms Latarcins NaN D C20104 Latarcin 2a\n
418 C20105 Venoms Spider venoms Latarcins NaN D C20105 Latarcin 3a\n
419 NaN Venoms Spider venoms Lycocitins NaN C Lycocitins\n
420 C20106 Venoms Spider venoms Lycocitins NaN D C20106 Lycocitin 1\n
421 C20107 Venoms Spider venoms Lycocitins NaN D C20107 Lycocitin 2\n
422 C20108 Venoms Spider venoms Lycocitins NaN D C20108 Lycocitin 3\n
423 NaN Venoms Spider venoms Lycotoxins NaN C Lycotoxins\n
424 C20109 Venoms Spider venoms Lycotoxins NaN D C20109 Lycotoxin I\n
425 C20110 Venoms Spider venoms Lycotoxins NaN D C20110 Lycotoxin II\n
426 NaN Venoms Spider venoms Oxyopinins NaN C Oxyopinins\n
427 C20111 Venoms Spider venoms Oxyopinins NaN D C20111 Oxyopinin 1\n
428 C20112 Venoms Spider venoms Oxyopinins NaN D C20112 Oxyopinin 2a\n
429 NaN Venoms Spider venoms Phrixotoxins NaN C Phrixotoxins\n
430 C20113 Venoms Spider venoms Phrixotoxins NaN D C20113 Phrixotoxin 1\n
431 C20114 Venoms Spider venoms Phrixotoxins NaN D C20114 Phrixotoxin 2\n
432 NaN Venoms Spider venoms Others NaN C Others\n
433 C20115 Venoms Spider venoms Others NaN D C20115 Agelenin\n
434 C20116 Venoms Spider venoms Others NaN D C20116 Hanatoxin 1\n
435 C20117 Venoms Spider venoms Others NaN D C20117 Huwentoxin-XI\n
436 C20054 Venoms Spider venoms Others NaN D C20054 alpha-Latrotoxin\n
437 C20118 Venoms Spider venoms Others NaN D C20118 Robustoxin\n
438 C20067 Venoms Spider venoms Others NaN D C20067 Stromatoxin 1\n
439 NaN Venoms Others NaN NaN B Others\n
440 NaN Venoms Others Salamandra venoms NaN C Salamandra venoms\n
441 C19950 Venoms Others Salamandra venoms NaN D C19950 Samandarine\n
442 C19951 Venoms Others Salamandra venoms NaN D C19951 Samandarone\n
443 C20059 Venoms Others Salamandra venoms NaN D C20059 Samandenone\n
444 NaN Venoms Others Others NaN C Others\n
445 C20060 Venoms Others Others NaN D C20060 Cephalostatin 1\n
446 C20061 Venoms Others Others NaN D C20061 Ritterazine A\n

8008 rows × 6 columns