Solution of Lahti et al. 2014

Write a function that takes as input a dictionary of constraints (i.e., selecting a specific group of records) and returns a dictionary tabulating the BMI group for all the records matching the constraints. For example, calling:

``get_BMI_count({'Age': '28', 'Sex': 'female'})``

should return:

``{'NA': 3, 'lean': 8, 'overweight': 2, 'underweight': 1}``
``````

In [3]:

import csv # Import csv modulce for reading the file

``````

We start my reading the Metadata file, set up a csv reader, and print the header and first few lines to get a feel for data.

``````

In [5]:

csvr = csv.DictReader(f, delimiter = '\t')
# For each row
for i, row in enumerate(csvr):
print(row)
if i > 2:
break

``````
``````

['SampleID', 'Age', 'Sex', 'Nationality', 'DNA_extraction_method', 'ProjectID', 'Diversity', 'BMI_group', 'SubjectID', 'Time']
{'Age': '28', 'Diversity': '5.76', 'Sex': 'male', 'DNA_extraction_method': 'NA', 'BMI_group': 'severeobese', 'SubjectID': '1', 'Nationality': 'US', 'Time': '0', 'ProjectID': '1', 'SampleID': 'Sample-1'}
{'Age': '24', 'Diversity': '6.06', 'Sex': 'female', 'DNA_extraction_method': 'NA', 'BMI_group': 'obese', 'SubjectID': '2', 'Nationality': 'US', 'Time': '0', 'ProjectID': '1', 'SampleID': 'Sample-2'}
{'Age': '52', 'Diversity': '5.5', 'Sex': 'male', 'DNA_extraction_method': 'NA', 'BMI_group': 'lean', 'SubjectID': '3', 'Nationality': 'US', 'Time': '0', 'ProjectID': '1', 'SampleID': 'Sample-3'}
{'Age': '22', 'Diversity': '5.87', 'Sex': 'female', 'DNA_extraction_method': 'NA', 'BMI_group': 'underweight', 'SubjectID': '4', 'Nationality': 'US', 'Time': '0', 'ProjectID': '1', 'SampleID': 'Sample-4'}

``````

It's time to decide on a data structure to record our result: For each row in the file, we want to make sure all the constraints are matching the desired ones. If so, we keep count of the BMI group. A dictionary with the BMI_groups as keys and counts as values will work well:

``````

In [6]:

# Initiate an empty dictionary to keep track of counts per BMI_group
BMI_count = {}

``````
``````

In [10]:

# set up our dictionary of constraints for testing purposes
dict_constraints = {'Age': '28', 'Sex': 'female'}

``````

OK, now the tricky part: for each row, we want to test if the constraints (a dictionary) matches the data (which itself is a dictionary). We can do it element-wise, that means we take a key from the data dictionary (`row`) and test if its value is NOT identical to the corresponding value in the constraint dictionary. We start out by setting the value `matching` to `TRUE` and set it to `FALSE` if we encounter a discripancy. This way, we stop immediately if one of the elements does not match and move on to the next row of data.

``````

In [30]:

csvr = csv.DictReader(f, delimiter = '\t')
for i, row in enumerate(csvr):
# check that all conditions are met
matching = True
for e in dict_constraints:
if row[e] != dict_constraints[e]:
# The constraint is not met. Move to the next record
matching = False
break
print("in row", i, "the key", e,"in data does not match", e, "in constraints")
if i > 5:
break

``````
``````

in row 0 the key Age in data does not match Age in constraints

``````

In some rows, all constraints will be fulfillled (i.e., our `matching` variable will still be `TRUE` after checking all elements). In this case, we want to increase the count of that particular BMI_group in our result dictionary `BMI_count`. We can directly add one to the appropriate `BMI_group` if we have seen it before, else we initiate that key with a value of one:

``````

In [16]:

csvr = csv.DictReader(f, delimiter = '\t')
for row in csvr:
# check that all conditions are met
matching = True
for e in dict_constraints:
if row[e] != dict_constraints[e]:
# The constraint is not met. Move to the next record
matching = False
break
# matching is True only if all the constraints have been met
if matching == True:
# extract the BMI_group
my_BMI = row['BMI_group']
if my_BMI in BMI_count.keys():
# If we've seen it before, add one record to the count
BMI_count[my_BMI] = BMI_count[my_BMI] + 1
else:
# If not, initialize at 1
BMI_count[my_BMI] = 1

``````
``````

In [17]:

BMI_count

``````
``````

Out[17]:

{'NA': 3, 'lean': 8, 'overweight': 2, 'underweight': 1}

``````

Excellent! Now, we can put everything together and create a function that accepts our constraint dictionary. Remember to document everything nicely:

``````

In [5]:

def get_BMI_count(dict_constraints):
""" Take as input a dictionary of constraints
for example, {'Age': '28', 'Sex': 'female'}
And return the count of the various groups of BMI
"""
# We use a dictionary to store the results
BMI_count = {}
# Open the file, build a csv DictReader
csvr = csv.DictReader(f, delimiter = '\t')
# For each row
for row in csvr:
# check that all conditions are met
matching = True
for e in dict_constraints:
if row[e] != dict_constraints[e]:
# The constraint is not met. Move to the next record
matching = False
break
# matching is True only if all the constraints have been met
if matching == True:
# extract the BMI_group
my_BMI = row['BMI_group']
if my_BMI in BMI_count.keys():
# If we've seen it before, add one record to the count
BMI_count[my_BMI] = BMI_count[my_BMI] + 1
else:
# If not, initialize at 1
BMI_count[my_BMI] = 1
return BMI_count

``````
``````

In [6]:

get_BMI_count({'Nationality': 'US', 'Sex': 'female'})

``````
``````

Out[6]:

{'lean': 12, 'obese': 3, 'overweight': 5, 'severeobese': 1, 'underweight': 3}

``````

Write a function that takes as input the constraints (as above), and a bacterial "genus". The function returns the average abundance (in logarithm base 10) of the genus for each group of BMI in the sub-population. For example, calling:

``get_abundance_by_BMI({'Time': '0', 'Nationality': 'US'}, 'Clostridium difficile et rel.')``

should return:

``````____________________________________________________________________
Abundance of Clostridium difficile et rel. In sub-population:
____________________________________________________________________
Nationality -> US
Time -> 0
____________________________________________________________________
3.08     NA
3.31     underweight
3.84     lean
2.89     overweight
3.31     obese
3.45     severeobese
____________________________________________________________________``````

To solve this task, we can recycle quite a bit of code that we just developed. However, instead of just counting occurances of `BMI_group`s, we want to keep track of the records (i.e., `SampleID`s) that match our constraints and look up a specific bacteria abundance in the file `HITChip.tab`. First, we create a dictionary with all records that match our constraints:

``````

In [31]:

# We use a dictionary to store the results
BMI_IDs = {}
# Open the file, build a csv DictReader
csvr = csv.DictReader(f, delimiter = '\t')
for row in csvr:
# check that all conditions are met
matching = True
for e in dict_constraints:
if row[e] != dict_constraints[e]:
# The constraint is not met. Move to the next record
matching = False
break
# matching is True only if all the constraints have been met
if matching == True:
# extract the BMI_group
my_BMI = row['BMI_group']
if my_BMI in BMI_IDs.keys():
# If we've seen it before, add the SampleID
BMI_IDs[my_BMI] = BMI_IDs[my_BMI] + [row['SampleID']]
else:
# If not, initialize
BMI_IDs[my_BMI] = [row['SampleID']]

``````
``````

In [32]:

BMI_IDs

``````
``````

Out[32]:

{'NA': ['Sample-889', 'Sample-892', 'Sample-893'],
'lean': ['Sample-17',
'Sample-22',
'Sample-153',
'Sample-208',
'Sample-753',
'Sample-754',
'Sample-785',
'Sample-963'],
'overweight': ['Sample-874', 'Sample-964'],
'underweight': ['Sample-93']}

``````

Before moving on, let's have a look at the `HITChip` file:

``````

In [23]:

with open("../data/Lahti2014/HITChip.tab", "r") as HIT:
csvr = csv.DictReader(HIT, delimiter = "\t")
for i, row in enumerate(csvr):
print(row)
if i > 2:
break

``````
``````

['SampleID', 'Actinomycetaceae', 'Aerococcus', 'Aeromonas', 'Akkermansia', 'Alcaligenes faecalis et rel.', 'Allistipes et rel.', 'Anaerobiospirillum', 'Anaerofustis', 'Anaerostipes caccae et rel.', 'Anaerotruncus colihominis et rel.', 'Anaerovorax odorimutans et rel.', 'Aneurinibacillus', 'Aquabacterium', 'Asteroleplasma et rel.', 'Atopobium', 'Bacillus', 'Bacteroides fragilis et rel.', 'Bacteroides intestinalis et rel.', 'Bacteroides ovatus et rel.', 'Bacteroides plebeius et rel.', 'Bacteroides splachnicus et rel.', 'Bacteroides stercoris et rel.', 'Bacteroides uniformis et rel.', 'Bacteroides vulgatus et rel.', 'Bifidobacterium', 'Bilophila et rel.', 'Brachyspira', 'Bryantella formatexigens et rel.', 'Bulleidia moorei et rel.', 'Burkholderia', 'Butyrivibrio crossotus et rel.', 'Campylobacter', 'Catenibacterium mitsuokai et rel.', 'Clostridium (sensu stricto)', 'Clostridium cellulosi et rel.', 'Clostridium colinum et rel.', 'Clostridium difficile et rel.', 'Clostridium felsineum et rel.', 'Clostridium leptum et rel.', 'Clostridium nexile et rel.', 'Clostridium orbiscindens et rel.', 'Clostridium ramosum et rel.', 'Clostridium sphenoides et rel.', 'Clostridium stercorarium et rel.', 'Clostridium symbiosum et rel.', 'Clostridium thermocellum et rel.', 'Collinsella', 'Coprobacillus catenaformis et rel.', 'Coprococcus eutactus et rel.', 'Corynebacterium', 'Desulfovibrio et rel.', 'Dialister', 'Dorea formicigenerans et rel.', 'Eggerthella lenta et rel.', 'Enterobacter aerogenes et rel.', 'Enterococcus', 'Escherichia coli et rel.', 'Eubacterium biforme et rel.', 'Eubacterium cylindroides et rel.', 'Eubacterium hallii et rel.', 'Eubacterium limosum et rel.', 'Eubacterium rectale et rel.', 'Eubacterium siraeum et rel.', 'Eubacterium ventriosum et rel.', 'Faecalibacterium prausnitzii et rel.', 'Fusobacteria', 'Gemella', 'Granulicatella', 'Haemophilus', 'Helicobacter', 'Klebisiella pneumoniae et rel.', 'Lachnobacillus bovis et rel.', 'Lachnospira pectinoschiza et rel.', 'Lactobacillus catenaformis et rel.', 'Lactobacillus gasseri et rel.', 'Lactobacillus plantarum et rel.', 'Lactobacillus salivarius et rel.', 'Lactococcus', 'Leminorella', 'Megamonas hypermegale et rel.', 'Megasphaera elsdenii et rel.', 'Methylobacterium', 'Micrococcaceae', 'Mitsuokella multiacida et rel.', 'Moraxellaceae', 'Novosphingobium', 'Oceanospirillum', 'Oscillospira guillermondii et rel.', 'Outgrouping clostridium cluster XIVa', 'Oxalobacter formigenes et rel.', 'Papillibacter cinnamivorans et rel.', 'Parabacteroides distasonis et rel.', 'Peptococcus niger et rel.', 'Peptostreptococcus anaerobius et rel.', 'Peptostreptococcus micros et rel.', 'Phascolarctobacterium faecium et rel.', 'Prevotella melaninogenica et rel.', 'Prevotella oralis et rel.', 'Prevotella ruminicola et rel.', 'Prevotella tannerae et rel.', 'Propionibacterium', 'Proteus et rel.', 'Pseudomonas', 'Roseburia intestinalis et rel.', 'Ruminococcus bromii et rel.', 'Ruminococcus callidus et rel.', 'Ruminococcus gnavus et rel.', 'Ruminococcus lactaris et rel.', 'Ruminococcus obeum et rel.', 'Serratia', 'Sporobacter termitidis et rel.', 'Staphylococcus', 'Streptococcus bovis et rel.', 'Streptococcus intermedius et rel.', 'Streptococcus mitis et rel.', 'Subdoligranulum variable at rel.', 'Sutterella wadsworthia et rel.', 'Tannerella et rel.', 'Uncultured Bacteroidetes', 'Uncultured Chroococcales', 'Uncultured Clostridium (sensu stricto)les I', 'Uncultured Clostridium (sensu stricto)les II', 'Uncultured Mollicutes', 'Uncultured Selenomonadaceae', 'Veillonella', 'Vibrio', 'Weissella et rel.', 'Wissella et rel.', 'Xanthomonadaceae', 'Yersinia et rel.']
{'Bacteroides stercoris et rel.': '2020.57101911791', 'Ruminococcus obeum et rel.': '77692.8399947947', 'Allistipes et rel.': '4593.31769688299', 'Ruminococcus bromii et rel.': '240.404463164125', 'Prevotella melaninogenica et rel.': '1189.31888563154', 'Collinsella': '487.131000368392', 'Prevotella ruminicola et rel.': '105.054372965389', 'Leminorella': '38.6715077233782', 'Bryantella formatexigens et rel.': '6528.59810400945', 'Klebisiella pneumoniae et rel.': '234.250546212356', 'Uncultured Bacteroidetes': '60.3133032571746', 'Lachnospira pectinoschiza et rel.': '14142.3135775986', 'Bacteroides vulgatus et rel.': '32268.6517568209', 'Dialister': '223.054152433315', 'Lactobacillus plantarum et rel.': '656.942824448475', 'Clostridium colinum et rel.': '383.832802397491', 'Bacteroides intestinalis et rel.': '190.944003539949', 'Wissella et rel.': '61.1071007925413', 'Staphylococcus': '74.7418326807986', 'Streptococcus bovis et rel.': '1231.97829925934', 'Eggerthella lenta et rel.': '255.750474733651', 'Asteroleplasma et rel.': '35.5669333395318', 'Clostridium difficile et rel.': '2335.15023422691', 'Bacillus': '114.685768647644', 'Desulfovibrio et rel.': '180.42969481698', 'Burkholderia': '45.8699536871023', 'Ruminococcus gnavus et rel.': '11063.292272352', 'Micrococcaceae': '35.9370046443557', 'Eubacterium cylindroides et rel.': '245.726340375678', 'Clostridium nexile et rel.': '15701.5016700386', 'Bifidobacterium': '11759.0202743222', 'SampleID': 'Sample-1', 'Eubacterium ventriosum et rel.': '11970.8260120332', 'Megasphaera elsdenii et rel.': '1227.78909015878', 'Coprobacillus catenaformis et rel.': '179.971732675445', 'Uncultured Clostridium (sensu stricto)les II': '1028.46847367333', 'Anaerobiospirillum': '35.3563566613223', 'Bilophila et rel.': '130.884906292234', 'Sporobacter termitidis et rel.': '2173.03856541078', 'Pseudomonas': '70.3370411627284', 'Coprococcus eutactus et rel.': '22948.9343288612', 'Anaerovorax odorimutans et rel.': '629.384375572851', 'Clostridium felsineum et rel.': '35.6307839491112', 'Prevotella tannerae et rel.': '1283.53815363728', 'Vibrio': '154.965222779692', 'Clostridium cellulosi et rel.': '2474.02281831756', 'Outgrouping clostridium cluster XIVa': '29555.6830274738', 'Tannerella et rel.': '1190.85286766131', 'Clostridium thermocellum et rel.': '34.9865501723585', 'Lactobacillus gasseri et rel.': '441.216058087914', 'Proteus et rel.': '289.733611657736', 'Clostridium leptum et rel.': '5154.98353438531', 'Brachyspira': '70.4665794825687', 'Peptostreptococcus anaerobius et rel.': '37.6783424535229', 'Bacteroides uniformis et rel.': '1851.64988304942', 'Lactococcus': '107.351063742561', 'Peptostreptococcus micros et rel.': '175.835238105408', 'Atopobium': '79.7109441226269', 'Anaerofustis': '53.2552249190787', 'Dorea formicigenerans et rel.': '14285.904392846', 'Butyrivibrio crossotus et rel.': '4558.73895680696', 'Escherichia coli et rel.': '472.357875084644', 'Uncultured Clostridium (sensu stricto)les I': '1104.41406152582', 'Phascolarctobacterium faecium et rel.': '725.333175420113', 'Aeromonas': '42.6023272590121', 'Clostridium orbiscindens et rel.': '7362.9410195337', 'Streptococcus intermedius et rel.': '295.604697719047', 'Propionibacterium': '115.758576131571', 'Bacteroides splachnicus et rel.': '1274.47218516652', 'Moraxellaceae': '70.9429502056795', 'Papillibacter cinnamivorans et rel.': '6891.35880891412', 'Clostridium symbiosum et rel.': '16306.6140886602', 'Catenibacterium mitsuokai et rel.': '72.0752367405305', 'Campylobacter': '281.571156604069', 'Clostridium (sensu stricto)': '998.980901995732', 'Anaerostipes caccae et rel.': '11190.2731781007', 'Lactobacillus catenaformis et rel.': '70.6625157216755', 'Weissella et rel.': '171.075788053953', 'Oceanospirillum': '167.953823500534', 'Clostridium stercorarium et rel.': '1625.0810257736', 'Uncultured Mollicutes': '521.322821398893', 'Actinomycetaceae': '72.0189547177961', 'Bacteroides plebeius et rel.': '1154.90165970345', 'Eubacterium siraeum et rel.': '160.905880469766', 'Xanthomonadaceae': '77.7085722415655', 'Alcaligenes faecalis et rel.': '143.425506392999', 'Peptococcus niger et rel.': '157.298884722962', 'Parabacteroides distasonis et rel.': '1161.35788684798', 'Oxalobacter formigenes et rel.': '204.622353289446', 'Sutterella wadsworthia et rel.': '548.806949964156', 'Subdoligranulum variable at rel.': '2602.6841082063', 'Haemophilus': '71.540476866323', 'Methylobacterium': '35.1269334616332', 'Bacteroides fragilis et rel.': '4292.54489187362', 'Fusobacteria': '323.362880167556', 'Aerococcus': '36.6774989319696', 'Megamonas hypermegale et rel.': '107.311629358762', 'Gemella': '35.899918440336', 'Bulleidia moorei et rel.': '180.874435920243', 'Uncultured Selenomonadaceae': '106.394226924348', 'Faecalibacterium prausnitzii et rel.': '84377.7500987508', 'Roseburia intestinalis et rel.': '3606.82686412831', 'Novosphingobium': '41.0904696615633', 'Eubacterium rectale et rel.': '4279.42265985932', 'Streptococcus mitis et rel.': '1819.27197627525', 'Uncultured Chroococcales': '35.6744941938961', 'Lactobacillus salivarius et rel.': '105.649302240529', 'Helicobacter': '177.382720444565', 'Yersinia et rel.': '148.457166233748', 'Akkermansia': '1381.23685390141', 'Enterococcus': '316.345429422144', 'Mitsuokella multiacida et rel.': '110.397398971489', 'Anaerotruncus colihominis et rel.': '710.087359713864', 'Serratia': '35.6655019469416', 'Corynebacterium': '108.932200272176', 'Veillonella': '211.82167029952', 'Aquabacterium': '47.9446504058434', 'Lachnobacillus bovis et rel.': '2865.07980538427', 'Granulicatella': '40.4354480578309', 'Eubacterium biforme et rel.': '425.830349594737', 'Oscillospira guillermondii et rel.': '5216.48152011872', 'Clostridium sphenoides et rel.': '15437.0905209792', 'Ruminococcus callidus et rel.': '9065.64591869983', 'Enterobacter aerogenes et rel.': '408.69061708681', 'Eubacterium limosum et rel.': '142.661880787189', 'Prevotella oralis et rel.': '413.930593766523', 'Aneurinibacillus': '43.918857957738', 'Clostridium ramosum et rel.': '199.98025058557', 'Ruminococcus lactaris et rel.': '301.357562002238', 'Eubacterium hallii et rel.': '4782.98334820634', 'Bacteroides ovatus et rel.': '2519.94166956552'}
{'Bacteroides stercoris et rel.': '957.570251668015', 'Ruminococcus obeum et rel.': '26899.9624612333', 'Allistipes et rel.': '8059.37841236238', 'Ruminococcus bromii et rel.': '3515.12091241772', 'Prevotella melaninogenica et rel.': '110002.189025851', 'Collinsella': '398.964413024397', 'Prevotella ruminicola et rel.': '262.618141353367', 'Leminorella': '39.0104735645511', 'Bryantella formatexigens et rel.': '5141.44688658139', 'Klebisiella pneumoniae et rel.': '244.305766143041', 'Uncultured Bacteroidetes': '67.1658566883046', 'Lachnospira pectinoschiza et rel.': '9406.9397096607', 'Bacteroides vulgatus et rel.': '19166.6836968366', 'Dialister': '4444.32196133912', 'Lactobacillus plantarum et rel.': '629.055893080636', 'Clostridium colinum et rel.': '413.615045277086', 'Bacteroides intestinalis et rel.': '167.360400455124', 'Wissella et rel.': '55.2217934132836', 'Staphylococcus': '74.5011314045027', 'Streptococcus bovis et rel.': '4695.09425255341', 'Eggerthella lenta et rel.': '235.329802792409', 'Asteroleplasma et rel.': '34.9178088231881', 'Clostridium difficile et rel.': '4408.1321277635', 'Bacillus': '110.835070562747', 'Desulfovibrio et rel.': '227.83695959035', 'Burkholderia': '53.4971632171515', 'Ruminococcus gnavus et rel.': '2733.22188237163', 'Micrococcaceae': '35.5590299852275', 'Eubacterium cylindroides et rel.': '176.657324938006', 'Clostridium nexile et rel.': '10737.4539916825', 'Bifidobacterium': '8182.967766195', 'SampleID': 'Sample-2', 'Eubacterium ventriosum et rel.': '5046.77328171548', 'Megasphaera elsdenii et rel.': '256.130727094113', 'Coprobacillus catenaformis et rel.': '388.59014173252', 'Uncultured Clostridium (sensu stricto)les II': '10796.2711980416', 'Anaerobiospirillum': '35.1865673563884', 'Bilophila et rel.': '106.594087755128', 'Sporobacter termitidis et rel.': '14747.6185750786', 'Pseudomonas': '73.0376589972576', 'Coprococcus eutactus et rel.': '28847.3926560496', 'Anaerovorax odorimutans et rel.': '700.722231193224', 'Clostridium felsineum et rel.': '37.8292135448473', 'Prevotella tannerae et rel.': '861.806667484945', 'Vibrio': '167.467196804019', 'Clostridium cellulosi et rel.': '16033.4932251686', 'Outgrouping clostridium cluster XIVa': '22253.3466630754', 'Tannerella et rel.': '1482.73742652623', 'Clostridium thermocellum et rel.': '35.1775849326978', 'Lactobacillus gasseri et rel.': '447.926146053727', 'Proteus et rel.': '282.501604563543', 'Clostridium leptum et rel.': '7017.24658206289', 'Brachyspira': '69.584645103434', 'Peptostreptococcus anaerobius et rel.': '35.3033711258297', 'Bacteroides uniformis et rel.': '1671.57294611471', 'Lactococcus': '111.86272126066', 'Peptostreptococcus micros et rel.': '174.693251717978', 'Atopobium': '77.5402945928499', 'Anaerofustis': '87.6215719000156', 'Dorea formicigenerans et rel.': '13227.6823622393', 'Butyrivibrio crossotus et rel.': '6282.66462359292', 'Escherichia coli et rel.': '526.384634675218', 'Uncultured Clostridium (sensu stricto)les I': '18897.1453646919', 'Phascolarctobacterium faecium et rel.': '385.560947600988', 'Aeromonas': '38.3398119961015', 'Clostridium orbiscindens et rel.': '13246.1898580334', 'Streptococcus intermedius et rel.': '236.827494354039', 'Propionibacterium': '108.921428810013', 'Bacteroides splachnicus et rel.': '1735.1424288741', 'Moraxellaceae': '69.5055871143714', 'Papillibacter cinnamivorans et rel.': '4023.24490893197', 'Clostridium symbiosum et rel.': '10392.2042903252', 'Catenibacterium mitsuokai et rel.': '92.0590910942654', 'Campylobacter': '280.262341648386', 'Clostridium (sensu stricto)': '3148.89810487013', 'Anaerostipes caccae et rel.': '6860.17346489658', 'Lactobacillus catenaformis et rel.': '70.4119579496169', 'Weissella et rel.': '116.656048660499', 'Oceanospirillum': '140.405880787627', 'Clostridium stercorarium et rel.': '997.99299155555', 'Uncultured Mollicutes': '4415.40208045288', 'Actinomycetaceae': '71.6174398836624', 'Bacteroides plebeius et rel.': '1509.15356520748', 'Eubacterium siraeum et rel.': '342.736746079781', 'Xanthomonadaceae': '75.7233116621278', 'Alcaligenes faecalis et rel.': '146.719701415408', 'Peptococcus niger et rel.': '249.31084930671', 'Parabacteroides distasonis et rel.': '6889.97725409654', 'Oxalobacter formigenes et rel.': '1121.05746824444', 'Sutterella wadsworthia et rel.': '580.046927124766', 'Subdoligranulum variable at rel.': '28163.5420176167', 'Haemophilus': '85.5371851424096', 'Methylobacterium': '34.7074445519911', 'Bacteroides fragilis et rel.': '2079.61864559992', 'Fusobacteria': '319.817987646445', 'Aerococcus': '35.0939016727016', 'Megamonas hypermegale et rel.': '105.591219666725', 'Gemella': '34.2991985702368', 'Bulleidia moorei et rel.': '175.075114577769', 'Uncultured Selenomonadaceae': '56.5503180707247', 'Faecalibacterium prausnitzii et rel.': '70769.9158415648', 'Roseburia intestinalis et rel.': '1560.78578502096', 'Novosphingobium': '41.3413096229012', 'Eubacterium rectale et rel.': '2316.05651915014', 'Streptococcus mitis et rel.': '1721.05971349193', 'Uncultured Chroococcales': '41.7407912097207', 'Lactobacillus salivarius et rel.': '104.394450599519', 'Helicobacter': '174.759957716599', 'Yersinia et rel.': '148.15376044081', 'Akkermansia': '2361.54262443099', 'Enterococcus': '254.380676293958', 'Mitsuokella multiacida et rel.': '104.570718564164', 'Anaerotruncus colihominis et rel.': '3090.15924687574', 'Serratia': '44.6672277762165', 'Corynebacterium': '108.483046253417', 'Veillonella': '200.735633813136', 'Aquabacterium': '42.909111608071', 'Lachnobacillus bovis et rel.': '10875.9203095783', 'Granulicatella': '35.81588320093', 'Eubacterium biforme et rel.': '542.430519343359', 'Oscillospira guillermondii et rel.': '36743.9299371806', 'Clostridium sphenoides et rel.': '12389.5412600859', 'Ruminococcus callidus et rel.': '3608.63673472775', 'Enterobacter aerogenes et rel.': '458.545114223584', 'Eubacterium limosum et rel.': '111.867465765752', 'Prevotella oralis et rel.': '23879.052756104', 'Aneurinibacillus': '41.3611073685682', 'Clostridium ramosum et rel.': '214.050015053772', 'Ruminococcus lactaris et rel.': '10040.5448809581', 'Eubacterium hallii et rel.': '2440.43193914563', 'Bacteroides ovatus et rel.': '2936.89441827642'}
{'Bacteroides stercoris et rel.': '311.299262836978', 'Ruminococcus obeum et rel.': '5615.65670002755', 'Allistipes et rel.': '2206.50573030564', 'Ruminococcus bromii et rel.': '53677.8386287938', 'Prevotella melaninogenica et rel.': '1019.30957689331', 'Collinsella': '583.184503944829', 'Prevotella ruminicola et rel.': '89.9574089207962', 'Leminorella': '38.0013656213506', 'Bryantella formatexigens et rel.': '854.525984993192', 'Klebisiella pneumoniae et rel.': '240.01669299708', 'Uncultured Bacteroidetes': '78.78914155389', 'Lachnospira pectinoschiza et rel.': '2569.09268731024', 'Bacteroides vulgatus et rel.': '26089.5245572182', 'Dialister': '224.521530528055', 'Lactobacillus plantarum et rel.': '701.447735817192', 'Clostridium colinum et rel.': '353.04427970336', 'Bacteroides intestinalis et rel.': '98.0354509404868', 'Wissella et rel.': '36.5872057285934', 'Staphylococcus': '79.5686173634415', 'Streptococcus bovis et rel.': '619.70340551746', 'Eggerthella lenta et rel.': '381.863506948621', 'Asteroleplasma et rel.': '36.4685173763808', 'Clostridium difficile et rel.': '19853.0966345145', 'Bacillus': '114.611935283196', 'Desulfovibrio et rel.': '182.22005236855', 'Burkholderia': '45.4899490968037', 'Ruminococcus gnavus et rel.': '1256.00612585238', 'Micrococcaceae': '36.3074131783969', 'Eubacterium cylindroides et rel.': '181.736519081222', 'Clostridium nexile et rel.': '882.37141217218', 'Bifidobacterium': '7579.92673264528', 'SampleID': 'Sample-3', 'Eubacterium ventriosum et rel.': '622.736371172036', 'Megasphaera elsdenii et rel.': '314.652192054481', 'Coprobacillus catenaformis et rel.': '213.228333791073', 'Uncultured Clostridium (sensu stricto)les II': '2382.60191324172', 'Anaerobiospirillum': '37.5896040144946', 'Bilophila et rel.': '122.263925385916', 'Sporobacter termitidis et rel.': '22678.7885629872', 'Pseudomonas': '72.7077078211695', 'Coprococcus eutactus et rel.': '1614.86611345217', 'Anaerovorax odorimutans et rel.': '2265.74205286361', 'Clostridium felsineum et rel.': '41.1027333667611', 'Prevotella tannerae et rel.': '765.613978170335', 'Vibrio': '177.587050521686', 'Clostridium cellulosi et rel.': '23937.8801293299', 'Outgrouping clostridium cluster XIVa': '964.384855392337', 'Tannerella et rel.': '732.832389463029', 'Clostridium thermocellum et rel.': '36.5111922408322', 'Lactobacillus gasseri et rel.': '448.760601392134', 'Proteus et rel.': '294.55150629084', 'Clostridium leptum et rel.': '23788.7943354309', 'Brachyspira': '72.5558098238468', 'Peptostreptococcus anaerobius et rel.': '39.4021350122532', 'Bacteroides uniformis et rel.': '246.369229244994', 'Lactococcus': '181.789997362727', 'Peptostreptococcus micros et rel.': '183.280519604566', 'Atopobium': '82.4807102997726', 'Anaerofustis': '45.5692756125726', 'Dorea formicigenerans et rel.': '2996.7534298053', 'Butyrivibrio crossotus et rel.': '17137.6551264532', 'Escherichia coli et rel.': '1524.54711919428', 'Uncultured Clostridium (sensu stricto)les I': '9941.42689025392', 'Phascolarctobacterium faecium et rel.': '438.148611733819', 'Aeromonas': '41.0496803489383', 'Clostridium orbiscindens et rel.': '19842.5106051588', 'Streptococcus intermedius et rel.': '196.246916641573', 'Propionibacterium': '114.948348570259', 'Bacteroides splachnicus et rel.': '835.096362440799', 'Moraxellaceae': '71.886319298056', 'Papillibacter cinnamivorans et rel.': '982.96022990176', 'Clostridium symbiosum et rel.': '2536.03058624394', 'Catenibacterium mitsuokai et rel.': '1568.79149739578', 'Campylobacter': '288.146255625342', 'Clostridium (sensu stricto)': '3503.93342407884', 'Anaerostipes caccae et rel.': '1753.03757188504', 'Lactobacillus catenaformis et rel.': '115.583695287112', 'Weissella et rel.': '137.614018022265', 'Oceanospirillum': '146.760845741096', 'Clostridium stercorarium et rel.': '452.769940169664', 'Uncultured Mollicutes': '2705.45641014781', 'Actinomycetaceae': '76.7885523301023', 'Bacteroides plebeius et rel.': '519.497106023152', 'Eubacterium siraeum et rel.': '391.258345977737', 'Xanthomonadaceae': '78.4513283270781', 'Alcaligenes faecalis et rel.': '149.895577006333', 'Peptococcus niger et rel.': '423.091515712639', 'Parabacteroides distasonis et rel.': '1724.67480209301', 'Oxalobacter formigenes et rel.': '2397.49282997646', 'Sutterella wadsworthia et rel.': '751.878656807589', 'Subdoligranulum variable at rel.': '2492.97885468133', 'Haemophilus': '72.2606260102919', 'Methylobacterium': '36.0698049436694', 'Bacteroides fragilis et rel.': '988.016805267545', 'Fusobacteria': '331.116658242456', 'Aerococcus': '36.4684590752703', 'Megamonas hypermegale et rel.': '107.775121906872', 'Gemella': '38.6215789610141', 'Bulleidia moorei et rel.': '203.465924081754', 'Uncultured Selenomonadaceae': '35.0873642012232', 'Faecalibacterium prausnitzii et rel.': '2694.6721771549', 'Roseburia intestinalis et rel.': '270.187704629013', 'Novosphingobium': '39.7440376416747', 'Eubacterium rectale et rel.': '538.371251907576', 'Streptococcus mitis et rel.': '462.476703157077', 'Uncultured Chroococcales': '42.8617296375576', 'Lactobacillus salivarius et rel.': '121.210946002814', 'Helicobacter': '182.725411286123', 'Yersinia et rel.': '151.82945051479', 'Akkermansia': '30042.1180325848', 'Enterococcus': '289.312604105773', 'Mitsuokella multiacida et rel.': '107.872788311439', 'Anaerotruncus colihominis et rel.': '2484.29561786565', 'Serratia': '40.9391799568801', 'Corynebacterium': '114.344949585787', 'Veillonella': '143.295195919861', 'Aquabacterium': '39.986798473158', 'Lachnobacillus bovis et rel.': '727.687940879875', 'Granulicatella': '39.1764746724361', 'Eubacterium biforme et rel.': '1036.56415047853', 'Oscillospira guillermondii et rel.': '125726.358948862', 'Clostridium sphenoides et rel.': '1933.85285287853', 'Ruminococcus callidus et rel.': '1045.39863572445', 'Enterobacter aerogenes et rel.': '401.854210793547', 'Eubacterium limosum et rel.': '114.882099026737', 'Prevotella oralis et rel.': '354.37644049715', 'Aneurinibacillus': '39.208405728295', 'Clostridium ramosum et rel.': '181.921949980453', 'Ruminococcus lactaris et rel.': '255.371304038065', 'Eubacterium hallii et rel.': '500.453032143198', 'Bacteroides ovatus et rel.': '742.497903203875'}
{'Bacteroides stercoris et rel.': '1613.95729087433', 'Ruminococcus obeum et rel.': '134155.706186624', 'Allistipes et rel.': '21769.9358558725', 'Ruminococcus bromii et rel.': '6129.67841710804', 'Prevotella melaninogenica et rel.': '784.444461008532', 'Collinsella': '580.320385750736', 'Prevotella ruminicola et rel.': '112.000088814222', 'Leminorella': '51.1295752326222', 'Bryantella formatexigens et rel.': '6589.00366913722', 'Klebisiella pneumoniae et rel.': '241.510020067997', 'Uncultured Bacteroidetes': '72.5654498575303', 'Lachnospira pectinoschiza et rel.': '9854.15266598892', 'Bacteroides vulgatus et rel.': '127194.326136257', 'Dialister': '414.498412707916', 'Lactobacillus plantarum et rel.': '649.961100709455', 'Clostridium colinum et rel.': '5035.65856029489', 'Bacteroides intestinalis et rel.': '569.712637030537', 'Wissella et rel.': '44.5241880392279', 'Staphylococcus': '76.0370501997199', 'Streptococcus bovis et rel.': '2088.23764035133', 'Eggerthella lenta et rel.': '279.359098899315', 'Asteroleplasma et rel.': '35.9535392278991', 'Clostridium difficile et rel.': '831.076417227899', 'Bacillus': '121.093595450426', 'Desulfovibrio et rel.': '185.578895198899', 'Burkholderia': '77.3614957825974', 'Ruminococcus gnavus et rel.': '3193.07830780298', 'Micrococcaceae': '36.7943690146329', 'Eubacterium cylindroides et rel.': '189.055199253022', 'Clostridium nexile et rel.': '7685.21399625697', 'Bifidobacterium': '6815.27778254446', 'SampleID': 'Sample-4', 'Eubacterium ventriosum et rel.': '3116.52353584725', 'Megasphaera elsdenii et rel.': '1041.75973297738', 'Coprobacillus catenaformis et rel.': '543.114876516244', 'Uncultured Clostridium (sensu stricto)les II': '2881.87315622668', 'Anaerobiospirillum': '36.7745950580399', 'Bilophila et rel.': '132.504727754643', 'Sporobacter termitidis et rel.': '3229.66463150548', 'Pseudomonas': '72.0951128703707', 'Coprococcus eutactus et rel.': '57660.0808197584', 'Anaerovorax odorimutans et rel.': '1145.53712230086', 'Clostridium felsineum et rel.': '36.0929985064936', 'Prevotella tannerae et rel.': '2978.38160311097', 'Vibrio': '205.570562332293', 'Clostridium cellulosi et rel.': '2238.94355797021', 'Outgrouping clostridium cluster XIVa': '10648.4608792093', 'Tannerella et rel.': '1899.81421896351', 'Clostridium thermocellum et rel.': '36.3801788740735', 'Lactobacillus gasseri et rel.': '450.839449355039', 'Proteus et rel.': '292.029936329815', 'Clostridium leptum et rel.': '2296.8035288294', 'Brachyspira': '71.3875204011077', 'Peptostreptococcus anaerobius et rel.': '37.2308777724527', 'Bacteroides uniformis et rel.': '6030.43952107733', 'Lactococcus': '107.646637511013', 'Peptostreptococcus micros et rel.': '181.755394682624', 'Atopobium': '77.1669055050925', 'Anaerofustis': '50.7764379830445', 'Dorea formicigenerans et rel.': '17289.4936613187', 'Butyrivibrio crossotus et rel.': '6492.58455877885', 'Escherichia coli et rel.': '608.128418245777', 'Uncultured Clostridium (sensu stricto)les I': '1707.34647259334', 'Phascolarctobacterium faecium et rel.': '402.921163437255', 'Aeromonas': '40.91985003592', 'Clostridium orbiscindens et rel.': '10261.7089969784', 'Streptococcus intermedius et rel.': '244.677945573292', 'Propionibacterium': '124.929547269543', 'Bacteroides splachnicus et rel.': '1391.31730639905', 'Moraxellaceae': '71.1557467405751', 'Papillibacter cinnamivorans et rel.': '16655.7826000318', 'Clostridium symbiosum et rel.': '17220.0762253225', 'Catenibacterium mitsuokai et rel.': '102.963180271077', 'Campylobacter': '286.125561055593', 'Clostridium (sensu stricto)': '949.158895322188', 'Anaerostipes caccae et rel.': '12858.4266577811', 'Lactobacillus catenaformis et rel.': '71.9867820507583', 'Weissella et rel.': '115.188296662287', 'Oceanospirillum': '144.67712966726', 'Clostridium stercorarium et rel.': '986.625684922019', 'Uncultured Mollicutes': '575.770920834396', 'Actinomycetaceae': '74.1533610540817', 'Bacteroides plebeius et rel.': '3675.01457766587', 'Eubacterium siraeum et rel.': '201.083328677045', 'Xanthomonadaceae': '82.8109342366465', 'Alcaligenes faecalis et rel.': '172.170573810853', 'Peptococcus niger et rel.': '168.995763204274', 'Parabacteroides distasonis et rel.': '1915.86214942697', 'Oxalobacter formigenes et rel.': '1442.36121946445', 'Sutterella wadsworthia et rel.': '3191.88078805316', 'Subdoligranulum variable at rel.': '24215.0265505325', 'Haemophilus': '71.6951137579792', 'Methylobacterium': '36.7947351278379', 'Bacteroides fragilis et rel.': '2500.85651871731', 'Fusobacteria': '332.440206154764', 'Aerococcus': '35.9834717864625', 'Megamonas hypermegale et rel.': '108.54745115488', 'Gemella': '36.7362170790441', 'Bulleidia moorei et rel.': '216.607207853978', 'Uncultured Selenomonadaceae': '35.9099070493019', 'Faecalibacterium prausnitzii et rel.': '48259.7495305778', 'Roseburia intestinalis et rel.': '1436.56517934518', 'Novosphingobium': '41.8065458154256', 'Eubacterium rectale et rel.': '2623.86397872186', 'Streptococcus mitis et rel.': '2073.06294523599', 'Uncultured Chroococcales': '37.942725368598', 'Lactobacillus salivarius et rel.': '107.90290173482', 'Helicobacter': '180.420672155966', 'Yersinia et rel.': '152.548576601', 'Akkermansia': '3886.40096437251', 'Enterococcus': '265.715530081075', 'Mitsuokella multiacida et rel.': '112.843788825819', 'Anaerotruncus colihominis et rel.': '593.978721916727', 'Serratia': '36.7681751259288', 'Corynebacterium': '111.528621553994', 'Veillonella': '180.882271427719', 'Aquabacterium': '154.578790514784', 'Lachnobacillus bovis et rel.': '5175.17938901455', 'Granulicatella': '38.3948023177617', 'Eubacterium biforme et rel.': '3417.80242021815', 'Oscillospira guillermondii et rel.': '26583.2717896654', 'Clostridium sphenoides et rel.': '20049.4161240924', 'Ruminococcus callidus et rel.': '5748.20321839189', 'Enterobacter aerogenes et rel.': '484.822334617022', 'Eubacterium limosum et rel.': '115.556322875858', 'Prevotella oralis et rel.': '534.226855460011', 'Aneurinibacillus': '43.4009507065595', 'Clostridium ramosum et rel.': '262.971460039164', 'Ruminococcus lactaris et rel.': '812.265557256994', 'Eubacterium hallii et rel.': '4410.91664956437', 'Bacteroides ovatus et rel.': '2595.56865473167'}

``````

We see that that each row contains the `SampleID` and abundance data for various phylogenetically clustered bacteria. For each row in the file, we can now check if we are interested in that particular `SampleID` (i.e., if it matched our constraint and is in our `BMI_IDs` dictionary). If so, we retrieve the abundance of the bacteria of interest and add it to the the previously identified abundances within a particular BMI_group. If we had not encounter this BMI_group before, we initiate the key with the abundance as value. As we want to calculate the mean of these abuncandes later, we also keep track of the number of occurances:

``````

In [27]:

# set up dictionary to track abundances by BMI_group and number of identified records
abundance = {}
# choose a bacteria genus for testing
genus = "Clostridium difficile et rel."
with open('../data/Lahti2014/HITChip.tab') as f:
csvr = csv.DictReader(f, delimiter = '\t')
# For each row
for row in csvr:
# check whether we need this SampleID
matching = False
for g in BMI_IDs:
if row['SampleID'] in BMI_IDs[g]:
if g in abundance.keys():
abundance[g][0] = abundance[g][0] + float(row[genus])
abundance[g][1] = abundance[g][1] + 1

else:
abundance[g] = [float(row[genus]), 1]
# we have found it, so move on
break

``````
``````

In [28]:

abundance

``````
``````

Out[28]:

{'NA': [26653.01492637446, 3],
'lean': [70636.33360838753, 8],
'overweight': [3571.2360766130105, 2],
'underweight': [3056.63707319002, 1]}

``````

Now we take care of calculating the mean and printing the results. We need to load the scipy (or numby) module in order to calculate `log10`:

``````

In [29]:

import scipy

print("____________________________________________________________________")
print("Abundance of " + genus + " In sub-population:")
print("____________________________________________________________________")
for key, value in dict_constraints.items():
print(key, "->", value)
print("____________________________________________________________________")
for ab in ['NA', 'underweight', 'lean', 'overweight',
'obese', 'severeobese', 'morbidobese']:
if ab in abundance.keys():
abundance[ab][0] = scipy.log10(abundance[ab][0] / abundance[ab][1])
print(round(abundance[ab][0], 2), '\t', ab)
print("____________________________________________________________________")
print("")

``````
``````

____________________________________________________________________
Abundance of Clostridium difficile et rel. In sub-population:
____________________________________________________________________
Age -> 28
Sex -> female
____________________________________________________________________
3.95 	 NA
3.49 	 underweight
3.95 	 lean
3.25 	 overweight
____________________________________________________________________

``````

Last but not least, we put it all together in a function:

``````

In [ ]:

import scipy # For log10

def get_abundance_by_BMI(dict_constraints, genus = 'Aerococcus'):
# We use a dictionary to store the results
BMI_IDs = {}
# Open the file, build a csv DictReader
csvr = csv.DictReader(f, delimiter = '\t')
# For each row
for row in csvr:
# check that all conditions are met
matching = True
for e in dict_constraints:
if row[e] != dict_constraints[e]:
# The constraint is not met. Move to the next record
matching = False
break
# matching is True only if all the constraints have been met
if matching == True:
# extract the BMI_group
my_BMI = row['BMI_group']
if my_BMI in BMI_IDs.keys():
# If we've seen it before, add the SampleID
BMI_IDs[my_BMI] = BMI_IDs[my_BMI] + [row['SampleID']]
else:
# If not, initialize
BMI_IDs[my_BMI] = [row['SampleID']]
# Now let's open the other file, and keep track of the abundance of the genus for each
# BMI group
abundance = {}
with open('../data/Lahti2014/HITChip.tab') as f:
csvr = csv.DictReader(f, delimiter = '\t')
# For each row
for row in csvr:
# check whether we need this SampleID
matching = False
for g in BMI_IDs:
if row['SampleID'] in BMI_IDs[g]:
if g in abundance.keys():
abundance[g][0] = abundance[g][0] + float(row[genus])
abundance[g][1] = abundance[g][1] + 1

else:
abundance[g] = [float(row[genus]), 1]
# we have found it, so move on
break
# Finally, calculate means, and print results
print("____________________________________________________________________")
print("Abundance of " + genus + " In sub-population:")
print("____________________________________________________________________")
for key, value in dict_constraints.items():
print(key, "->", value)
print("____________________________________________________________________")
for ab in ['NA', 'underweight', 'lean', 'overweight',
'obese', 'severeobese', 'morbidobese']:
if ab in abundance.keys():
abundance[ab][0] = scipy.log10(abundance[ab][0] / abundance[ab][1])
print(round(abundance[ab][0], 2), '\t', ab)
print("____________________________________________________________________")
print("")

``````
``````

In [8]:

get_abundance_by_BMI({'Time': '0', 'Nationality': 'US'},
'Clostridium difficile et rel.')

``````
``````

____________________________________________________________________
Abundance of Clostridium difficile et rel. In sub-population:
____________________________________________________________________
Nationality -> US
Time -> 0
____________________________________________________________________
3.08 	 NA
3.31 	 underweight
3.84 	 lean
2.89 	 overweight
3.31 	 obese
3.45 	 severeobese
____________________________________________________________________

``````

Repeat this analysis for all genera, and for the records having `Time = 0`.

A function to extract all the genera in the database:

``````

In [10]:

def get_all_genera():
with open('../data/Lahti2014/HITChip.tab') as f:
return genera

``````

Testing:

``````

In [7]:

get_all_genera()[:6]

``````
``````

Out[7]:

['Actinomycetaceae',
'Aerococcus',
'Aeromonas',
'Akkermansia',
'Alcaligenes faecalis et rel.',
'Allistipes et rel.']

``````

Now use this function to print the results for all genera at `Time = 0`:

``````

In [8]:

for g in get_all_genera()[:5]:
get_abundance_by_BMI({'Time': '0'}, g)

``````
``````

____________________________________________________________________
Abundance of Actinomycetaceae In sub-population:
____________________________________________________________________
Time -> 0
____________________________________________________________________
1.98 	 NA
1.95 	 underweight
1.98 	 lean
1.97 	 overweight
1.93 	 obese
1.95 	 severeobese
1.9 	 morbidobese
____________________________________________________________________

____________________________________________________________________
Abundance of Aerococcus In sub-population:
____________________________________________________________________
Time -> 0
____________________________________________________________________
1.66 	 NA
1.63 	 underweight
1.66 	 lean
1.66 	 overweight
1.61 	 obese
1.62 	 severeobese
1.6 	 morbidobese
____________________________________________________________________

____________________________________________________________________
Abundance of Aeromonas In sub-population:
____________________________________________________________________
Time -> 0
____________________________________________________________________
1.68 	 NA
1.68 	 underweight
1.69 	 lean
1.69 	 overweight
1.66 	 obese
1.66 	 severeobese
1.63 	 morbidobese
____________________________________________________________________

____________________________________________________________________
Abundance of Akkermansia In sub-population:
____________________________________________________________________
Time -> 0
____________________________________________________________________
3.53 	 NA
4.0 	 underweight
3.65 	 lean
3.71 	 overweight
3.52 	 obese
3.48 	 severeobese
3.35 	 morbidobese
____________________________________________________________________

____________________________________________________________________
Abundance of Alcaligenes faecalis et rel. In sub-population:
____________________________________________________________________
Time -> 0
____________________________________________________________________
2.32 	 NA
2.26 	 underweight
2.36 	 lean
2.37 	 overweight
2.49 	 obese
2.43 	 severeobese
2.26 	 morbidobese
____________________________________________________________________

``````