Microbiome experiment step-by-step analysis

This is a jupyter notebook example of how to load, process and plot data from a microbiome experiment using Calour.

Setup

Import the calour module



In [1]:

    
import calour as ca









    



/Users/amnon/miniconda3/envs/calour/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

(optional) Set the level of feedback messages from calour

can use:

1 for debug (lots of feedback on each command)
11 for info (useful information from some commands)
21 for warning (just warning messages)

The Calour default is warning (21)



In [2]:

    
ca.set_log_level(11)

Also enable interactive plots inside the jupyter notebook



In [3]:

    
%matplotlib notebook

Loading the data

For an amplicon experiment we use ca.read_amplicon()

First parameter is the location+name of the biom table file (can be hdf5/json/txt biom table - see here for details)

Second (optional) parameter is the sample mapping file locaion+name. First column should be the sample id (identical to the sample ids in the biom table). Rest of the column are information fields about each sample.

normalize=XXX : tells calour to rescale each sample to XXX reads (by dividing each feature frequency by the total number of reads in the sample and multiplying by XXX). Alternatively, can use normalize=None to skip normalization (i.e. in the case the biom table is already rarified)

min_reads=XXX : throw away samples with less than min_reads total (before normalization). Useful to get rid of samples with small number of reads. Can use min_reads=None to keep all samples.

We will use the data from:

Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.

Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.

Microbiome, 4(1), p.30.



In [4]:

    
dat=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
                     'data/chronic-fatigue-syndrome.sample.txt',
                     normalize=10000,min_reads=1000)









    



2018-03-04 12:40:24 INFO loaded 87 samples, 2129 features
2018-03-04 12:40:24 WARNING These have metadata but do not have data - dropped: {'ERR1331814'}
2018-03-04 12:40:24 INFO After filtering, 87 remaining



In [5]:

    
print(dat)









    



AmpliconExperiment ("chronic-fatigue-syndrome.biom") with 87 samples, 2129 features

Process the data

Get rid of the features (bacteria) with small amount of reads

We throw away all features with total reads (over all samples) < 10 (after each sample was normalized to 10k reads/sample). So a bacteria present (with 1 read) in 10 samples will be kept, as well as a bacteria present in only one sample, with 10 reads in this sample. Note alternatively we could filter based on mean reads/sample or fraction of samples where the feature is present. Each method filters away slightly different bacteria. See filtering notebook for details on the filtering functions.



In [6]:

    
dat=dat.filter_abundance(10)









    



2018-03-04 12:40:30 INFO After filtering, 1100 remaining

Cluster (reorder) the features so similarly behaving bacteria are close to each other

Features are clustered (hierarchical clustering) based on euaclidian distance between features (over all samples) following normalizing each feature to mean 0 std 1. For more details and examples, see sorting notebook or cluster_features documentation

Note that if we have a lot of features, clustering is slow, so it is recommended to first filter away the non-interesting features.



In [7]:

    
datc=dat.cluster_features()









    



2018-03-04 12:40:30 INFO After filtering, 1100 remaining

Sort the samples according to physical functioning and Disease state

Note that order within each group of similar value is maintained. We first sort by physical functioning, then sort by the disease state. So within each disease state, samples will still be sorted by physical functioning.



In [8]:

    
datc=datc.sort_samples('Physical_functioning')
datc=datc.sort_samples('Subject')

Plotting the data

Columns (x-axis) are the samples, rows (y-axis) are the features. We will show on the x-axis the host-individual field of each sample.

we will use the jupyter notebook GUI so we will see the interactive plot in the notebook. Alternatively we could use the qt5 GUI to see the plot in a separate standalone window.

A few cool things we can do with the interactive plot:

Click with the mouse on the heatmap to see details about the feature/sample selected (including information from dbBact).
use SHIFT+UP or SHIFT+DOWN to zoom in/out on the features
use UP/DOWN to scroll up/down on the features
use SHIFT+RIGHT or SHIFT+LEFT to zoom in/out on the samples
use RIGHT/LEFT to scroll left/right on the samples

See here for more details



In [9]:

    
datc.plot(sample_field='Subject', gui='jupyter')









    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[9]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1093a7630>

Adding a field to the top bar

Now let's add the values of the "Sex" field into the xbar on top First we'll also sort by sex, so values will be continuous (note we then sort by the disease state to get the two groups separated).



In [10]:

    
datc=datc.sort_samples('Sex')
datc=datc.sort_samples('Subject')



In [11]:

    
datc.plot(sample_field='Subject', gui='jupyter',barx_fields=['Sex'])









    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[11]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a19dacc50>

Differential abundance testing

Let's look for bacteria separating sick from healthy We ask it to find all bacteria significantly different between samples with 'Control' and 'Patient' in the 'Subject' field.

By default calour uses the mean of the ranks of each feature (over all samples), with dsFDR multiple hypothesis correction.

For more information, see notebook and function doc



In [12]:

    
dd=datc.diff_abundance(field='Subject',val1='Control',val2='Patient', random_seed=2018)









    



2018-03-04 12:40:36 INFO 87 samples with both values
2018-03-04 12:40:36 INFO After filtering, 1100 remaining
2018-03-04 12:40:36 INFO 39 samples with value 1 (['Control'])
2018-03-04 12:40:37 INFO method meandiff. number of higher in ['Control'] : 44. number of higher in ['Patient'] : 19. total 63

Plotting the differentially abundant features

Let's plot to see the behavior of these bacteria. The output of diff_abundance is an Experiment with only the significant bacteria, which are sorted by the effect size. On the bottom is the bacteria with the largest effect size (higher in Control compared to Patient).



In [13]:

    
dd.plot(sample_field='Subject', gui='jupyter')









    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[13]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a0ffeca20>

dbBact term enrichment

We can ask what is special in the bacteria significanly higher in the Control vs. the Patient group and vice versa.

We supply the parameter ignore_exp=[12] to ignore annotations regarding this experiment (expid=12) since it is already in the dbBact database.

Note since we need to get the per-feature annotations from dbBact, we need a live internet connection to run this command.



In [14]:

    
ax, enriched=dd.plot_diff_abundance_enrichment(term_type='combined',ignore_exp=[12])









    



2018-03-04 12:40:41 INFO Getting dbBact annotations for 63 sequences, please wait...
2018-03-04 12:40:45 INFO Got 2383 annotations
2018-03-04 12:40:45 INFO Added annotation data to experiment. Total 631 annotations, 63 terms
2018-03-04 12:40:45 INFO removed 171 terms

The enriched terms are in a calour experiment class (terms are features, bacteria are samples), so we can see the list of enriched terms with the p-value (pval) and effect size (odif)



In [15]:

    
enriched.feature_metadata









    Out[15]:







  
    
      
      odif
      pvals
      term
    
  
  
    
      crohn's disease
      -1.688995
      0.000999
      crohn's disease
    
    
      **63**little physical activity
      -1.368421
      0.000999
      **63**little physical activity
    
    
      -control
      -1.038876
      0.000999
      -control
    
    
      age > 1 year
      -1.014354
      0.010989
      age > 1 year
    
    
      -small village
      -0.997608
      0.000999
      -small village
    
    
      age
      -0.931873
      0.060939
      age
    
    
      obsolete_juvenile stage
      -0.898923
      0.000999
      obsolete_juvenile stage
    
    
      child
      -0.878190
      0.006993
      child
    
    
      animal product diet
      -0.825359
      0.005994
      animal product diet
    
    
      -rural community
      -0.803828
      0.000999
      -rural community
    
    
      **63**-physical activity
      -0.684211
      0.000999
      **63**-physical activity
    
    
      higher in individuals with low physical activity ( high in little physical activity  compared to physical activity  in feces homo sapiens united states of america
      -0.684211
      0.000999
      higher in individuals with low physical activi...
    
    
      -adult
      -0.674641
      0.000999
      -adult
    
    
      **276**-tunapuco
      -0.593301
      0.000999
      **276**-tunapuco
    
    
      high in united states of america city state of oklahoma  compared to peru small village tunapuco rural community  in feces homo sapiens adult
      -0.593301
      0.000999
      high in united states of america city state o...
    
    
      **276**-peru
      -0.593301
      0.000999
      **276**-peru
    
    
      old age
      -0.586124
      0.007992
      old age
    
    
      high in children with Crohn's disease compared to healthy adult controls ( high in crohn's disease child obsolete_juvenile stage  compared to control adult  in feces homo sapiens glasgow
      -0.578947
      0.000999
      high in children with Crohn's disease compared...
    
    
      high in female  compared to male  in feces homo sapiens united states of america
      -0.473684
      0.000999
      high in female  compared to male  in feces ho...
    
    
      female
      -0.466467
      0.065934
      female
    
    
      infant
      -0.452602
      0.006993
      infant
    
    
      **131**age 30-50 years
      -0.435407
      0.025974
      **131**age 30-50 years
    
    
      canis lupus familiaris
      -0.431685
      0.000999
      canis lupus familiaris
    
    
      **150**enzyme supplement
      -0.421053
      0.006993
      **150**enzyme supplement
    
    
      state of oklahoma
      -0.360048
      0.008991
      state of oklahoma
    
    
      Higher in animal product diet compared to plant diet ( high in diet animal product diet  compared to plant diet  in feces homo sapiens united states of america
      -0.352871
      0.001998
      Higher in animal product diet compared to plan...
    
    
      **74**-plant diet
      -0.352871
      0.001998
      **74**-plant diet
    
    
      high in age 1 year  compared to age 2 months  in feces homo sapiens female infant state of california
      -0.345694
      0.001998
      high in age 1 year  compared to age 2 months ...
    
    
      **284**-age 2 months
      -0.345694
      0.001998
      **284**-age 2 months
    
    
      **214**taconic farms
      -0.315789
      0.025974
      **214**taconic farms
    
    
      ...
      ...
      ...
      ...
    
    
      high in peru small village tunapuco rural community  compared to united states of america city state of oklahoma  in feces homo sapiens adult
      0.500000
      0.000999
      high in peru small village tunapuco rural com...
    
    
      lower in small intestine compared to colon in pigs ( high in caecum left colon right colon  compared to duodenum jejunum ileum  in sus scrofa united kingdom pig
      0.500000
      0.001998
      lower in small intestine compared to colon in ...
    
    
      **256**right colon
      0.543062
      0.000999
      **256**right colon
    
    
      **256**left colon
      0.543062
      0.000999
      **256**left colon
    
    
      lower in babies from finland compared to estonia ( high in estonia  compared to finland  in feces homo sapiens infant age < 3 years
      0.545455
      0.000999
      lower in babies from finland compared to eston...
    
    
      **63**physical activity
      0.545455
      0.017982
      **63**physical activity
    
    
      common  caecum, left colon, right colon, sus scrofa, united kingdom, pig,
      0.561005
      0.000999
      common  caecum, left colon, right colon, sus s...
    
    
      common  feces, homo sapiens, child, egypt, obsolete_juvenile stage,
      0.570574
      0.000999
      common  feces, homo sapiens, child, egypt, obs...
    
    
      **276**peru
      0.575758
      0.000999
      **276**peru
    
    
      **276**tunapuco
      0.575758
      0.000999
      **276**tunapuco
    
    
      -city
      0.590909
      0.000999
      -city
    
    
      high in male  compared to female  in feces homo sapiens united states of america
      0.613636
      0.000999
      high in male  compared to female  in feces ho...
    
    
      lower in wastewater plant effluent compared to influent and sewer in south america ( high in sewage influent  compared to effluent  in city wastewater treatment plant south america
      0.637560
      0.000999
      lower in wastewater plant effluent compared to...
    
    
      -effluent
      0.637560
      0.000999
      -effluent
    
    
      common  feces, homo sapiens, venezuela, amerindian, hunter gatherer,
      0.674641
      0.000999
      common  feces, homo sapiens, venezuela, amerin...
    
    
      common  feces, homo sapiens, adult, peru, small village, tunapuco, rural community,
      0.704545
      0.000999
      common  feces, homo sapiens, adult, peru, smal...
    
    
      rural community
      0.730463
      0.000999
      rural community
    
    
      male
      0.735845
      0.026973
      male
    
    
      -finland
      0.773325
      0.000999
      -finland
    
    
      -united states of america
      0.795455
      0.000999
      -united states of america
    
    
      common  feces, homo sapiens, city, el salvador, small village,
      0.795455
      0.000999
      common  feces, homo sapiens, city, el salvador...
    
    
      -ileum
      0.811005
      0.000999
      -ileum
    
    
      wet season
      0.909091
      0.000999
      wet season
    
    
      -crohn's disease
      1.132775
      0.000999
      -crohn's disease
    
    
      small village
      1.200159
      0.000999
      small village
    
    
      adult
      1.248156
      0.017982
      adult
    
    
      **53**influent
      1.275120
      0.000999
      **53**influent
    
    
      **53**sewage
      1.275120
      0.000999
      **53**sewage
    
    
      homo sapiens
      2.280233
      0.068931
      homo sapiens
    
    
      control
      2.662679
      0.011988
      control
    
  

239 rows × 3 columns



In [ ]:

	odif	pvals	term
crohn's disease	-1.688995	0.000999	crohn's disease
63little physical activity	-1.368421	0.000999	63little physical activity
-control	-1.038876	0.000999	-control
age > 1 year	-1.014354	0.010989	age > 1 year
-small village	-0.997608	0.000999	-small village
age	-0.931873	0.060939	age
obsolete_juvenile stage	-0.898923	0.000999	obsolete_juvenile stage
child	-0.878190	0.006993	child
animal product diet	-0.825359	0.005994	animal product diet
-rural community	-0.803828	0.000999	-rural community
63-physical activity	-0.684211	0.000999	63-physical activity
higher in individuals with low physical activity ( high in little physical activity compared to physical activity in feces homo sapiens united states of america	-0.684211	0.000999	higher in individuals with low physical activi...
-adult	-0.674641	0.000999	-adult
276-tunapuco	-0.593301	0.000999	276-tunapuco
high in united states of america city state of oklahoma compared to peru small village tunapuco rural community in feces homo sapiens adult	-0.593301	0.000999	high in united states of america city state o...
276-peru	-0.593301	0.000999	276-peru
old age	-0.586124	0.007992	old age
high in children with Crohn's disease compared to healthy adult controls ( high in crohn's disease child obsolete_juvenile stage compared to control adult in feces homo sapiens glasgow	-0.578947	0.000999	high in children with Crohn's disease compared...
high in female compared to male in feces homo sapiens united states of america	-0.473684	0.000999	high in female compared to male in feces ho...
female	-0.466467	0.065934	female
infant	-0.452602	0.006993	infant
131age 30-50 years	-0.435407	0.025974	131age 30-50 years
canis lupus familiaris	-0.431685	0.000999	canis lupus familiaris
150enzyme supplement	-0.421053	0.006993	150enzyme supplement
state of oklahoma	-0.360048	0.008991	state of oklahoma
Higher in animal product diet compared to plant diet ( high in diet animal product diet compared to plant diet in feces homo sapiens united states of america	-0.352871	0.001998	Higher in animal product diet compared to plan...
74-plant diet	-0.352871	0.001998	74-plant diet
high in age 1 year compared to age 2 months in feces homo sapiens female infant state of california	-0.345694	0.001998	high in age 1 year compared to age 2 months ...
284-age 2 months	-0.345694	0.001998	284-age 2 months
214taconic farms	-0.315789	0.025974	214taconic farms
...	...	...	...
high in peru small village tunapuco rural community compared to united states of america city state of oklahoma in feces homo sapiens adult	0.500000	0.000999	high in peru small village tunapuco rural com...
lower in small intestine compared to colon in pigs ( high in caecum left colon right colon compared to duodenum jejunum ileum in sus scrofa united kingdom pig	0.500000	0.001998	lower in small intestine compared to colon in ...
256right colon	0.543062	0.000999	256right colon
256left colon	0.543062	0.000999	256left colon
lower in babies from finland compared to estonia ( high in estonia compared to finland in feces homo sapiens infant age < 3 years	0.545455	0.000999	lower in babies from finland compared to eston...
63physical activity	0.545455	0.017982	63physical activity
common caecum, left colon, right colon, sus scrofa, united kingdom, pig,	0.561005	0.000999	common caecum, left colon, right colon, sus s...
common feces, homo sapiens, child, egypt, obsolete_juvenile stage,	0.570574	0.000999	common feces, homo sapiens, child, egypt, obs...
276peru	0.575758	0.000999	276peru
276tunapuco	0.575758	0.000999	276tunapuco
-city	0.590909	0.000999	-city
high in male compared to female in feces homo sapiens united states of america	0.613636	0.000999	high in male compared to female in feces ho...
lower in wastewater plant effluent compared to influent and sewer in south america ( high in sewage influent compared to effluent in city wastewater treatment plant south america	0.637560	0.000999	lower in wastewater plant effluent compared to...
-effluent	0.637560	0.000999	-effluent
common feces, homo sapiens, venezuela, amerindian, hunter gatherer,	0.674641	0.000999	common feces, homo sapiens, venezuela, amerin...
common feces, homo sapiens, adult, peru, small village, tunapuco, rural community,	0.704545	0.000999	common feces, homo sapiens, adult, peru, smal...
rural community	0.730463	0.000999	rural community
male	0.735845	0.026973	male
-finland	0.773325	0.000999	-finland
-united states of america	0.795455	0.000999	-united states of america
common feces, homo sapiens, city, el salvador, small village,	0.795455	0.000999	common feces, homo sapiens, city, el salvador...
-ileum	0.811005	0.000999	-ileum
wet season	0.909091	0.000999	wet season
-crohn's disease	1.132775	0.000999	-crohn's disease
small village	1.200159	0.000999	small village
adult	1.248156	0.017982	adult
53influent	1.275120	0.000999	53influent
53sewage	1.275120	0.000999	53sewage
homo sapiens	2.280233	0.068931	homo sapiens
control	2.662679	0.011988	control