In [1]:
import calour as ca
In [2]:
ca.set_log_level(11)
In [3]:
%matplotlib notebook
For an amplicon experiment we use ca.read_amplicon()
First parameter is the location+name of the biom table file (can be hdf5/json/txt biom table - see here for details)
Second (optional) parameter is the sample mapping file locaion+name. First column should be the sample id (identical to the sample ids in the biom table). Rest of the column are information fields about each sample.
normalize=XXX : tells calour to rescale each sample to XXX reads (by dividing each feature frequency by the total number of reads in the sample and multiplying by XXX). Alternatively, can use normalize=None to skip normalization (i.e. in the case the biom table is already rarified)
min_reads=XXX : throw away samples with less than min_reads total (before normalization). Useful to get rid of samples with small number of reads. Can use min_reads=None to keep all samples.
We will use the data from:
Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.
Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.
Microbiome, 4(1), p.30.
In [4]:
dat=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
In [5]:
print(dat)
We throw away all features with total reads (over all samples) < 10 (after each sample was normalized to 10k reads/sample). So a bacteria present (with 1 read) in 10 samples will be kept, as well as a bacteria present in only one sample, with 10 reads in this sample. Note alternatively we could filter based on mean reads/sample or fraction of samples where the feature is present. Each method filters away slightly different bacteria. See filtering notebook for details on the filtering functions.
In [6]:
dat=dat.filter_abundance(10)
Features are clustered (hierarchical clustering) based on euaclidian distance between features (over all samples) following normalizing each feature to mean 0 std 1. For more details and examples, see sorting notebook or cluster_features documentation
In [7]:
datc=dat.cluster_features()
In [8]:
datc=datc.sort_samples('Physical_functioning')
datc=datc.sort_samples('Subject')
Columns (x-axis) are the samples, rows (y-axis) are the features. We will show on the x-axis the host-individual field of each sample.
we will use the jupyter notebook GUI so we will see the interactive plot in the notebook. Alternatively we could use the qt5 GUI to see the plot in a separate standalone window.
A few cool things we can do with the interactive plot:
Click with the mouse on the heatmap to see details about the feature/sample selected (including information from dbBact).
use SHIFT+UP or SHIFT+DOWN to zoom in/out on the features
use UP/DOWN to scroll up/down on the features
use SHIFT+RIGHT or SHIFT+LEFT to zoom in/out on the samples
use RIGHT/LEFT to scroll left/right on the samples
See here for more details
In [9]:
datc.plot(sample_field='Subject', gui='jupyter')
Out[9]:
In [10]:
datc=datc.sort_samples('Sex')
datc=datc.sort_samples('Subject')
In [11]:
datc.plot(sample_field='Subject', gui='jupyter',barx_fields=['Sex'])
Out[11]:
Let's look for bacteria separating sick from healthy We ask it to find all bacteria significantly different between samples with 'Control' and 'Patient' in the 'Subject' field.
By default calour uses the mean of the ranks of each feature (over all samples), with dsFDR multiple hypothesis correction.
For more information, see notebook and function doc
In [12]:
dd=datc.diff_abundance(field='Subject',val1='Control',val2='Patient', random_seed=2018)
Let's plot to see the behavior of these bacteria. The output of diff_abundance is an Experiment with only the significant bacteria, which are sorted by the effect size. On the bottom is the bacteria with the largest effect size (higher in Control compared to Patient).
In [13]:
dd.plot(sample_field='Subject', gui='jupyter')
Out[13]:
We can ask what is special in the bacteria significanly higher in the Control vs. the Patient group and vice versa.
We supply the parameter ignore_exp=[12]
to ignore annotations regarding this experiment (expid=12) since it is already in the dbBact database.
In [14]:
ax, enriched=dd.plot_diff_abundance_enrichment(term_type='combined',ignore_exp=[12])
The enriched terms are in a calour experiment class (terms are features, bacteria are samples), so we can see the list of enriched terms with the p-value (pval) and effect size (odif)
In [15]:
enriched.feature_metadata
Out[15]:
In [ ]: