In [1]:
import calour as ca
ca.set_log_level(11)
%matplotlib notebook
import numpy as np
np.random.seed(2018)
we use the Chronic faitigue syndrome dataset from
Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.
Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.
Microbiome, 4(1), p.30.
In [2]:
cfs=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
In [3]:
print(cfs)
In [4]:
cfs=cfs.filter_abundance(10)
In [5]:
cfs=cfs.cluster_features()
In [6]:
cfs=cfs.sort_samples('Subject')
In [7]:
cfs.plot(sample_field='Subject',gui='jupyter')
Out[7]:
So we see there are lots of bacteria, hard to see which ones are significantly different between Control (healthy) and Patient (CFS).
(rank mean test after rank transform, with ds-FDR FDR control of 0.1)
Note the results of diff_abundance() is a new Experiment, containing only statistically significant bacteria. The bacteria are sorted according to the effect size (where positive effect size is when group1 > group2, and negative is group1 < group2).
In [8]:
dd=cfs.diff_abundance('Subject','Control','Patient')
So we got 55 differentially abundant bacteria when we compare all samples that have 'Control' in sample metadata field 'Subject' compared to samples that have 'Patient'
Let's see what we got in a heatmap
In [9]:
dd.plot(sample_field='Subject',gui='jupyter')
Out[9]:
The bacteria at the top are higher (with statistical significance) in 'Patient' compared to 'Control', with the topmost bacteria the most different.
The bottom bacteria are higher in 'Control' compared to 'Patient', with the bottom-most bacteria the most different.
Note that calour does not show where in the middle is the transition from higher in Control to higher in Patient. This is since we believe if you don't see it by eye, it is not interesting.
However, we can add a colorbar r to indicate the group where the bacteria are higher.
We use the '_calour_diff_abundance_group' field which is added by the diff_abundance() function.
The diff_abundance() also adds the p-value associated with each bacteria ('_calour_diff_abundance_pval'), and the effect size('_calour_diff_abundance_effect') as fields in the dd.feature_metadata Pandas dataframe
In [10]:
dd.plot(sample_field='Subject',gui='jupyter',bary_fields=['_calour_diff_abundance_group'])
Out[10]:
In [ ]:
print(dd.feature_metadata.columns)
diff_abundance() calculates the effect size (and p-value) based on random permutations of the sample labels.
Therefore, running diff_abundance() twice may result in slightly different results. In order to get exactly the same results, we can use the random_seed parameter.
In [11]:
dd=cfs.diff_abundance('Subject','Control','Patient')
In [12]:
dd2=cfs.diff_abundance('Subject','Control','Patient')
In [13]:
print('WITHOUT resetting the random seed:')
print('%d different bacteria between the two function calls' % len(set(dd.feature_metadata.index)^set(dd2.feature_metadata.index)))
In [14]:
dd=cfs.diff_abundance('Subject','Control','Patient', random_seed=2018)
In [15]:
dd2=cfs.diff_abundance('Subject','Control','Patient', random_seed=2018)
In [16]:
print('WITH resetting the random seed:')
print('%d different bacteria between the two function calls' % len(set(dd.feature_metadata.index)^set(dd2.feature_metadata.index)))
In [17]:
dd2=cfs.diff_abundance('Subject','Control','Patient', alpha=0.25)
In [18]:
print('alpha=0.1:\n%s\n\nalpha=0.25\n%s' % (dd, dd2))
Instead of ds-FDR, we can opt for Benjaminy-Hochberg FDR. However, in most cases, it will be more conservative (detect less significant bacteria), due to the discreteness and sparsity of typical microbiome data.
All FDR methods are listed in the diff_abundance API doc.
In [19]:
dd2=cfs.diff_abundance('Subject','Control','Patient', fdr_method='bhfdr', random_seed=2018)
In [20]:
print('dsFDR:\n%s\n\nBH-FDR\n%s' % (dd, dd2))
Instead of the default (rank transform), we can use log2 transform (transform='log2'
), or skip the transformation at all (transform=None
).
All transform options are listed in the diff_abundance
API doc.
We recommend using the default rank transform in order to reduce the effect of outliers.
In [21]:
dd2=cfs.diff_abundance('Subject','Control','Patient', transform=None, random_seed=2018)
In [22]:
print('rankt transform:\n%s\n\nno transform\n%s' % (dd, dd2))
In [ ]: