In [1]:
import calour as ca
ca.set_log_level(11)
%matplotlib notebook
the Chronic faitigue syndrome data from:
Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.
Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.
Microbiome, 4(1), p.30.
In [2]:
cfs=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
In [3]:
print(cfs)
Moving pictures dataset. from:
Caporaso, J.G., Lauber, C.L., Costello, E.K., Berg-Lyons, D., Gonzalez, A., Stombaugh, J., Knights, D., Gajer, P., Ravel, J., Fierer, N. and Gordon, J.I., 2011.
Moving pictures of the human microbiome.
Genome biology, 12(5), p.R50.
In [4]:
movpic=ca.read_amplicon('data/moving_pic.biom',
'data/moving_pic.sample.txt',
normalize=10000,min_reads=1000)
In [5]:
print(movpic)
is the original data sorted by the Subject field?
In [6]:
print(cfs.sample_metadata['Subject'].is_monotonic_increasing)
In [7]:
cfs=cfs.sort_samples('Subject')
and is the new data sorted?
In [8]:
print(cfs.sample_metadata['Subject'].is_monotonic_increasing)
For the moving pictures dataset, we want the data to be sorted by individual, and within each individual to be sorted by timepoint
In [9]:
movpic=movpic.sort_samples('DAYS_SINCE_EXPERIMENT_START')
movpic=movpic.sort_samples('HOST_SUBJECT_ID')
In [10]:
print(movpic.sample_metadata['DAYS_SINCE_EXPERIMENT_START'].is_monotonic_increasing)
In [11]:
print(movpic.sample_metadata['HOST_SUBJECT_ID'].is_monotonic_increasing)
lets keep only samples from participant F4
In [12]:
tt=movpic.filter_samples('HOST_SUBJECT_ID','F4')
print('* original:\n%s\n\n* filtered:\n%s' % (movpic, tt))
now lets only keep skin and fecal samples
In [13]:
print(movpic.sample_metadata['BODY_HABITAT'].unique())
In [14]:
yy=tt.filter_samples('BODY_HABITAT', ['UBERON:skin', 'UBERON:feces'])
print(yy)
let's keep just the non-skin and non-feces samples
In [15]:
yy=tt.filter_samples('BODY_HABITAT', ['UBERON:skin', 'UBERON:feces'], negate=True)
print(yy)
filter_abundance
)Remove all features (bacteria) with < 10 reads total (summed over all samples, after normalization).
This is useful for getting rid of non-interesting features. Note that differently from filtering based of fraction of samples where feature is present (filter_prevalence
), this method (filter_abundance
) will also keep features present in a small fraction of the samples, but in high frequency.
In [16]:
tt=cfs.filter_abundance(25)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
In [17]:
tt=cfs.filter_abundance(25, negate=True)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs,tt))
In [18]:
# remove bacteria present in less than half of the samples
tt=cfs.filter_prevalence(0.5)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
In [19]:
# keep only high frequency bacteria (mean over all samples > 1%)
tt=cfs.filter_mean(0.01)
print('* original:\n%s\n\n* filtered:\n%s' % (cfs, tt))
In [ ]: