In [1]:
import calour as ca
ca.set_log_level(11)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
we use the chronic fatigue syndrome data from:
Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.
Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.
Microbiome, 4(1), p.30.
for each sample we normalize to 10000 reads/sample.
This is done by dividing the number of reads of each feature in a sample by the total sum of reads (of all features) in the sample, and then multiplying by the desired number of reads (i.e. 10000).
After this normalization, the sum of (normalized) reads in each sample will be 10000.
This is different from rarefaction, since each feature can have a non-integer number of reads, and less information is thrown away. However, you need to be careful not to have a bias by the original number of reads (mostly in binary methods). dsFDR works fine with this normalization.
Note that we also throw away samples with less than min_reads=1000 reads total (before normalization). This is in order to reduce the discretization effect in samples with low number of reads.
In [2]:
cfs_normalized=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
In [3]:
print(cfs_normalized)
The sum of reads per sample should be 10000
In [4]:
cfs_normalized.get_data(sparse=False).sum(axis=1)
Out[4]:
The original number of reads per sample (before normalization) is stored in the sample_metadata table in the field "_calour_original_abundance"
In [5]:
res=plt.hist(cfs_normalized.sample_metadata['_calour_original_abundance'],50)
plt.xlabel('original number of reads')
plt.ylabel('number of samples')
Out[5]:
we can load the data without normalizing the reads per sample by setting the parameter normalize=None
This is not recommended for typical microbiome experiments since the number of reads per sample is arbitrary and does not reflect the number of bacteria in the sample.
We still chose to remove all samples with less than 1000 reads total.
In [6]:
cfs_not_normalized=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=None,min_reads=1000)
In [7]:
cfs_not_normalized.get_data(sparse=False).sum(axis=1)
Out[7]:
In [8]:
tt = cfs_not_normalized.normalize(5000)
In [9]:
tt.get_data(sparse=False).sum(axis=1)
Out[9]:
normalize_compositional
)In some cases, a plausible biological scenario is that a few bacteria have a very large number of reads. Increase in the frequency of such a bacteria will cause a decrease in the frequencies of all other bacteria (even if in reality their total number remains constant in the sample) since data is normalized to constant sum per sample.
Under the assumption that most bacteria do not change in total number between the samples, we can normalize to constant sum when ignoring the set of high frequency bacteria.
We will demonstrate using a synthetic example:
In the original dataset we have a few tens of bacteria separating between healthy and sick
In [10]:
dd=cfs_normalized.diff_abundance('Subject','Control','Patient', random_seed=2018)
In [11]:
tt=cfs_normalized.copy()
tt.sparse=False
tt.data[tt.sample_metadata['Subject']=='Control',0] = 50000
tt=tt.normalize(10000)
In [12]:
dd=tt.diff_abundance('Subject','Control','Patient', random_seed=2018)
We get more bacteria which are higher in 'Patient' since bacteria 0 is now very high in controls, and data is TSS normalized
In [13]:
yy=tt.normalize_compositional()
In [14]:
dd=yy.diff_abundance('Subject','Control','Patient', random_seed=2018)
so we reduced the inflation of false differentially abundant bacteria due to data compositionallity
normalize_by_subset_features
)Sometimes we want to normalize while ignoring some features (say ignoring all mitochondrial sequences), but we still want to keep these features - just not use them in the normalization.
Note the sum of reads per sample will not be constant (since samples also contain the ignored features).
Lets ignore the bacteria that don't have a good taxonomy assignment
In [15]:
bad_seqs=[cseq for cseq,ctax in cfs_not_normalized.feature_metadata['taxonomy'].iteritems() if len(ctax)<13]
In [16]:
tt = cfs_not_normalized.normalize_by_subset_features(bad_seqs, total=10000)
In [17]:
tt.get_data(sparse=False).sum(axis=1)
Out[17]:
In [ ]: