Setup

import calour


In [1]:
import calour as ca


/Users/amnon/miniconda3/envs/calour/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

let's set to print messages of INFO level from Calour

Calour uses Python's builtin logging module to print out logging messages. By default the logging level is set to WARNING. Let's change it to INFO for the purpose of this tutorial, so we get more detailed information about the function outputs.


In [2]:
ca.set_log_level('INFO')

and we want interactive plots inside the notebook


In [3]:
%matplotlib notebook

Moving picture data set

This data set is from: Caporaso JG, Lauber CL, Costello EK, Berg-Lyons D, Gonzalez A, Stombaugh J, Knights D, Gajer P, Ravel J, Fierer N, et al. (2011) Moving pictures of the human microbiome. Genome Biology, 12, R50.

The raw data are reproccessed with deblur method, which is published in mSystem

Load the data

We use read_amplicon to read the data into AmpliconExperiment class. This class has some amplicon experiment specific functions such as filter_taxonomy etc.

Useful parameters are:

  • biom table name
  • mapping file name (can be None if no sample metadata available)
  • min_reads - the minimal number of reads per sample in order to keep it
  • normalize - the depth to normalize each sample to (note it is not rarefaction butn normalization to constant sum)

In [4]:
exp = ca.read_amplicon(data_file='data/moving_pic.biom', sample_metadata_file='data/moving_pic.sample.txt',
                       min_reads=1000, normalize=10000)


2018-03-04 12:41:21 INFO loaded 1968 samples, 7056 features
2018-03-04 12:41:21 INFO After filtering, 1967 remaining

In [5]:
exp


Out[5]:
AmpliconExperiment ("moving_pic.biom") with 1967 samples, 7056 features

each Experiment has the following attributes:

data : a data sparse (or dense) 2D array

This stores the abundance information. Each row is a sample, each column a feature (i.e. OTU, gene, metabolites, etc)


In [6]:
exp.data


Out[6]:
<1967x7056 sparse matrix of type '<class 'numpy.float64'>'
	with 130081 stored elements in Compressed Sparse Row format>

sample_metadata : a pandas dataframe with one row per sample

index is the SampleID (first column in the mapping file), matching the biom table sampleIDs


In [7]:
exp.sample_metadata.head(5)


Out[7]:
BarcodeSequence LinkerPrimerSequence DAYS_SINCE_EPOCH TARGET_SUBFRAGMENT ASSIGNED_FROM_GEO EXPERIMENT_CENTER TITLE COMMON_SAMPLE_SITE RUN_PREFIX HOST_COMMON_NAME ... KEY_SEQ BODY_PRODUCT AGE_IN_YEARS RUN_CENTER LIBRARY_CONSTRUCTION_PROTOCOL LATITUDE REGION HOST_INDIVIDUAL Description _calour_original_abundance
#SampleID
L2S137.273277 AATCAGTCTCGT GTGCCAGCMGCCGCGGTAA 14223 V4 n CCME Moving_pictures_of_the_human_microbiome L_palm s_2_sequence human ... none UBERON:zone of skin of hand None CGS-GL Five primers, two for PCR and three for sequen... 40.014986 L2 F4 illumina_time_series 10000.0
L3S167.273782 TACTTCGCTCGC GTGCCAGCMGCCGCGGTAA 14526 V4 n CCME Moving_pictures_of_the_human_microbiome L_palm s_3_sequence None ... none UBERON:zone of skin of hand None CGS-GL Five primers, two for PCR and three for sequen... 40.014986 L3 M3 illumina_time_series 10000.0
L4S102.275028 GCAATAGCTGCT GTGCCAGCMGCCGCGGTAA 14310 V4 n CCME Moving_pictures_of_the_human_microbiome R_palm s_4_sequence None ... none UBERON:zone of skin of hand None CGS-GL Five primers, two for PCR and three for sequen... 40.014986 L4 M3 illumina_time_series 10000.0
L6S243.274842 ACGAGTGCTATC GTGCCAGCMGCCGCGGTAA 14507 V4 n CCME Moving_pictures_of_the_human_microbiome Tongue s_6_sequence None ... none UBERON:tongue None CGS-GL Five primers, two for PCR and three for sequen... 40.014986 L6 M3 illumina_time_series 10000.0
L3S240.274358 AGTACGCTCGAG GTGCCAGCMGCCGCGGTAA 14178 V4 n CCME Moving_pictures_of_the_human_microbiome R_palm s_3_sequence human ... none UBERON:zone of skin of hand None CGS-GL Five primers, two for PCR and three for sequen... 40.014986 L3 F4 illumina_time_series 10000.0

5 rows × 48 columns

feature_metadata : a pandas dataframe with one row per feature (i.e. sOTU)

For deblurred data set, index is the actual sequence, matching the biom table

Additional properties are loaded from biom table observation metadata (i.e. taxonomy)


In [8]:
exp.feature_metadata.head(5)


Out[8]:
taxonomy
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCTTAGGCGGCAAATTAAGTTAGTGGTTAAATAGTTCGGCTCAACCGGATTTCGCCATTAAAACTGATATGCTAGAGATTAAACGAGGTAGGCGGAATAAGTTA k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o_...
TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGTGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCG k__Bacteria;p__Verrucomicrobia;c__Verrucomicro...
TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTTGCGTCGGGAGTGAAAACTCAGGGCTTAACCCTGAGCCTGCTTCCGATACGGGCAGACTAGAGGTATGCAGGGGAGAACGGAATTCCTG k__Bacteria;p__Actinobacteria;c__Actinobacteri...
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCATGCAAGTCAGAAGTGAAAATCCGGGGCTCAACCCCGGAACTGCTTTTGAAACTGCAGGGCTAGAGTGCAGGAGGGGCAAGTGGAATTCCTA k__Bacteria;p__Firmicutes;c__Clostridia;o__Clo...
TACGTAGGTGGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCTGATTAAGTCAGCGGTGAAAGGTAGCAGCTTAACTGTTTTACATGCCGTTGATACTGGTTAGCTTGAGTTGACAGAAGGCAGATAGAATTCCT k__Bacteria;p__Bacteroidetes;c__Cytophagia;o__...

Sort the samples

sort by the time field, COMMON_SAMPLE_SITE (to get separation between left/right hand), and subject id


In [9]:
exp.sample_metadata.COMMON_SAMPLE_SITE.value_counts()


Out[9]:
Tongue    508
L_palm    499
R_palm    493
feces     467
Name: COMMON_SAMPLE_SITE, dtype: int64

In [10]:
exp=exp.sort_samples('DAYS_SINCE_EXPERIMENT_START').sort_samples('COMMON_SAMPLE_SITE').sort_samples('HOST_SUBJECT_ID')

In [11]:
exp.feature_metadata.head()


Out[11]:
taxonomy
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCTTAGGCGGCAAATTAAGTTAGTGGTTAAATAGTTCGGCTCAACCGGATTTCGCCATTAAAACTGATATGCTAGAGATTAAACGAGGTAGGCGGAATAAGTTA k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o_...
TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGTGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCG k__Bacteria;p__Verrucomicrobia;c__Verrucomicro...
TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTTGCGTCGGGAGTGAAAACTCAGGGCTTAACCCTGAGCCTGCTTCCGATACGGGCAGACTAGAGGTATGCAGGGGAGAACGGAATTCCTG k__Bacteria;p__Actinobacteria;c__Actinobacteri...
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCATGCAAGTCAGAAGTGAAAATCCGGGGCTCAACCCCGGAACTGCTTTTGAAACTGCAGGGCTAGAGTGCAGGAGGGGCAAGTGGAATTCCTA k__Bacteria;p__Firmicutes;c__Clostridia;o__Clo...
TACGTAGGTGGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCTGATTAAGTCAGCGGTGAAAGGTAGCAGCTTAACTGTTTTACATGCCGTTGATACTGGTTAGCTTGAGTTGACAGAAGGCAGATAGAATTCCT k__Bacteria;p__Bacteroidetes;c__Cytophagia;o__...

divide to per sample-type experiments


In [12]:
exp.sample_metadata['BODY_PRODUCT'].value_counts()


Out[12]:
UBERON:zone of skin of hand    992
UBERON:tongue                  508
UBERON:feces                   467
Name: BODY_PRODUCT, dtype: int64

let's split the dataset:


In [13]:
feces = exp.filter_samples('BODY_PRODUCT', 'UBERON:feces')
tongue = exp.filter_samples('BODY_PRODUCT', 'UBERON:tongue')
# just to have fun, let's negate and use multiple values
hand = exp.filter_samples('BODY_PRODUCT', ['UBERON:feces', 'UBERON:tongue'], negate=True )

NOTE: The data array and sample and feature metadata are always synchronized to the same order for all the manipulations (filtering, sorting, transforming, etc.) done on Experiment object.


In [14]:
feces


Out[14]:
AmpliconExperiment ("moving_pic.biom") with 467 samples, 7056 features

In [15]:
tongue


Out[15]:
AmpliconExperiment ("moving_pic.biom") with 508 samples, 7056 features

In [16]:
hand


Out[16]:
AmpliconExperiment ("moving_pic.biom") with 992 samples, 7056 features

Note that filtering samples does not change or get rid of features not present in the set of samples.

get rid of features with <50 reads total over all samples

(Where we have 10000 reads/sample after normalization in the load)


In [17]:
tt=exp.filter_abundance(50)


2018-03-04 12:41:32 INFO After filtering, 2512 remaining

And plot the data naively

We plot by sorting samples according to body site, with sample color bars for the subject id.

We use the plot_sort function which does a sort on a (set) of fields on the fly and then plots. We can specify the type of interactive heatmap using the gui with following options:

  • 'jupyter' plots an interactive plot inside the notebook
  • 'qt5' opens the plot in an interactive qt5 window
  • None just creates the static matplotlib figure

In [18]:
f = tt.sort_samples('BODY_PRODUCT').plot(gui='jupyter', feature_field='taxonomy', barx_fields=['HOST_SUBJECT_ID'], clim=[0, 1000])



In [19]:
# you can save figures
f.save_figure('fig.pdf')

Sort bacteria by the taxonomy (just because we can :D)


In [19]:
zz=tt.sort_taxonomy()
zz.sort_samples('BODY_PRODUCT').plot(gui='jupyter',barx_fields=['HOST_SUBJECT_ID'], clim=[0, 1000])


Out[19]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x110e256a0>

Sort bacteria by clustering

So similar behaving bacteria will be close to each other We also remove bacteria with < 50 reads total, so the heatmap will be of the more interesting bacteria.


In [20]:
ttt=tt.cluster_features()


2018-03-04 12:41:43 INFO After filtering, 2512 remaining

In [21]:
f=ttt.sort_samples('BODY_PRODUCT').plot(gui='jupyter',barx_fields=['HOST_SUBJECT_ID','COMMON_SAMPLE_SITE'], clim=[0,1000])


This is the same plot but focused on a specific region that is interesting:


In [22]:
f=ttt.sort_samples('BODY_PRODUCT').plot(gui='jupyter',barx_fields=['HOST_SUBJECT_ID','COMMON_SAMPLE_SITE'],
                rect=[-0.5, 1966.5, 2511.5, 2354.5],
                clim=[0,1000], feature_field=None)


We can see there is a set of sOTUs showing up in together in the individual of "M3" sporadically (near the bottom of the heatmap above). This behavior is difficult to see in a naive heatmap without clustering and sorting the features.

find features correlated with time in the individule of "M3"


In [23]:
tt = feces.filter_samples('HOST_SUBJECT_ID','M3')
tt=tt.cluster_features(50)


2018-03-04 12:42:01 INFO After filtering, 493 remaining

In [24]:
dd = tt.correlation('DAYS_SINCE_EXPERIMENT_START', random_seed=2018)


2018-03-04 12:42:02 INFO After filtering, 493 remaining
2018-03-04 12:42:02 INFO method spearman for field DAYS_SINCE_EXPERIMENT_START. Positive correlated features : 61. Negative correlated features : 191. total 252

We get 259 features with significant correlation following FDR control.

Note features are sorted by the effect size (biggest/smallest correlation is top/bottom)


In [25]:
dd.plot(feature_field=None, gui='jupyter')


Out[25]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a191b7198>

We can also sort by the center of mass


In [26]:
ttt=tt.filter_prevalence(0.3)


2018-03-04 12:42:05 INFO After filtering, 76 remaining

In [27]:
ttt=ttt.sort_centroid()

In [28]:
ttt.sample_metadata.HOST_SUBJECT_ID.value_counts()


Out[28]:
M3    336
Name: HOST_SUBJECT_ID, dtype: int64

In [29]:
f=ttt.plot(gui='jupyter',feature_field=None, clim=[0,1000])


Some more analyses

We can sort the features based on abundance in a specific field/value group. For example, let's sort the features by their abundances in the individual "M3" and then look how abundant those bacteria are in the individual "F4":


In [30]:
tt = feces.sort_abundance({'HOST_SUBJECT_ID': ['M3']})

In [31]:
tt.plot(sample_field='HOST_SUBJECT_ID', feature_field=None, gui='jupyter')


Out[31]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a191bb9e8>

we can find the sOTUs that are different between left and right hands


In [32]:
m3s=hand.filter_samples('HOST_SUBJECT_ID','M3')
m3s=m3s.normalize_compositional()
m3s=m3s.sort_samples('DAYS_SINCE_EPOCH')

dd=m3s.diff_abundance('COMMON_SAMPLE_SITE','L_palm', random_seed=2018)

dd.sort_samples('COMMON_SAMPLE_SITE').plot(sample_field='COMMON_SAMPLE_SITE', gui='jupyter', title='Host M3')


2018-03-04 12:42:20 INFO After filtering, 2 remaining
2018-03-04 12:42:20 INFO ignoring 2 features
2018-03-04 12:42:20 INFO After filtering, 5247 remaining
2018-03-04 12:42:21 INFO 365 samples with value 1 (['L_palm'])
2018-03-04 12:42:23 INFO method meandiff. number of higher in ['L_palm'] : 3. number of higher in None : 843. total 846
Out[32]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a1ba1f978>

In [33]:
m3s=hand.filter_samples('HOST_SUBJECT_ID','F4')

m3s=m3s.sort_samples('DAYS_SINCE_EPOCH')

dd=m3s.diff_abundance('COMMON_SAMPLE_SITE','L_palm', random_seed=2018)

dd.sort_samples('COMMON_SAMPLE_SITE').plot(sample_field='COMMON_SAMPLE_SITE', gui='jupyter')


2018-03-04 12:42:24 INFO After filtering, 1892 remaining
2018-03-04 12:42:24 INFO 134 samples with value 1 (['L_palm'])
2018-03-04 12:42:25 INFO method meandiff. number of higher in ['L_palm'] : 183. number of higher in None : 4. total 187
Out[33]:
<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a1c111d68>

Conclusions

  • We can identify the taxa that colonize and persist, the ones that disappear forever in the middle of the period, and the ones that stochastically come and go, which is also reported in the original paper.
  • In addition, with convenient exploration of the data in heatmap, we find a set of sOTUs appear together in a group. This set is enriched with skin bacteria. We can hypothesize that this set is likely from skin and is a result of sample collecting protocol.

In [ ]: