Setup

import calour



In [1]:

    
import calour as ca









    



/Users/amnon/miniconda3/envs/calour/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

let's set to print messages of INFO level from Calour

Calour uses Python's builtin logging module to print out logging messages. By default the logging level is set to WARNING. Let's change it to INFO for the purpose of this tutorial, so we get more detailed information about the function outputs.



In [2]:

    
ca.set_log_level('INFO')

and we want interactive plots inside the notebook



In [3]:

    
%matplotlib notebook

Moving picture data set

This data set is from: Caporaso JG, Lauber CL, Costello EK, Berg-Lyons D, Gonzalez A, Stombaugh J, Knights D, Gajer P, Ravel J, Fierer N, et al. (2011) Moving pictures of the human microbiome. Genome Biology, 12, R50.

The raw data are reproccessed with deblur method, which is published in mSystem

Load the data

We use read_amplicon to read the data into AmpliconExperiment class. This class has some amplicon experiment specific functions such as filter_taxonomy etc.

Useful parameters are:

biom table name
mapping file name (can be None if no sample metadata available)
min_reads - the minimal number of reads per sample in order to keep it
normalize - the depth to normalize each sample to (note it is not rarefaction butn normalization to constant sum)



In [4]:

    
exp = ca.read_amplicon(data_file='data/moving_pic.biom', sample_metadata_file='data/moving_pic.sample.txt',
                       min_reads=1000, normalize=10000)









    



2018-03-04 12:41:21 INFO loaded 1968 samples, 7056 features
2018-03-04 12:41:21 INFO After filtering, 1967 remaining



In [5]:

    
exp









    Out[5]:





AmpliconExperiment ("moving_pic.biom") with 1967 samples, 7056 features

each `Experiment` has the following attributes:

`data` : a data sparse (or dense) 2D array

This stores the abundance information. Each row is a sample, each column a feature (i.e. OTU, gene, metabolites, etc)



In [6]:

    
exp.data









    Out[6]:





<1967x7056 sparse matrix of type '<class 'numpy.float64'>'
	with 130081 stored elements in Compressed Sparse Row format>

`sample_metadata` : a pandas dataframe with one row per sample

index is the SampleID (first column in the mapping file), matching the biom table sampleIDs



In [7]:

    
exp.sample_metadata.head(5)









    Out[7]:







  
    
      
      BarcodeSequence
      LinkerPrimerSequence
      DAYS_SINCE_EPOCH
      TARGET_SUBFRAGMENT
      ASSIGNED_FROM_GEO
      EXPERIMENT_CENTER
      TITLE
      COMMON_SAMPLE_SITE
      RUN_PREFIX
      HOST_COMMON_NAME
      ...
      KEY_SEQ
      BODY_PRODUCT
      AGE_IN_YEARS
      RUN_CENTER
      LIBRARY_CONSTRUCTION_PROTOCOL
      LATITUDE
      REGION
      HOST_INDIVIDUAL
      Description
      _calour_original_abundance
    
    
      #SampleID
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      L2S137.273277
      AATCAGTCTCGT
      GTGCCAGCMGCCGCGGTAA
      14223
      V4
      n
      CCME
      Moving_pictures_of_the_human_microbiome
      L_palm
      s_2_sequence
      human
      ...
      none
      UBERON:zone of skin of hand
      None
      CGS-GL
      Five primers, two for PCR and three for sequen...
      40.014986
      L2
      F4
      illumina_time_series
      10000.0
    
    
      L3S167.273782
      TACTTCGCTCGC
      GTGCCAGCMGCCGCGGTAA
      14526
      V4
      n
      CCME
      Moving_pictures_of_the_human_microbiome
      L_palm
      s_3_sequence
      None
      ...
      none
      UBERON:zone of skin of hand
      None
      CGS-GL
      Five primers, two for PCR and three for sequen...
      40.014986
      L3
      M3
      illumina_time_series
      10000.0
    
    
      L4S102.275028
      GCAATAGCTGCT
      GTGCCAGCMGCCGCGGTAA
      14310
      V4
      n
      CCME
      Moving_pictures_of_the_human_microbiome
      R_palm
      s_4_sequence
      None
      ...
      none
      UBERON:zone of skin of hand
      None
      CGS-GL
      Five primers, two for PCR and three for sequen...
      40.014986
      L4
      M3
      illumina_time_series
      10000.0
    
    
      L6S243.274842
      ACGAGTGCTATC
      GTGCCAGCMGCCGCGGTAA
      14507
      V4
      n
      CCME
      Moving_pictures_of_the_human_microbiome
      Tongue
      s_6_sequence
      None
      ...
      none
      UBERON:tongue
      None
      CGS-GL
      Five primers, two for PCR and three for sequen...
      40.014986
      L6
      M3
      illumina_time_series
      10000.0
    
    
      L3S240.274358
      AGTACGCTCGAG
      GTGCCAGCMGCCGCGGTAA
      14178
      V4
      n
      CCME
      Moving_pictures_of_the_human_microbiome
      R_palm
      s_3_sequence
      human
      ...
      none
      UBERON:zone of skin of hand
      None
      CGS-GL
      Five primers, two for PCR and three for sequen...
      40.014986
      L3
      F4
      illumina_time_series
      10000.0
    
  

5 rows × 48 columns

`feature_metadata` : a pandas dataframe with one row per feature (i.e. sOTU)

For deblurred data set, index is the actual sequence, matching the biom table

Additional properties are loaded from biom table observation metadata (i.e. taxonomy)



In [8]:

    
exp.feature_metadata.head(5)









    Out[8]:







  
    
      
      taxonomy
    
  
  
    
      TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCTTAGGCGGCAAATTAAGTTAGTGGTTAAATAGTTCGGCTCAACCGGATTTCGCCATTAAAACTGATATGCTAGAGATTAAACGAGGTAGGCGGAATAAGTTA
      k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o_...
    
    
      TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGTGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCG
      k__Bacteria;p__Verrucomicrobia;c__Verrucomicro...
    
    
      TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTTGCGTCGGGAGTGAAAACTCAGGGCTTAACCCTGAGCCTGCTTCCGATACGGGCAGACTAGAGGTATGCAGGGGAGAACGGAATTCCTG
      k__Bacteria;p__Actinobacteria;c__Actinobacteri...
    
    
      TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCATGCAAGTCAGAAGTGAAAATCCGGGGCTCAACCCCGGAACTGCTTTTGAAACTGCAGGGCTAGAGTGCAGGAGGGGCAAGTGGAATTCCTA
      k__Bacteria;p__Firmicutes;c__Clostridia;o__Clo...
    
    
      TACGTAGGTGGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCTGATTAAGTCAGCGGTGAAAGGTAGCAGCTTAACTGTTTTACATGCCGTTGATACTGGTTAGCTTGAGTTGACAGAAGGCAGATAGAATTCCT
      k__Bacteria;p__Bacteroidetes;c__Cytophagia;o__...

Sort the samples

sort by the time field, COMMON_SAMPLE_SITE (to get separation between left/right hand), and subject id



In [9]:

    
exp.sample_metadata.COMMON_SAMPLE_SITE.value_counts()









    Out[9]:





Tongue    508
L_palm    499
R_palm    493
feces     467
Name: COMMON_SAMPLE_SITE, dtype: int64



In [10]:

    
exp=exp.sort_samples('DAYS_SINCE_EXPERIMENT_START').sort_samples('COMMON_SAMPLE_SITE').sort_samples('HOST_SUBJECT_ID')



In [11]:

    
exp.feature_metadata.head()









    Out[11]:







  
    
      
      taxonomy
    
  
  
    
      TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCTTAGGCGGCAAATTAAGTTAGTGGTTAAATAGTTCGGCTCAACCGGATTTCGCCATTAAAACTGATATGCTAGAGATTAAACGAGGTAGGCGGAATAAGTTA
      k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o_...
    
    
      TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGTGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCG
      k__Bacteria;p__Verrucomicrobia;c__Verrucomicro...
    
    
      TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTTGCGTCGGGAGTGAAAACTCAGGGCTTAACCCTGAGCCTGCTTCCGATACGGGCAGACTAGAGGTATGCAGGGGAGAACGGAATTCCTG
      k__Bacteria;p__Actinobacteria;c__Actinobacteri...
    
    
      TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCATGCAAGTCAGAAGTGAAAATCCGGGGCTCAACCCCGGAACTGCTTTTGAAACTGCAGGGCTAGAGTGCAGGAGGGGCAAGTGGAATTCCTA
      k__Bacteria;p__Firmicutes;c__Clostridia;o__Clo...
    
    
      TACGTAGGTGGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCTGATTAAGTCAGCGGTGAAAGGTAGCAGCTTAACTGTTTTACATGCCGTTGATACTGGTTAGCTTGAGTTGACAGAAGGCAGATAGAATTCCT
      k__Bacteria;p__Bacteroidetes;c__Cytophagia;o__...

divide to per sample-type experiments



In [12]:

    
exp.sample_metadata['BODY_PRODUCT'].value_counts()









    Out[12]:





UBERON:zone of skin of hand    992
UBERON:tongue                  508
UBERON:feces                   467
Name: BODY_PRODUCT, dtype: int64

let's split the dataset:



In [13]:

    
feces = exp.filter_samples('BODY_PRODUCT', 'UBERON:feces')
tongue = exp.filter_samples('BODY_PRODUCT', 'UBERON:tongue')
# just to have fun, let's negate and use multiple values
hand = exp.filter_samples('BODY_PRODUCT', ['UBERON:feces', 'UBERON:tongue'], negate=True )

NOTE: The data array and sample and feature metadata are always synchronized to the same order for all the manipulations (filtering, sorting, transforming, etc.) done on Experiment object.



In [14]:

    
feces









    Out[14]:





AmpliconExperiment ("moving_pic.biom") with 467 samples, 7056 features



In [15]:

    
tongue









    Out[15]:





AmpliconExperiment ("moving_pic.biom") with 508 samples, 7056 features



In [16]:

    
hand









    Out[16]:





AmpliconExperiment ("moving_pic.biom") with 992 samples, 7056 features

Note that filtering samples does not change or get rid of features not present in the set of samples.

get rid of features with <50 reads total over all samples

(Where we have 10000 reads/sample after normalization in the load)



In [17]:

    
tt=exp.filter_abundance(50)









    



2018-03-04 12:41:32 INFO After filtering, 2512 remaining

And plot the data naively

We plot by sorting samples according to body site, with sample color bars for the subject id.

We use the plot_sort function which does a sort on a (set) of fields on the fly and then plots. We can specify the type of interactive heatmap using the gui with following options:

'jupyter' plots an interactive plot inside the notebook
'qt5' opens the plot in an interactive qt5 window
None just creates the static matplotlib figure



In [18]:

    
f = tt.sort_samples('BODY_PRODUCT').plot(gui='jupyter', feature_field='taxonomy', barx_fields=['HOST_SUBJECT_ID'], clim=[0, 1000])



In [19]:

    
# you can save figures
f.save_figure('fig.pdf')

Sort bacteria by the taxonomy (just because we can :D)



In [19]:

    
zz=tt.sort_taxonomy()
zz.sort_samples('BODY_PRODUCT').plot(gui='jupyter',barx_fields=['HOST_SUBJECT_ID'], clim=[0, 1000])









    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[19]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x110e256a0>

Sort bacteria by clustering

So similar behaving bacteria will be close to each other We also remove bacteria with < 50 reads total, so the heatmap will be of the more interesting bacteria.



In [20]:

    
ttt=tt.cluster_features()









    



2018-03-04 12:41:43 INFO After filtering, 2512 remaining



In [21]:

    
f=ttt.sort_samples('BODY_PRODUCT').plot(gui='jupyter',barx_fields=['HOST_SUBJECT_ID','COMMON_SAMPLE_SITE'], clim=[0,1000])

This is the same plot but focused on a specific region that is interesting:



In [22]:

    
f=ttt.sort_samples('BODY_PRODUCT').plot(gui='jupyter',barx_fields=['HOST_SUBJECT_ID','COMMON_SAMPLE_SITE'],
                rect=[-0.5, 1966.5, 2511.5, 2354.5],
                clim=[0,1000], feature_field=None)

We can see there is a set of sOTUs showing up in together in the individual of "M3" sporadically (near the bottom of the heatmap above). This behavior is difficult to see in a naive heatmap without clustering and sorting the features.

find features correlated with time in the individule of "M3"



In [23]:

    
tt = feces.filter_samples('HOST_SUBJECT_ID','M3')
tt=tt.cluster_features(50)









    



2018-03-04 12:42:01 INFO After filtering, 493 remaining



In [24]:

    
dd = tt.correlation('DAYS_SINCE_EXPERIMENT_START', random_seed=2018)









    



2018-03-04 12:42:02 INFO After filtering, 493 remaining
2018-03-04 12:42:02 INFO method spearman for field DAYS_SINCE_EXPERIMENT_START. Positive correlated features : 61. Negative correlated features : 191. total 252

We get 259 features with significant correlation following FDR control.

Note features are sorted by the effect size (biggest/smallest correlation is top/bottom)



In [25]:

    
dd.plot(feature_field=None, gui='jupyter')









    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[25]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a191b7198>

We can also sort by the center of mass



In [26]:

    
ttt=tt.filter_prevalence(0.3)









    



2018-03-04 12:42:05 INFO After filtering, 76 remaining



In [27]:

    
ttt=ttt.sort_centroid()



In [28]:

    
ttt.sample_metadata.HOST_SUBJECT_ID.value_counts()









    Out[28]:





M3    336
Name: HOST_SUBJECT_ID, dtype: int64



In [29]:

    
f=ttt.plot(gui='jupyter',feature_field=None, clim=[0,1000])

Some more analyses

We can sort the features based on abundance in a specific field/value group. For example, let's sort the features by their abundances in the individual "M3" and then look how abundant those bacteria are in the individual "F4":



In [30]:

    
tt = feces.sort_abundance({'HOST_SUBJECT_ID': ['M3']})



In [31]:

    
tt.plot(sample_field='HOST_SUBJECT_ID', feature_field=None, gui='jupyter')









    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[31]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a191bb9e8>

we can find the sOTUs that are different between left and right hands



In [32]:

    
m3s=hand.filter_samples('HOST_SUBJECT_ID','M3')
m3s=m3s.normalize_compositional()
m3s=m3s.sort_samples('DAYS_SINCE_EPOCH')

dd=m3s.diff_abundance('COMMON_SAMPLE_SITE','L_palm', random_seed=2018)

dd.sort_samples('COMMON_SAMPLE_SITE').plot(sample_field='COMMON_SAMPLE_SITE', gui='jupyter', title='Host M3')









    



2018-03-04 12:42:20 INFO After filtering, 2 remaining
2018-03-04 12:42:20 INFO ignoring 2 features
2018-03-04 12:42:20 INFO After filtering, 5247 remaining
2018-03-04 12:42:21 INFO 365 samples with value 1 (['L_palm'])
2018-03-04 12:42:23 INFO method meandiff. number of higher in ['L_palm'] : 3. number of higher in None : 843. total 846






    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[32]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a1ba1f978>



In [33]:

    
m3s=hand.filter_samples('HOST_SUBJECT_ID','F4')

m3s=m3s.sort_samples('DAYS_SINCE_EPOCH')

dd=m3s.diff_abundance('COMMON_SAMPLE_SITE','L_palm', random_seed=2018)

dd.sort_samples('COMMON_SAMPLE_SITE').plot(sample_field='COMMON_SAMPLE_SITE', gui='jupyter')









    



2018-03-04 12:42:24 INFO After filtering, 1892 remaining
2018-03-04 12:42:24 INFO 134 samples with value 1 (['L_palm'])
2018-03-04 12:42:25 INFO method meandiff. number of higher in ['L_palm'] : 183. number of higher in None : 4. total 187






    














    











    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    





 
 










    Out[33]:





<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a1c111d68>

Conclusions

We can identify the taxa that colonize and persist, the ones that disappear forever in the middle of the period, and the ones that stochastically come and go, which is also reported in the original paper.
In addition, with convenient exploration of the data in heatmap, we find a set of sOTUs appear together in a group. This set is enriched with skin bacteria. We can hypothesize that this set is likely from skin and is a result of sample collecting protocol.



In [ ]:

	BarcodeSequence	LinkerPrimerSequence	DAYS_SINCE_EPOCH	TARGET_SUBFRAGMENT	ASSIGNED_FROM_GEO	EXPERIMENT_CENTER	TITLE	COMMON_SAMPLE_SITE	RUN_PREFIX	HOST_COMMON_NAME	...	KEY_SEQ	BODY_PRODUCT	AGE_IN_YEARS	RUN_CENTER	LIBRARY_CONSTRUCTION_PROTOCOL	LATITUDE	REGION	HOST_INDIVIDUAL	Description	_calour_original_abundance
#SampleID
L2S137.273277	AATCAGTCTCGT	GTGCCAGCMGCCGCGGTAA	14223	V4	n	CCME	Moving_pictures_of_the_human_microbiome	L_palm	s_2_sequence	human	...	none	UBERON:zone of skin of hand	None	CGS-GL	Five primers, two for PCR and three for sequen...	40.014986	L2	F4	illumina_time_series	10000.0
L3S167.273782	TACTTCGCTCGC	GTGCCAGCMGCCGCGGTAA	14526	V4	n	CCME	Moving_pictures_of_the_human_microbiome	L_palm	s_3_sequence	None	...	none	UBERON:zone of skin of hand	None	CGS-GL	Five primers, two for PCR and three for sequen...	40.014986	L3	M3	illumina_time_series	10000.0
L4S102.275028	GCAATAGCTGCT	GTGCCAGCMGCCGCGGTAA	14310	V4	n	CCME	Moving_pictures_of_the_human_microbiome	R_palm	s_4_sequence	None	...	none	UBERON:zone of skin of hand	None	CGS-GL	Five primers, two for PCR and three for sequen...	40.014986	L4	M3	illumina_time_series	10000.0
L6S243.274842	ACGAGTGCTATC	GTGCCAGCMGCCGCGGTAA	14507	V4	n	CCME	Moving_pictures_of_the_human_microbiome	Tongue	s_6_sequence	None	...	none	UBERON:tongue	None	CGS-GL	Five primers, two for PCR and three for sequen...	40.014986	L6	M3	illumina_time_series	10000.0
L3S240.274358	AGTACGCTCGAG	GTGCCAGCMGCCGCGGTAA	14178	V4	n	CCME	Moving_pictures_of_the_human_microbiome	R_palm	s_3_sequence	human	...	none	UBERON:zone of skin of hand	None	CGS-GL	Five primers, two for PCR and three for sequen...	40.014986	L3	F4	illumina_time_series	10000.0

	taxonomy
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCTTAGGCGGCAAATTAAGTTAGTGGTTAAATAGTTCGGCTCAACCGGATTTCGCCATTAAAACTGATATGCTAGAGATTAAACGAGGTAGGCGGAATAAGTTA	k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o_...
TACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGTGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCG	k__Bacteria;p__Verrucomicrobia;c__Verrucomicro...
TACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTTGTTGCGTCGGGAGTGAAAACTCAGGGCTTAACCCTGAGCCTGCTTCCGATACGGGCAGACTAGAGGTATGCAGGGGAGAACGGAATTCCTG	k__Bacteria;p__Actinobacteria;c__Actinobacteri...
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCATGCAAGTCAGAAGTGAAAATCCGGGGCTCAACCCCGGAACTGCTTTTGAAACTGCAGGGCTAGAGTGCAGGAGGGGCAAGTGGAATTCCTA	k__Bacteria;p__Firmicutes;c__Clostridia;o__Clo...
TACGTAGGTGGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGCTGATTAAGTCAGCGGTGAAAGGTAGCAGCTTAACTGTTTTACATGCCGTTGATACTGGTTAGCTTGAGTTGACAGAAGGCAGATAGAATTCCT	k__Bacteria;p__Bacteroidetes;c__Cytophagia;o__...