The raw LC-MS/MS data used for these analyses is publically available at https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=56b82a67d9884b0dbda32978281c5ebf&view=advanced_view
In [1]:
import calour as ca
ca.set_log_level('INFO')
%matplotlib notebook
We load the mzmine2 table and the corresponding sample metadata file (mapping file).
We also load the gnps cluster info file, which contains gnps ids for metabolites (features) in the table. This is not mandatory but enables interactive analysis of the metabolites in the table.
Each sample is normalized using TSS to 100% (so metabolite frequencies are relative)
sample IDs from the mapping file are cut in the first occurance of ".m", since sample ids in the mapping file are made of the mzmine2 sample ids + ".mzXML"
Calour can read several ms table formats: mzmine2, biom table, gnps exported tsv or generic csv (such as from openMS). For more information about reading mass-spec data into Calour, see the read_ms documentation here
In [2]:
exp = ca.read_ms(data_file='data/metabolomics-apnea.mzmine2.csv',
sample_metadata_file='data/metabolomics-apnea.sample.txt',
normalize=100, cut_sample_id_sep='.m',
gnps_file='data/metabolomics-apnea.gnps.tsv')
In [3]:
print(exp)
Below, we cluster highly correlated metabolites for visualization purposes. Then we sort the samples by 'age' (to visualize trends over time) and then sort by 'exposure_type' (to observe differences between treatment and control samples)
Sorting on equal valus retains the previous order, so samples will be sorted by exposure_type, and within each exposure_type will be sorted by age
In [4]:
expc = exp.cluster_features()
expc = expc.sort_samples(field='age').sort_samples(field='exposure_type')
In [5]:
# plotting the modified experiment object
f = expc.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5})
Observation 1. We see a stark shift in the molecular composition of gut metabolome between the ages of 10 and 10.5 weeks and the rest of the samples. This is because the samples collected at these ages (10 and 10.5 weeks) were baseline samples collected when mice were given regular chow. After that, the diet was switched to a fat-rich, Western type diet which causes this dramatic shift in the gut metabolome in both, control and treatment group.
In [6]:
expcf = expc.filter_samples(field = 'age', values=['10', '10.5'], negate = True)
print(expcf)
Using rank-mean test with dsFDR multiple hypothesis correction
We use the binarydata transform parameter since we want to detect metabolites with differences in their presence/absence rather than differences in the frequency in each sample.
We also set the random_seed since it is a random permutation based test, and we want it to replicate on reruns
In [7]:
dexp = expcf.diff_abundance(field = 'exposure_type', val1='Air', val2='IHH', random_seed=2018, transform='binarydata')
In [9]:
f=dexp.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5})
As a lot of differentially abundant metabolites are not identified, one might be interested in looking at just the identified features to begin with.
To do this, we filter the features (axis='f') based on the field 'gnps_name' (which contains the gnps derived name for the metabolite or nan if not found in gnps), and use select=None to remove the nan sample.
Note This only works if you supply the gnps cluster info file (gnps_file) when calling read_ms()
In [10]:
# filter features without IDs
expcff = expcf.filter_by_metadata(field='gnps_name', select=None, axis='f')
In [11]:
print('There are %d identified features in the experiment' %expcff.shape[1])
In [12]:
dexp = expcff.diff_abundance(field = 'exposure_type', val1='Air', val2='IHH', random_seed=2018,
transform='binarydata')
In [13]:
# plotting identified differentially abundant features
f = dexp.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5},
feature_field='gnps_name', yticklabel_kwargs={'size':7,'rotation': 0}, yticklabel_len=50,
yticks_max=None)
In [14]:
print(dexp.feature_metadata.gnps_name[dexp.feature_metadata._calour_direction=='IHH'].values)
In [15]:
print(dexp.feature_metadata.gnps_name[dexp.feature_metadata._calour_direction=='Air'].values)
Observation 2. We observed that this list of differentially abundant metabolites is enriched in bile acids (cholic acid, taurocholic acid, chenodeoxycholic acid, tauroursocholic acid etc.) and hormones (5-Androstene-3.beta.,16.beta.,17.alpha.-triol, 5.beta.-Pregnane-3.alpha.,17-diol-20-one) among other molecules. Looking at this, we can begin to hypothesize the possible pathways disrupted in the microbiome, (and hence the host) due to IHH. In this case, it appears that the downstream effects of IHH could be linked to alterations in bile acid pool and endocrine disruption.
Although we must note that these are spectral aligment based annotations (level 2 annotations) according to metabolomics reporting standards (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772505/) and should be confirmed by comparision with pure analytical standards (level 1 annotation).
In [16]:
# just keep named metabolites
expff = exp.filter_by_metadata(field='gnps_name', select=None, axis='f')
In [17]:
dexp = expff.diff_abundance('age', val1=['10'], transform='binarydata', random_seed=2018)
In [18]:
dexp = dexp.sort_samples('age')
f = dexp.plot(sample_field='age',barx_fields=['age'], feature_field='gnps_name', gui='jupyter',
yticklabel_kwargs={'rotation':0, 'size':7}, yticklabel_len=50)
In [19]:
print(dexp.feature_metadata.gnps_name[dexp.feature_metadata._calour_direction=='10'].values)
In [20]:
print(dexp.feature_metadata.gnps_name[dexp.feature_metadata._calour_direction=='NOT 10'].values)
Observation 3. Looking at the heatmap and these lists, we can quickly see that as mice are switched to a high-fat diet, there is an large increase in bile acids in the gut metabolome, both host-derived (eg. Taurocholic acid, Cholic acid ) and microbe-derived (eg. Deoxycholic acid, Tauroursodeoxycholic acid, 12-Ketodeoxycholic acid). Similar results have been observed in other studies as well [26595891,29173647]. Here, we are able to verify previosly observed trends in very easy steps!
This way we can examine if clusters of metabolites in the molecular network. In addition to being an easier alternative to look at molecular networking data ( as opposed to rendering it as a network in cytoscape & overlaying metadata), this also helps us query temporal trends in data to observe trends in molecules belonging to the same cluster across samples and along time.
In [21]:
# sorting by cluster index
exps = exp.sort_by_metadata('gnps_component', axis='f')
exps = exps.sort_samples('age')
exps.plot(sample_field='age',barx_fields=['age'],feature_field='gnps_name', gui='jupyter',
yticklabel_kwargs={'rotation':0, 'size':7}, yticklabel_len=50, bary_fields=['gnps_component'],
bary_label=False)
Out[21]:
In [22]:
expsagg =exps.aggregate_by_metadata('gnps_component', agg='sum', axis='f')
expsagg.plot(sample_field='age',barx_fields=['age'], feature_field='gnps_name', gui='jupyter',
yticklabel_kwargs={'rotation':0, 'size':7}, yticklabel_len=50, bary_fields=['gnps_component'],
bary_label=False)
Out[22]:
In [23]:
import numpy as np
o = []
for cm in expsagg.feature_metadata._calour_merge_ids:
resstr=''
for cid in cm.split(';'):
cid = int(cid)
cname = exps.feature_metadata.gnps_name.loc[cid]
if cname=='nan':
continue
if isinstance(cname,float):
continue
resstr += cname+';'
if resstr=='':
resstr='na'
o.append(resstr)
expsagg.feature_metadata['gnps_all'] = o
In [24]:
dexp = expsagg.diff_abundance('age', val1=['10'], transform='binarydata', random_seed=2018)
In [25]:
dexp.plot(sample_field='age',barx_fields=['age'], feature_field='gnps_all', gui='jupyter',
yticklabel_kwargs={'rotation':0, 'size':7}, yticklabel_len=50, bary_fields=['gnps_component'],
bary_label=False)
Out[25]:
Sometimes feature finding is not perfect, and we get several metabolites with very close mz/rt.
In order to identify these occurances, we sort the features by the MZ and then plot the heatmap and look for neighboring metabolites that behave very similar and have close MZ/RT
In [26]:
expart = exp.sort_samples(field='age').sort_samples(field='exposure_type')
expart =expart.sort_by_metadata(field='MZ',axis='f')
#adding new column 'mzrt' to feature metadata
expart.feature_metadata['mzrt'] = expart.feature_metadata.MZ.map("{0:.2f}".format) + '-' + expart.feature_metadata.RT.map("{0:.2f}".format)
f = expart.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5},
feature_field='mzrt')
Here are some examples we found when interactively looking at the heatmap.
We get the rect coordinates by pressing the "print axes ranges" button when we get a view containing the suspect features.
This way we can reporoduce the zoomed figure inside the notebook
In [27]:
f = expart.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5},
rect=[-0.5, 207.5, 1098.5, 1094.20703125], feature_field='mzrt')
Observation 4a. They have MZ of 994.5221123 and 977.4951502999999 and RT of 228.17 and 227.38. They could be NH4 adducts of the same ion (http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/MS-Adduct-Calculator)
In [28]:
f = expart.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5},
rect=[-0.5, 207.5, 1055.5703125, 1051.27734375],feature_field='mzrt')
In [31]:
f = expart.plot(sample_field='exposure_type', gui='jupyter', barx_fields=['age'], barx_label_kwargs={'size':5},
rect=[-0.5, 207.5, 780.8203125, 772.234375],feature_field='mzrt')
Observation 4b. These could be isomers or artefacts of feature detection where the same feature got split into multiple features due to strong RT drift in some samples, broad peak shape etc. These cases would need further investigation in MZmine2.
Conclusion In this notebook, we obtained insights on how intermittent hypoxia and hypercapnia (IHH) could be perturning the host-commensal metabolism. We also confirmed previous findings on the effect of high-fat diet on gut metabolome. Additionally, we were able to QC feature finding on these data and identified the possibility of adducts, isomers or split features in these data. All this could be done by following some easy steps in calour in a single Jupyter notebook (as opposed to using different software packages for each task, re-formatting data for each platform and so on)
In [ ]: