Microbiome Summer School 2017 - Mass Spectrometry Tutorial

Welcome to this tutorial for Plenary 9 of the Microbiome Summer School 2017. This tutorial concerns Algorithms for Mass Spectrometry.

This notebook contains working code and an example of applications of the algorithms covered in Plenary 9. A dataset of mass spectra will be processed and corrected by the Virtual Lock Mass algorithm and subsequently aligned. A machine learning algorithm will then be applied to the data.


In [ ]:
#This section contains some fundamental imports for the notebook.
import numpy as np

The following section will load the mass spectra data into memory.

This dataset is a set of 80 samples of red blood cell cultures. Their spectra was acquired by LDTD-ToF mass spectrometry on a Waters Synapt G2-Si instrument. These spectra were acquired in high resolution mode using a data independant acquisition mode ($MS^e$).

Of these 80 samples, 40 are from red blood cell cultures infected by malaria. The other 40 samples are not infected. It is the objective of this tutorial to correct and align these spectra in order to classify them by machine learning.

The dataset is stored in the file dataset.h5, contained within this tutorial. The hdf5 format is a very efficient data storage format for multiple types of datasets and numeric data.

The loading operation may take some seconds to complete.


In [ ]:
from tutorial_code.utils import load_spectra

In [ ]:
datafile = "dataset.h5"
spectra = load_spectra(datafile)

At this point, the mass spectra are loaded in memory and ready for the next step.

The next steps will be to correct and align these spectra in order to render them more comparable for the machine learning analysis to follow.

First, the Virtual Lock Mass algorithm will be applied. The following command will import the corrector code.


In [ ]:
from tutorial_code.virtual_lock_mass import VirtualLockMassCorrector

We must then create a corrector for the spectra.

The following command will create a corrector with a minimum peak intensity of 1000 and a maximum distance of 40 ppms. Theses settings yield the most correction points of the dataset, and thus they are considered optimal.


In [ ]:
corrector = VirtualLockMassCorrector(window_size=40, minimum_peak_intensity=1000)

The corrector is then trained on the dataset in order to detect the VLM correction points. This is done by using the fit function, with the dataset as a parameter.


In [ ]:
corrector.fit(spectra)

Once the corrector has been trained, it can apply its correction to the spectra. We simply use the transform function of the corrector on the dataset. However, we must store the result in a new variable.


In [ ]:
corrected_spectra = corrector.transform(spectra)

Now the spectra are corrected and larger shifts between samples should be removed. We must still align the spectra together in order to remove small variations in m/z values.

The following command will import the aligner code.


In [ ]:
from tutorial_code.alignment import Mass_Spectra_Aligner

As before, we must create an aligner.

The following command will create this aligner with a window size of 30 ppms.


In [ ]:
aligner = Mass_Spectra_Aligner(window_size=30)

The aligner will then detect the alignment points by being fitted to the mass spectra.


In [ ]:
aligner.fit(corrected_spectra)

Once the aligner is fitted, we have the alignment points.

The spectra will then be aligned by the transform function of the aligner. Once again, the aligned spectra will need to be stored in a new variable.


In [ ]:
aligned_spectra = aligner.transform(corrected_spectra)

The spectra are now aligned. In terms of m/z values, the spectra are ready to be compared.

The spectra must now be changed into a format more appropriate for machine learning, which the algorithms can read. This format is that of a data matrix, where each row represents a mass spectrum and the columns represent a peak that is present in the dataset.

To make this conversion, import the spectrum_to_matrix function from the utilitaries.


In [ ]:
from tutorial_code.utils import spectrum_to_matrix

data = spectrum_to_matrix(aligned_spectra)

Finally, we need to extract labels from the spectra in order to know which spectrum represents which class (infected or not by malaria).

The following function extracts this information from the spectra's metadata and return an array of tags, which are 0 for non-infected samples and 1 for malaria-infected samples


In [ ]:
from tutorial_code.utils import extract_tags

tags = extract_tags(aligned_spectra)

Here we start on the machine learning analysis more properly.

We need to ensure that we have a good experimental workflow that is reproducible and whose predictors are generalizable to more data.

A first step is of splitting the data into a training set and a testing set of samples. The algorithms will be trained on and exposed to the training set, while the testing set is set apart and kept for a final evaluation of the model. This way, we can ensure that the model can generalize its predictions on new or never seen before data.

A dataset can easily be split by the existing function train_test_split from the scikit-learn package.


In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data, tags, test_size=0.25, random_state=42)

The above command split the dataset randomly into a training set of 60 spectra (or examples, samples) and a testing set of 20 spectra. All the feature information, i.e. the data matrix containing the peak intensities, are contained within the matrices X_train and X_test. The arrays Y_train and Y_test respectively contain the tags pertaining to the samples.

The random_state so that we can reproduce this exact split of samples, instead of getting a random split each time we repeat this command

Next, we will create an object to handle the cross-validation step and our learner at the same time.

We will create a Decision Tree classifier (as presented during the plenary sessions) for this classification task.

Cross-validation is more detailed in the online page for this tutorial, as well as in a previous tutorial. In short, this process breaks the training sets into folds, in this case 5. The training algorithm will be trained on all the folds but one, and then tested on the remaining one. This process is repeated so that each fold serves as a test fold once. By this method of evaluation, we can determine which parameters of the algorithm (called hyper-parameters) are best.


In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier 
#this algorithm will not be used in the tutorial, but was presented in the plenary.
#Try using it on your own and see the results!

The Decision Tree classifier has one hyper-parameter that we will cross-validate in this tutorial. This hyper-parameter is the maximum depth of the decision tree allowed.

The following commands will create the parameter grid, the classifier and the cross-validator.


In [ ]:
param_grid = {
    "max_depth":[1,2,3,4,5,6]
}

In [ ]:
learner = GridSearchCV(DecisionTreeClassifier(random_state=42),
                      param_grid=param_grid,
                      cv=5) #the number of folds

The learner is ready to be trained.

We use the fit method with the training set. While using a GridSearchCV object, the fit will conduct both the cross-validation and train the model with the best hyper-parameters on the whole training set.


In [ ]:
learner.fit(X_train, Y_train)

We can then check the optimal parameters for the learner, and evaluate the predictions on the training set and testing sets.


In [ ]:
print(learner.best_estimator_)

The learner can then predict on both the training sets and testing sets. We can then evalutate the learner by comparing the true labels with the predicted ones.


In [ ]:
predictions_on_train = learner.predict(X_train)

In [ ]:
predictions_on_test = learner.predict(X_test)

We can then use an existing function of scikit-learn that build a classification report on the comparison between the true labels and the predictions. This gives us the precision and recall of the classifier, as well as the F1 Score.

An additionnal function gives us the zero one loss, or the error rate, of the learner. If we print one minus the zero one loss, we obtain the percentage of accuracy of the classifier.

Further information on the metrics is presented on the website of this tutorial.


In [ ]:
from sklearn.metrics import classification_report, zero_one_loss

Here are the results on the training set (empirical risk/accuracy).


In [ ]:
print(classification_report(Y_train, predictions_on_train))
print(1. - zero_one_loss(Y_train, predictions_on_train))

And here are the results on the testing set (test risk/accuray).


In [ ]:
print(classification_report(Y_test, predictions_on_test))
print(1. - zero_one_loss(Y_test, predictions_on_test))

This marks the end of the tutorial. If you wish to experiment further, feel free to edit parameters and even change the machine learning algorithm.

Below is the code for a simple AdaBoost Classifier to test on the dataset.


In [ ]:
param_grid = {
    "n_estimators":[1,5,10,20,30,40,50,60,70,80,90,100],
    "learning_rate":[0.01, 0.1, 1., 10., 100.]
}

learner = GridSearchCV(AdaBoostClassifier(random_state=42),
                      param_grid=param_grid,
                      cv=5) #the number of folds