In [38]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from calour.training import plot_scatter, plot_roc, plot_cm
In [17]:
import calour as ca
In [18]:
%matplotlib notebook
We will use the data from Qitta study 103 (https://qiita.ucsd.edu/study/description/103#)
In [19]:
dat=ca.read_amplicon('data/88-soil.biom',
'data/88-soil.sample.txt',
normalize=100,min_reads=10)
In [20]:
print(dat)
We throw away all features with total reads (over all samples) < 1 (after each sample was normalized to 100 reads/sample). Note alternatively we could filter based on mean reads/sample or fraction of samples where the feature is present. Each method filters away slightly different bacteria. See filtering notebook for details on the filtering functions.
In [22]:
dat=dat.filter_abundance(1)
dat
Out[22]:
Let's look at the distribution of pH for all the samples:
In [7]:
dat.sample_metadata['ph'].hist()
Out[7]:
In [8]:
dat.sort_samples('ph').sort_centroid(n=0.001).plot(sample_field='ph', gui='jupyter')
Out[8]:
We can then run regression analysis:
In [9]:
it = dat.regress('ph', RandomForestRegressor(random_state=0), cv=5, params=[{'n_estimators':3}, {'n_estimators': 500}])
This function returns a generator, which yields the prediction result for each parameter set specified in params
. Here we would like to see how the number of trees (named n_estimators
) in the model impact its performance. The result with n_estimators = 10
is:
In [10]:
res1 = next(it)
In [11]:
res1.head()
Out[11]:
We can plot out the result as following. Each dot is a sample with observed and predicted pH, colored by the fold of cross validation the sample is from. The diagonal line indicates perfect predition. The correlation coefficient and its p-value between the prediction and observation are also annotated around the top of the plot.
In [13]:
plot_scatter(res1, cv=True)
Out[13]:
Let's look at the result for n_estimators = 500
:
In [14]:
res2 = next(it)
In [15]:
res2.head()
Out[15]:
In [16]:
plot_scatter(res2, cv=True)
Out[16]:
From the plot, you can see, with more trees in the Random Forest model, the prediction is much better with a higher correlation coefficient.
We can do analysis similarly for classification. Let's show it with another data set that was introduced in previous notebook.
In [23]:
dat=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
'data/chronic-fatigue-syndrome.sample.txt',
normalize=10000,min_reads=1000)
In [24]:
print(dat)
Let's see if we can distinguish patient samples from control samples with classification:
In [25]:
dat.sample_metadata['Subject'].value_counts()
Out[25]:
In [31]:
it = dat.classify('Subject', RandomForestClassifier(random_state=0), cv=RepeatedStratifiedKFold(5, 3), params=[{'n_estimators':3}, {'n_estimators': 500}])
In [32]:
res1 = next(it)
In [33]:
res1.head()
Out[33]:
We can plot out the result as ROC curve or confusion matrix.
In [34]:
plot_roc(res1, classes=['Patient'])
Out[34]:
You can also plot confustion matrix:
In [42]:
plot_cm(res1)
Out[42]:
Let's look at the result for n_estimators = 500
:
In [35]:
res2 = next(it)
In [36]:
res2.head()
Out[36]:
In [37]:
plot_roc(res2, classes=['Patient'])
Out[37]:
You can also plot confustion matrix:
In [41]:
plot_cm(res2, normalize=True)
Out[41]:
From the confusion matrix and ROC plots, you can see, with more trees in the Random Forest model, similar to regression, the classification accuracy is much better.