At this point of the workshop we should already have three tables: crassphage-samples-information.txt, samples-species-taxonomy.txt, and samples-genus-taxonomy.txt. We will use one of the taxonomy tables to train a Random Forrest regression model, and see if we can predict possible hosts for crassphage.
In preparing this tutorial, I advised this blog post by Fergus Boyles.
In [29]:
import pandas as pd
tax_level = 'species' # if you want to use genus, just switch this to 'genus'
data_file = 'samples-'+ tax_level + '-taxonomy.txt'
target_file = 'crassphage-samples-information.txt'
raw_data = pd.read_csv(data_file, sep='\t', index_col=0)
raw_target = pd.read_csv(target_file, sep='\t', index_col=0)
In [30]:
all_samples = raw_data.index
crassphage_positive_samples = raw_target.index[raw_target['presence']==True]
print('The number of samples in which crassphage was detected: %s' % len(crassphage_positive_samples))
crassphage_negative_samples = raw_target.index[raw_target['presence']==False]
print("The number of samples in which crassphage wasn't detected: %s" % len(crassphage_negative_samples))
# Keep only positive samples
samples = crassphage_positive_samples
Ok, so there a lot more negative samples than positive samples.
Let's look at where are the positive samples coming from.
In [31]:
for o in ['USA', 'CHI', 'FIJ', 'TAN']:
a = len([s for s in crassphage_positive_samples if s.startswith(o)])
b = len([s for s in all_samples if s.startswith(o)])
c = a/b * 100
print('The number of crassphage positive samples from %s is %s out of total %s (%0.f%%)' % (o, a, b, c))
We can see that crassphage is most prevalent in the US cohort, but also quite common in the Chinese cohort. Whereas it is nearly absent from the Fiji cohort, and completely absent from the eight hunter gatherers from Tanzania.
The krakenhll output is in counts, so we will normalize this output to portion values, so that the sum of all taxons in a sample will be 1. We will also normalize the crassphage values, but since we don't have the needed information to translate these values to portion of the microbial community, we will just make sure that the values are between zero and one, by deviding by the largest values.
In [32]:
normalized_data = raw_data.div(raw_data.sum(axis=1), axis=0)
normalized_target = raw_target['non_outlier_mean_coverage']/raw_target['non_outlier_mean_coverage'].max()
We will remove the taxons that don't occur in at least 3 of the crassphage positive samples. The reasoning here is that a taxon that occurs so uncommonly is more likely to be an outcome of spuriuous misclassification, and would anyway contribute very little to a successful classifier.
In [33]:
occurence_in_positive_samples = normalized_data.loc[crassphage_positive_samples,]
qualifying_taxons = normalized_data.columns[(occurence_in_positive_samples>0).sum()>3]
portion = len(qualifying_taxons)/len(normalized_data.columns) * 100
print('The number of qualifying taxons is %s out of total %s (%0.f%%)' % (len(qualifying_taxons), \
len(normalized_data.columns), \
portion))
data = normalized_data.loc[samples, qualifying_taxons]
target = normalized_target[samples]
In [34]:
# add the crassphage column
merged_df = pd.concat([data, target], axis=1)
# Changing the name of the crassphage column
merged_df.columns = list(merged_df.columns[:-1]) + ['crassphage']
# Write as TAB-delimited
# anvi'o doesn't like charachters that are not alphanumeric or '_'
# so we will fix index and column labels
import re
f = lambda x: x if x.isalnum() else '_'
merged_df.columns = [re.sub('\_+', '_', ''.join([f(ch) for ch in s])) for s in list(merged_df.columns)]
merged_df.index = [re.sub('\_+', '_', ''.join([f(ch) for ch in s])) for s in list(merged_df.index)]
# Save data to a TAB-delimited file
merged_df.to_csv(tax_level + '-matrix.txt', sep='\t', index_label = 'sample')
Now we leave the python notebook and go back to the command line.
In [35]:
from sklearn import ensemble
n_trees = 64
RF = ensemble.RandomForestRegressor(n_trees, \
oob_score=True, \
random_state=1)
RF.fit(data,target)
Out[35]:
The model finished, let's check the oob value (a kind of R^2 estimation for the model. For more details refer to this part of the scikit-learn docummentation.
In [36]:
RF.oob_score_
Out[36]:
Let's look at the importance of the features:
In [37]:
import numpy as np
features_according_to_order_of_importance = data.columns[np.argsort(-RF.feature_importances_)]
In [38]:
from matplotlib import pyplot as plt
g = features_according_to_order_of_importance[0:10]
for f in g:
plt.figure()
plt.scatter(target[data[f].sort_values().index], \
data.loc[data[f].sort_values().index,f])
plt.title(f)
plt.ylabel('crassphage normalized Q2Q3 mean coverage')
plt.xlabel('%s relative abundance' % f.replace('_',' '))
plt.show()
We can see that this list is dominated by members and close relatives of the Bacteroides genus, which is encouraging. In addition, a few other things are included, but we need to remember that in a regression model things that anti-correlate could also count as important features.
In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, \
target, \
train_size=0.8, \
random_state=42)
RF = ensemble.RandomForestRegressor(n_trees, oob_score=True, random_state=1)
RF.fit(X_train, y_train)
from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr
predicted_train = RF.predict(X_train)
predicted_test = RF.predict(X_test)
test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)
print(f'Out-of-bag R-2 score estimate: {RF.oob_score_:>5.3}')
print(f'Test data R-2 score: {test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
Ok, so we can see that this model doesn't really have predictive stregth, and yet it did a pretty good job finding the important taxons.