Last year, I read a paper titled, "Feature Selection Methods for Identifying Genetic Determinants of Host Species in RNA Viruses". This year, I read another paper titled, "Predicting host tropism of influenza A virus proteins using random forest". The essence of these papers were to predict influenza virus host tropism from sequence features. The particular feature engineering steps were somewhat distinct, in which the former used amino acid sequences encoded as binary 1/0s, while the latter used physiochemical characteristics of the amino acid sequences instead. However, the core problem was essentially identical - predict a host classification from influenza protein sequence features. Random forest classifiers were used in both papers, and is a powerful method for identifying non-linear mappings from features to class labels. My question here was to see if I could get comparable performance using a simple neural network.
I downloaded influenza HA sequences from the Influenza Research Database. Sequences dated from 1980 to 2015. Lab strains were excluded, duplicates allowed (captures host tropism of certain sequences). All viral subtypes were included.
Below, let's take a deep dive into what it takes to construct an artificial neural network!
The imports necessary for running this notebook.
In [2]:
! echo $PATH
! echo $CUDA_ROOT
In [3]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from collections import Counter
from sklearn.preprocessing import LabelBinarizer
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.metrics import mutual_info_score as mi
from lasagne import layers
from lasagne.updates import nesterov_momentum
from nolearn.lasagne import NeuralNet
import theano
Read in the viral sequences.
In [4]:
sequences = SeqIO.to_dict(SeqIO.parse('20150902_nnet_ha.fasta', 'fasta'))
# sequences
The sequences are going to be of variable length. To avoid the problem of doing multiple sequence alignments, filter to just the most common length (i.e. 566 amino acids).
In [5]:
lengths = Counter()
for accession, seqrecord in sequences.items():
lengths[len(seqrecord.seq)] += 1
lengths.most_common(1)[0][0]
Out[5]:
There are sequences that are ambiguously labeled. For example, "Environment" and "Avian" samples. We would like to give a more detailed prediction as to which hosts it likely came from. Therefore, take out the "Environment" and "Avian" samples.
In [6]:
# For convenience, we will only work with amino acid sequencees of length 566.
final_sequences = dict()
for accession, seqrecord in sequences.items():
host = seqrecord.id.split('|')[1]
if len(seqrecord.seq) == lengths.most_common(1)[0][0]:
final_sequences[accession] = seqrecord
Create a numpy
array to store the alignment.
In [7]:
alignment = MultipleSeqAlignment(final_sequences.values())
alignment_array = np.array([list(rec) for rec in alignment])
The first piece of meat in the code begins here. In the cell below, we convert the sequence matrix into a series of binary 1
s and 0
s, to encode the features as numbers. This is important - AFAIK, almost all machine learning algorithms require numerical inputs.
In [8]:
# Create an empty dataframe.
# df = pd.DataFrame()
# # Create a dictionary of position + label binarizer objects.
# pos_lb = dict()
# for pos in range(lengths.most_common(1)[0][0]):
# # Convert position 0 by binarization.
# lb = LabelBinarizer()
# # Fit to the alignment at that position.
# lb.fit(alignment_array[:,pos])
# # Add the label binarizer to the dictionary.
# pos_lb[pos] = lb
# # Create a dataframe.
# pos = pd.DataFrame(lb.transform(alignment_array[:,pos]))
# # Append the columns to the dataframe.
# for col in pos.columns:
# maxcol = len(df.columns)
# df[maxcol + 1] = pos[col]
from isoelectric_point import isoelectric_points
df = pd.DataFrame(alignment_array).replace(isoelectric_points)
In [9]:
# Add in host data
df['host'] = [s.id.split('|')[1] for s in final_sequences.values()]
df = df.replace({'X':np.nan, 'J':np.nan, 'B':np.nan, 'Z':np.nan})
df.dropna(inplace=True)
df.to_csv('isoelectric_point_data.csv')
In [10]:
# Normalize data to between 0 and 1.
from sklearn.preprocessing import StandardScaler
df_std = pd.DataFrame(StandardScaler().fit_transform(df.ix[:,:-1]))
df_std['host'] = df['host']
In [11]:
ambiguous_hosts = ['Environment', 'Avian', 'Unknown', 'NA', 'Bird', 'Sea_Mammal', 'Aquatic_Bird']
unknowns = df_std[df_std['host'].isin(ambiguous_hosts)]
train_test_df = df_std[df_std['host'].isin(ambiguous_hosts) == False]
train_test_df.dropna(inplace=True)
With the cell above, we now have a sequence feature matrix, in which the 566 amino acids positions have been expanded to 6750 columns of binary sequence features.
The next step is to grab out the host species labels, and encode them as 1s and 0s as well.
In [12]:
set([i for i in train_test_df['host'].values])
Out[12]:
In [13]:
# Grab out the labels.
output_lb = LabelBinarizer()
output_lb.fit(train_test_df['host'])
Y = output_lb.fit_transform(train_test_df['host'])
Y = Y.astype(np.float32) # Necessary for passing the data into nolearn.
Y.shape
Out[13]:
In [14]:
X = train_test_df.ix[:,:-1].values
X = X.astype(np.float32) # Necessary for passing the data into nolearn.
X.shape
Out[14]:
Next up, we do the train/test split.
In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)
For comparison, let's train a random forest classifier, and see what the concordance is between the predicted labels and the actual labels.
In [84]:
rf = RandomForestClassifier()
rf.fit(X_train, Y_train)
predictions = rf.predict(X_test)
predicted_labels = output_lb.inverse_transform(predictions)
# Compute the mutual information between the predicted labels and the actual labels.
mi(predicted_labels, output_lb.inverse_transform(Y_test))
Out[84]:
By the majority-consensus rule, and using mutual information as the metric for scoring, things look not so bad! As mentioned above, the RandomForestClassifier
is a pretty powerful method for finding non-linear patterns between features and class labels.
Uncomment the cell below if you want to try the scikit-learn
's ExtraTreesClassifier
.
In [85]:
# et = ExtraTreesClassifier()
# et.fit(X_train, Y_train)
# predictions = et.predict(X_test)
# predicted_labels = output_lb.inverse_transform(predictions)
# mi(predicted_labels, output_lb.inverse_transform(Y_test))
As a demonstration of how this model can be used, let's look at the ambiguously labeled sequences, i.e. those from "Environment" and "Avian", to see whether we can make a prediction as to what host it likely came frome.
In [86]:
# unknown_hosts = unknowns.ix[:,:-1].values
# preds = rf.predict(unknown_hosts)
# output_lb.inverse_transform(preds)
Alrighty - we're now ready to try out a neural network! For this try, we will use lasagne
and nolearn
, two packages which have made things pretty easy for building neural networks. In this segment, I'm going to not show experiments with multiple architectures, activations and the like. The goal is to illustrate how easy the specification of a neural network is.
The network architecture that we'll try is as such:
In [109]:
from lasagne import nonlinearities as nl
net1 = NeuralNet(layers=[
('input', layers.InputLayer),
('hidden1', layers.DenseLayer),
#('dropout', layers.DropoutLayer),
#('hidden2', layers.DenseLayer),
#('dropout2', layers.DropoutLayer),
('output', layers.DenseLayer),
],
# Layer parameters:
input_shape=(None, X.shape[1]),
hidden1_num_units=300,
#dropout_p=0.3,
#hidden2_num_units=500,
#dropout2_p=0.3,
output_nonlinearity=nl.softmax,
output_num_units=Y.shape[1],
#allow_input_downcast=True,
# Optimization Method:
update=nesterov_momentum,
update_learning_rate=0.01,
update_momentum=0.9,
regression=True,
max_epochs=100,
verbose=1
)
Training a simple neural network on my MacBook Air takes quite a bit of time :). But the function call for fitting it is a simple nnet.fit(X, Y)
.
In [110]:
net1.fit(X_train, Y_train)
Out[110]:
Let's grab out the predictions!
In [111]:
preds = net1.predict(X_test)
preds.shape
Out[111]:
We're going to see how good the classifier did by examining the class labels. The way to visualize this is to have, say, the class labels on the X-axis, and the probability of prediction on the Y-axis. We can do this sample by sample. Here's a simple example with no frills in the matplotlib interface.
In [112]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.bar(np.arange(len(preds[0])), preds[0])
Out[112]:
Alrighty, let's add some frills - the class labels, the probability of each class label, and the original class label.
In [113]:
### NOTE: Change the value of i to anything above!
i = 111
plt.figure(figsize=(20,5))
plt.bar(np.arange(len(output_lb.classes_)), preds[i])
plt.xticks(np.arange(len(output_lb.classes_)) + 0.5, output_lb.classes_, rotation='vertical')
plt.title('Original Label: ' + output_lb.inverse_transform(Y_test)[i])
plt.show()
# print(output_lb.inverse_transform(Y_test)[i])
Let's do a majority-consensus rule applied to the labels, and then compute the mutual information score again.
In [114]:
preds_labels = []
for i in range(preds.shape[0]):
maxval = max(preds[i])
pos = list(preds[i]).index(maxval)
preds_labels.append(output_lb.classes_[pos])
mi(preds_labels, output_lb.inverse_transform(Y_test))
Out[114]:
With a score of 0.73, that's not bad either! It certainly didn't outperform the RandomForestClassifier
, but the default parameters on the RFC were probably pretty good to begin with. Notice how little tweaking on the neural network we had to do as well.
For good measure, these were the class labels. Notice how successful influenza has been in replicating across the many different species!
In [115]:
output_lb.classes_
Out[115]:
A bit more about the biology of influenza.
If you made it this far, thank you for hanging on! How does this mini project relate to the biology of flu?
As the flu evolves and moves between viral hosts, it gradually adapts to that host. This allows it to successfully establish an infection in the host population.
We can observe the viral host as we sample viruses from it. Sometimes, we don't catch it in its adapted state, but it's un-adapted state, as if it had freshly joined in from its other population. That is likely why some of the class labels are mis-identified.
Also, there are environmentally sampled isolates. They obviously aren't simply replicating in the environment (i.e. bodies of water), but in some host, and were shed into the water. For these guys, the host labels won't necessarily match up, as there'll be a stronger signal with particular hosts - whether it be from ducks, pigs or even humans.
There's a few obvious things that can be done.
What else might be done? Ping me at ericmajinglong@gmail.com with the subject "neural nets and HA". :)
In [ ]: