Introduction

In this notebook, I will develop a system for forecasting influenza viral evolution. More specifically, I would like to predict the probability distribution over new influenza sequences, given current sequences.

Use Case

This data product uses my background knowledge of influenza evolution and marries it with recent skills picked up in deep learning to try to forecast future influenza sequences. This is potentially a long-term competitive advantage for a company building out vaccines against fast-evolving viruses, and more generally, a long-term competitive advantage for public health.

Borrowing a quote from one of the great ice hockey players,

I skate to where the puck will be, not where it has already been. (Wayne Gretzky)

Data

The data come from the influenza research database (IRD). Search parameters were as follows:

  • Data Type: Protein
  • Virus Type: A
  • Subtype: H3N2
  • Date Range: From 2000
  • 'Classical' Proteins: Complete? HA

No other search parameters were used. Data were downloaded on 31 May 2017.

Learning Strategy

  1. Use variational autoencoders, a deep learning method, to learn a latent manifold on which sequence evolution is taking place.
  2. Simultaneously construct a genotype network of flu evolution.
    1. Nodes: Influenza HA protein sequences.
    2. Edges: Sequences differ by one amino acid.
  3. Sanity checks:
    1. Plot edit distance between any two random pairs of protein sequences against their manifold distance. There should be a linear relationship between the two.
  4. Validation:
    1. MVP validation will be done by doing one round of "back testing" - we hold out 2017 data, and predict whether 2017 data shows up or not.

Technical aspects demonstrated

  1. Deep learning.
  2. Basic time series analysis.
  3. Genomics.

In [366]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import networkx as nx
import tensorflow as tf
import pymc3 as pm
import GPflow

plt.style.use('fivethirtyeight')


from Bio import SeqIO
from copy import deepcopy
from custom import load_model
from datetime import datetime
from keras.layers import Input, Dense, Lambda, Dropout
from keras.models import Model, model_from_json
from keras import backend as K
from keras import objectives
from keras.callbacks import EarlyStopping
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from random import sample

from Levenshtein import distance as levD
from matplotlib.gridspec import GridSpec
from mpl_toolkits.mplot3d import Axes3D
from pymc3 import gp
from sklearn.model_selection import train_test_split
from scipy.spatial import ConvexHull
from scipy.spatial.distance import euclidean as eucD

from custom import encode_array, save_model, get_density_interval
from utils.data import load_sequence_and_metadata

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

We first will start with doing some basic cleaning of the data.

  1. We want to make sure that there's only human flu sequences. The dataset may contain non-flu sequences.
  2. We want to forecast 2017 sequences. Therefore, any machine learning fitting can only be done on the data up to 2016; any predictions are to be done on 2017 sequences.

In [367]:
# Let's start by loading the protein sequence FASTA file and metadata.
sequences, metadata = load_sequence_and_metadata(kind='csv')
assert len(sequences) == len(metadata)

# Filter for just human sequences, then split into training and test set.
metadata = metadata[metadata['Host Species'] == 'IRD:Human']
training_metadata = metadata[metadata['Collection Date'] < datetime(2017, 1, 1)]
training_idxs = [i for i, s in enumerate(sequences) if s.id in training_metadata['Sequence Accession'].values]

test_metadata = metadata[metadata['Collection Date'] >= datetime(2017, 1, 1)]
test_idxs = [i for i, s in enumerate(sequences) if s.id in test_metadata['Sequence Accession'].values]

# Encode as array.
sequence_array = encode_array(sequences)
training_array = sequence_array[training_idxs]
test_array = sequence_array[test_idxs]

training_sequences = [sequences[i] for i in training_idxs]
test_sequences = [sequences[i] for i in test_idxs]


load_sequence_and_metadata() took 0:00:01.972372 seconds.
24

To visualize what's going on, let's display the data below.


In [368]:
test_array   # the one-of-K encoding of sequences.


Out[368]:
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [369]:
training_array  # also a one-of-K encoding of sequences


Out[369]:
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Some sanity checks implemented below:

  1. Length of training (or testing) array is the same as the metadata.

In [370]:
# Sanity checks
assert len(training_array) == len(training_metadata)
assert len(test_array) == len(test_metadata)

# Diagnostic prints that may be helpful later
print(training_array.shape)


(13373, 13680)

Let's now build the variational autoencoder model. This is a deep learning model that provides us with a way to learn a latent representation of discrete sequence space as a continuous representation instead. It's both a visualization tool and a deep learning data compression tool.


In [371]:
# # Set up VAE.
# with tf.device('/gpu:0'):
#     intermediate_dim = 1000
#     encoding_dim = 3
#     latent_dim = encoding_dim
#     epsilon_std = 1.0
#     nb_epoch = 250

#     x = Input(shape=(training_array.shape[1],))
#     z_mean = Dense(latent_dim)(x)
#     z_log_var = Dense(latent_dim)(x)

#     def sampling(args):
#         z_mean, z_log_var = args
#         epsilon = K.random_normal(shape=(latent_dim, ), mean=0.,
#                                   stddev=epsilon_std)
#         return z_mean + K.exp(z_log_var / 2) * epsilon


#     def vae_loss(x, x_decoded_mean):
#         xent_loss = training_array.shape[1] * objectives.binary_crossentropy(x, x_decoded_mean)
#         kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
#         return xent_loss + kl_loss

#     z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])
#     x_decoded_mean = Dense(training_array.shape[1], activation='sigmoid')(z_mean)

#     vae = Model(x, x_decoded_mean)
#     vae.compile(optimizer='adam', loss=vae_loss)

#     # build a model to project inputs on the latent space
#     encoder = Model(x, z_mean)
#     encoder_var = Model(x, z_log_var)

#     x_train, x_test = train_test_split(training_array)

#     early_stopping = EarlyStopping(monitor="val_loss", patience=2)


#     # build the decoder
#     encoded_input = Input(shape=(encoding_dim,))
#     # retrieve the last layer of the autoencoder model
#     decoder_layer = vae.layers[-1]
#     # create the decoder model
#     decoder = Model(inputs=encoded_input, outputs=decoder_layer(encoded_input))


#     # Train the VAE to learn weights
#     vae.fit(x_train, x_train,
#             shuffle=True,
#             epochs=nb_epoch,
#             validation_data=(x_test, x_test),
#             callbacks=[early_stopping],
#            )

Putting VAEs into Production

How might we put the VAE into production? Keras allows us to save and load models as .h5 and .json files. (Reference here.)

I will save the model as it is right now under the trained_models/ directory.


In [372]:
# save_model(vae, 'trained_models/vae')
# save_model(encoder, 'trained_models/encoder')
# save_model(decoder, 'trained_models/decoder')

In [373]:
# with open('trained_models/vae.yaml', 'r+') as f:
#     yaml_spec = f.read()

In [374]:
vae = load_model('trained_models/vae')
encoder = load_model('trained_models/encoder')
decoder = load_model('trained_models/decoder')

In [375]:
training_embeddings_mean = encoder.predict(training_array)
training_embeddings_mean.shape


Out[375]:
(13373, 3)

Let's compute the convex hull of the test_embedding points, so that it can be laid over the training_embedding points.


In [376]:
test_embeddings_mean = encoder.predict(test_array)
test_embeddings_hull = ConvexHull(test_embeddings_mean)

Before moving on, save test embeddings and training embeddings to disk


In [377]:
pd.DataFrame(test_embeddings_mean)


Out[377]:
0 1 2
0 -0.169175 -1.476121 1.758001
1 -0.822338 -1.677743 3.001601
2 -0.934565 -1.852435 2.426434
3 -0.144333 -1.715674 1.810452
4 -0.140226 -1.707515 2.105132
5 -0.140226 -1.707515 2.105132
6 -0.822338 -1.677743 3.001601
7 -0.822338 -1.677743 3.001601
8 -0.347295 -1.357930 1.691708
9 -0.140226 -1.707515 2.105132
10 -0.140226 -1.707515 2.105132
11 -0.140226 -1.707515 2.105132
12 -0.822338 -1.677743 3.001601
13 -0.767528 -1.696776 2.951524
14 -0.767528 -1.696776 2.951524
15 -0.182890 -1.064798 2.458643
16 -1.739199 -2.354932 2.738085
17 -0.991879 -1.567481 2.622967
18 -0.991879 -1.567481 2.622967
19 -1.103650 -1.716386 3.114609
20 -0.795672 -1.999494 2.489496
21 -0.003452 -1.377378 2.010183
22 0.390658 -0.544416 2.050886
23 -0.662323 -1.770830 2.411823
24 -0.155083 -1.658935 2.032437
25 -0.155083 -1.658935 2.032437
26 -0.155083 -1.658935 2.032437
27 -0.700969 -1.603273 2.359200
28 -0.155083 -1.658935 2.032437
29 -1.343689 -1.772821 3.213515
... ... ... ...
1052 -1.533787 -2.120633 2.746078
1053 -1.533787 -2.120633 2.746078
1054 -1.533787 -2.120633 2.746078
1055 -0.873609 -1.525196 2.817189
1056 -0.873609 -1.525196 2.817189
1057 -0.199075 -1.598989 2.062863
1058 -0.800519 -1.561380 2.297793
1059 -1.533787 -2.120633 2.746078
1060 -0.873609 -1.525196 2.817189
1061 0.390658 -0.544416 2.050886
1062 -0.997703 -1.894238 3.080574
1063 -0.205954 -1.591393 2.033720
1064 -0.822338 -1.677743 3.001601
1065 -0.094305 -1.817686 2.135722
1066 -0.094305 -1.817686 2.135722
1067 -1.506273 -2.185941 2.808886
1068 -1.506273 -2.185941 2.808886
1069 -0.205954 -1.591393 2.033720
1070 -0.822338 -1.677743 3.001601
1071 -0.574521 -1.463377 2.253461
1072 -0.263545 -1.485282 1.932661
1073 -0.205954 -1.591393 2.033720
1074 -0.801519 -1.772139 2.251057
1075 -0.205954 -1.591393 2.033720
1076 -1.533787 -2.120633 2.746078
1077 -1.533787 -2.120633 2.746078
1078 -1.533787 -2.120633 2.746078
1079 -1.533787 -2.120633 2.746078
1080 -1.210322 -2.204110 2.564916
1081 -1.533787 -2.120633 2.746078

1082 rows × 3 columns


In [378]:
lowp, highp = get_density_interval(99, training_embeddings_mean, axis=0)
lowp, highp


Out[378]:
(array([-1.6280247 , -4.0219806 , -0.79145813]),
 array([ 3.00617099,  0.32198843,  3.30450177]))

In [379]:
# lowp, highp = get_density_interval(97.5, test_embeddings_mean, axis=0)
# lowp, highp

Let's visualize the latent space, which is currently encoded in 3 dimensions.


In [380]:
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.scatter(training_embeddings_mean[:, 0], 
           training_embeddings_mean[:, 1], 
           training_embeddings_mean[:, 2], 
           c=training_metadata['Collection Date'], cmap='viridis')
# Plot the convex hull of the 2017 sequences.
for simplex in test_embeddings_hull.simplices:
    ax.plot(test_embeddings_mean[simplex, 0], 
            test_embeddings_mean[simplex, 1], 
            test_embeddings_mean[simplex, 2], 'r--', lw=0.5)
ax.set_xlim(lowp[0], highp[0])
ax.set_ylim(lowp[1], highp[1])
ax.set_zlim(lowp[2], highp[2])
ax.set_facecolor('white')
plt.show()


Let's save the learned coordinates to disk, so that it can be visualized in a separate file. Uncomment the cell if new training happened, and the coordinates need to be updated.


In [381]:
training_metadata.head()


Out[381]:
Name Sequence Accession Complete Genome Segment Segment Length Subtype Collection Date Host Species Country State/Province Flu Season Strain Name Unnamed: 12
0 HA HM628693 No 4 1701 H3N2 2010-03-03 IRD:Human Brazil -N/A- -N/A- A/Acre/15093/2010(H3N2) NaN
1 HA HM628694 No 4 1701 H3N2 2010-04-04 IRD:Human Brazil -N/A- -N/A- A/Acre/26954/2010(H3N2) NaN
2 HA KT889256 Yes 4 1701 H3N2 2013-09-26 IRD:Human Ethiopia -N/A- -N/A- A/Addis Ababa/1514A07305892N/2013 NaN
3 HA KC882982 No 4 1701 H3N2 2011-01-05 IRD:Human USA Alabama 10-11 A/Alabama/01/2011 NaN
4 HA KT841229 Yes 4 1701 H3N2 2015-01-06 IRD:Human USA Alabama 14-15 A/Alabama/01/2015 NaN

In [382]:
training_metadata.loc[:, 'coords0'] = training_embeddings_mean[:, 0]
training_metadata.loc[:, 'coords1'] = training_embeddings_mean[:, 1]
training_metadata.loc[:, 'coords2'] = training_embeddings_mean[:, 2]
training_metadata.tail()
# training_metadata.to_csv('data/metadata_with_embeddings.csv')


/home/ericmjl/anaconda/envs/pymc3/lib/python3.6/site-packages/pandas/core/indexing.py:296: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/home/ericmjl/anaconda/envs/pymc3/lib/python3.6/site-packages/pandas/core/indexing.py:476: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
Out[382]:
Name Sequence Accession Complete Genome Segment Segment Length Subtype Collection Date Host Species Country State/Province Flu Season Strain Name Unnamed: 12 coords0 coords1 coords2
16977 HA KJ567656 No 4 1701 H3N2 2012-04-18 IRD:Human China -N/A- -N/A- A/Zhongshan/01/2012 NaN 1.193005 -1.211505 2.009971
16978 HA KJ567658 No 4 1701 H3N2 2013-10-21 IRD:Human China -N/A- -N/A- A/Zhongshan/01/2013 NaN 1.327763 -0.830435 2.577444
16979 HA KJ567655 No 4 1701 H3N2 2011-08-08 IRD:Human China -N/A- -N/A- A/Zhongshan/02/2011 NaN 2.722208 -1.042558 1.311242
16980 HA KJ567657 No 4 1701 H3N2 2012-04-18 IRD:Human China -N/A- -N/A- A/Zhongshan/02/2012 NaN 1.223866 -1.283524 2.154933
16981 HA KJ567659 No 4 1701 H3N2 2013-10-21 IRD:Human China -N/A- -N/A- A/Zhongshan/02/2013 NaN 1.327763 -0.830435 2.577444

Sanity Check 1: Distances

I want to do a sanity check on the protein sequences. The goal here is to figure out whether short embedding euclidean distances roughly equals to short Levenshtein distances between sequences. I will sample 1000 pairs of sequences, compute their Levenshtein distance and their euclidean distances, and check that there is a linear relationship between the two.


In [383]:
# Generate pairs of random indices.
indices = np.random.randint(low=0, high=len(training_array), size=(1000, 2))
indices


Out[383]:
array([[ 1614, 10334],
       [ 9737, 11399],
       [  938,   780],
       ..., 
       [ 9154,  8958],
       [ 4903,  9467],
       [ 5359,  1792]])

In [384]:
lev_dists = []
euc_dists = []
for i1, i2 in indices:
    lev_dist = levD(str(training_sequences[i1].seq), str(training_sequences[i2].seq))
    euc_dist = eucD(training_embeddings_mean[i1], training_embeddings_mean[i2])
    
    lev_dists.append(lev_dist)
    euc_dists.append(euc_dist)
    
fig = plt.figure(figsize=(5,5))
plt.scatter(lev_dists, euc_dists, alpha=0.1)
plt.xlabel('Levenshtein Distance')
plt.ylabel('Euclidean Distance')


Out[384]:
<matplotlib.text.Text at 0x7f3edd673dd8>

Sanity Check Conclusion

Regardless of what indices are sampled, we will generally get back a linear relationship between Euclidean and Levenshtein distances. I feel good about this sanity check.

Visualize Evolutionary Trajectory over Time

I'm now going to try visualizing the evolutionary trajectory of the HA protein, day-by-day. We will take the average coordinate per day, and plot them in 3D space.


In [385]:
from ipywidgets import interact
from ipywidgets.widgets import Dropdown

tm_coords = deepcopy(training_metadata)  # tm_coords means "training metadata with coordinates"
tm_coords['coord0'] = training_embeddings_mean[:, 0]
tm_coords['coord1'] = training_embeddings_mean[:, 1]
tm_coords['coord2'] = training_embeddings_mean[:, 2]
avg_coords_by_day = tm_coords.groupby('Collection Date')\
                            [['coord0', 'coord1', 'coord2']].median()\
                            .resample("D").median().dropna().reset_index()

avg_coords_by_quarter = tm_coords.groupby('Collection Date')\
                            [['coord0', 'coord1', 'coord2']].median().\
                            resample("Q").median().dropna().reset_index()

        
avg_coords_by_week = tm_coords.groupby('Collection Date')\
                            [['coord0', 'coord1', 'coord2']].median().\
                            resample("W").median().dropna().reset_index()

@interact(elev=(-180, 180, 10), azim=(0, 360, 10))
def plot_daily_avg(elev, azim):

    fig = plt.figure(figsize=(16,8))
    ax1 = fig.add_subplot(1,2,1, projection='3d')
    ax1.scatter(xs=avg_coords_by_week['coord0'],
                ys=avg_coords_by_week['coord1'],
                zs=avg_coords_by_week['coord2'],
                c=avg_coords_by_week['Collection Date'], 
                cmap='viridis')
    ax1.view_init(elev, azim)
    ax1.set_title('by day')
    
    ax2 = fig.add_subplot(1,2,2, projection='3d')
    ax2.scatter(xs=avg_coords_by_quarter['coord0'], 
                ys=avg_coords_by_quarter['coord1'], 
                zs=avg_coords_by_quarter['coord2'], 
                c=avg_coords_by_quarter['Collection Date'], 
                cmap='viridis')
    ax2.view_init(elev=elev, azim=azim)
    ax2.set_title('by quarter')
    
    ax1.set_xlim(ax2.get_xlim())
    ax1.set_ylim(ax2.get_ylim())
    ax1.set_zlim(ax2.get_zlim())
    
    for simplex in test_embeddings_hull.simplices:
        ax1.plot(test_embeddings_mean[simplex, 0], 
                 test_embeddings_mean[simplex, 1], 
                 test_embeddings_mean[simplex, 2], 
                 'k--', lw=0.2)
        
    ax1.set_facecolor('white')
    ax2.set_facecolor('white')

    plt.show()


Looking at the data, it looks most feasible to try predicting coordinates for the coming months, particularly trying to go out 3-6 months.

Let's transform the 3D plots into three 1D plots. (This is just a viz aid.)


In [386]:
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(1,3,1)
ax1.plot(avg_coords_by_quarter['coord0'])
ax1.set_xlabel('Time')
ax1.set_ylabel('Coordinate Value')
ax1.set_title('Coordinate 0')

ax2 = fig.add_subplot(1,3,2)
ax2.plot(avg_coords_by_quarter['coord1'])
ax2.set_xlabel('Time')
ax2.set_ylabel('Coordinate Value')
ax2.set_title('Coordinate 1')

ax3 = fig.add_subplot(1,3,3)
ax3.plot(avg_coords_by_quarter['coord2'])
ax3.set_xlabel('Time')
ax3.set_ylabel('Coordinate Value')
ax3.set_title('Coordinate 2')
plt.show()


GP for Timeseries

Here's me trying out GPs for time series. I've hacked this once before, following an example, so maybe it's transferrable...


In [387]:
import theano.tensor as tt
x_vals = np.array(range(len(avg_coords_by_quarter))).reshape(-1, 1).astype('float32')

def build_coords_model(coordinate):
    y_vals = avg_coords_by_quarter[coordinate].values.astype('float32')
    
    print(x_vals.shape, y_vals.shape)

    with pm.Model() as model:
        # l = pm.HalfCauchy('l', beta=20)
        l = pm.Uniform('l', 0, 30)

        # Covariance function
        log_s2_f = pm.Uniform('log_s2_f', lower=-10, upper=5)
        s2_f = pm.Deterministic('s2_f', tt.exp(log_s2_f))
        # s2_f = pm.HalfCauchy('s2_f', beta=2)
        # f_cov = s2_f * pm.gp.cov.Matern52(input_dim=1, lengthscales=l)
        f_cov = s2_f * pm.gp.cov.ExpQuad(input_dim=1, lengthscales=l)

        # Sigma
        log_s2_n = pm.Uniform('log_s2_n', lower=-10, upper=5)
        s2_n = pm.Deterministic('s2_n', tt.exp(log_s2_n))
        # s2_n = pm.HalfCauchy('s2_n', beta=2)

        y_obs = pm.gp.GP('y_obs', cov_func=f_cov, sigma=s2_n, 
                         observed={'X':x_vals, 
                                   'Y':y_vals})
        trace = pm.sample(2000)

        pp_x = np.arange(len(avg_coords_by_quarter)+2)[:, None]
        gp_samples = pm.gp.sample_gp(trace=trace, gp=y_obs, X_values=pp_x, samples=1000)

    return gp_samples

Now with the function defined, let's train three separate GP models to predict the evolutionary coordinates over time.


In [388]:
coord0_preds = build_coords_model('coord0')


Auto-assigning NUTS sampler...
Initializing NUTS using advi...
(67, 1) (67,)
Average ELBO = -38.676: 100%|██████████| 200000/200000 [02:43<00:00, 1221.92it/s]   
Finished [100%]: Average ELBO = -38.682
100%|██████████| 2000/2000 [00:17<00:00, 114.94it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]/home/ericmjl/anaconda/envs/pymc3/lib/python3.6/site-packages/scipy/stats/_multivariate.py:535: RuntimeWarning: covariance is not positive-semidefinite.
  out = random_state.multivariate_normal(mean, cov, size)
100%|██████████| 1000/1000 [00:23<00:00, 42.81it/s]

In [389]:
coord1_preds = build_coords_model('coord1')


Auto-assigning NUTS sampler...
Initializing NUTS using advi...
(67, 1) (67,)
Average ELBO = -26.999: 100%|██████████| 200000/200000 [02:48<00:00, 1188.19it/s]   
Finished [100%]: Average ELBO = -27.005
100%|██████████| 2000/2000 [00:17<00:00, 111.69it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]/home/ericmjl/anaconda/envs/pymc3/lib/python3.6/site-packages/scipy/stats/_multivariate.py:535: RuntimeWarning: covariance is not positive-semidefinite.
  out = random_state.multivariate_normal(mean, cov, size)
100%|██████████| 1000/1000 [00:23<00:00, 42.54it/s]

In [390]:
coord2_preds = build_coords_model('coord2')


Auto-assigning NUTS sampler...
Initializing NUTS using advi...
(67, 1) (67,)
Average ELBO = 5.1491: 100%|██████████| 200000/200000 [02:49<00:00, 1179.82it/s]
Finished [100%]: Average ELBO = 5.1616
Evidence of divergence detected, inspect ELBO.
100%|██████████| 2000/2000 [00:15<00:00, 128.43it/s]
  0%|          | 0/1000 [00:00<?, ?it/s]/home/ericmjl/anaconda/envs/pymc3/lib/python3.6/site-packages/scipy/stats/_multivariate.py:535: RuntimeWarning: covariance is not positive-semidefinite.
  out = random_state.multivariate_normal(mean, cov, size)
100%|██████████| 1000/1000 [00:26<00:00, 38.26it/s]

In [391]:
coord0_preds.shape, coord1_preds.shape, coord2_preds.shape,


Out[391]:
((1000, 69), (1000, 69), (1000, 69))

Now, we can plot the coordinates with the ground truth data.


In [392]:
from random import sample
def plot_coords_with_groundtruth(coord_preds, data, ax):
    pp_x = np.arange(len(avg_coords_by_quarter)+2)[:, None]
    for x in coord_preds:
        ax.plot(pp_x, x, color='purple', alpha=0.1)
    ax.plot(x_vals, data, 'oy', ms=10);
    ax.set_xlabel("quarter index");
    ax.set_ylabel("coordinate");
    ax.set_title("coordinate posterior predictive distribution");
    
    return ax

fig = plt.figure(figsize=(15,5))
ax0 = fig.add_subplot(1,3,1)
ax1 = fig.add_subplot(1,3,2)
ax2 = fig.add_subplot(1,3,3)
plot_coords_with_groundtruth(coord0_preds, avg_coords_by_quarter['coord0'], ax0)
plot_coords_with_groundtruth(coord1_preds, avg_coords_by_quarter['coord1'], ax1)
plot_coords_with_groundtruth(coord2_preds, avg_coords_by_quarter['coord2'], ax2)


Out[392]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3edc4483c8>

I'd like to try making a Bokeh plot of this.


In [409]:
test_coords_embed = deepcopy(test_metadata)
test_coords_embed['coord0'] = test_embeddings_mean[:, 0]
test_coords_embed['coord1'] = test_embeddings_mean[:, 1]
test_coords_embed['coord2'] = test_embeddings_mean[:, 2]
test_coords_embed = test_coords_embed.set_index('Collection Date').resample('Q').mean().reset_index()
test_coords_embed.to_csv('data/test_metadata_with_embeddings.csv')
test_coords_embed


Out[409]:
Collection Date Segment Segment Length Unnamed: 12 coord0 coord1 coord2
0 2017-03-31 4.0 1734.832853 NaN -0.700494 -1.664800 2.457307
1 2017-06-30 4.0 1737.000000 NaN -0.536831 -1.530502 2.380945

In [414]:
# test_coords_embed[['coord0', 'coord1', 'coord2']].values
decoder.predict(test_coords_embed[['coord0', 'coord1', 'coord2']].values)


Out[414]:
array([[  8.27948770e-06,   8.13587394e-06,   8.28455086e-06, ...,
          8.18558965e-06,   8.26399446e-06,   8.14564100e-06],
       [  1.08876475e-05,   1.07061132e-05,   1.08432678e-05, ...,
          1.08219147e-05,   1.09386965e-05,   1.07924889e-05]], dtype=float32)

Make distribution plot that shows where the forecasts show up on the 95% credible interval.


In [416]:
from bokeh.models import HoverTool, ColumnDataSource, LinearColorMapper
from bokeh.layouts import row

def ecdf(data):
    x, y = np.sort(data), np.arange(1, len(data)+1) / len(data)
    return x, y

def percentile_xy(x, y, perc):
    return np.percentile(x, perc), np.percentile(y, perc)


def plot_forecast_hpd(coord_preds, coords_test, c, quarter):
    
    
    x, y = ecdf(coord_preds[:, -3+quarter])
    y_perc = (y * 100).astype('int')
    df = pd.DataFrame(dict(x=x, y=y, y_perc=y_perc))
    
    df['color'] = df['y_perc'].apply(lambda x: 'red' if x > 2.5 and x < 97.5 else 'blue')
    src = ColumnDataSource(df)

    
    # Instantiate Figure
    p = figure(plot_width=300, plot_height=300, 
               x_axis_label="coordinate value", y_axis_label="cumulative distribution",
               title="")

    # Make scatter plot.
    p.scatter('x', 'y', 
              color='color',
              source=src)
    p.ray(x=[coords_test['coord{0}'.format(c)][quarter-1]], y=[0], angle=np.pi / 2, length=0, line_width=2, color='blue')
#     p.scatter(*percentile_xy(x, y, 2.5), color='red')
#     p.scatter(*percentile_xy(x, y, 97.5), color='red')

    # Add Hover tool
    hovertool = HoverTool()
    hovertool.tooltips = [
        ("Value", "@x"),
        ("Percentile", "@y_perc")
    ]
    p.add_tools(hovertool)

    return p

q = 2
p0 = plot_forecast_hpd(coord0_preds, test_coords_embed, 0, quarter=q)
p1 = plot_forecast_hpd(coord1_preds, test_coords_embed, 1, quarter=q)
p2 = plot_forecast_hpd(coord2_preds, test_coords_embed, 2, quarter=q)

show(row(p0, p1, p2))


Decode the hold-out coordinates and check where they are in the forecasted sequences.


In [461]:
# coord0_preds[sample(range(len(coord0_preds)), 100)].tolist()

In [493]:
output_notebook()

p = figure(plot_width=300, plot_height=280, x_axis_label='Calendar Quarter', y_axis_label='Coordinate Value')
pp_x = np.arange(len(avg_coords_by_quarter)+2)[:, None]

coord = 2
coord_pred_dict = {'coord0': coord0_preds,
                   'coord1': coord1_preds,
                   'coord2': coord2_preds,}

n_samples = 200
xs = list(pp_x.reshape(-1).repeat(n_samples).reshape(-1, n_samples).T.tolist())
ys = coord_pred_dict['coord{0}'.format(coord)][sample(range(len(coord_pred_dict['coord{0}'.format(coord)])), n_samples)].tolist()
p.multi_line(xs=xs, 
             ys=ys, 
             color='#fff176', 
             alpha=1)

p.circle(x_vals.reshape(-1), avg_coords_by_quarter['coord{0}'.format(coord)], color='#616161')
p.circle(pp_x[-2:].reshape(-1), test_coords_embed['coord{0}'.format(coord)], color='#ff5252', radius=1)
show(p)


Loading BokehJS ...

We can see that the GPs learned the time-varying function pretty accurately, with some outliers smoothed out. Additionally, the uncertainty surrounding the two quarter forward predictions (right-most side of each plot) fan out, but not too much either. We can sample from these coordinates and generate back new sequence predictions.


In [424]:
one_quarter_pred = np.array([coord0_preds[:, -2], coord1_preds[:, -2], coord2_preds[:, -2]]).T
two_quarter_pred = np.array([coord0_preds[:, -1], coord1_preds[:, -1], coord2_preds[:, -1]]).T

# Grab out the 95% HPD interval from the data, filter out the data that are outside of HPD
oneQ_low, oneQ_high = get_density_interval(95, one_quarter_pred)
oneQ_low, oneQ_high


Out[424]:
(array([-0.81481089, -2.0716895 ,  2.07563536]),
 array([  1.64616269e-03,  -1.37735577e+00,   2.65616666e+00]))

Sanity check to see what one_quarter_pred looks like.


In [425]:
one_quarter_pred


Out[425]:
array([[-0.22533602, -1.81781203,  2.07353906],
       [-0.23686901, -1.37654179,  2.16634588],
       [-0.29916472, -1.64404162,  2.69969378],
       ..., 
       [-0.1421011 , -1.93666375,  2.36000013],
       [-0.34343446, -1.71038549,  2.45056756],
       [-0.3791557 , -1.6089714 ,  2.64149927]])

In [426]:
test_embeddings_mean


Out[426]:
array([[-0.16917507, -1.47612083,  1.75800097],
       [-0.8223381 , -1.67774343,  3.00160122],
       [-0.93456453, -1.85243511,  2.42643428],
       ..., 
       [-1.53378665, -2.12063313,  2.74607778],
       [-1.21032155, -2.20410991,  2.56491613],
       [-1.53378665, -2.12063313,  2.74607778]], dtype=float32)

Compute the convex hull of the scatter points, and check to see whether the 2017 two quarters are within the predicted convex hull.


In [427]:
oneQ_hull = ConvexHull(one_quarter_pred)
twoQ_hull = ConvexHull(two_quarter_pred)

In [428]:
oneQ_hull.vertices


Out[428]:
array([ 35,  63,  99, 116, 176, 242, 255, 256, 279, 330, 390, 407, 408,
       446, 474, 492, 498, 505, 518, 569, 575, 577, 609, 614, 647, 648,
       656, 702, 736, 746, 782, 803, 871, 875, 919, 923, 941, 982, 985], dtype=int32)

Quick sanity check - vary "dim" from 0-2; and check against test_coords_embed above: are they within the "cube" that is formed by convex hull vertices? (Granted: there's better methods for checking whether the predicted "average" sequence is within the convex hull, but this is quick-and-dirty.


In [429]:
dim = 2
one_quarter_pred[oneQ_hull.vertices][:, dim].min(), one_quarter_pred[oneQ_hull.vertices][:, dim].max()
# two_quarter_pred[twoQ_hull.vertices][:, dim].min(), two_quarter_pred[twoQ_hull.vertices][:, dim].max()


Out[429]:
(1.8668841031409968, 2.8441310934452693)

In [430]:
from ipywidgets import IntSlider
# @interact(elev=range(-180, 180, 20), azim=range(0, 360, 20))
def plot_predicted_coords(elev, azim):
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(1,1,1, projection='3d')
    ax.scatter(test_coords_embed['coord0'], test_coords_embed['coord1'], 
               test_coords_embed['coord2'], s=200)
    # ax.scatter(one_quarter_pred[:, 0], one_quarter_pred[:, 1], one_quarter_pred[:, 2], alpha=0.1)
    # ax.scatter(two_quarter_pred[:, 0], two_quarter_pred[:, 1], two_quarter_pred[:, 2], alpha=0.1)
    ax.set_facecolor('white')
    ax.view_init(elev=elev, azim=azim)
    
    for simplex in oneQ_hull.simplices:
        ax.plot(one_quarter_pred[simplex, 0], 
                 one_quarter_pred[simplex, 1], 
                 one_quarter_pred[simplex, 2], 
                 'b-', lw=0.2)
    plt.show()
    
elev_slider = IntSlider(value=0, min=-180, max=180, step=20)
azim_slider = IntSlider(value=60, min=0, max=360, step=20)
interact(plot_predicted_coords, elev=elev_slider, azim=azim_slider)


Out[430]:
<function __main__.plot_predicted_coords>

Now, let's try sampling out sequences from this coordinate space.


In [431]:
from custom import embedding2binary, binary2chararray
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

In [432]:
from collections import Counter

def embedding2seqs(decoder, preds, sequences):
    binary = embedding2binary(decoder, preds)
    chararray = binary2chararray(sequences, binary)
    strs = [''.join(i for i in k) for k in chararray]
    strcounts = Counter(strs)
    seqrecords = []
    for s, c in sorted(strcounts.items(), key=lambda x: x[1], reverse=True):
        s = Seq(s)
        sr = SeqRecord(s, id='weight_' + str(c))
        seqrecords.append(sr)
        
    return binary, chararray, strcounts, strs, seqrecords
    
oneQ_binary, oneQ_chararray, oneQ_strcounts, oneQ_strs, oneQ_seqrecords = embedding2seqs(decoder, one_quarter_pred, sequences)
twoQ_binary, twoQ_chararray, twoQ_strcounts, twoQ_strs, twoQ_seqrecords = embedding2seqs(decoder, two_quarter_pred, sequences)

In [433]:
twoQ_strcounts.keys()


Out[433]:
dict_keys(['MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRASSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRASSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFKNESFNWTGVTQNGTSSACIRASSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFKNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRGSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRASSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRGSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFKNESFNWTGVTQNGTSSACIRGSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFKNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIESIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA', 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFKNESFNWAGVTQNGTSSACIRGSSSSFFSRLNWLTHLNYTYPALNVTMPNKEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRVQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNETYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA'])

In [434]:
# oneQ_binary = embedding2binary(decoder, one_quarter_pred)
# oneQ_chararray = binary2chararray(sequences, oneQ_binary)

In [435]:
# twoQ_binary = embedding2binary(decoder, two_quarter_pred)
# twoQ_chararray = binary2chararray(sequences, twoQ_binary)

Now, we can save the data to disk.


In [436]:
# # Data structure here: 
# oneQ_strs = Counter([''.join(i for i in k) for k in oneQ_chararray])

# oneQ_seqrecords = []
# for s, c in sorted(oneQ_strs.items(), key=lambda x: x[1], reverse=True):
#     s = Seq(s)
#     sr = SeqRecord(s, id='weight_' + str(c))
#     oneQ_seqrecords.append(sr)
    
# # Write to disk
SeqIO.write(oneQ_seqrecords, 'data/oneQ_predictions.fasta', 'fasta')
SeqIO.write(twoQ_seqrecords, 'data/twoQ_predictions.fasta', 'fasta')


Out[436]:
16

In [437]:
test_binarray = embedding2binary(decoder, test_coords_embed.loc[:, ['coord0', 'coord1', 'coord2']].values)
test_chararray = binary2chararray(sequences, test_binarray)
test_strs = [''.join(i for i in k) for k in test_chararray]
test_seqrecords = [SeqRecord(seq=Seq(i)) for n, i in enumerate(test_strs)]

In [438]:
twoQ_strcounts[test_strs[1]]
oneQ_strcounts[test_strs[0]]


Out[438]:
335

In [439]:
sorted(oneQ_strcounts.values(), reverse=True)[0:3]


Out[439]:
[572, 335, 29]

In [440]:
sorted(twoQ_strcounts.values(), reverse=True)[0:3]


Out[440]:
[450, 419, 31]

Sampled Point Cloud

I would like to visualize the sampled point cloud. Each unique sequence will be given a unique colour. Let's see how much volume each sequence takes.


In [441]:
from palettable.tableau import Tableau_20
from matplotlib.colors import ListedColormap
from itertools import cycle
from datetime import datetime

def make_predcoords_df(strcounts, strs, chararray, pred):
    """
    Just a utility function for the notebook. DO NOT PUT INTO PRODUCTION!
    """

    # cm = discrete_cmap(len(strcounts.keys()), base_cmap=Tableau_20)
    cmap_mpl = dict()
    for (seq, rgb) in zip(strcounts.keys(), cycle(Tableau_20.mpl_colors)):
        cmap_mpl[seq] = rgb

    cmap_hexcol = dict()
    for (seq, hexcol) in zip(strcounts.keys(), cycle(Tableau_20.hex_colors)):
        cmap_hexcol[seq] = hexcol
    
    mpl_colors = [cmap_mpl[s] for s in strs]
    hex_colors = [cmap_hexcol[s] for s in strs]    
    
    predcoords = pd.DataFrame()
    predcoords['coords0'] = pred[:, 0]
    predcoords['coords1'] = pred[:, 1]
    predcoords['coords2'] = pred[:, 2]
    predcoords['matplotlib_colors'] = mpl_colors
    predcoords['hexdecimal_colors'] = hex_colors
    
    return predcoords
    
oneQ_predcoords = make_predcoords_df(oneQ_strcounts, oneQ_strs, oneQ_chararray, one_quarter_pred)
twoQ_predcoords = make_predcoords_df(twoQ_strcounts, twoQ_strs, twoQ_chararray, two_quarter_pred)

oneQ_predcoords.to_csv('data/oneQ_prediction_coords_with_colors.csv')
twoQ_predcoords.to_csv('data/twoQ_prediction_coords_with_colors.csv')

dim1 = 'coords0'
dim2 = 'coords2'

def make_bokeh_plot_predcoord_bounding_boxes(predcoords, dim1, dim2):
    """
    This is just a plot made for diagnostic purposes. DO NOT PUT INTO PRODUCTION!
    """
    
    assert dim1 != dim2
    p = figure(plot_width=400, plot_height=300)
    xs_all = []
    ys_all = []
    colors = []
    starttime = datetime.now()
    for (mpl_color, hex_color), dat in oneQ_predcoords.groupby(['matplotlib_colors', 'hexdecimal_colors']):
        d = dat[[dim1, dim2]]
        
        if len(d) >=10:
            xs = []
            ys = []
            hull = ConvexHull(d[[dim1, dim2]])
            for v in hull.vertices:
                xs.append(d.iloc[v][dim1])
                ys.append(d.iloc[v][dim2])
            xs.append(xs[0])
            ys.append(ys[0])
            xs_all.append(xs)
            ys_all.append(ys)
            colors.append(hex_color)
    p.multi_line(xs_all, ys_all, color=colors)
    endtime = datetime.now()
    elapsed = endtime - starttime
    print(f'elapsed time: {elapsed} s')

    show(p)
    
make_bokeh_plot_predcoord_bounding_boxes(twoQ_predcoords, 'coords0', 'coords2')


elapsed time: 0:00:00.019869 s

Further Check

I wanted to check whether actual sequences looked anything like the "average" sequence. This is the sanity check below.


In [442]:
most_probable_seq = 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICIAAAA'

In [443]:
from Levenshtein import distance
levDs_from_preds = [distance(str(record.seq), most_probable_seq) for record in test_sequences]

def ecdf_scatter(data):
    x, y = np.sort(data), np.arange(1, len(data)+1) / len(data)
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(x, y)
    
    return ax
    
    
ax = ecdf_scatter(levDs_from_preds)
ax.set_xlabel('Number of amino acids differing')
ax.set_ylabel('Cumulative distribution')


Out[443]:
<matplotlib.text.Text at 0x7f3edd30acf8>

The answer is... not really. Haha, but this is expected. Based on the last quarter, the "average" sequence is going to look a median of 11 a.a. different from any given "actual" sequences, with the spread from 6 to 18.

A way to work around this is to not sample per quarter, but per month instead. But this would involve a crap ton of re-training... okay, maybe that's something I could do for the coming weeks instead!

Vaccine Strain Trajectory

One big question is whether the vaccine strains have been selected well or not. Let's take a look at where the vaccine strains fall on the evolutionary trajectory plots.


In [444]:
import yaml

with open('data/vaccine_strains.yaml', 'r+') as f:
    vaccine_strains = yaml.load(f)
    
set(vaccine_strains.values())


Out[444]:
{'A/Brisbane/10/2007',
 'A/California/7/2004',
 'A/Fujian/411/2002',
 'A/Hong Kong/4801/2014',
 'A/Moscow/10/99',
 'A/Perth/16/2009',
 'A/Switzerland/9715293/2013',
 'A/Texas/50/2012',
 'A/Victoria/361/2011',
 'A/Wisconsin/67/2005'}

In [445]:
training_metadata['Strain Name'] = training_metadata['Strain Name'].str.split('(').str[0]
training_metadata[training_metadata['Strain Name'].isin(set(vaccine_strains.values()))]


/home/ericmjl/anaconda/envs/pymc3/lib/python3.6/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
Out[445]:
Name Sequence Accession Complete Genome Segment Segment Length Subtype Collection Date Host Species Country State/Province Flu Season Strain Name Unnamed: 12 coords0 coords1 coords2
1521 HA KM978061 Yes 4 1701 H3N2 2007-06-01 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 1.996639 -2.077164 0.645289
1522 HA CY039087 Yes 4 1717 H3N2 2007-06-01 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 1.996639 -2.077164 0.645289
1523 HA KM821339 Yes 4 1739 H3N2 2007-06-01 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 1.857847 -1.965022 0.615769
1524 HA KJ609214 Yes 4 1719 H3N2 2007-02-06 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 2.086244 -2.154771 0.752320
1525 HA KM978060 Yes 4 1701 H3N2 2007-06-01 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 1.846917 -1.932853 0.670933
1526 HA CY035022 Yes 4 1714 H3N2 2007-02-06 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 1.996639 -2.077164 0.645289
1527 HA EU199366 Yes 4 1701 H3N2 2007-06-01 IRD:Human Australia -N/A- -N/A- A/Brisbane/10/2007 NaN 2.086244 -2.154771 0.752320
2056 HA CY114373 Yes 4 1731 H3N2 2004-06-01 IRD:Human USA California -N/A- A/California/7/2004 NaN 1.672657 -2.446701 0.157596
2057 HA KM821334 Yes 4 1739 H3N2 2004-06-01 IRD:Human USA California -N/A- A/California/7/2004 NaN 1.672657 -2.446701 0.157596
4219 HA CY088483 Yes 4 1701 H3N2 2002-06-01 IRD:Human China -N/A- -N/A- A/Fujian/411/2002 NaN 0.797548 -3.386261 0.631403
4220 HA KM821324 Yes 4 1734 H3N2 2002-06-01 IRD:Human China -N/A- -N/A- A/Fujian/411/2002 NaN 1.168723 -3.565021 0.478938
4221 HA CY112933 Yes 4 1734 H3N2 2002-06-01 IRD:Human China -N/A- -N/A- A/Fujian/411/2002 NaN 0.817277 -2.969726 0.236463
10630 HA KJ609206 Yes 4 1727 H3N2 2009-04-07 IRD:Human Australia -N/A- -N/A- A/Perth/16/2009 NaN 1.570083 -1.476204 0.913420
10631 HA KM821346 Yes 4 1739 H3N2 2009-06-01 IRD:Human Australia -N/A- -N/A- A/Perth/16/2009 NaN 1.570083 -1.476204 0.913420
10632 HA GQ293081 Yes 4 1757 H3N2 2009-04-07 IRD:Human Australia -N/A- -N/A- A/Perth/16/2009 NaN 1.615223 -1.532344 1.027166
15006 HA KC892248 Yes 4 1701 H3N2 2012-04-15 IRD:Human USA Texas 11-12 A/Texas/50/2012 NaN 1.083027 -0.750671 1.785108
15007 HA KC892952 Yes 4 1701 H3N2 2012-04-15 IRD:Human USA Texas 11-12 A/Texas/50/2012 NaN 1.083027 -0.750671 1.785108
15008 HA KJ942616 Yes 4 1749 H3N2 2012-06-01 IRD:Human USA Texas -N/A- A/Texas/50/2012 NaN 1.356216 -0.956209 2.112774
15755 HA KJ942680 Yes 4 1751 H3N2 2011-06-01 IRD:Human Australia -N/A- -N/A- A/Victoria/361/2011 NaN 0.856970 -1.166669 1.332401
15756 HA KM821347 Yes 4 1734 H3N2 2011-06-01 IRD:Human Australia -N/A- -N/A- A/Victoria/361/2011 NaN 1.024059 -1.062532 1.447274
15757 HA KC306165 Yes 4 1701 H3N2 2011-06-01 IRD:Human Australia -N/A- -N/A- A/Victoria/361/2011 NaN 1.434311 -1.393395 1.895818
16712 HA CY164048 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16713 HA CY163912 Yes 4 1731 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16714 HA CY164096 Yes 4 1731 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16715 HA CY163872 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16716 HA CY164064 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16717 HA CY163696 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16718 HA CY163728 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16719 HA CY163688 Yes 4 1731 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16720 HA CY163720 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16728 HA CY163928 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16729 HA CY164072 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16730 HA CY163992 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16731 HA CY164088 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16732 HA CY163648 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16733 HA CY163976 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16734 HA CY163664 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16735 HA CY164008 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16736 HA CY163880 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16737 HA CY163744 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16738 HA CY163936 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16739 HA CY164000 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16740 HA CY163904 Yes 4 1731 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16741 HA CY164104 Yes 4 1731 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16742 HA CY163704 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16743 HA CY164056 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16744 HA CY164040 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16745 HA CY163984 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16746 HA CY164016 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16747 HA CY163920 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16748 HA CY164080 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16749 HA CY163656 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.286535 -1.908271 0.217957
16750 HA CY163888 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.328366 -2.004357 0.321468
16751 HA CY163960 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16752 HA CY163944 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16753 HA CY163736 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16754 HA CY114381 Yes 4 1737 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16755 HA KM821341 Yes 4 1746 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16756 HA EU103823 Yes 4 1701 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.438502 -2.019906 0.391133
16757 HA CY034116 Yes 4 1714 H3N2 2005-06-01 IRD:Human USA Wisconsin -N/A- A/Wisconsin/67/2005 NaN 1.286535 -1.908271 0.217957

67 rows × 16 columns


In [ ]: