Spline transformation in CONCISE

This notebook will demonstrate the use of spline transformation in CONCISE.

B-splines crash course

Core component of spline transformation are B-splines (See Splines, Knots and Penalties by Eilers and and Marx). Here is how they look like and how to compute them with concise.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from concise.utils.splines import BSpline
from concise.preprocessing import EncodeSplines # uses BSpline


Using TensorFlow backend.

In [2]:
def plot_order(degree, ax):
    x = np.linspace(0, 1, 100).reshape((-1,1))
    y = EncodeSplines(n_bases=10, degree=degree).fit_transform(x).squeeze() 
    # alternatively: y = BSpline(0, 1, 10, degree).predict(x)
    ax.plot(x, y)
    ax.set(title="Degree: {d}, order: {o}".format(d=degree, o=degree+1))

In [3]:
fig, ax = plt.subplots(nrows=4)
fig.tight_layout()
for i in range(ax.shape[0]):
    plot_order(i, ax[i])


We can see that for degree=0, they are just segmenting the range into equally-sized bins.

Generalized additive models with Keras

Let's see how to fit a simple smooth curve, as usually done with generalized additive models (using say mgcv package in R).


In [4]:
# some data
x = np.linspace(1, 15)
y_true = np.sin(x)
y = y_true + np.random.normal(scale=0.3, size=len(x))

plt.scatter(x,y);



In [5]:
import concise.layers as cl
import concise.regularizers as cr
import keras.layers as kl
from keras.models import Model, Sequential
from keras.optimizers import Adam

In [6]:
# pre-processor
x_spl = EncodeSplines(n_bases=10).fit_transform(x.reshape((-1,1)))

In [7]:
x_spl.shape


Out[7]:
(50, 1, 10)

In [10]:
# keras model
m = Sequential([
        cl.SplineT(input_shape=x_spl.shape[1:])
    ])
m.compile(Adam(lr=0.1), "mse")

In [11]:
m.fit(x_spl, y, epochs=10, verbose=0);

In [12]:
plt.scatter(x,y)
plt.plot(x, m.predict(x_spl));


Smoothness regularizaion

The advantage of penalized B-splines (P-splines) is that we can use a large number of spline knots with smoothness regularization. In concise, this is achieved by adding a regularizer to the SplineT layer: concise.regularizers.SplineSmoother(diff_order=2, l2_smooth=0.0, l2=0.0).


In [13]:
def fit_lines(x, n_bases=10, l2_smooth=0, diff_order=2):
    # train the pre-processor
    es = EncodeSplines(n_bases=n_bases)
    es.fit(x.reshape((-1,1)))
    
    # train the model
    x_spl = es.transform(x.reshape((-1,1)))
    m = Sequential([
            cl.SplineT(kernel_regularizer=cr.SplineSmoother(diff_order=diff_order, # Smoothness regularization
                                                            l2_smooth=l2_smooth),
                       input_shape=x_spl.shape[1:])
                    ])
    m.compile(Adam(lr=0.1), "mse")
    m.fit(x_spl, y, epochs=150, verbose=0)
    
    # denser grid to predict
    x_pred = np.linspace(x.min(), x.max(), num=100)
    return x_pred, m.predict(es.transform(x_pred.reshape((-1,1))))

In [14]:
# Generate data
x = np.linspace(1, 15, num=20)
y_true = np.sin(x)
y = y_true + np.random.normal(scale=0.3, size=len(x))

In [15]:
l2s_list = [0, 10]
y_pred_l = [fit_lines(x, n_bases=50, l2_smooth=l2s) for l2s in l2s_list]

In [16]:
plt.scatter(x,y)
for i, (x_pred, y_pred) in enumerate(y_pred_l):
    plt.plot(x_pred, y_pred, label=l2s_list[i])
plt.legend(title="l2_smooth");


SplineT in sequence-based models

Let's show a simple use-case of spline transformation in the context of DNA sequence-based models. The core idea is to have, in addition to sequence, a scalar vector along the sequence encoding distances to some genomic landmarks. For simplicity, we will here consider the distance within the sequence.

We will use a simple simulated dataset, where multiple TTAATGA motifs were embeded in the sequence and the response was computed as the sum of the position-dependent motif effects:

$$ f_{pos} (x) = 0.3*\frac{1}{\sqrt{2 \pi 0.01}} \exp{\{- \frac{(x - 0.2)^2}{0.02}\}} + 0.05 \sin(15 x) + 0.8 \;,$$

where $x\in[0,1]$ is the position within the sequence.


In [17]:
from scipy.stats import norm
pef = lambda x: 0.3*norm.pdf(x, 0.2, 0.1) + 0.05*np.sin(15*x) + 0.8

In [55]:
positions = np.arange(499)
pos_effect_theoretical = pef(positions/positions.max())
plt.plot(pos_effect_theoretical);
plt.xlabel("Position")
plt.ylabel("Pos. effect");



In [56]:
from concise.preprocessing import encodeDNA
from concise.utils.pwm import PWM
import concise.initializers as ci
from keras.callbacks import EarlyStopping

In [57]:
# Load the data
data_dir = "../data/"
dt = pd.read_csv(data_dir + "/01-fixed_seq_len-1.csv")
motifs = ["TTAATGA"]
dt


Out[57]:
y seq
0 5.4763 TTCTGGGAGGCGTCCTTACGT...
1 0.2309 AAGTATGATCATACGCACCGT...
2 6.0881 TATGCAGGATCGAAATACTCT...
... ... ...
2997 0.2773 GAGGGGTCAGCATGCGTATGA...
2998 15.6793 CTACCAAATCCAAACTTCAGC...
2999 13.0256 CTGACCATACGTTTCACACCG...

3000 rows × 2 columns


In [58]:
# Create arrays for model training
x_seq = encodeDNA(dt["seq"])
y = dt.y.as_matrix()
seq_length = x_seq.shape[1]

x_seq.shape # (n_samples, seq_length, n_bases)


Out[58]:
(3000, 500, 4)

In [59]:
# positional track
x_pos = np.repeat(np.arange(seq_length).reshape((1,-1)), y.shape[0], axis=0)
x_pos[:5]


Out[59]:
array([[  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499]])

In [60]:
# default data-format for SplineT positions is: (N_batch, seq_length, channels)
x_pos = x_pos.reshape((-1, seq_length, 1)) 
x_pos.shape


Out[60]:
(3000, 500, 1)

In [61]:
# transform positions with the EncodeSplines pre-processor
es = EncodeSplines(n_bases=10)
x_pos_spl = es.fit_transform(x_pos)
# This adds a new axis n_bases to the end
x_pos_spl.shape


Out[61]:
(3000, 500, 1, 10)

In [62]:
# Compile the training set
train = {"seq": x_seq, "pos": x_pos_spl}, y

In [63]:
# To make the task easier, we will initialize the filters 
# of the conv layer to the simulated motif
pwm_list = [PWM.from_consensus(motif) for motif in motifs]
motif_width = 7
print(pwm_list)
pwm_list[0].plotPWM(figsize=(3,0.5));


[PWM(name: None, consensus: TTAATGA)]

In [69]:
def model_SplineT(n_bases=10):
    """Create Keras model based on two data inputs: sequence and scalar-vector along the sequence"""
    np.random.seed(42)

    # DNA input
    input_dna = cl.InputDNA(seq_length, name="seq")
    
    # Positions input
    input_pos = kl.Input((seq_length, 1, n_bases), name="pos")

    # Sequence module
    x_seq = cl.ConvDNA(filters=1, 
                       kernel_size=motif_width, ## motif width
                       activation="relu", 
                       # motif initialization
                       kernel_initializer=ci.PSSMKernelInitializer(pwm_list),
                       bias_initializer=ci.PSSMBiasInitializer(pwm_list,
                                                               kernel_size=motif_width, 
                                                               mean_max_scale=0.99),
                       padding="same",
                       name="conv_dna",
                       ## mean_max_scale of 1 means that only consensus sequence gets score larger than 0
                      )(input_dna)
    
    # Positional module
    x_pos = cl.SplineT(kernel_regularizer=cr.SplineSmoother(l2_smooth=1e-3), 
                       kernel_initializer="zeros",
                       name="SplineT")(input_pos)
    x_pos = kl.Lambda(lambda x: 1 + x)(x_pos)
    
    # Merge
    x = kl.multiply([x_seq, x_pos])
    x = cl.GlobalSumPooling1D()(x)
    x = kl.Dense(units=1,activation="linear")(x)
    model = Model(inputs=[input_dna, input_pos], outputs=x)

    # compile the model
    model.compile(optimizer=Adam(lr=0.1), loss="mse")
    return model

In [70]:
# Get the compiled model
np.random.seed(42)
m = model_SplineT()

In [71]:
# Train
np.random.seed(42)
m.fit(train[0], train[1],
      epochs=30, verbose=2,
      callbacks=[EarlyStopping(patience=5)],
      validation_split=.2);


Train on 2400 samples, validate on 600 samples
Epoch 1/30
0s - loss: 2.4880 - val_loss: 1.7815
Epoch 2/30
0s - loss: 1.7226 - val_loss: 1.5614
Epoch 3/30
0s - loss: 1.6402 - val_loss: 1.5303
Epoch 4/30
0s - loss: 1.7361 - val_loss: 1.6460
Epoch 5/30
0s - loss: 1.7810 - val_loss: 1.5749
Epoch 6/30
0s - loss: 1.8376 - val_loss: 1.5673
Epoch 7/30
0s - loss: 1.7018 - val_loss: 1.9090
Epoch 8/30
0s - loss: 1.6653 - val_loss: 1.6473
Epoch 9/30
0s - loss: 1.7693 - val_loss: 1.5478

In [72]:
## visualize the effect
w_pos = m.get_layer("SplineT").get_weights()[0]
x_pos_single = train[0]["pos"][0,:,0,:]

In [73]:
plt.plot(1 + np.dot(x_pos_single, np.ravel(w_pos)))
plt.plot(pos_effect_theoretical);
plt.ylabel('Positional effect')
plt.xlabel('Position')
plt.title("Positional effect")
plt.legend(labels=["Inferred", "Theoretical"]);


Convenience function: SplineWeight1D

Concise provides a shortcut to modelling internal positions: SplineWeight1D.


In [74]:
def model_SplineWeight1D():
    np.random.seed(42)
    model = Sequential([
                cl.ConvDNA(filters=1, 
                       kernel_size=motif_width, ## motif width
                       activation="relu", 
                       kernel_initializer=ci.PSSMKernelInitializer(pwm_list),
                       bias_initializer=ci.PSSMBiasInitializer(pwm_list,kernel_size=motif_width, mean_max_scale=0.99),
                       padding="same",
                       name="conv_dna",
                       seq_length=seq_length
                       ## mean_max_scale of 1 means that only consensus sequence gets score larger than 0
                      ),
                cl.SplineWeight1D(n_bases=10, l2_smooth=1e-3, l2=0, name="spline_weight"),
                cl.GlobalSumPooling1D(),
                kl.Dense(units=1,activation="linear")
        ])
    model.compile(optimizer=Adam(lr=0.1), loss="mse")
    return model

In [75]:
# Train
np.random.seed(42)
m = model_SplineWeight1D()
m.fit(train[0]["seq"], train[1],
      epochs=30, verbose=2,
      callbacks=[EarlyStopping(patience=5)],
      validation_split=.2);


Train on 2400 samples, validate on 600 samples
Epoch 1/30
0s - loss: 2.4785 - val_loss: 1.8418
Epoch 2/30
0s - loss: 1.6359 - val_loss: 1.5540
Epoch 3/30
0s - loss: 1.6398 - val_loss: 1.6543
Epoch 4/30
0s - loss: 1.6762 - val_loss: 1.4844
Epoch 5/30
0s - loss: 1.7411 - val_loss: 1.6224
Epoch 6/30
0s - loss: 1.9801 - val_loss: 1.8591
Epoch 7/30
0s - loss: 1.8632 - val_loss: 1.6679
Epoch 8/30
0s - loss: 1.8000 - val_loss: 1.5397
Epoch 9/30
0s - loss: 1.6719 - val_loss: 1.7291
Epoch 10/30
0s - loss: 1.6877 - val_loss: 1.7287

In [76]:
# plot
m.get_layer("spline_weight").plot() # layer also has a plotting method
plt.plot(positions, pos_effect_theoretical - 1)
plt.ylabel('Positional effect')
plt.xlabel('Position')
plt.title("Positional effect")
plt.legend(labels=["Inferred", "Theoretical"]);