Spline transformation in CONCISE

This notebook will demonstrate the use of spline transformation in CONCISE.

B-splines crash course

Core component of spline transformation are B-splines (See Splines, Knots and Penalties by Eilers and and Marx). Here is how they look like and how to compute them with concise.



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
from concise.utils.splines import BSpline
from concise.preprocessing import EncodeSplines # uses BSpline









    



Using TensorFlow backend.



In [2]:

    
def plot_order(degree, ax):
    x = np.linspace(0, 1, 100).reshape((-1,1))
    y = EncodeSplines(n_bases=10, degree=degree).fit_transform(x).squeeze() 
    # alternatively: y = BSpline(0, 1, 10, degree).predict(x)
    ax.plot(x, y)
    ax.set(title="Degree: {d}, order: {o}".format(d=degree, o=degree+1))



In [3]:

    
fig, ax = plt.subplots(nrows=4)
fig.tight_layout()
for i in range(ax.shape[0]):
    plot_order(i, ax[i])

We can see that for degree=0, they are just segmenting the range into equally-sized bins.

Generalized additive models with Keras

Let's see how to fit a simple smooth curve, as usually done with generalized additive models (using say mgcv package in R).



In [4]:

    
# some data
x = np.linspace(1, 15)
y_true = np.sin(x)
y = y_true + np.random.normal(scale=0.3, size=len(x))

plt.scatter(x,y);



In [5]:

    
import concise.layers as cl
import concise.regularizers as cr
import keras.layers as kl
from keras.models import Model, Sequential
from keras.optimizers import Adam



In [6]:

    
# pre-processor
x_spl = EncodeSplines(n_bases=10).fit_transform(x.reshape((-1,1)))



In [7]:

    
x_spl.shape









    Out[7]:





(50, 1, 10)



In [10]:

    
# keras model
m = Sequential([
        cl.SplineT(input_shape=x_spl.shape[1:])
    ])
m.compile(Adam(lr=0.1), "mse")



In [11]:

    
m.fit(x_spl, y, epochs=10, verbose=0);



In [12]:

    
plt.scatter(x,y)
plt.plot(x, m.predict(x_spl));

Smoothness regularizaion

The advantage of penalized B-splines (P-splines) is that we can use a large number of spline knots with smoothness regularization. In concise, this is achieved by adding a regularizer to the SplineT layer: concise.regularizers.SplineSmoother(diff_order=2, l2_smooth=0.0, l2=0.0).



In [13]:

    
def fit_lines(x, n_bases=10, l2_smooth=0, diff_order=2):
    # train the pre-processor
    es = EncodeSplines(n_bases=n_bases)
    es.fit(x.reshape((-1,1)))
    
    # train the model
    x_spl = es.transform(x.reshape((-1,1)))
    m = Sequential([
            cl.SplineT(kernel_regularizer=cr.SplineSmoother(diff_order=diff_order, # Smoothness regularization
                                                            l2_smooth=l2_smooth),
                       input_shape=x_spl.shape[1:])
                    ])
    m.compile(Adam(lr=0.1), "mse")
    m.fit(x_spl, y, epochs=150, verbose=0)
    
    # denser grid to predict
    x_pred = np.linspace(x.min(), x.max(), num=100)
    return x_pred, m.predict(es.transform(x_pred.reshape((-1,1))))



In [14]:

    
# Generate data
x = np.linspace(1, 15, num=20)
y_true = np.sin(x)
y = y_true + np.random.normal(scale=0.3, size=len(x))



In [15]:

    
l2s_list = [0, 10]
y_pred_l = [fit_lines(x, n_bases=50, l2_smooth=l2s) for l2s in l2s_list]



In [16]:

    
plt.scatter(x,y)
for i, (x_pred, y_pred) in enumerate(y_pred_l):
    plt.plot(x_pred, y_pred, label=l2s_list[i])
plt.legend(title="l2_smooth");

SplineT in sequence-based models

Let's show a simple use-case of spline transformation in the context of DNA sequence-based models. The core idea is to have, in addition to sequence, a scalar vector along the sequence encoding distances to some genomic landmarks. For simplicity, we will here consider the distance within the sequence.

We will use a simple simulated dataset, where multiple TTAATGA motifs were embeded in the sequence and the response was computed as the sum of the position-dependent motif effects:

$$ f_{pos} (x) = 0.3*\frac{1}{\sqrt{2 \pi 0.01}} \exp{\{- \frac{(x - 0.2)^2}{0.02}\}} + 0.05 \sin(15 x) + 0.8 \;,$$

where $x\in[0,1]$ is the position within the sequence.



In [17]:

    
from scipy.stats import norm
pef = lambda x: 0.3*norm.pdf(x, 0.2, 0.1) + 0.05*np.sin(15*x) + 0.8



In [55]:

    
positions = np.arange(499)
pos_effect_theoretical = pef(positions/positions.max())
plt.plot(pos_effect_theoretical);
plt.xlabel("Position")
plt.ylabel("Pos. effect");



In [56]:

    
from concise.preprocessing import encodeDNA
from concise.utils.pwm import PWM
import concise.initializers as ci
from keras.callbacks import EarlyStopping



In [57]:

    
# Load the data
data_dir = "../data/"
dt = pd.read_csv(data_dir + "/01-fixed_seq_len-1.csv")
motifs = ["TTAATGA"]
dt









    Out[57]:







  
    
      
      y
      seq
    
  
  
    
      0
      5.4763
      TTCTGGGAGGCGTCCTTACGT...
    
    
      1
      0.2309
      AAGTATGATCATACGCACCGT...
    
    
      2
      6.0881
      TATGCAGGATCGAAATACTCT...
    
    
      ...
      ...
      ...
    
    
      2997
      0.2773
      GAGGGGTCAGCATGCGTATGA...
    
    
      2998
      15.6793
      CTACCAAATCCAAACTTCAGC...
    
    
      2999
      13.0256
      CTGACCATACGTTTCACACCG...
    
  

3000 rows × 2 columns



In [58]:

    
# Create arrays for model training
x_seq = encodeDNA(dt["seq"])
y = dt.y.as_matrix()
seq_length = x_seq.shape[1]

x_seq.shape # (n_samples, seq_length, n_bases)









    Out[58]:





(3000, 500, 4)



In [59]:

    
# positional track
x_pos = np.repeat(np.arange(seq_length).reshape((1,-1)), y.shape[0], axis=0)
x_pos[:5]









    Out[59]:





array([[  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499],
       [  0,   1,   2, ..., 497, 498, 499]])



In [60]:

    
# default data-format for SplineT positions is: (N_batch, seq_length, channels)
x_pos = x_pos.reshape((-1, seq_length, 1)) 
x_pos.shape









    Out[60]:





(3000, 500, 1)



In [61]:

    
# transform positions with the EncodeSplines pre-processor
es = EncodeSplines(n_bases=10)
x_pos_spl = es.fit_transform(x_pos)
# This adds a new axis n_bases to the end
x_pos_spl.shape









    Out[61]:





(3000, 500, 1, 10)



In [62]:

    
# Compile the training set
train = {"seq": x_seq, "pos": x_pos_spl}, y



In [63]:

    
# To make the task easier, we will initialize the filters 
# of the conv layer to the simulated motif
pwm_list = [PWM.from_consensus(motif) for motif in motifs]
motif_width = 7
print(pwm_list)
pwm_list[0].plotPWM(figsize=(3,0.5));









    



[PWM(name: None, consensus: TTAATGA)]



In [69]:

    
def model_SplineT(n_bases=10):
    """Create Keras model based on two data inputs: sequence and scalar-vector along the sequence"""
    np.random.seed(42)

    # DNA input
    input_dna = cl.InputDNA(seq_length, name="seq")
    
    # Positions input
    input_pos = kl.Input((seq_length, 1, n_bases), name="pos")

    # Sequence module
    x_seq = cl.ConvDNA(filters=1, 
                       kernel_size=motif_width, ## motif width
                       activation="relu", 
                       # motif initialization
                       kernel_initializer=ci.PSSMKernelInitializer(pwm_list),
                       bias_initializer=ci.PSSMBiasInitializer(pwm_list,
                                                               kernel_size=motif_width, 
                                                               mean_max_scale=0.99),
                       padding="same",
                       name="conv_dna",
                       ## mean_max_scale of 1 means that only consensus sequence gets score larger than 0
                      )(input_dna)
    
    # Positional module
    x_pos = cl.SplineT(kernel_regularizer=cr.SplineSmoother(l2_smooth=1e-3), 
                       kernel_initializer="zeros",
                       name="SplineT")(input_pos)
    x_pos = kl.Lambda(lambda x: 1 + x)(x_pos)
    
    # Merge
    x = kl.multiply([x_seq, x_pos])
    x = cl.GlobalSumPooling1D()(x)
    x = kl.Dense(units=1,activation="linear")(x)
    model = Model(inputs=[input_dna, input_pos], outputs=x)

    # compile the model
    model.compile(optimizer=Adam(lr=0.1), loss="mse")
    return model



In [70]:

    
# Get the compiled model
np.random.seed(42)
m = model_SplineT()



In [71]:

    
# Train
np.random.seed(42)
m.fit(train[0], train[1],
      epochs=30, verbose=2,
      callbacks=[EarlyStopping(patience=5)],
      validation_split=.2);









    



Train on 2400 samples, validate on 600 samples
Epoch 1/30
0s - loss: 2.4880 - val_loss: 1.7815
Epoch 2/30
0s - loss: 1.7226 - val_loss: 1.5614
Epoch 3/30
0s - loss: 1.6402 - val_loss: 1.5303
Epoch 4/30
0s - loss: 1.7361 - val_loss: 1.6460
Epoch 5/30
0s - loss: 1.7810 - val_loss: 1.5749
Epoch 6/30
0s - loss: 1.8376 - val_loss: 1.5673
Epoch 7/30
0s - loss: 1.7018 - val_loss: 1.9090
Epoch 8/30
0s - loss: 1.6653 - val_loss: 1.6473
Epoch 9/30
0s - loss: 1.7693 - val_loss: 1.5478



In [72]:

    
## visualize the effect
w_pos = m.get_layer("SplineT").get_weights()[0]
x_pos_single = train[0]["pos"][0,:,0,:]



In [73]:

    
plt.plot(1 + np.dot(x_pos_single, np.ravel(w_pos)))
plt.plot(pos_effect_theoretical);
plt.ylabel('Positional effect')
plt.xlabel('Position')
plt.title("Positional effect")
plt.legend(labels=["Inferred", "Theoretical"]);

Convenience function: SplineWeight1D

Concise provides a shortcut to modelling internal positions: SplineWeight1D.



In [74]:

    
def model_SplineWeight1D():
    np.random.seed(42)
    model = Sequential([
                cl.ConvDNA(filters=1, 
                       kernel_size=motif_width, ## motif width
                       activation="relu", 
                       kernel_initializer=ci.PSSMKernelInitializer(pwm_list),
                       bias_initializer=ci.PSSMBiasInitializer(pwm_list,kernel_size=motif_width, mean_max_scale=0.99),
                       padding="same",
                       name="conv_dna",
                       seq_length=seq_length
                       ## mean_max_scale of 1 means that only consensus sequence gets score larger than 0
                      ),
                cl.SplineWeight1D(n_bases=10, l2_smooth=1e-3, l2=0, name="spline_weight"),
                cl.GlobalSumPooling1D(),
                kl.Dense(units=1,activation="linear")
        ])
    model.compile(optimizer=Adam(lr=0.1), loss="mse")
    return model



In [75]:

    
# Train
np.random.seed(42)
m = model_SplineWeight1D()
m.fit(train[0]["seq"], train[1],
      epochs=30, verbose=2,
      callbacks=[EarlyStopping(patience=5)],
      validation_split=.2);









    



Train on 2400 samples, validate on 600 samples
Epoch 1/30
0s - loss: 2.4785 - val_loss: 1.8418
Epoch 2/30
0s - loss: 1.6359 - val_loss: 1.5540
Epoch 3/30
0s - loss: 1.6398 - val_loss: 1.6543
Epoch 4/30
0s - loss: 1.6762 - val_loss: 1.4844
Epoch 5/30
0s - loss: 1.7411 - val_loss: 1.6224
Epoch 6/30
0s - loss: 1.9801 - val_loss: 1.8591
Epoch 7/30
0s - loss: 1.8632 - val_loss: 1.6679
Epoch 8/30
0s - loss: 1.8000 - val_loss: 1.5397
Epoch 9/30
0s - loss: 1.6719 - val_loss: 1.7291
Epoch 10/30
0s - loss: 1.6877 - val_loss: 1.7287



In [76]:

    
# plot
m.get_layer("spline_weight").plot() # layer also has a plotting method
plt.plot(positions, pos_effect_theoretical - 1)
plt.ylabel('Positional effect')
plt.xlabel('Position')
plt.title("Positional effect")
plt.legend(labels=["Inferred", "Theoretical"]);

	y	seq
0	5.4763	TTCTGGGAGGCGTCCTTACGT...
1	0.2309	AAGTATGATCATACGCACCGT...
2	6.0881	TATGCAGGATCGAAATACTCT...
...	...	...
2997	0.2773	GAGGGGTCAGCATGCGTATGA...
2998	15.6793	CTACCAAATCCAAACTTCAGC...
2999	13.0256	CTGACCATACGTTTCACACCG...