Initializing filters on known motifs

In the scenario where data is scarse, it is often useful to initialize the filters of the first convolutional layer to some known position weights matrices (PWM's). That way, the model already starts with a parameter configuration much closer to the 'right' one.

Concise provides access to 2 PWM databases:

  • transcription factors from ENCODE (2067 PWMs)
  • transcription factors from HOCOMOCO v10 (640 PWMs)
  • rna-binding proteins from ATtrACT (1583 PWMs)

Find the motif of interest

Each PWM database is provided as a module under concise.data. It provides two functions:

  • concise.data.<db>.get_metadata() - returns a pandas.DataFrame with metadata information about each PWM
  • concise.data.<db>.get_pwm_list() - given a list of PWM ids, return a list with concise.utils.pwm.PWM instances

Metadata tables


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

In [4]:
# RBP PWM's
from concise.data import attract

dfa = attract.get_metadata()
dfa


Out[4]:
PWM_id Gene_name Gene_id Mutated Organism Motif Len Experiment_description Database Pubmed Experiment_description.1 Family Score
0 519 3IVK 3IVK no Mus_musculus GAAACA 6 X-RAY DIFFRACTION PDB 19965478 X-RAY DIFFRACTION NaN 1.000000**
1 574 3IVK 3IVK no Mus_musculus UGGG 4 X-RAY DIFFRACTION PDB 19965478 X-RAY DIFFRACTION NaN 1.000000**
2 464 4KZD 4KZD no Mus_musculus GAAAC 5 X-RAY DIFFRACTION PDB 24952597 X-RAY DIFFRACTION NaN 1.000000**
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4879 1396 HNRNPAB ENSG00000197451 no Homo_sapiens AUAGCA 6 In vitro splicing assays AEDB 12426391 other RRM 1.000000**
4880 1397 HNRNPA1 ENSG00000135486 no Homo_sapiens UAGG 4 Immunoprecipitation;U... AEDB 15506926 UV cross-linking RRM 1.000000**
4881 1398 PTBP1 ENSG00000011304 no Homo_sapiens UUCUUC 6 In vivo splicing assa... AEDB 14966131 UV cross-linking RRM 1.000000**

4882 rows × 13 columns


In [5]:
# TF PWM's
from concise.data import encode

dfe = encode.get_metadata()
dfe


Out[5]:
PWM_id info1 info2 consensus
0 AFP_1 AFP1_transfac_M00616 None ATTAACTACAC
1 AHR::ARNT::HIF1A_1 AhR,-Arnt,-HIF-1_tran... None TGCGTGCGG
2 AHR::ARNT_1 AhR:Arnt_transfac_M00235 None TAAGGGTTGCGTGCCC
... ... ... ... ...
2064 ZSCAN4_3 ZSCAN4_jolma_full_M54 None TGCACACACTGAAAA
2065 fake_AACGSSAA None None AACGCCAA
2066 fake_AAGCSSAA None None AAGCCCAA

2067 rows × 4 columns


In [6]:
# TF PWM's
from concise.data import hocomoco

dfh = hocomoco.get_metadata()
dfh


Out[6]:
PWM_id TF Organism DB info consensus
0 AHR_HUMAN.H10MO.B AHR HUMAN H10MO B GTTGCGTGC
1 AIRE_HUMAN.H10MO.C AIRE HUMAN H10MO C ATTGGTTATATTGGTTAA
2 ALX1_HUMAN.H10MO.B ALX1 HUMAN H10MO B ATAATTGAATTA
... ... ... ... ... ... ...
637 ZN784_HUMAN.H10MO.D ZN784 HUMAN H10MO D GAGGTAGGTAC
638 ZSC16_HUMAN.H10MO.D ZSC16 HUMAN H10MO D GAGGTGTTCTGTTAACACTA
639 ZSCA4_HUMAN.H10MO.D ZSCA4 HUMAN H10MO D AGTGTGTGCACT

640 rows × 6 columns

Let's choose PUM2 PWM (RBP in Human):


In [7]:
dfa_pum2 = dfa[dfa.Gene_name.str.match("PUM2") & \
               dfa.Organism.str.match("Homo_sapiens")]
dfa_pum2


Out[7]:
PWM_id Gene_name Gene_id Mutated Organism Motif Len Experiment_description Database Pubmed Experiment_description.1 Family Score
2603 503 PUM2 ENSG00000055917 no Homo_sapiens UGUAAAUA 8 X-RAY DIFFRACTION PDB 21397187 X-RAY DIFFRACTION PUF 1.000000**
2604 361 PUM2 ENSG00000055917 no Homo_sapiens UGUACAUC 8 X-RAY DIFFRACTION PDB 21397187 X-RAY DIFFRACTION PUF 1.000000**
2605 514 PUM2 ENSG00000055917 no Homo_sapiens UGUAGAUA 8 X-RAY DIFFRACTION PDB 21397187 X-RAY DIFFRACTION PUF 1.000000**
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2613 107 PUM2 ENSG00000055917 no Homo_sapiens UGUAUAUA 8 PAR-clip C 20371350 genome-wide in vivo i... PUF 0.250000**
2614 107 PUM2 ENSG00000055917 no Homo_sapiens UGUACAUA 8 PAR-clip C 20371350 genome-wide in vivo i... PUF 0.250000**
2615 107 PUM2 ENSG00000055917 no Homo_sapiens UGUAGAUA 8 PAR-clip C 20371350 genome-wide in vivo i... PUF 0.250000**

13 rows × 13 columns

Visualization - PWM class

The PWM class provides a method plotPWM to visualize the PWM.


In [8]:
# Visualize the PUM2 Motifs from different experiments
from concise.utils.pwm import PWM
dfa_pum2_uniq = dfa_pum2[["Experiment_description", "PWM_id"]].drop_duplicates()
pwm_list = attract.get_pwm_list(dfa_pum2_uniq.PWM_id)

In [9]:
for i, pwm in enumerate(pwm_list):
    print("PWM_id:", pwm.name, "; Experiment_description:", dfa_pum2_uniq.Experiment_description.iloc[i])
    pwm.plotPWM(figsize=(3,1))


PWM_id: 503 ; Experiment_description: X-RAY DIFFRACTION
PWM_id: 361 ; Experiment_description: X-RAY DIFFRACTION
PWM_id: 514 ; Experiment_description: X-RAY DIFFRACTION
PWM_id: 116 ; Experiment_description: RIP-chip
PWM_id: 129 ; Experiment_description: genome-wide in vivo immunoprecipitation
PWM_id: 107 ; Experiment_description: PAR-clip

We can select the PWM with id 129.


In [10]:
pwm_list = [pwm for pwm in pwm_list if pwm.name == "129"]

In [11]:
pwm_list


Out[11]:
[PWM(name: 129, consensus: TGTAAATA)]

Initialize the conv filters with PWM values


In [12]:
import concise.layers as cl
import keras.layers as kl
import concise.initializers as ci
import concise.regularizers as cr
from keras.callbacks import EarlyStopping
from concise.preprocessing import encodeDNA
from keras.models import Model, load_model
from keras.optimizers import Adam

In [13]:
# get the data
def load(split="train", st=None):
    dt = pd.read_csv("../data/RBP/PUM2_{0}.csv".format(split))
    # DNA/RNA sequence
    xseq = encodeDNA(dt.seq) # list of sequences -> np.ndarray
    # response variable
    y = dt.binding_site.as_matrix().reshape((-1, 1)).astype("float")
    return {"seq": xseq}, y

train, valid, test = load("train"), load("valid"), load("test")

# deduce sequence length
seq_length = train[0]["seq"].shape[1]

In [14]:
# define the model
def model(train, filters=1, kernel_size=9, pwm_list=None, lr=0.001):
    seq_length = train[0]["seq"].shape[1]
    if pwm_list is None:
        kinit = "glorot_uniform"
        binit = "zeros"
    else:
        kinit = ci.PSSMKernelInitializer(pwm_list, add_noise_before_Pwm2Pssm=True)
        binit = "zeros"
        
    # sequence
    in_dna = cl.InputDNA(seq_length=seq_length, name="seq")
    x = cl.ConvDNA(filters=filters, 
                   kernel_size=kernel_size, 
                   activation="relu",
                   kernel_initializer=kinit,
                   bias_initializer=binit,
                   name="conv1")(in_dna)
    x = kl.AveragePooling1D(pool_size=4)(x)
    x = kl.Flatten()(x)
    
    x = kl.Dense(units=1)(x)
    m = Model(in_dna, x)
    m.compile(Adam(lr=lr), loss="binary_crossentropy", metrics=["acc"])
    return m

ci.PSSMKernelInitializer will set the filters of the first convolutional layer to the values of the position-specific scoring matrix (PSSM):

$$ pssm_{ij} = log \frac{pwm_{ij}}{b_j} \;,$$

where $b_j$ is the background probability of observing base $j$.

We add gaussian noise to each individual filter. Let's visualize the filters:


In [15]:
# create two models: with and without PWM initialization
m_rand_init = model(train, filters=3, pwm_list=None) # random initialization
m_pwm_init = model(train, filters=3, pwm_list=pwm_list) # motif initialization

In [17]:
print("Random initialization:")
m_rand_init.get_layer("conv1").plot_weights(figsize=(3, 5));


Random initialization:
<matplotlib.figure.Figure at 0x7f9003607208>

In [18]:
print("Known PWM initialization:")
m_pwm_init.get_layer("conv1").plot_weights(figsize=(3, 5));


Known PWM initialization:
<matplotlib.figure.Figure at 0x7f8f965565f8>

In [20]:
# train the models
m_rand_init.fit(train[0], train[1], epochs=50, validation_data=valid, 
                verbose=0,
                callbacks=[EarlyStopping(patience=5)])


Out[20]:
<keras.callbacks.History at 0x7f8f9b78ce48>

In [22]:
m_pwm_init.fit(train[0], train[1], epochs=50, validation_data=valid, 
                verbose=0,
                callbacks=[EarlyStopping(patience=5)]);

Test-set performance


In [23]:
import concise.eval_metrics as cem

In [24]:
# performance on the test-set
# Random initialization
print("Random intiailzation auPR:", cem.auprc(test[1], m_rand_init.predict(test[0])))
# PWM initialization
print("Known PWM initialization auPR:", cem.auprc(test[1], m_pwm_init.predict(test[0])))


Random intiailzation auPR: 0.580103849926
Known PWM initialization auPR: 0.713372711603

Filter visualization


In [25]:
m_rand_init.get_layer("conv1").plot_weights(plot_type="motif_pwm_info", figsize=(3, 5));


<matplotlib.figure.Figure at 0x7f8f9b7d0fd0>

In [26]:
m_pwm_init.get_layer("conv1").plot_weights(plot_type="motif_pwm_info", figsize=(3, 5));


<matplotlib.figure.Figure at 0x7f8f9b94c4e0>

Benefits of motif initialization

  • Interpretatbility
    • we can use fewer filters and know that the major effect will be captured by the first filters
      • handy when studying the model parameters
  • Better predictive performance
  • More stable training