DeepCpG basics

This tutorial describes how to create DeepCpG data files, train models, and evaluate models. These topics are described in greater detail in the DeepCpG documentation.

Initialization

We first initialize some variables that will be used throughout the tutorial. test_mode=1 should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, test_mode=0 should be used.


In [1]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # Set to 1 for testing and 0 otherwise.
example_dir="../../data" # Example data.
cpg_dir="$example_dir/cpg" # CpG profiles.
dna_dir="$example_dir/dna/mm10" # DNA sequences.
anno_dir="$example_dir/anno/mm10" # Annotations of genomic contexts.

data_dir="./data" # DeepCpG data.
models_dir="./models" # Trained models.
mkdir -p $models_dir
eval_dir="./eval" # Evaluation data.
mkdir -p $eval_dir

Creating DeepCpG data files

We first store the known CpG methylation states of each cell into a tab delimted file with the following columns:

  • Chromosome (without chr)
  • Position of the CpG site on the chromosome
  • Binary methylation state of the CpG sites (0=unmethylation, 1=methylated)

CpG sites with a methylation rate between zero and one should be binarized by rounding. Filenames should correspond to cell names.

Each position must point the cytosine residue of a CpG site (positions enumerated from 1). Otherwise dcpg_data.py will report warnings, e.g. if a wrong genome is used or CpG sites were not correctly aligned.

For this tutorial we are using a subset of serum mouse embryonic stem cells from Smallwood et al. (2014):


In [2]:
ls $cpg_dir


BS27_1_SER.tsv BS27_3_SER.tsv BS27_5_SER.tsv BS27_6_SER.tsv BS27_8_SER.tsv

We can have a look at the methylation profile of cell 'BS27_1_SER':


In [3]:
head "$cpg_dir/BS27_1_SER.tsv"


1	3000827	1.0
1	3001007	1.0
1	3001018	1.0
1	3001277	1.0
1	3001629	1.0
1	3003226	1.0
1	3003339	1.0
1	3003379	1.0
1	3006416	1.0
1	3007580	1.0

Since we are dealing with mouse cells, we are using the mm10 (GRCm38) mouse genome build:


In [4]:
ls $dna_dir


Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
Mus_musculus.GRCm38.dna.chromosome.10.fa.gz
Mus_musculus.GRCm38.dna.chromosome.11.fa.gz
Mus_musculus.GRCm38.dna.chromosome.12.fa.gz
Mus_musculus.GRCm38.dna.chromosome.13.fa.gz
Mus_musculus.GRCm38.dna.chromosome.14.fa.gz
Mus_musculus.GRCm38.dna.chromosome.15.fa.gz
Mus_musculus.GRCm38.dna.chromosome.16.fa.gz
Mus_musculus.GRCm38.dna.chromosome.17.fa.gz
Mus_musculus.GRCm38.dna.chromosome.18.fa.gz
Mus_musculus.GRCm38.dna.chromosome.19.fa.gz
Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
Mus_musculus.GRCm38.dna.chromosome.3.fa.gz
Mus_musculus.GRCm38.dna.chromosome.4.fa.gz
Mus_musculus.GRCm38.dna.chromosome.5.fa.gz
Mus_musculus.GRCm38.dna.chromosome.6.fa.gz
Mus_musculus.GRCm38.dna.chromosome.7.fa.gz
Mus_musculus.GRCm38.dna.chromosome.8.fa.gz
Mus_musculus.GRCm38.dna.chromosome.9.fa.gz
Mus_musculus.GRCm38.dna.chromosome.MT.fa.gz
Mus_musculus.GRCm38.dna.chromosome.X.fa.gz
Mus_musculus.GRCm38.dna.chromosome.Y.fa.gz

These files were downloaded by setup.py. Other genomes, e.g. human genome hg38, can be downloaded, for example, with the following command:

wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz

Now we can run dcpg_data.py to create the input data for DeepCpG. For testing purposes, we only consider a few CpG sites on chromosome 19:


In [5]:
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --dna_files $dna_dir
    --out_dir $data_dir
    --cpg_wlen 50
    --dna_wlen 1001
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --chromo 1 13
        --nb_sample_chromo 1000
        "
fi
run $cmd


#################################
dcpg_data.py --cpg_profiles ../../data/cpg/BS27_1_SER.tsv ../../data/cpg/BS27_3_SER.tsv ../../data/cpg/BS27_5_SER.tsv ../../data/cpg/BS27_6_SER.tsv ../../data/cpg/BS27_8_SER.tsv --dna_files ../../data/dna/mm10 --out_dir ./data --cpg_wlen 50 --dna_wlen 1001 --chromo 1 13 --nb_sample_chromo 1000
#################################
INFO (2017-04-14 10:26:50,391): Reading CpG profiles ...
INFO (2017-04-14 10:26:50,391): ../../data/cpg/BS27_1_SER.tsv
INFO (2017-04-14 10:26:56,879): ../../data/cpg/BS27_3_SER.tsv
INFO (2017-04-14 10:27:01,592): ../../data/cpg/BS27_5_SER.tsv
INFO (2017-04-14 10:27:09,046): ../../data/cpg/BS27_6_SER.tsv
INFO (2017-04-14 10:27:14,598): ../../data/cpg/BS27_8_SER.tsv
INFO (2017-04-14 10:27:19,908): 2000 samples
INFO (2017-04-14 10:27:19,908): --------------------------------------------------------------------------------
INFO (2017-04-14 10:27:19,909): Chromosome 1 ...
INFO (2017-04-14 10:27:19,923): 1000 / 1000 (100.0%) sites matched minimum coverage filter
INFO (2017-04-14 10:27:23,697): Chunk 	1 / 1
INFO (2017-04-14 10:27:23,703): Extracting DNA sequence windows ...
INFO (2017-04-14 10:27:24,124): Extracting CpG neighbors ...
INFO (2017-04-14 10:27:24,240): --------------------------------------------------------------------------------
INFO (2017-04-14 10:27:24,240): Chromosome 13 ...
INFO (2017-04-14 10:27:24,254): 1000 / 1000 (100.0%) sites matched minimum coverage filter
INFO (2017-04-14 10:27:26,493): Chunk 	1 / 1
INFO (2017-04-14 10:27:26,499): Extracting DNA sequence windows ...
INFO (2017-04-14 10:27:26,779): Extracting CpG neighbors ...
INFO (2017-04-14 10:27:26,898): Done!

For each CpG site that is observed in at least one cell, this command extracts the 50 neighboring CpG sites (25 to the left and 25 to the right), and the 1001 bp long DNA sequence window centered on the CpG site. In test mode, only 1000 CpG sites will be randomly sampled from chromosome 1 and 13. The command creates multiple HDF5 files with name cX_FROM_TO.h5, where X is the chromosome, and FROM and TO the index of CpG sites stored in the file:


In [6]:
ls $data_dir


c13_000000-001000.h5 c1_000000-001000.h5

Training models

We can now train models on the created data. First, we need to split the data into a training, validation set, and test set. The training set should contain at least 3 million CpG sites. We will use chromosome 1, 3, 5, 7, and 19 as training set, and chromosome 13, 14, 15, 16, and 17 as validation set:


In [7]:
function get_data_files {
  local data_dir=$1
  shift
  local chromos=$@

  files=""
  for chromo in $chromos; do
    files="$files $(ls $data_dir/c${chromo}_*.h5 2> /dev/null)"
  done
  echo $files
}

train_files=$(get_data_files $data_dir 1 3 5 7 9)
val_files=$(get_data_files $data_dir 13 14 15 16 17)

In [8]:
echo $train_files


./data/c1_000000-001000.h5

In [9]:
echo $val_files


./data/c13_000000-001000.h5

We can count the number of CpG sites in the training set using dcpg_data_stats.py:


In [10]:
cmd="dcpg_data_stats.py $train_files"
run $cmd


#################################
dcpg_data_stats.py ./data/c1_000000-001000.h5
#################################
           output  nb_tot  nb_obs  frac_obs      mean       var
0  cpg/BS27_1_SER    1000     187     0.187  0.775401  0.174154
1  cpg/BS27_3_SER    1000     208     0.208  0.711538  0.205251
2  cpg/BS27_5_SER    1000     200     0.200  0.690000  0.213900
3  cpg/BS27_6_SER    1000     195     0.195  0.666667  0.222222
4  cpg/BS27_8_SER    1000     210     0.210  0.776190  0.173719

For each output cell, nb_tot is the total number of CpG sites, nb_obs the number of CpG sites with known methylation state, frac_obs the ratio between nb_obs and nb_tot, mean the mean methylation rate, and var the variance of the methylation rate.

We can now train our model. DeepCpG consists of a DNA, CpG, and Joint model, which can be trained either jointly or separately. We will train them jointly, starting with the CpG model:


In [11]:
cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --cpg_model RnnL1
    --out_dir $models_dir/cpg
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 30
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-001000.h5 --val_files ./data/c13_000000-001000.h5 --cpg_model RnnL1 --out_dir ./models/cpg --nb_epoch 1 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-04-14 10:27:34,697): Building model ...
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-04-14 10:27:34,703): Building CpG model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
cpg/state (InputLayer)           (None, 5, 50)         0                                            
____________________________________________________________________________________________________
cpg/dist (InputLayer)            (None, 5, 50)         0                                            
____________________________________________________________________________________________________
cpg/merge_1 (Merge)              (None, 5, 100)        0           cpg/state[0][0]                  
                                                                   cpg/dist[0][0]                   
____________________________________________________________________________________________________
cpg/timedistributed_1 (TimeDistr (None, 5, 256)        25856       cpg/merge_1[0][0]                
____________________________________________________________________________________________________
cpg/bidirectional_1 (Bidirection (None, 512)           787968      cpg/timedistributed_1[0][0]      
____________________________________________________________________________________________________
cpg/dropout_1 (Dropout)          (None, 512)           0           cpg/bidirectional_1[0][0]        
____________________________________________________________________________________________________
cpg/BS27_1_SER (Dense)           (None, 1)             513         cpg/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_3_SER (Dense)           (None, 1)             513         cpg/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_5_SER (Dense)           (None, 1)             513         cpg/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_6_SER (Dense)           (None, 1)             513         cpg/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_8_SER (Dense)           (None, 1)             513         cpg/dropout_1[0][0]              
====================================================================================================
Total params: 816,389
Trainable params: 816,389
Non-trainable params: 0
____________________________________________________________________________________________________
INFO (2017-04-14 10:27:35,361): Computing output statistics ...
Output statistics:
          name | nb_tot | nb_obs | frac_obs | mean |  var
---------------------------------------------------------
cpg/BS27_1_SER |   1000 |    187 |     0.19 | 0.78 | 0.17
cpg/BS27_3_SER |   1000 |    208 |     0.21 | 0.71 | 0.21
cpg/BS27_5_SER |   1000 |    200 |     0.20 | 0.69 | 0.21
cpg/BS27_6_SER |   1000 |    195 |     0.20 | 0.67 | 0.22
cpg/BS27_8_SER |   1000 |    210 |     0.21 | 0.78 | 0.17

Class weights:
cpg/BS27_1_SER | cpg/BS27_3_SER | cpg/BS27_5_SER | cpg/BS27_6_SER | cpg/BS27_8_SER
----------------------------------------------------------------------------------
        0=0.78 |         0=0.71 |         0=0.69 |         0=0.67 |         0=0.78
        1=0.22 |         1=0.29 |         1=0.31 |         1=0.33 |         1=0.22

INFO (2017-04-14 10:27:35,888): Loading data ...
INFO (2017-04-14 10:27:35,892): Initializing callbacks ...
INFO (2017-04-14 10:27:35,892): Training model ...

Training samples: 1000
Validation samples: 1000
Epochs: 1
Learning rate: 0.0001
====================================================================================================
Epoch 1/1
====================================================================================================
done (%) | time |   loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    12.8 |  0.0 | 5.0195 | 0.4144 |              0.0461 |              0.0569 |              0.0591 |              0.0610 |              0.0495 |             0.1200 |             0.5217 |             0.3462 |             0.3600 |             0.7241
    25.6 |  0.0 | 5.0225 | 0.4290 |              0.0483 |              0.0628 |              0.0599 |              0.0581 |              0.0503 |             0.2339 |             0.5109 |             0.4531 |             0.2887 |             0.6584
    38.4 |  0.0 | 5.0225 | 0.4215 |              0.0455 |              0.0611 |              0.0624 |              0.0557 |              0.0585 |             0.2353 |             0.4795 |             0.4211 |             0.3425 |             0.6294
    51.2 |  0.0 | 5.0164 | 0.4312 |              0.0457 |              0.0576 |              0.0627 |              0.0580 |              0.0569 |             0.2965 |             0.4164 |             0.4455 |             0.3908 |             0.6067
    64.0 |  0.0 | 5.0125 | 0.4199 |              0.0485 |              0.0595 |              0.0627 |              0.0579 |              0.0519 |             0.3184 |             0.3931 |             0.4524 |             0.4026 |             0.5329
    76.8 |  0.0 | 5.0053 | 0.4155 |              0.0457 |              0.0635 |              0.0596 |              0.0594 |              0.0490 |             0.3591 |             0.3609 |             0.4559 |             0.4269 |             0.4744
    89.6 |  0.0 | 5.0022 | 0.4281 |              0.0468 |              0.0612 |              0.0583 |              0.0611 |              0.0503 |             0.4083 |             0.3483 |             0.4529 |             0.4527 |             0.4781
   100.0 |  0.0 | 4.9983 | 0.4316 |              0.0459 |              0.0594 |              0.0603 |              0.0608 |              0.0507 |             0.4410 |             0.3488 |             0.4443 |             0.4576 |             0.4662
Epoch 00000: val_loss improved from inf to 5.38326, saving model to ./models/cpg/model_weights_val.h5

 split |   loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 train | 4.9983 | 0.4316 |              0.0459 |              0.0594 |              0.0603 |              0.0608 |              0.0507 |             0.4410 |             0.3488 |             0.4443 |             0.4576 |             0.4662
   val | 5.3833 | 0.5048 |              0.1382 |              0.1420 |              0.1343 |              0.1407 |              0.1415 |             0.6551 |             0.3811 |             0.4814 |             0.6307 |             0.3756
====================================================================================================

Training set performance:
  loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
4.9983 | 0.4316 |              0.0459 |              0.0594 |              0.0603 |              0.0608 |              0.0507 |             0.4410 |             0.3488 |             0.4443 |             0.4576 |             0.4662

Validation set performance:
  loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5.3833 | 0.5048 |              0.1382 |              0.1420 |              0.1343 |              0.1407 |              0.1415 |             0.6551 |             0.3811 |             0.4814 |             0.6307 |             0.3756
INFO (2017-04-14 10:27:53,178): Done!

--RnnL1 specifies the architecture of the CpG model and --nb_epoch the number of training epochs. These are hyper-parameters, which can be adapted depending on the size of the training set and model complexitiy. For testing purposes, we decrease the number of samples using --nb_train_sample and --nb_val_sample.

The CpG model is often times already quite accurate on its own. However, we can further boost the performance by also training a DNA model, which leverage the DNA sequence:


In [12]:
cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model CnnL2h128
    --out_dir $models_dir/dna
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 30
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-001000.h5 --val_files ./data/c13_000000-001000.h5 --dna_model CnnL2h128 --out_dir ./models/dna --nb_epoch 1 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-04-14 10:27:57,142): Building model ...
INFO (2017-04-14 10:27:57,145): Building DNA model ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution (None, 991, 128)      5760        dna[0][0]                        
____________________________________________________________________________________________________
dna/activation_1 (Activation)    (None, 991, 128)      0           dna/convolution1d_1[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_1 (MaxPooling1D (None, 247, 128)      0           dna/activation_1[0][0]           
____________________________________________________________________________________________________
dna/convolution1d_2 (Convolution (None, 245, 256)      98560       dna/maxpooling1d_1[0][0]         
____________________________________________________________________________________________________
dna/activation_2 (Activation)    (None, 245, 256)      0           dna/convolution1d_2[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_2 (MaxPooling1D (None, 122, 256)      0           dna/activation_2[0][0]           
____________________________________________________________________________________________________
dna/flatten_1 (Flatten)          (None, 31232)         0           dna/maxpooling1d_2[0][0]         
____________________________________________________________________________________________________
dna/dense_1 (Dense)              (None, 128)           3997824     dna/flatten_1[0][0]              
____________________________________________________________________________________________________
dna/activation_3 (Activation)    (None, 128)           0           dna/dense_1[0][0]                
____________________________________________________________________________________________________
dna/dropout_1 (Dropout)          (None, 128)           0           dna/activation_3[0][0]           
____________________________________________________________________________________________________
cpg/BS27_1_SER (Dense)           (None, 1)             129         dna/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_3_SER (Dense)           (None, 1)             129         dna/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_5_SER (Dense)           (None, 1)             129         dna/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_6_SER (Dense)           (None, 1)             129         dna/dropout_1[0][0]              
____________________________________________________________________________________________________
cpg/BS27_8_SER (Dense)           (None, 1)             129         dna/dropout_1[0][0]              
====================================================================================================
Total params: 4,102,789
Trainable params: 4,102,789
Non-trainable params: 0
____________________________________________________________________________________________________
INFO (2017-04-14 10:27:57,364): Computing output statistics ...
Output statistics:
          name | nb_tot | nb_obs | frac_obs | mean |  var
---------------------------------------------------------
cpg/BS27_1_SER |   1000 |    187 |     0.19 | 0.78 | 0.17
cpg/BS27_3_SER |   1000 |    208 |     0.21 | 0.71 | 0.21
cpg/BS27_5_SER |   1000 |    200 |     0.20 | 0.69 | 0.21
cpg/BS27_6_SER |   1000 |    195 |     0.20 | 0.67 | 0.22
cpg/BS27_8_SER |   1000 |    210 |     0.21 | 0.78 | 0.17

Class weights:
cpg/BS27_1_SER | cpg/BS27_3_SER | cpg/BS27_5_SER | cpg/BS27_6_SER | cpg/BS27_8_SER
----------------------------------------------------------------------------------
        0=0.78 |         0=0.71 |         0=0.69 |         0=0.67 |         0=0.78
        1=0.22 |         1=0.29 |         1=0.31 |         1=0.33 |         1=0.22

INFO (2017-04-14 10:27:57,811): Loading data ...
INFO (2017-04-14 10:27:57,815): Initializing callbacks ...
INFO (2017-04-14 10:27:57,815): Training model ...

Training samples: 1000
Validation samples: 1000
Epochs: 1
Learning rate: 0.0001
====================================================================================================
Epoch 1/1
====================================================================================================
done (%) | time |   loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    12.8 |  0.0 | 3.1027 | 0.5018 |              0.0423 |              0.0404 |              0.0722 |              0.0596 |              0.0549 |             0.8000 |             0.5500 |             0.3438 |             0.5652 |             0.2500
    25.6 |  0.0 | 3.0962 | 0.4770 |              0.0441 |              0.0528 |              0.0657 |              0.0641 |              0.0567 |             0.6778 |             0.5898 |             0.4762 |             0.4033 |             0.2379
    38.4 |  0.1 | 3.0768 | 0.4997 |              0.0373 |              0.0553 |              0.0705 |              0.0638 |              0.0575 |             0.7656 |             0.5877 |             0.4791 |             0.4155 |             0.2506
    51.2 |  0.1 | 3.0570 | 0.4826 |              0.0448 |              0.0578 |              0.0679 |              0.0596 |              0.0547 |             0.7435 |             0.5657 |             0.4459 |             0.3866 |             0.2713
    64.0 |  0.1 | 3.0312 | 0.4709 |              0.0450 |              0.0597 |              0.0628 |              0.0603 |              0.0513 |             0.7734 |             0.5259 |             0.4093 |             0.4133 |             0.2324
    76.8 |  0.1 | 3.0065 | 0.4499 |              0.0465 |              0.0634 |              0.0588 |              0.0575 |              0.0483 |             0.7803 |             0.4810 |             0.4036 |             0.3841 |             0.2003
    89.6 |  0.1 | 2.9857 | 0.4438 |              0.0461 |              0.0612 |              0.0572 |              0.0586 |              0.0506 |             0.7732 |             0.4772 |             0.3784 |             0.3769 |             0.2132
   100.0 |  0.2 | 2.9700 | 0.4499 |              0.0451 |              0.0592 |              0.0593 |              0.0601 |              0.0506 |             0.7760 |             0.4731 |             0.3642 |             0.4125 |             0.2238
Epoch 00000: val_loss improved from inf to 3.20989, saving model to ./models/dna/model_weights_val.h5

 split |   loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 train | 2.9700 | 0.4499 |              0.0451 |              0.0592 |              0.0593 |              0.0601 |              0.0506 |             0.7760 |             0.4731 |             0.3642 |             0.4125 |             0.2238
   val | 3.2099 | 0.5143 |              0.1412 |              0.1386 |              0.1339 |              0.1434 |              0.1378 |             0.8071 |             0.6503 |             0.3238 |             0.5729 |             0.2174
====================================================================================================

Training set performance:
  loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2.9700 | 0.4499 |              0.0451 |              0.0592 |              0.0593 |              0.0601 |              0.0506 |             0.7760 |             0.4731 |             0.3642 |             0.4125 |             0.2238

Validation set performance:
  loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3.2099 | 0.5143 |              0.1412 |              0.1386 |              0.1339 |              0.1434 |              0.1378 |             0.8071 |             0.6503 |             0.3238 |             0.5729 |             0.2174
INFO (2017-04-14 10:28:17,323): Done!

Finally, we combine both models by training a Joint model:


In [13]:
cmd="dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model $models_dir/dna
    --cpg_model $models_dir/cpg
    --joint_model JointL2h512
    --train_models joint
    --out_dir $models_dir/joint
"
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_epoch 1
        --nb_train_sample 1000
        --nb_val_sample 1000
    "
else
    cmd="$cmd
        --nb_epoch 10
        "
fi
run $cmd


#################################
dcpg_train.py ./data/c1_000000-001000.h5 --val_files ./data/c13_000000-001000.h5 --dna_model ./models/dna --cpg_model ./models/cpg --joint_model JointL2h512 --train_models joint --out_dir ./models/joint --nb_epoch 1 --nb_train_sample 1000 --nb_val_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-04-14 10:28:21,112): Building model ...
INFO (2017-04-14 10:28:21,115): Loading existing DNA model ...
INFO (2017-04-14 10:28:21,115): Using model files ./models/dna/model.json ./models/dna/model_weights_val.h5
Replicate names:
BS27_1_SER, BS27_3_SER, BS27_5_SER, BS27_6_SER, BS27_8_SER

INFO (2017-04-14 10:28:21,513): Loading existing CpG model ...
INFO (2017-04-14 10:28:21,513): Using model files ./models/cpg/model.json ./models/cpg/model_weights_val.h5
INFO (2017-04-14 10:28:22,546): Joining models ...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
dna (InputLayer)                 (None, 1001, 4)       0                                            
____________________________________________________________________________________________________
dna/convolution1d_1 (Convolution (None, 991, 128)      5760        dna[0][0]                        
____________________________________________________________________________________________________
dna/activation_1 (Activation)    (None, 991, 128)      0           dna/convolution1d_1[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_1 (MaxPooling1D (None, 247, 128)      0           dna/activation_1[0][0]           
____________________________________________________________________________________________________
dna/convolution1d_2 (Convolution (None, 245, 256)      98560       dna/maxpooling1d_1[0][0]         
____________________________________________________________________________________________________
dna/activation_2 (Activation)    (None, 245, 256)      0           dna/convolution1d_2[0][0]        
____________________________________________________________________________________________________
dna/maxpooling1d_2 (MaxPooling1D (None, 122, 256)      0           dna/activation_2[0][0]           
____________________________________________________________________________________________________
cpg/state (InputLayer)           (None, 5, 50)         0                                            
____________________________________________________________________________________________________
cpg/dist (InputLayer)            (None, 5, 50)         0                                            
____________________________________________________________________________________________________
dna/flatten_1 (Flatten)          (None, 31232)         0           dna/maxpooling1d_2[0][0]         
____________________________________________________________________________________________________
cpg/merge_1 (Merge)              (None, 5, 100)        0           cpg/state[0][0]                  
                                                                   cpg/dist[0][0]                   
____________________________________________________________________________________________________
dna/dense_1 (Dense)              (None, 128)           3997824     dna/flatten_1[0][0]              
____________________________________________________________________________________________________
cpg/timedistributed_1 (TimeDistr (None, 5, 256)        25856       cpg/merge_1[0][0]                
____________________________________________________________________________________________________
dna/activation_3 (Activation)    (None, 128)           0           dna/dense_1[0][0]                
____________________________________________________________________________________________________
cpg/bidirectional_1 (Bidirection (None, 512)           787968      cpg/timedistributed_1[0][0]      
____________________________________________________________________________________________________
dna/dropout_1 (Dropout)          (None, 128)           0           dna/activation_3[0][0]           
____________________________________________________________________________________________________
cpg/dropout_1 (Dropout)          (None, 512)           0           cpg/bidirectional_1[0][0]        
____________________________________________________________________________________________________
merge_1 (Merge)                  (None, 640)           0           dna/dropout_1[0][0]              
                                                                   cpg/dropout_1[0][0]              
____________________________________________________________________________________________________
joint/dense_1 (Dense)            (None, 512)           328192      merge_1[0][0]                    
____________________________________________________________________________________________________
joint/activation_1 (Activation)  (None, 512)           0           joint/dense_1[0][0]              
____________________________________________________________________________________________________
joint/dropout_1 (Dropout)        (None, 512)           0           joint/activation_1[0][0]         
____________________________________________________________________________________________________
joint/dense_2 (Dense)            (None, 512)           262656      joint/dropout_1[0][0]            
____________________________________________________________________________________________________
joint/activation_2 (Activation)  (None, 512)           0           joint/dense_2[0][0]              
____________________________________________________________________________________________________
joint/dropout_2 (Dropout)        (None, 512)           0           joint/activation_2[0][0]         
____________________________________________________________________________________________________
cpg/BS27_1_SER (Dense)           (None, 1)             513         joint/dropout_2[0][0]            
____________________________________________________________________________________________________
cpg/BS27_3_SER (Dense)           (None, 1)             513         joint/dropout_2[0][0]            
____________________________________________________________________________________________________
cpg/BS27_5_SER (Dense)           (None, 1)             513         joint/dropout_2[0][0]            
____________________________________________________________________________________________________
cpg/BS27_6_SER (Dense)           (None, 1)             513         joint/dropout_2[0][0]            
____________________________________________________________________________________________________
cpg/BS27_8_SER (Dense)           (None, 1)             513         joint/dropout_2[0][0]            
====================================================================================================
Total params: 5,509,381
Trainable params: 5,509,381
Non-trainable params: 0
____________________________________________________________________________________________________
Layer trainability:
                layer | trainable
---------------------------------
  dna/convolution1d_1 |     False
     dna/activation_1 |     False
   dna/maxpooling1d_1 |     False
  dna/convolution1d_2 |     False
     dna/activation_2 |     False
   dna/maxpooling1d_2 |     False
        dna/flatten_1 |     False
          dna/dense_1 |     False
cpg/timedistributed_1 |     False
     dna/activation_3 |     False
  cpg/bidirectional_1 |     False
        dna/dropout_1 |     False
        cpg/dropout_1 |     False
        joint/dense_1 |      True
   joint/activation_1 |      True
      joint/dropout_1 |      True
        joint/dense_2 |      True
   joint/activation_2 |      True
      joint/dropout_2 |      True

INFO (2017-04-14 10:28:22,692): Computing output statistics ...
Output statistics:
          name | nb_tot | nb_obs | frac_obs | mean |  var
---------------------------------------------------------
cpg/BS27_1_SER |   1000 |    187 |     0.19 | 0.78 | 0.17
cpg/BS27_3_SER |   1000 |    208 |     0.21 | 0.71 | 0.21
cpg/BS27_5_SER |   1000 |    200 |     0.20 | 0.69 | 0.21
cpg/BS27_6_SER |   1000 |    195 |     0.20 | 0.67 | 0.22
cpg/BS27_8_SER |   1000 |    210 |     0.21 | 0.78 | 0.17

Class weights:
cpg/BS27_1_SER | cpg/BS27_3_SER | cpg/BS27_5_SER | cpg/BS27_6_SER | cpg/BS27_8_SER
----------------------------------------------------------------------------------
        0=0.78 |         0=0.71 |         0=0.69 |         0=0.67 |         0=0.78
        1=0.22 |         1=0.29 |         1=0.31 |         1=0.33 |         1=0.22

INFO (2017-04-14 10:28:23,205): Loading data ...
INFO (2017-04-14 10:28:23,208): Initializing callbacks ...
INFO (2017-04-14 10:28:23,208): Training model ...

Training samples: 1000
Validation samples: 1000
Epochs: 1
Learning rate: 0.0001
====================================================================================================
Epoch 1/1
====================================================================================================
done (%) | time |   loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    12.8 |  0.0 | 9.7520 | 0.6253 |              0.0326 |              0.0491 |              0.0584 |              0.0841 |              0.0343 |             0.9231 |             0.2727 |             0.6000 |             0.5806 |             0.7500
    25.6 |  0.0 | 9.7728 | 0.5915 |              0.0427 |              0.0584 |              0.0623 |              0.0748 |              0.0474 |             0.7865 |             0.3328 |             0.5308 |             0.5703 |             0.7371
    38.4 |  0.0 | 9.7655 | 0.5533 |              0.0367 |              0.0597 |              0.0651 |              0.0647 |              0.0586 |             0.8185 |             0.3330 |             0.4803 |             0.4278 |             0.7071
    51.2 |  0.1 | 9.7612 | 0.5167 |              0.0449 |              0.0556 |              0.0627 |              0.0644 |              0.0591 |             0.7639 |             0.3122 |             0.3987 |             0.3949 |             0.7136
    64.0 |  0.1 | 9.7530 | 0.5198 |              0.0489 |              0.0587 |              0.0616 |              0.0611 |              0.0545 |             0.7595 |             0.3060 |             0.4059 |             0.4002 |             0.7274
    76.8 |  0.1 | 9.7414 | 0.5023 |              0.0473 |              0.0625 |              0.0594 |              0.0600 |              0.0503 |             0.7718 |             0.2892 |             0.3966 |             0.3682 |             0.6856
    89.6 |  0.1 | 9.7334 | 0.4878 |              0.0475 |              0.0610 |              0.0583 |              0.0610 |              0.0501 |             0.7789 |             0.2836 |             0.3710 |             0.3685 |             0.6371
   100.0 |  0.1 | 9.7288 | 0.4866 |              0.0466 |              0.0595 |              0.0602 |              0.0612 |              0.0512 |             0.7759 |             0.3126 |             0.3547 |             0.3648 |             0.6251
Epoch 00000: val_loss improved from inf to 10.08812, saving model to ./models/joint/model_weights_val.h5

 split |    loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 train |  9.7288 | 0.4866 |              0.0466 |              0.0595 |              0.0602 |              0.0612 |              0.0512 |             0.7759 |             0.3126 |             0.3547 |             0.3648 |             0.6251
   val | 10.0881 | 0.5095 |              0.1356 |              0.1376 |              0.1391 |              0.1460 |              0.1380 |             0.7981 |             0.6195 |             0.3139 |             0.3500 |             0.4662
====================================================================================================

Training set performance:
  loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
9.7288 | 0.4866 |              0.0466 |              0.0595 |              0.0602 |              0.0612 |              0.0512 |             0.7759 |             0.3126 |             0.3547 |             0.3648 |             0.6251

Validation set performance:
   loss |    acc | cpg/BS27_1_SER_loss | cpg/BS27_3_SER_loss | cpg/BS27_5_SER_loss | cpg/BS27_6_SER_loss | cpg/BS27_8_SER_loss | cpg/BS27_1_SER_acc | cpg/BS27_3_SER_acc | cpg/BS27_5_SER_acc | cpg/BS27_6_SER_acc | cpg/BS27_8_SER_acc
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
10.0881 | 0.5095 |              0.1356 |              0.1376 |              0.1391 |              0.1460 |              0.1380 |             0.7981 |             0.6195 |             0.3139 |             0.3500 |             0.4662
INFO (2017-04-14 10:28:48,257): Done!

You can find more information about training and model architectures in the DeepCpG documentation.

Imputing methylation profiles

Finally, we use dcpg_eval.py to impute missing methylation states and to evaluate prediction performance on observed states. We will use the trained Joint model, but could of course also evaluate the CpG or DNA model.


In [14]:
cmd="dcpg_eval.py
    $data_dir/c*.h5
    --model_files $models_dir/joint
    --out_data $eval_dir/data.h5
    --out_report $eval_dir/report.tsv
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd
        --nb_sample 10000
        "
fi
run $cmd


#################################
dcpg_eval.py ./data/c13_000000-001000.h5 ./data/c1_000000-001000.h5 --model_files ./models/joint --out_data ./eval/data.h5 --out_report ./eval/report.tsv --nb_sample 10000
#################################
Using TensorFlow backend.
INFO (2017-04-14 10:28:52,032): Loading model ...
INFO (2017-04-14 10:28:53,425): Loading data ...
INFO (2017-04-14 10:28:53,429): Predicting ...
INFO (2017-04-14 10:28:53,447):  128/2000 (6.4%)
INFO (2017-04-14 10:28:54,343):  384/2000 (19.2%)
INFO (2017-04-14 10:28:55,203):  640/2000 (32.0%)
INFO (2017-04-14 10:28:56,036):  896/2000 (44.8%)
INFO (2017-04-14 10:28:56,818): 1128/2000 (56.4%)
INFO (2017-04-14 10:28:57,652): 1384/2000 (69.2%)
INFO (2017-04-14 10:28:58,499): 1640/2000 (82.0%)
INFO (2017-04-14 10:28:59,376): 1896/2000 (94.8%)
INFO (2017-04-14 10:28:59,788): 2000/2000 (100.0%)
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/classification.py:516: RuntimeWarning: invalid value encountered in double_scalars
  mcc = cov_ytyp / np.sqrt(var_yt * var_yp)
           output       auc       acc       tpr       tnr        f1       mcc      n
1  cpg/BS27_3_SER  0.585749  0.622549  0.668874  0.490566  0.724014  0.144615  408.0
2  cpg/BS27_5_SER  0.541900  0.307888  0.000000  1.000000  0.000000  0.000000  393.0
4  cpg/BS27_8_SER  0.539363  0.482843  0.438486  0.637363  0.568507  0.063941  408.0
0  cpg/BS27_1_SER  0.526798  0.785166  0.990291  0.012195  0.879310  0.010059  391.0
3  cpg/BS27_6_SER  0.501337  0.370647  0.100746  0.910448  0.175896  0.017829  402.0
INFO (2017-04-14 10:29:00,257): Done!

The imputed methylation profiles of all cells are stored in data.h5, and performance metrics in report.tsv.


In [15]:
h5ls -r $eval_dir/data.h5


/                        Group
/chromo                  Dataset {2000}
/outputs                 Group
/outputs/cpg             Group
/outputs/cpg/BS27_1_SER  Dataset {2000}
/outputs/cpg/BS27_3_SER  Dataset {2000}
/outputs/cpg/BS27_5_SER  Dataset {2000}
/outputs/cpg/BS27_6_SER  Dataset {2000}
/outputs/cpg/BS27_8_SER  Dataset {2000}
/pos                     Dataset {2000}
/preds                   Group
/preds/cpg               Group
/preds/cpg/BS27_1_SER    Dataset {2000}
/preds/cpg/BS27_3_SER    Dataset {2000}
/preds/cpg/BS27_5_SER    Dataset {2000}
/preds/cpg/BS27_6_SER    Dataset {2000}
/preds/cpg/BS27_8_SER    Dataset {2000}

In [16]:
cat $eval_dir/report.tsv


metric	output	value
acc	cpg/BS27_1_SER	0.7851662404092071
acc	cpg/BS27_3_SER	0.6225490196078431
acc	cpg/BS27_5_SER	0.30788804071246817
acc	cpg/BS27_6_SER	0.3706467661691542
acc	cpg/BS27_8_SER	0.48284313725490197
auc	cpg/BS27_1_SER	0.5267976951614176
auc	cpg/BS27_3_SER	0.5857490940897163
auc	cpg/BS27_5_SER	0.541899611084103
auc	cpg/BS27_6_SER	0.5013366005791936
auc	cpg/BS27_8_SER	0.5393628453565362
f1	cpg/BS27_1_SER	0.8793103448275862
f1	cpg/BS27_3_SER	0.7240143369175627
f1	cpg/BS27_5_SER	0.0
f1	cpg/BS27_6_SER	0.17589576547231267
f1	cpg/BS27_8_SER	0.5685071574642127
mcc	cpg/BS27_1_SER	0.010059326521099646
mcc	cpg/BS27_3_SER	0.1446147146482661
mcc	cpg/BS27_5_SER	0.0
mcc	cpg/BS27_6_SER	0.017828739558016723
mcc	cpg/BS27_8_SER	0.06394060856947545
n	cpg/BS27_1_SER	391.0
n	cpg/BS27_3_SER	408.0
n	cpg/BS27_5_SER	393.0
n	cpg/BS27_6_SER	402.0
n	cpg/BS27_8_SER	408.0
tnr	cpg/BS27_1_SER	0.012195121951219513
tnr	cpg/BS27_3_SER	0.49056603773584906
tnr	cpg/BS27_5_SER	1.0
tnr	cpg/BS27_6_SER	0.9104477611940298
tnr	cpg/BS27_8_SER	0.6373626373626373
tpr	cpg/BS27_1_SER	0.9902912621359223
tpr	cpg/BS27_3_SER	0.6688741721854304
tpr	cpg/BS27_5_SER	0.0
tpr	cpg/BS27_6_SER	0.10074626865671642
tpr	cpg/BS27_8_SER	0.4384858044164038

Exporting methylation profiles

dcpg_eval_export.py can be used to export imputed methylation profiles:


In [17]:
cmd="dcpg_eval_export.py
    $eval_dir/data.h5
    -o $eval_dir/hdf
    -f hdf
"
eval $cmd


INFO (2017-04-14 10:29:02,007): cpg/BS27_1_SER
INFO (2017-04-14 10:29:02,014): cpg/BS27_3_SER
INFO (2017-04-14 10:29:02,019): cpg/BS27_5_SER
INFO (2017-04-14 10:29:02,023): cpg/BS27_6_SER
INFO (2017-04-14 10:29:02,028): cpg/BS27_8_SER
INFO (2017-04-14 10:29:02,033): Done!

In [18]:
ls $eval_dir/hdf


BS27_1_SER.h5 BS27_3_SER.h5 BS27_5_SER.h5 BS27_6_SER.h5 BS27_8_SER.h5

By default, dcpg_eval_export.py exports profiles to HDF5 files. You can use -f bedGraph to export profiles to gzip-compressed bedGraph files, which, however, takes longer.

Evaluating prediction performances

dcpg_eval_perf.py enables evaluating prediction performances genome wide, in specific genomic contexts, and by computing performance curves. Using --anno_files, you can specify a list of BED files with annotation tracks that are evaluated, and you can compute ROC and precision recall (PR) curves for individual outputs using --curves roc pr. You can also use --anno_curves roc pr to compute performance curves for annotations specified by --anno_files.


In [19]:
cmd="dcpg_eval_perf.py
    $eval_dir/data.h5
    --out_dir $eval_dir/perf
    --curves roc pr
    --anno_files $anno_dir/CGI*.bed $anno_dir/Gene_body.bed $anno_dir/Introns.bed $anno_dir/Exons.bed $anno_dir/UW_DNase1.bed
"
eval $cmd


INFO (2017-04-14 10:29:03,746): Loading data ...
INFO (2017-04-14 10:29:03,760): 2000 samples
INFO (2017-04-14 10:29:03,760): Evaluating globally ...
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/metrics/classification.py:516: RuntimeWarning: invalid value encountered in double_scalars
  mcc = cov_ytyp / np.sqrt(var_yt * var_yp)
           output    anno       auc       acc       tpr       tnr        f1       mcc      n
1  cpg/BS27_3_SER  global  0.585749  0.622549  0.668874  0.490566  0.724014  0.144615  408.0
2  cpg/BS27_5_SER  global  0.541900  0.307888  0.000000  1.000000  0.000000  0.000000  393.0
4  cpg/BS27_8_SER  global  0.539363  0.482843  0.438486  0.637363  0.568507  0.063941  408.0
0  cpg/BS27_1_SER  global  0.526798  0.785166  0.990291  0.012195  0.879310  0.010059  391.0
3  cpg/BS27_6_SER  global  0.501337  0.370647  0.100746  0.910448  0.175896  0.017829  402.0
INFO (2017-04-14 10:29:03,828): roc curve
INFO (2017-04-14 10:29:03,837): pr curve
INFO (2017-04-14 10:29:03,846): Evaluating annotations ...
INFO (2017-04-14 10:29:04,028): CGI: 312
INFO (2017-04-14 10:29:04,348): CGI_shelf: 97
INFO (2017-04-14 10:29:04,348): Skipping due to insufficient annotated sites!
INFO (2017-04-14 10:29:04,639): CGI_shore: 120
INFO (2017-04-14 10:29:04,874): Gene_body: 969
INFO (2017-04-14 10:29:06,179): Introns: 724
INFO (2017-04-14 10:29:07,814): Exons: 261
INFO (2017-04-14 10:29:09,510): UW_DNase1: 160
INFO (2017-04-14 10:29:09,534): Writing ./eval/perf/metrics.tsv ...
INFO (2017-04-14 10:29:09,540): Writing ./eval/perf/curves.tsv ...
INFO (2017-04-14 10:29:09,566): Done!

Performance metrics are stored in metrics.tsv, and performances curves in curves.tsv:


In [20]:
head $eval_dir/perf/metrics.tsv


anno	metric	output	value
global	acc	cpg/BS27_5_SER	0.30789
global	acc	cpg/BS27_6_SER	0.37065
global	acc	cpg/BS27_8_SER	0.48284
global	acc	cpg/BS27_3_SER	0.62255
global	acc	cpg/BS27_1_SER	0.78517
global	auc	cpg/BS27_6_SER	0.50134
global	auc	cpg/BS27_1_SER	0.52680
global	auc	cpg/BS27_8_SER	0.53936
global	auc	cpg/BS27_5_SER	0.54190

In [21]:
head $eval_dir/perf/curves.tsv


anno	curve	output	x	y	thr
global	roc	cpg/BS27_1_SER	0.00000	0.00324	0.55943
global	roc	cpg/BS27_1_SER	0.00000	0.01294	0.55266
global	roc	cpg/BS27_1_SER	0.01220	0.01294	0.55247
global	roc	cpg/BS27_1_SER	0.01220	0.02589	0.55041
global	roc	cpg/BS27_1_SER	0.02439	0.02589	0.54989
global	roc	cpg/BS27_1_SER	0.02439	0.03236	0.54941
global	roc	cpg/BS27_1_SER	0.03659	0.03236	0.54921
global	roc	cpg/BS27_1_SER	0.03659	0.07120	0.54488
global	roc	cpg/BS27_1_SER	0.06098	0.07120	0.54435

You can find more annotation tracks in the example data directory:


In [22]:
ls $anno_dir


Active_enhancers.bed H3K4me1.bed          Tet2.bed
CGI.bed              H3K4me1_Tet1.bed     UW_DNase1.bed
CGI_shelf.bed        IAP.bed              Wu_Tet1.bed
CGI_shore.bed        Intergenic.bed       mESC_enhancers.bed
Exons.bed            Introns.bed          p300.bed
Gene_body.bed        LMRs.bed             prom_2k05k.bed
H3K27ac.bed          Oct4_2i.bed          prom_2k05k_cgi.bed
H3K27me3.bed         TSSs.bed             prom_2k05k_ncgi.bed

To visualize performances of a single model, you can use the Rmarkdown script R/eval_perf_single.Rmd in the DeepCpG directory. The following command will only work if you have installed R and the required libraries (rmarkdown, knitr, ggplot, dplyr, tidyr, xtable, grid). R/eval_perf_mult.Rmd can be used for a side-by-side comparison of multiple models.


In [23]:
cp ../../../R/eval_perf_single.Rmd $eval_dir/perf/index.Rmd
cwd=$PWD
cd $eval_dir/perf
Rscript -e "library(rmarkdown); render('./index.Rmd', output_format='html_document')"
cd $cwd


During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 


processing file: index.Rmd
  |...                                                              |   5%
   inline R code fragments

  |......                                                           |   9%
label: unnamed-chunk-1 (with options) 
List of 1
 $ include: symbol F

  |.........                                                        |  14%
  ordinary text without R code

  |............                                                     |  18%
label: unnamed-chunk-2 (with options) 
List of 1
 $ include: symbol F


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

  |...............                                                  |  23%
  ordinary text without R code

  |..................                                               |  27%
label: unnamed-chunk-3
  |.....................                                            |  32%
  ordinary text without R code

  |........................                                         |  36%
label: unnamed-chunk-4
  |...........................                                      |  41%
  ordinary text without R code

  |..............................                                   |  45%
label: unnamed-chunk-5
  |................................                                 |  50%
  ordinary text without R code

  |...................................                              |  55%
label: unnamed-chunk-6
  |......................................                           |  59%
  ordinary text without R code

  |.........................................                        |  64%
label: unnamed-chunk-7 (with options) 
List of 1
 $ results: chr "asis"

  |............................................                     |  68%
  ordinary text without R code

  |...............................................                  |  73%
label: unnamed-chunk-8 (with options) 
List of 2
 $ fig.width : num 10
 $ fig.height: num 10

  |..................................................               |  77%
  ordinary text without R code

  |.....................................................            |  82%
label: unnamed-chunk-9 (with options) 
List of 2
 $ fig.width : num 10
 $ fig.height: num 6

  |........................................................         |  86%
  ordinary text without R code

  |...........................................................      |  91%
label: unnamed-chunk-10 (with options) 
List of 1
 $ results: chr "asis"

  |..............................................................   |  95%
  ordinary text without R code

  |.................................................................| 100%
label: unnamed-chunk-11 (with options) 
List of 2
 $ fig.width : num 10
 $ fig.height: num 25


output file: index.knit.md

/opt/local/bin/pandoc +RTS -K512m -RTS index.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output index.html --smart --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 3 --template /Users/angermue/Library/R/3.3/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/5n/38b69b1x03n7bb63rbnlzd280000gn/T//RtmpKTFbAA/rmarkdown-strbca3cadbd8a.html --mathjax --variable 'mathjax-url:https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' 

Output created: index.html
Warning messages:
1: `axis.ticks.margin` is deprecated. Please set `margin` property  of `axis.text` instead 
2: `axis.ticks.margin` is deprecated. Please set `margin` property  of `axis.text` instead 
3: `axis.ticks.margin` is deprecated. Please set `margin` property  of `axis.text` instead