This tutorial describes how to create DeepCpG data files, train models, and evaluate models. These topics are described in greater detail in the DeepCpG documentation.
We first initialize some variables that will be used throughout the tutorial. test_mode=1
should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, test_mode=0
should be used.
In [1]:
function run {
local cmd=$@
echo
echo "#################################"
echo $cmd
echo "#################################"
eval $cmd
}
test_mode=1 # Set to 1 for testing and 0 otherwise.
example_dir="../../data" # Example data.
cpg_dir="$example_dir/cpg" # CpG profiles.
dna_dir="$example_dir/dna/mm10" # DNA sequences.
anno_dir="$example_dir/anno/mm10" # Annotations of genomic contexts.
data_dir="./data" # DeepCpG data.
models_dir="./models" # Trained models.
mkdir -p $models_dir
eval_dir="./eval" # Evaluation data.
mkdir -p $eval_dir
We first store the known CpG methylation states of each cell into a tab delimted file with the following columns:
CpG sites with a methylation rate between zero and one should be binarized by rounding. Filenames should correspond to cell names.
Each position must point the cytosine residue of a CpG site (positions enumerated from 1). Otherwise dcpg_data.py
will report warnings, e.g. if a wrong genome is used or CpG sites were not correctly aligned.
For this tutorial we are using a subset of serum mouse embryonic stem cells from Smallwood et al. (2014):
In [2]:
ls $cpg_dir
We can have a look at the methylation profile of cell 'BS27_1_SER':
In [3]:
head "$cpg_dir/BS27_1_SER.tsv"
Since we are dealing with mouse cells, we are using the mm10 (GRCm38) mouse genome build:
In [4]:
ls $dna_dir
These files were downloaded by setup.py
. Other genomes, e.g. human genome hg38, can be downloaded, for example, with the following command:
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz
Now we can run dcpg_data.py
to create the input data for DeepCpG. For testing purposes, we only consider a few CpG sites on chromosome 19:
In [5]:
cmd="dcpg_data.py
--cpg_profiles $cpg_dir/*.tsv
--dna_files $dna_dir
--out_dir $data_dir
--cpg_wlen 50
--dna_wlen 1001
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd
--chromo 1 13
--nb_sample_chromo 1000
"
fi
run $cmd
For each CpG site that is observed in at least one cell, this command extracts the 50 neighboring CpG sites (25 to the left and 25 to the right), and the 1001 bp long DNA sequence window centered on the CpG site. In test mode, only 1000 CpG sites will be randomly sampled from chromosome 1 and 13. The command creates multiple HDF5 files with name cX_FROM_TO.h5
, where X
is the chromosome, and FROM
and TO
the index of CpG sites stored in the file:
In [6]:
ls $data_dir
We can now train models on the created data. First, we need to split the data into a training, validation set, and test set. The training set should contain at least 3 million CpG sites. We will use chromosome 1, 3, 5, 7, and 19 as training set, and chromosome 13, 14, 15, 16, and 17 as validation set:
In [7]:
function get_data_files {
local data_dir=$1
shift
local chromos=$@
files=""
for chromo in $chromos; do
files="$files $(ls $data_dir/c${chromo}_*.h5 2> /dev/null)"
done
echo $files
}
train_files=$(get_data_files $data_dir 1 3 5 7 9)
val_files=$(get_data_files $data_dir 13 14 15 16 17)
In [8]:
echo $train_files
In [9]:
echo $val_files
We can count the number of CpG sites in the training set using dcpg_data_stats.py
:
In [10]:
cmd="dcpg_data_stats.py $train_files"
run $cmd
For each output cell, nb_tot
is the total number of CpG sites, nb_obs
the number of CpG sites with known methylation state, frac_obs
the ratio between nb_obs
and nb_tot
, mean
the mean methylation rate, and var
the variance of the methylation rate.
We can now train our model. DeepCpG consists of a DNA, CpG, and Joint model, which can be trained either jointly or separately. We will train them jointly, starting with the CpG model:
In [11]:
cmd="dcpg_train.py
$train_files
--val_files $val_files
--cpg_model RnnL1
--out_dir $models_dir/cpg
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd
--nb_epoch 1
--nb_train_sample 1000
--nb_val_sample 1000
"
else
cmd="$cmd
--nb_epoch 30
"
fi
run $cmd
--RnnL1
specifies the architecture of the CpG model and --nb_epoch
the number of training epochs. These are hyper-parameters, which can be adapted depending on the size of the training set and model complexitiy. For testing purposes, we decrease the number of samples using --nb_train_sample
and --nb_val_sample
.
The CpG model is often times already quite accurate on its own. However, we can further boost the performance by also training a DNA model, which leverage the DNA sequence:
In [12]:
cmd="dcpg_train.py
$train_files
--val_files $val_files
--dna_model CnnL2h128
--out_dir $models_dir/dna
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd
--nb_epoch 1
--nb_train_sample 1000
--nb_val_sample 1000
"
else
cmd="$cmd
--nb_epoch 30
"
fi
run $cmd
Finally, we combine both models by training a Joint model:
In [13]:
cmd="dcpg_train.py
$train_files
--val_files $val_files
--dna_model $models_dir/dna
--cpg_model $models_dir/cpg
--joint_model JointL2h512
--train_models joint
--out_dir $models_dir/joint
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd
--nb_epoch 1
--nb_train_sample 1000
--nb_val_sample 1000
"
else
cmd="$cmd
--nb_epoch 10
"
fi
run $cmd
You can find more information about training and model architectures in the DeepCpG documentation.
Finally, we use dcpg_eval.py
to impute missing methylation states and to evaluate prediction performance on observed states. We will use the trained Joint model, but could of course also evaluate the CpG or DNA model.
In [14]:
cmd="dcpg_eval.py
$data_dir/c*.h5
--model_files $models_dir/joint
--out_data $eval_dir/data.h5
--out_report $eval_dir/report.tsv
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd
--nb_sample 10000
"
fi
run $cmd
The imputed methylation profiles of all cells are stored in data.h5
, and performance metrics in report.tsv
.
In [15]:
h5ls -r $eval_dir/data.h5
In [16]:
cat $eval_dir/report.tsv
dcpg_eval_export.py
can be used to export imputed methylation profiles:
In [17]:
cmd="dcpg_eval_export.py
$eval_dir/data.h5
-o $eval_dir/hdf
-f hdf
"
eval $cmd
In [18]:
ls $eval_dir/hdf
By default, dcpg_eval_export.py
exports profiles to HDF5 files. You can use -f bedGraph
to export profiles to gzip-compressed bedGraph files, which, however, takes longer.
dcpg_eval_perf.py
enables evaluating prediction performances genome wide, in specific genomic contexts, and by computing performance curves. Using --anno_files
, you can specify a list of BED files with annotation tracks that are evaluated, and you can compute ROC and precision recall (PR) curves for individual outputs using --curves roc pr
. You can also use --anno_curves roc pr
to compute performance curves for annotations specified by --anno_files
.
In [19]:
cmd="dcpg_eval_perf.py
$eval_dir/data.h5
--out_dir $eval_dir/perf
--curves roc pr
--anno_files $anno_dir/CGI*.bed $anno_dir/Gene_body.bed $anno_dir/Introns.bed $anno_dir/Exons.bed $anno_dir/UW_DNase1.bed
"
eval $cmd
Performance metrics are stored in metrics.tsv
, and performances curves in curves.tsv
:
In [20]:
head $eval_dir/perf/metrics.tsv
In [21]:
head $eval_dir/perf/curves.tsv
You can find more annotation tracks in the example data directory:
In [22]:
ls $anno_dir
To visualize performances of a single model, you can use the Rmarkdown script R/eval_perf_single.Rmd
in the DeepCpG directory. The following command will only work if you have installed R and the required libraries (rmarkdown, knitr, ggplot, dplyr, tidyr, xtable, grid). R/eval_perf_mult.Rmd
can be used for a side-by-side comparison of multiple models.
In [23]:
cp ../../../R/eval_perf_single.Rmd $eval_dir/perf/index.Rmd
cwd=$PWD
cd $eval_dir/perf
Rscript -e "library(rmarkdown); render('./index.Rmd', output_format='html_document')"
cd $cwd