This tutorial describes how to visualize and analyze motifs learnt by DeepCpG, i.e. the filters of the first convolutional layer of the DNA model.
We fill first compute the activations (occurrence frequencies) of motifs in sequence windows, and then extract and align sequence fragments that maximally activate each motif. Finally, we will visualize the resulting alignments via Weblogo and compare them to known motifs via Tomtom.
Tomtom is required for comparing motifs and is part of the MEME-Suite, which can be downloaded here.
WebLogo3 is required for visualizing motifs and can be installed with pip
:
pip install weblogo
We first initialize some variables that will be used throughout the tutorial. test_mode=1
should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, test_mode=0
should be used.
In [11]:
function run {
local cmd=$@
echo
echo "#################################"
echo $cmd
echo "#################################"
eval $cmd
}
test_mode=1 # Set to 1 for testing and 0 otherwise
example_dir="../../data" # Directory with example data.
cpg_dir="$example_dir/cpg" # Directory with CpG profiles.
dna_dir="$example_dir/dna/mm10" # Directory with DNA sequences.
motif_dir="./motifs" # Motif output directory
We will analyze the filters of the DeepCpG DNA model that was trained on serum cells from Smallwood et al. (2014), and which is described in the DeepCpG publication. This model can be downloaded with dcpg_download.py
:
In [12]:
cmd="dcpg_download.py
Smallwood2014_serum_dna
--out_dir $motif_dir/model
"
run $cmd
Next, we use dcpg_data.py
to extract sequence windows for computing filter activations. If you have already created a dataset, you can skip this step. We use --nb_sample_chromo
to randomly sample only 1500 CpG sites from each chromosome, which is sufficient for visualizing motifs and reduces compute costs.
In [13]:
data_dir="./data"
cmd="dcpg_data.py
--cpg_profiles $cpg_dir/*.tsv
--dna_files $dna_dir
--out_dir $data_dir
--dna_wlen 1001
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd --nb_sample 1000"
else
cmd="$cmd --nb_sample_chromo 1500"
fi
run $cmd
Now we use dcpg_filter_act.py
to compute the activation of filters in DNA sequence windows. We are using --store_inputs
to copy DNA sequences from the input file to the output file, since they are needed in the following step align sequence fragments. We are considering only 30000 CpG sites, which is usually sufficient for visualizing motifs.
In [14]:
cmd="dcpg_filter_act.py
$data_dir/*.h5
--model_files ${motif_dir}/model
--out_file ${motif_dir}/activations.h5
--store_inputs
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd --nb_sample 1000"
else
cmd="$cmd --nb_sample 30000"
fi
run $cmd
dcpg_filter_motifs.py
enables to visualize and cluster motifs, compare motifs to known motifs, and compute motif statistics. We will compare motifs to motifs in the CIS-BP database, and plot motif heatmaps, motif activations, as well as the first two principal components of motif activations via the --plot_heat
, --plot_dens
and --plot_pca
argument, respectively. In test mode, we only select filter 0, 1, and 2:
In [15]:
cmd="dcpg_filter_motifs.py
$motif_dir/activations.h5
--out_dir $motif_dir
--motif_db $example_dir/motif_databases/CIS-BP/Mus_musculus.meme
--plot_heat
--plot_dens
--plot_pca
--out_format png
"
if [[ $test_mode -eq 1 ]]; then
cmd="$cmd --filter -2"
fi
run $cmd
We can now have a look into the output motif directory. report_top.csv
contains for each DeepCpG motif the most similar motif in the CIS-BP database. report.csv
lists all motifs that are similar to DeepCpG motifs, not only the top motif. Both files include the following columns:
idx
: Index of the filter starting with zero.motif
: The DNA sequence that is recognized by the filter.act_mean
: The mean activation of the filter. If zero, the motif has been never recognized, which means that the filter is dead. If there are a lot of dead filters, something went wrong with the training or you used too few samples for computing the activations with dcpg_filter_act.py
.act_std
: Standard deviation of the filter activation. If high, the filter is very sensitive to a particular motif.ic
: Information content of the motifs.nb_site
: Number of sequences in alignment that was used to create the sequence logo.target_id
: Id of the found motif.q-value
: FDR adjusted p-value of the found motif.protein
: Name of the found motif.
In [16]:
cat $motif_dir/report_top.csv | head -n 25
Sequence logos are stored in the logos/
directory. Let's have a look at the top-ranked motif 121-Ctcf (not true in test mode):
The corresponding PWM weight matrix stored in heat/
looks as follows:
Densities of motif activations are stored in dens/
, e.g. the density of the selected motif:
Finally, plot_wmean.png
shows the two first principal components of weighted-mean motif activations in sequence windows. Motifs with similar activaton pattern cluster close to each other: