Motif analysis

This tutorial describes how to visualize and analyze motifs learnt by DeepCpG, i.e. the filters of the first convolutional layer of the DNA model.

We fill first compute the activations (occurrence frequencies) of motifs in sequence windows, and then extract and align sequence fragments that maximally activate each motif. Finally, we will visualize the resulting alignments via Weblogo and compare them to known motifs via Tomtom.

Requirements

Tomtom is required for comparing motifs and is part of the MEME-Suite, which can be downloaded here.

WebLogo3 is required for visualizing motifs and can be installed with pip:

pip install weblogo

Initialization

We first initialize some variables that will be used throughout the tutorial. test_mode=1 should be used for testing purposes, which speeds up computations by only using a subset of the data. For real applications, test_mode=0 should be used.


In [11]:
function run {
  local cmd=$@
  echo
  echo "#################################"
  echo $cmd
  echo "#################################"
  eval $cmd
}

test_mode=1 # Set to 1 for testing and 0 otherwise
example_dir="../../data" # Directory with example data.
cpg_dir="$example_dir/cpg" # Directory with CpG profiles.
dna_dir="$example_dir/dna/mm10" # Directory with DNA sequences.

motif_dir="./motifs" # Motif output directory



Dowloading the DeepCpG model

We will analyze the filters of the DeepCpG DNA model that was trained on serum cells from Smallwood et al. (2014), and which is described in the DeepCpG publication. This model can be downloaded with dcpg_download.py:


In [12]:
cmd="dcpg_download.py
  Smallwood2014_serum_dna
  --out_dir $motif_dir/model
  "
run $cmd


#################################
dcpg_download.py Smallwood2014_serum_dna --out_dir ./motifs/model
#################################
INFO (2017-03-05 19:09:31,461): Downloading model ...
INFO (2017-03-05 19:09:31,461): Model URL: http://www.ebi.ac.uk/~angermue/deepcpg/alias/1754b5bbc21a8257663acc52e657f69c
--2017-03-05 19:09:31--  http://www.ebi.ac.uk/~angermue/deepcpg/alias/1754b5bbc21a8257663acc52e657f69c
Resolving www.ebi.ac.uk (www.ebi.ac.uk)... 193.62.192.80
Connecting to www.ebi.ac.uk (www.ebi.ac.uk)|193.62.192.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124853155 (119M) [text/plain]
Saving to: ‘./motifs/model/model.zip’

./motifs/model/mode 100%[===================>] 119.07M  10.2MB/s    in 10s     

2017-03-05 19:09:42 (11.8 MB/s) - ‘./motifs/model/model.zip’ saved [124853155/124853155]

Archive:  ./motifs/model/model.zip
  inflating: ./motifs/model/model.h5  
  inflating: ./motifs/model/model.json  
  inflating: ./motifs/model/model_weights.h5  
  inflating: ./motifs/model/model_weights_train.h5  
  inflating: ./motifs/model/model_weights_val.h5  
INFO (2017-03-05 19:09:44,010): Done!

Creating DeepCpG data files

Next, we use dcpg_data.py to extract sequence windows for computing filter activations. If you have already created a dataset, you can skip this step. We use --nb_sample_chromo to randomly sample only 1500 CpG sites from each chromosome, which is sufficient for visualizing motifs and reduces compute costs.


In [13]:
data_dir="./data"
cmd="dcpg_data.py
    --cpg_profiles $cpg_dir/*.tsv
    --dna_files $dna_dir
    --out_dir $data_dir
    --dna_wlen 1001
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd --nb_sample 1000"
else
    cmd="$cmd --nb_sample_chromo 1500"
fi
run $cmd


#################################
dcpg_data.py --cpg_profiles ../../data//cpg/BS27_1_SER.tsv ../../data//cpg/BS27_3_SER.tsv ../../data//cpg/BS27_5_SER.tsv ../../data//cpg/BS27_6_SER.tsv ../../data//cpg/BS27_8_SER.tsv --dna_files ../../data//dna/mm10 --out_dir ./data --dna_wlen 1001 --nb_sample 1000
#################################
INFO (2017-03-05 19:09:49,454): Reading single-cell profiles ...
INFO (2017-03-05 19:09:49,730): 1000 samples
INFO (2017-03-05 19:09:49,731): --------------------------------------------------------------------------------
INFO (2017-03-05 19:09:49,731): Chromosome 1 ...
INFO (2017-03-05 19:09:49,739): 1000 / 1000 (100.0%) sites matched minimum coverage filter
INFO (2017-03-05 19:09:54,709): Chunk 	1 / 1
INFO (2017-03-05 19:09:54,776): Extracting DNA sequence windows ...
INFO (2017-03-05 19:09:55,273): Done!

Computing filter activations

Now we use dcpg_filter_act.py to compute the activation of filters in DNA sequence windows. We are using --store_inputs to copy DNA sequences from the input file to the output file, since they are needed in the following step align sequence fragments. We are considering only 30000 CpG sites, which is usually sufficient for visualizing motifs.


In [14]:
cmd="dcpg_filter_act.py
    $data_dir/*.h5
    --model_files ${motif_dir}/model
    --out_file ${motif_dir}/activations.h5
    --store_inputs
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd --nb_sample 1000"
else
    cmd="$cmd --nb_sample 30000"
fi
run $cmd


#################################
dcpg_filter_act.py ./data/c1_000000-001000.h5 ./data/c1_000000-025000.h5 --model_files ./motifs/model --out_file ./motifs/activations.h5 --store_inputs --nb_sample 1000
#################################
Using TensorFlow backend.
INFO (2017-03-05 19:10:06,304): Loading model ...
INFO (2017-03-05 19:10:06,305): Using model files ./motifs/model/model.json ./motifs/model/model_weights.h5
INFO (2017-03-05 19:10:07,068): Using activation layer "dna/activation_1"
INFO (2017-03-05 19:10:07,068): Using weight layer "dna/convolution1d_1"
INFO (2017-03-05 19:10:07,069): Reading data ...
INFO (2017-03-05 19:10:07,084): Computing activations
INFO (2017-03-05 19:10:07,093):  128/1000 (12.8%)
INFO (2017-03-05 19:10:08,145):  256/1000 (25.6%)
INFO (2017-03-05 19:10:09,347):  384/1000 (38.4%)
INFO (2017-03-05 19:10:10,359):  512/1000 (51.2%)
INFO (2017-03-05 19:10:11,396):  640/1000 (64.0%)
INFO (2017-03-05 19:10:12,631):  768/1000 (76.8%)
INFO (2017-03-05 19:10:13,726):  896/1000 (89.6%)
INFO (2017-03-05 19:10:14,773): 1000/1000 (100.0%)
INFO (2017-03-05 19:10:15,584): Done!

Visualizing and analyzing motifs

dcpg_filter_motifs.py enables to visualize and cluster motifs, compare motifs to known motifs, and compute motif statistics. We will compare motifs to motifs in the CIS-BP database, and plot motif heatmaps, motif activations, as well as the first two principal components of motif activations via the --plot_heat, --plot_dens and --plot_pca argument, respectively. In test mode, we only select filter 0, 1, and 2:


In [15]:
cmd="dcpg_filter_motifs.py
    $motif_dir/activations.h5
    --out_dir $motif_dir
    --motif_db $example_dir/motif_databases/CIS-BP/Mus_musculus.meme
    --plot_heat
    --plot_dens
    --plot_pca
    --out_format png
    "
if [[ $test_mode -eq 1 ]]; then
    cmd="$cmd --filter -2"
fi
run $cmd


#################################
dcpg_filter_motifs.py ./motifs/activations.h5 --out_dir ./motifs --motif_db ../../data//motif_databases/CIS-BP/Mus_musculus.meme --plot_heat --plot_dens --plot_pca --out_format png --filter -2
#################################
INFO (2017-03-05 19:10:22,986): Reading data
Filters: 3
Filter len: 11
Samples: 1000
INFO (2017-03-05 19:10:23,022): Performing PCA on activations using 1000 samples
INFO (2017-03-05 19:10:24,148): Analyzing filters
INFO (2017-03-05 19:10:24,148): -----------------
INFO (2017-03-05 19:10:24,148): Filter 0
INFO (2017-03-05 19:10:24,516): Plotting filter densities
INFO (2017-03-05 19:10:24,866): Plotting filter heatmap
INFO (2017-03-05 19:10:25,119): Extracting activating kmers
INFO (2017-03-05 19:10:25,136): Plotting sequence logo
INFO (2017-03-05 19:10:26,204): Computing PWM
INFO (2017-03-05 19:10:26,209): Filter 1
INFO (2017-03-05 19:10:26,536): Plotting filter densities
INFO (2017-03-05 19:10:26,847): Plotting filter heatmap
INFO (2017-03-05 19:10:27,365): Extracting activating kmers
INFO (2017-03-05 19:10:27,381): Plotting sequence logo
INFO (2017-03-05 19:10:27,961): Computing PWM
INFO (2017-03-05 19:10:27,963): Filter 2
INFO (2017-03-05 19:10:28,289): Plotting filter densities
INFO (2017-03-05 19:10:28,596): Plotting filter heatmap
INFO (2017-03-05 19:10:28,848): Extracting activating kmers
INFO (2017-03-05 19:10:28,875): Plotting sequence logo
INFO (2017-03-05 19:10:29,625): Computing PWM


Filter statistics:
   idx        motif  act_mean   act_std        ic  nb_site
2    2  ACGCCTGCCGC  0.005773  0.028300  6.129625     6035
0    0  AGGTGTACCCC  0.004393  0.024795  8.054280     2058
1    1  CCGCGGGGGGG  0.002785  0.021427  7.536105     1264
INFO (2017-03-05 19:10:29,636): Running tomtom

 tomtom -dist pearson -thresh 0.05 -oc ./motifs/tomtom ./motifs/meme.txt ../../data//motif_databases/CIS-BP/Mus_musculus.meme
The output directory './motifs/tomtom' already exists.
Its contents will be overwritten.
Processing query 1 out of 3 
Estimating pi_0 from all 1436 observed p-values.
Estimating pi_0.
Estimated pi_0=1
Processing query 2 out of 3 
Estimating pi_0 from all 1436 observed p-values.
Estimating pi_0.
Estimated pi_0=0.97983
Processing query 3 out of 3 
Estimating pi_0 from all 1436 observed p-values.
Estimating pi_0.
Estimated pi_0=1

Tomtom results:
   idx        motif  act_mean  act_std      ic  nb_site   target id   p-value   e-value   q-value  overlap query consensus target consensus orientation protein                                                url
0    1  CCGCGGGGGGG    0.0028   0.0214  7.5361     1264  M6535_1.02  0.000038  0.027624  0.020922     11.0     CCGCGGGGGGG    GCGGGGGCGGGGG           +     Wt1  http://cisbp.ccbr.utoronto.ca/TFreport.php?sea...
1    2  ACGCCTGCCGC    0.0058   0.0283  6.1296     6035         NaN       NaN       NaN       NaN      NaN             NaN              NaN         NaN     NaN                                                NaN
2    0  AGGTGTACCCC    0.0044   0.0248  8.0543     2058         NaN       NaN       NaN       NaN      NaN             NaN              NaN         NaN     NaN                                                NaN
INFO (2017-03-05 19:10:32,320): Done!

We can now have a look into the output motif directory. report_top.csv contains for each DeepCpG motif the most similar motif in the CIS-BP database. report.csv lists all motifs that are similar to DeepCpG motifs, not only the top motif. Both files include the following columns:

  • idx: Index of the filter starting with zero.
  • motif: The DNA sequence that is recognized by the filter.
  • act_mean: The mean activation of the filter. If zero, the motif has been never recognized, which means that the filter is dead. If there are a lot of dead filters, something went wrong with the training or you used too few samples for computing the activations with dcpg_filter_act.py.
  • act_std: Standard deviation of the filter activation. If high, the filter is very sensitive to a particular motif.
  • ic: Information content of the motifs.
  • nb_site: Number of sequences in alignment that was used to create the sequence logo.
  • target_id: Id of the found motif.
  • q-value: FDR adjusted p-value of the found motif.
  • protein: Name of the found motif.

In [16]:
cat $motif_dir/report_top.csv | head -n 25


idx	motif	act_mean	act_std	ic	nb_site	target id	p-value	e-value	q-value	overlap	query consensus	target consensus	orientation	protein	url
1	CCGCGGGGGGG	0.003	0.021	7.536	1264	M6535_1.02	0.000	0.028	0.021	11.000	CCGCGGGGGGG	GCGGGGGCGGGGG	+	Wt1	http://cisbp.ccbr.utoronto.ca/TFreport.php?searchTF=T049053_1.02
2	ACGCCTGCCGC	0.006	0.028	6.130	6035										
0	AGGTGTACCCC	0.004	0.025	8.054	2058										

Sequence logos are stored in the logos/ directory. Let's have a look at the top-ranked motif 121-Ctcf (not true in test mode):

The corresponding PWM weight matrix stored in heat/ looks as follows:

Densities of motif activations are stored in dens/, e.g. the density of the selected motif:

Finally, plot_wmean.png shows the two first principal components of weighted-mean motif activations in sequence windows. Motifs with similar activaton pattern cluster close to each other: