This tutorial illustrates how to estimate and visualize the effect of single-nucleotide mutations on DNA methylation.
We first important the relevant libraries and initialize global variables.
In [1]:
import os
from os import path as pt
import h5py as h5
from matplotlib import pyplot as plt
import seaborn as sns
In [2]:
%matplotlib inline
In [3]:
example_dir = "../../data" # Directory with example data.
cpg_dir = pt.join(example_dir, "cpg") # Directory with CpG profiles.
dna_dir = pt.join(example_dir, "dna/mm10") # Directory with DNA sequences.
data_dir = "./data" # Directory with data files
model_dir = "./model" # Directory with DeepCpG model
snp_dir = "./snp" # Directory with results of mutation analysis
We will use the DNA model trained on serum mESCs from Smallwood et al. (2014) to compute effects, which we download from the DeepCpG model zoo with dcpg_download.py
. You can use any DeepCpG DNA model or Joint model instead. In case of using a Joint model, you will obtain mutation effects conditioned on observed neighoring CpG sites.
In [4]:
!dcpg_download.py \
Smallwood2014_serum_dna \
--out_dir {model_dir}
Next, we use dcpg_data.py
to extract DNA sequence windows for which we will compute effects. You can skip this step if you have already created data files.
In [5]:
!dcpg_data.py \
--cpg_profiles {cpg_dir}/*.tsv \
--dna_files {dna_dir} \
--out_dir {data_dir} \
--dna_wlen 1001 \
--chromo 4 \
--nb_sample 1000
Now we can run dcpg_snp.py
to compute the effect of single-nucleotide mutations for the created data files using the downloaded model. With --targets mean var
, dcpg_snp.py
computes the effect on the mean methylation rate and variance across cells. With --store_inputs
, input DNA sequence windows will be stored in the output file (--out_file
), which is useful for analyzing effects.
In [6]:
effects_file = "./effects.h5"
!dcpg_snp.py \
{data_dir}/*.h5 \
--model_files {model_dir} \
--out_file {effects_file} \
--targets mean var \
--store_inputs \
--nb_sample 1000
The output file is a HDF5 file with following datasets:
/chromo
: The chromosome of the DNA sequence window./pos
: The position of the DNA sequence window on the chromosome./dna
: The input DNA sequence window. Only present if --store_inputs
was used./mean
: The effect of each nucleotide in the DNA sequence window on the mean methylation rate across cells./var
: The effect of each nucleotide in the DNA sequence window on the variance across cells.
In [7]:
!h5ls -r {effects_file}
Here, 1000
is number of DNA sequence windows, 1001
the length of DNA sequence windows, and 4
the number of nucleotides.
To visualize effects, we first read the output data of dcpg_snp.py
using h5py:
In [8]:
h5_file = h5.File(effects_file, 'r')
data = {key: value.value for key, value in h5_file.items()}
h5_file.close()
Next, we compute the maximum absolute effect across nucleotides, averaged over all sequence windows:
In [9]:
global_effects = dict()
for target in ['mean', 'var']:
global_effects[target] = np.mean(np.max(np.abs(data[target]), axis=2), axis=0)
These can be now visualized at different resolutions:
In [10]:
def slice_ctr(d, wlen, axis=0):
_wlen = d.shape[axis]
ctr = _wlen // 2
dwlen = wlen // 2
idx = slice(ctr - dwlen, ctr + dwlen + 1)
if axis == 0:
return d[idx]
elif axis == 1:
return d[:, idx]
elif axis == 2:
return d[:, :, idx]
def plot_global_effects(effects, wlen=None):
if wlen:
effects = {key: slice_ctr(value, wlen) for key, value in effects.items()}
wlen = len(list(effects.values())[0])
fig, ax = plt.subplots(figsize=(15, 4))
x = np.arange(wlen) - (wlen // 2)
for key, value in effects.items():
ax.plot(x, value, label=key)
ax.legend()
plot_global_effects(global_effects)
plot_global_effects(global_effects, 401)
plot_global_effects(global_effects, 201)
The plots show that mutations close to the target CpG site have the largest effect and that mutations have a stronger effect on the mean methylation rate than on the variance between cells.
In addition to visualizing average effects, we can visualize the effects for individual sequences:
In [11]:
def select_seq(data, chromo, pos, target='mean'):
idx = (data['chromo'] == chromo.encode()) & (data['pos'] == pos)
assert idx.sum() == 1
return data[target][idx].squeeze().T
def plot_seq_effects(effects, wlen=51):
ctr = effects.shape[1] // 2
dwlen = wlen // 2
effects = effects[:, ctr-dwlen:ctr+dwlen+1]
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 8),
gridspec_kw={'height_ratios': [1, 2]})
x = np.arange(wlen) - dwlen
ax[0].plot(x, np.max(np.abs(effects), axis=0))
sns.heatmap(effects, ax=ax[1],
cbar_kws={'orientation': 'horizontal', 'pad': 0.1, 'shrink': 0.3})
ax[1].set_xticklabels(x)
ax[1].set_yticklabels(['A', 'T', 'G', 'C'])
for pos in data['pos'][:5]:
effects = select_seq(data, '4', pos)
plot_seq_effects(effects)
The heatmap shows the effect of each nucleotide within a window of 51 bp centered on the target CpG site, and the top row the maximum absolute effect across nucleotides. We used select_seq
to select the first 5 sequences. Instead you can use data['chromo']
and data['pos']
to select any CpG site in which you are interested in.
In [12]:
data['chromo'][:50]
Out[12]:
In [13]:
data['pos'][:50]
Out[13]: