Welcome to the Jupyter notebook dedicated to Hi-C data analysis. Here we will be working in interactive Python environment with some mixture of bash command line tools.
Here is the outline of what we are going to do:
If you have any questions, please, contact Aleksandra Galitsyna (agalitzina@gmail.com)
In [ ]:
# This is regular Python comment inside Jupyter "Code" cell.
# You can easily run "Hello world" in the "Code" cell (focus on the cell and press Shift+Enter):
print("Hello world!")
Note that behind this interactive notebook you have regular Python session running. Thus Python variables are accessible only throughout your history of actions in the notebook. To create a variable, you have to execute the corresponding block of code. All your variables will be lost when you restart the kernel of the notebook.
You can pause or stop the kernel, save notebook (.ipynb) file, copy and insert cells via buttons in the toolbar. Please, take a look at these useful buttons.
Also, try pressing 'Esc' and then 'h'. You will see shortcuts help.
Jupyter notebook allows you to create "magical" cells. We will use %%bash, %%capture, %matplotlib. For example, %%bash magic makes it easier to access bash commands:
In [ ]:
%%bash
echo "Current directory is: "; pwd
echo "List of files in the current directory is: "; ls
In [ ]:
# Module import under custom name
import numpy as np
In [ ]:
# You've started asking questions about it
np?
OK, it seems that now we are ready to start our Hi-C data analysis! I've placed Go top shortcut for you in each section so that you can navigate quickly throughout the notebook.
Hi-C results in paired-end sequencing, where each pair represents one possible contact. The analysis starts with raw sequencing data (.fastq files).
I've downloaded raw files from Flyamer et al. 2017 (GEO ID GSE80006) and placed them in the DATA/FASTQ/ directory.
We can view these files easily with bash help. Forward and reverse reads, correspondingly:
In [ ]:
%%bash
head -n 8 '../DATA/FASTQ/K562_B-bulk_R1.fastq'
In [ ]:
%%bash
head -n 8 '../DATA/FASTQ/K562_B-bulk_R2.fastq'
Now we have to map these reads to the genome of interest (Homo sapiens hg19 downloaded from UCSC in this case). We are going to use only chromosome 1 to minimise computational time.
The genome is also pre-downloaded:
In [ ]:
%%bash
ls ../GENOMES/HG19_FASTA
For Hi-C data mapping we will use hiclib. It utilizes bowtie 2 read mapping software. Bowtie 2 indexes the genome prior to reads mapping in order to reduce memory usage. Usually, you have to run genome indexing, but I've already done this time-consuming step. That's why code for this step is included but commented.
In [ ]:
#%%bash
#bowtie2-build /home/jovyan/GENOMES/HG19_FASTA/chr1.fa /home/jovyan/GENOMES/HG19_IND/hg19_chr1
#Time consuming step
In [ ]:
%%bash
ls ../GENOMES/HG19_IND
In [ ]:
import os
from hiclib import mapping
from mirnylib import h5dict, genome
Then we need to set some parameters and prepare our environment:
In [ ]:
%%bash
which bowtie2
# Bowtie 2 path
In [ ]:
%%bash
pwd
# Current working directory path
In [ ]:
# Setting parameters and environmental variables
bowtie_path = '/opt/conda/bin/bowtie2'
enzyme = 'DpnII'
bowtie_index_path = '/home/jovyan/GENOMES/HG19_IND/hg19_chr1'
fasta_path = '/home/jovyan/GENOMES/HG19_FASTA/'
chrms = ['1']
# Reading the genome
genome_db = genome.Genome(fasta_path, readChrms=chrms)
In [ ]:
# Creating directories for further data processing
if not os.path.exists('tmp/'):
os.mkdir('tmp/', exists_)
if not os.path.exists('../DATA/SAM/'):
os.mkdir('../DATA/SAM/')
In [ ]:
# Set parameters for iterative mapping
min_seq_len = 25
len_step = 5
nthreads = 2
temp_dir = 'tmp'
bowtie_flags = '--very-sensitive'
infile1 = '/home/jovyan/DATA/FASTQ1/K562_B-bulk_R1.fastq'
infile2 = '/home/jovyan/DATA/FASTQ1/K562_B-bulk_R2.fastq'
out1 = '/home/jovyan/DATA/SAM/K562_B-bulk_R1.chr1.sam'
out2 = '/home/jovyan/DATA/SAM/K562_B-bulk_R2.chr1.sam'
In [ ]:
# Iterative mapping itself. Time consuming step!
mapping.iterative_mapping(
bowtie_path = bowtie_path,
bowtie_index_path = bowtie_index_path,
fastq_path = infile1,
out_sam_path = out1,
min_seq_len = min_seq_len,
len_step = len_step,
nthreads = nthreads,
temp_dir = temp_dir,
bowtie_flags = bowtie_flags)
mapping.iterative_mapping(
bowtie_path = bowtie_path,
bowtie_index_path = bowtie_index_path,
fastq_path = infile2,
out_sam_path = out2,
min_seq_len = min_seq_len,
len_step = len_step,
nthreads = nthreads,
temp_dir = temp_dir,
bowtie_flags = bowtie_flags)
Let's take a look at .sam files that were created during iterative mapping:
In [ ]:
%%bash
ls /home/jovyan/DATA/SAM/
In [ ]:
%%bash
head -n 10 /home/jovyan/DATA/SAM/K562_B-bulk_R1.chr1.sam.25
For each read length and orientation, we have a file. Now we need to merge them into the single dataset (.hdf5 file):
In [ ]:
# Create the directory for output
if not os.path.exists('../DATA/HDF5/'):
os.mkdir('../DATA/HDF5/')
# Define file name for output
out = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5'
In [ ]:
# Open output file
mapped_reads = h5dict.h5dict(out)
# Parse mapping data and write to output file
mapping.parse_sam(
sam_basename1 = out1,
sam_basename2 = out2,
out_dict = mapped_reads,
genome_db = genome_db,
enzyme_name = enzyme,
save_seqs = False,
keep_ids = False)
Let's take a look at the created file:
In [ ]:
%%bash
ls /home/jovyan/DATA/HDF5/
In [ ]:
import h5py
# Reading the file
a = h5py.File('/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5')
In [ ]:
# "a" variable has dictionary-like structure, we can view its keys, for example:
list( a.keys() )
In [ ]:
# Mapping positions for forward reads are stored under 'cuts1' key:
a['cuts1'].value
In [ ]:
from hiclib import fragmentHiC
In [ ]:
inp = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5'
out = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments_filtered.hdf5'
In [ ]:
# Create output file
fragments = fragmentHiC.HiCdataset(
filename = out,
genome = genome_db,
maximumMoleculeLength= 500,
mode = 'w')
# Parse input data
fragments.parseInputData(
dictLike=inp)
# Filtering
fragments.filterRsiteStart(offset=5) # reads map too close to restriction site
fragments.filterDuplicates() # remove PCR duplicates
fragments.filterLarge() # remove too large restriction fragments
fragments.filterExtreme(cutH=0.005, cutL=0) # remove fragments with too high and low counts
In [ ]:
# Some hidden filteres were also applied, we can check them all:
fragments.printMetadata()
Nice visualisation of the data:
In [ ]:
import pandas as pd
df_stat = pd.DataFrame(list(fragments.metadata.items()), columns=['Feature', 'Count'])
df_stat
In [ ]:
df_stat['Ratio of total'] = 100*df_stat['Count']/df_stat.loc[2,'Count']
df_stat
In [ ]:
# Define file name for binned data. Note "{}" prepared for string formatting
out_bin = '/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'
In [ ]:
res_kb = [100, 20] # Several resolutions in Kb
for res in res_kb:
print(res)
outmap = out_bin.format(str(res)+'kb') # String formatting
fragments.saveHeatmap(outmap, res*1000) # Save heatmap
In [ ]:
del fragments # delete unwanted object
In [ ]:
# Importing visualisation modules
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')
In [ ]:
%matplotlib inline
In [ ]:
from hiclib.binnedData import binnedDataAnalysis
In [ ]:
res = 100 # Resolution in Kb
In [ ]:
# prepare to read the data
data_hic = binnedDataAnalysis(resolution=res*1000, genome=genome_db)
In [ ]:
# read the data
data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'.format(str(res)+'kb'),'hic')
mtx = data_hic.dataDict['hic']
In [ ]:
# show heatmap
plt.figure(figsize=[15,15])
plt.imshow(mtx[0:200, 0:200], cmap='jet', interpolation='None')
In [ ]:
# Additional data filtering
data_hic.removeDiagonal()
data_hic.removePoorRegions()
data_hic.removeZeros()
data_hic.iterativeCorrectWithoutSS(force=True)
data_hic.restoreZeros()
In [ ]:
mtx = data_hic.dataDict['hic']
In [ ]:
plt.figure(figsize=[15,15])
plt.imshow(mtx[200:500, 200:500], cmap='jet', interpolation='None')
In [ ]:
# Load compartments computed previously based on K562 dataset from Rao et al. 2014
eig = np.loadtxt('/home/jovyan/DATA/ANNOT/comp_K562_100Kb_chr1.tsv')
In [ ]:
eig
In [ ]:
from matplotlib import gridspec
In [ ]:
bgn = 0
end = 500
fig = plt.figure(figsize=(10,10))
gs = gridspec.GridSpec(2, 1, height_ratios=[20,2])
gs.update(wspace=0.0, hspace=0.0)
ax = plt.subplot(gs[0,0])
ax.matshow(mtx[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto')
ax.set_xticks([])
ax.set_yticks([])
axl = plt.subplot(gs[1,0])
plt.plot(range(end-bgn), eig[bgn:end] )
plt.xlim(0, end-bgn)
plt.xlabel('Eigenvector values')
ticks = range(bgn, end+1, 100)
ticklabels = ['{} Kb'.format(x) for x in ticks]
plt.xticks(ticks, ticklabels)
print('')
Seems to be nothing special with compartments. What if we had much better coverage by reads? Let's take a look at the dataset from Rao et al. 201401497-4), GEO GSE63525, HIC069:
In [ ]:
mtx_Rao = np.genfromtxt('../DATA/ANNOT/Rao_K562_chr1.csv', delimiter=',')
In [ ]:
bgn = 0
end = 500
fig = plt.figure(figsize=(10,10))
gs = gridspec.GridSpec(2, 1, height_ratios=[20,2])
gs.update(wspace=0.0, hspace=0.0)
ax = plt.subplot(gs[0,0])
ax.matshow(mtx_Rao[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto', vmax=1000)
ax.set_xticks([])
ax.set_yticks([])
axl = plt.subplot(gs[1,0])
plt.plot(range(end-bgn), eig[bgn:end] )
plt.xlim(0, end-bgn)
plt.xlabel('Eigenvector values')
ticks = range(bgn, end+1, 100)
ticklabels = ['{} Kb'.format(x) for x in ticks]
plt.xticks(ticks, ticklabels)
print('')
For TADs calling we will use lavaburst package. The code below is based on this example.
In [ ]:
# Import Python package
import lavaburst
In [ ]:
good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing
gam=[0.15, 0.25, 0.5, 0.75, 1.0] # set of parameters gamma for TADs calling
segments_dict = {}
for gam_current in gam:
print(gam_current)
S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)
model = lavaburst.model.SegModel(S)
segments = model.optimal_segmentation() # Positions of TADs for input matrix
segments_dict[gam_current] = segments.copy()
In [ ]:
A = mtx.copy()
good_bins = A.astype(bool).sum(axis=0) > 0
At = lavaburst.utils.tilt_heatmap(mtx, n_diags=100)
start_tmp = 0
end_tmp = 500
f = plt.figure(figsize=(20, 6))
ax = f.add_subplot(111)
blues = sns.cubehelix_palette(0.4, gamma=0.5, rot=-0.3, dark=0.1, light=0.9, as_cmap=True)
ax.matshow(np.log(At[start_tmp: end_tmp]), cmap=blues)
cmap = mpl.cm.get_cmap('brg')
gammas = segments_dict.keys()
for n, gamma in enumerate(gammas):
segments = segments_dict[gamma]
for a in segments[:-1]:
if a[1]<start_tmp or a[0]>end_tmp:
continue
ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5)
ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)
a = segments[-1]
ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5, label=gamma)
ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)
ax.set_xlim([0,end_tmp-start_tmp])
ax.set_ylim([100,-100])
ax.legend(bbox_to_anchor=(1.1, 1.05))
ax.set_aspect(0.5)
In [ ]:
#Let's check what are median TAD sized with different parameters:
for gam_current in gam:
segments = segments_dict[gam_current]
tad_lens = segments[:,1]-segments[:,0]
good_lens = (tad_lens>=200/res)&(tad_lens<100)
print(res*1000*np.mean(tad_lens[good_lens]))