Hi-C data analysis

Welcome to the Jupyter notebook dedicated to Hi-C data analysis. Here we will be working in interactive Python environment with some mixture of bash command line tools.

Here is the outline of what we are going to do:

  1. Notebook basics
  2. Reads maping
  3. Data filtering
  4. Binning
  5. Hi-C data visualisation
  6. Iterative correction
  7. Compartments and TADs

If you have any questions, please, contact Aleksandra Galitsyna (agalitzina@gmail.com)

0. Notebook basics

If you are new to Python and Jupyter notebook, please, take a quick look through this small list of tips.

  • First of all, Jupyter notebook is organised in cells, which may contain text, comments and code blocks of any size.

In [ ]:
# This is regular Python comment inside Jupyter "Code" cell.
# You can easily run "Hello world" in the "Code" cell (focus on the cell and press Shift+Enter):
print("Hello world!")
  • There are also other types of cells, for example, "Markdown". Double click this cell to view raw Markdown markup content.
  • You can define functions, classes, run pipelines and visualisations, run thousands of code lines inside a Jupyter cell. But usually, it is convenient to write simple and clean blocks of code.
  • Note that behind this interactive notebook you have regular Python session running. Thus Python variables are accessible only throughout your history of actions in the notebook. To create a variable, you have to execute the corresponding block of code. All your variables will be lost when you restart the kernel of the notebook.

  • You can pause or stop the kernel, save notebook (.ipynb) file, copy and insert cells via buttons in the toolbar. Please, take a look at these useful buttons.

  • Also, try pressing 'Esc' and then 'h'. You will see shortcuts help.

  • Jupyter notebook allows you to create "magical" cells. We will use %%bash, %%capture, %matplotlib. For example, %%bash magic makes it easier to access bash commands:


In [ ]:
%%bash 
echo "Current directory is: "; pwd
echo "List of files in the current directory is: "; ls
  • If you are not sure about the function, class or variable then use its name with '?' at the end to get available documentation. Here is an example for common module numpy:

In [ ]:
# Module import under custom name
import numpy as np

In [ ]:
# You've started asking questions about it
np?

OK, it seems that now we are ready to start our Hi-C data analysis! I've placed Go top shortcut for you in each section so that you can navigate quickly throughout the notebook.

1. Reads mapping

Go top

1.1 Input raw data

Hi-C results in paired-end sequencing, where each pair represents one possible contact. The analysis starts with raw sequencing data (.fastq files).

I've downloaded raw files from Flyamer et al. 2017 (GEO ID GSE80006) and placed them in the DATA/FASTQ/ directory.

We can view these files easily with bash help. Forward and reverse reads, correspondingly:


In [ ]:
%%bash 
head -n 8 '../DATA/FASTQ/K562_B-bulk_R1.fastq'

In [ ]:
%%bash 
head -n 8 '../DATA/FASTQ/K562_B-bulk_R2.fastq'

1.2 Genome

Now we have to map these reads to the genome of interest (Homo sapiens hg19 downloaded from UCSC in this case). We are going to use only chromosome 1 to minimise computational time.

The genome is also pre-downloaded:


In [ ]:
%%bash 
ls ../GENOMES/HG19_FASTA

For Hi-C data mapping we will use hiclib. It utilizes bowtie 2 read mapping software. Bowtie 2 indexes the genome prior to reads mapping in order to reduce memory usage. Usually, you have to run genome indexing, but I've already done this time-consuming step. That's why code for this step is included but commented.


In [ ]:
#%%bash
#bowtie2-build /home/jovyan/GENOMES/HG19_FASTA/chr1.fa /home/jovyan/GENOMES/HG19_IND/hg19_chr1
#Time consuming step

In [ ]:
%%bash 
ls ../GENOMES/HG19_IND

1.3 Iterative mapping

First of all, we need to import useful Python packages:


In [ ]:
import os

from hiclib import mapping
from mirnylib import h5dict, genome

Then we need to set some parameters and prepare our environment:


In [ ]:
%%bash 
which bowtie2
# Bowtie 2 path

In [ ]:
%%bash
pwd
# Current working directory path

In [ ]:
# Setting parameters and environmental variables
bowtie_path       = '/opt/conda/bin/bowtie2'

enzyme = 'DpnII'

bowtie_index_path = '/home/jovyan/GENOMES/HG19_IND/hg19_chr1'
fasta_path        = '/home/jovyan/GENOMES/HG19_FASTA/'
chrms             = ['1']

# Reading the genome
genome_db    = genome.Genome(fasta_path, readChrms=chrms)

In [ ]:
# Creating directories for further data processing

if not os.path.exists('tmp/'):
    os.mkdir('tmp/', exists_)
if not os.path.exists('../DATA/SAM/'):
    os.mkdir('../DATA/SAM/')

In [ ]:
# Set parameters for iterative mapping

min_seq_len       = 25
len_step          = 5
nthreads          = 2
temp_dir          = 'tmp'
bowtie_flags      = '--very-sensitive'

infile1           = '/home/jovyan/DATA/FASTQ1/K562_B-bulk_R1.fastq'
infile2           = '/home/jovyan/DATA/FASTQ1/K562_B-bulk_R2.fastq'
out1              = '/home/jovyan/DATA/SAM/K562_B-bulk_R1.chr1.sam'
out2              = '/home/jovyan/DATA/SAM/K562_B-bulk_R2.chr1.sam'

In [ ]:
# Iterative mapping itself. Time consuming step!

mapping.iterative_mapping(
    bowtie_path       = bowtie_path,
    bowtie_index_path = bowtie_index_path,
    fastq_path        = infile1,
    out_sam_path      = out1,
    min_seq_len       = min_seq_len,
    len_step          = len_step,
    nthreads          = nthreads,
    temp_dir          = temp_dir, 
    bowtie_flags      = bowtie_flags)

mapping.iterative_mapping(
    bowtie_path       = bowtie_path,
    bowtie_index_path = bowtie_index_path,
    fastq_path        = infile2,
    out_sam_path      = out2,
    min_seq_len       = min_seq_len,
    len_step          = len_step,
    nthreads          = nthreads,
    temp_dir          = temp_dir, 
    bowtie_flags      = bowtie_flags)

Let's take a look at .sam files that were created during iterative mapping:


In [ ]:
%%bash
ls /home/jovyan/DATA/SAM/

In [ ]:
%%bash
head -n 10 /home/jovyan/DATA/SAM/K562_B-bulk_R1.chr1.sam.25

1.4 Making sense of mapping output

For each read length and orientation, we have a file. Now we need to merge them into the single dataset (.hdf5 file):


In [ ]:
# Create the directory for output
if not os.path.exists('../DATA/HDF5/'):
    os.mkdir('../DATA/HDF5/')

# Define file name for output
out = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5'

In [ ]:
# Open output file
mapped_reads = h5dict.h5dict(out)

# Parse mapping data and write to output file
mapping.parse_sam(
    sam_basename1  =  out1,
    sam_basename2  =  out2,
    out_dict       =  mapped_reads,
    genome_db      =  genome_db,
    enzyme_name    =  enzyme, 
    save_seqs      =  False,
    keep_ids       =  False)

Let's take a look at the created file:


In [ ]:
%%bash
ls /home/jovyan/DATA/HDF5/

In [ ]:
import h5py

# Reading the file
a = h5py.File('/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5')

In [ ]:
# "a" variable has dictionary-like structure, we can view its keys, for example:
list( a.keys() )

In [ ]:
# Mapping positions for forward reads are stored under 'cuts1' key:
a['cuts1'].value

2. Data filtering

Go top

The raw Hi-C data is mapped and interpreted, the next step is to filter out possible methodological artefacts:


In [ ]:
from hiclib import fragmentHiC

In [ ]:
inp = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments.hdf5'
out = '/home/jovyan/DATA/HDF5/K562_B-bulk.fragments_filtered.hdf5'

In [ ]:
# Create output file
fragments = fragmentHiC.HiCdataset(
    filename             = out,
    genome               = genome_db,
    maximumMoleculeLength= 500,
    mode                 = 'w')

# Parse input data
fragments.parseInputData(
    dictLike=inp)

# Filtering
fragments.filterRsiteStart(offset=5) # reads map too close to restriction site
fragments.filterDuplicates() # remove PCR duplicates
fragments.filterLarge() # remove too large restriction fragments
fragments.filterExtreme(cutH=0.005, cutL=0) # remove fragments with too high and low counts

In [ ]:
# Some hidden filteres were also applied, we can check them all:
fragments.printMetadata()

Nice visualisation of the data:


In [ ]:
import pandas as pd

df_stat = pd.DataFrame(list(fragments.metadata.items()), columns=['Feature', 'Count'])
df_stat

In [ ]:
df_stat['Ratio of total'] = 100*df_stat['Count']/df_stat.loc[2,'Count']
df_stat

3. Data binning

Go top

The previous analysis involved interactions of restriction fragments, now we would like to work with interactions of genomic bins.


In [ ]:
# Define file name for binned data. Note "{}" prepared for string formatting
out_bin = '/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'

In [ ]:
res_kb = [100, 20] # Several resolutions in Kb

for res in res_kb:
    print(res)
    
    outmap = out_bin.format(str(res)+'kb') # String formatting

    fragments.saveHeatmap(outmap, res*1000) # Save heatmap

In [ ]:
del fragments # delete unwanted object

4. Hi-C data visualisation

Go top

Let's take a look at the resulting heat maps.


In [ ]:
# Importing visualisation modules
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')

In [ ]:
%matplotlib inline

In [ ]:
from hiclib.binnedData import binnedDataAnalysis

In [ ]:
res = 100 # Resolution in Kb

In [ ]:
# prepare to read the data
data_hic = binnedDataAnalysis(resolution=res*1000, genome=genome_db)

In [ ]:
# read the data
data_hic.simpleLoad('/home/jovyan/DATA/HDF5/K562_B-bulk.binned_{}.hdf5'.format(str(res)+'kb'),'hic')
mtx = data_hic.dataDict['hic']

In [ ]:
# show heatmap
plt.figure(figsize=[15,15])
plt.imshow(mtx[0:200, 0:200], cmap='jet', interpolation='None')

5. Iterative correction

Go top

The next typical step is data correction for unequal amplification and accessibility of genomic regions. We will use iterative correction.


In [ ]:
# Additional data filtering 
data_hic.removeDiagonal()
data_hic.removePoorRegions()
data_hic.removeZeros()
data_hic.iterativeCorrectWithoutSS(force=True)
data_hic.restoreZeros()

In [ ]:
mtx = data_hic.dataDict['hic']

In [ ]:
plt.figure(figsize=[15,15])
plt.imshow(mtx[200:500, 200:500], cmap='jet', interpolation='None')

7. Compartmanets and TADs

Go top

7.1 Comparison with compartments

Compartments usually can be found at whole-genome datasets, but we have only chromosome 1. Still, we can try to find some visual signs of compartments.


In [ ]:
# Load compartments computed previously based on K562 dataset from Rao et al. 2014
eig =  np.loadtxt('/home/jovyan/DATA/ANNOT/comp_K562_100Kb_chr1.tsv')

In [ ]:
eig

In [ ]:
from matplotlib import gridspec

In [ ]:
bgn = 0
end = 500

fig = plt.figure(figsize=(10,10))

gs = gridspec.GridSpec(2, 1, height_ratios=[20,2]) 
gs.update(wspace=0.0, hspace=0.0)

ax = plt.subplot(gs[0,0])

ax.matshow(mtx[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto')
ax.set_xticks([])
ax.set_yticks([])

axl = plt.subplot(gs[1,0])
plt.plot(range(end-bgn), eig[bgn:end] )
plt.xlim(0, end-bgn)
plt.xlabel('Eigenvector values')

ticks = range(bgn, end+1, 100)
ticklabels = ['{} Kb'.format(x) for x in ticks]
plt.xticks(ticks, ticklabels)

print('')

Seems to be nothing special with compartments. What if we had much better coverage by reads? Let's take a look at the dataset from Rao et al. 201401497-4), GEO GSE63525, HIC069:


In [ ]:
mtx_Rao = np.genfromtxt('../DATA/ANNOT/Rao_K562_chr1.csv', delimiter=',')

In [ ]:
bgn = 0
end = 500

fig = plt.figure(figsize=(10,10))

gs = gridspec.GridSpec(2, 1, height_ratios=[20,2])
gs.update(wspace=0.0, hspace=0.0)

ax = plt.subplot(gs[0,0])

ax.matshow(mtx_Rao[bgn:end, bgn:end], cmap='jet', origin='lower', aspect='auto', vmax=1000)
ax.set_xticks([])
ax.set_yticks([])

axl = plt.subplot(gs[1,0])
plt.plot(range(end-bgn), eig[bgn:end] )
plt.xlim(0, end-bgn)
plt.xlabel('Eigenvector values')

ticks = range(bgn, end+1, 100)
ticklabels = ['{} Kb'.format(x) for x in ticks]
plt.xticks(ticks, ticklabels)

print('')

7.2 Topologically associating domains (TADs)

For TADs calling we will use lavaburst package. The code below is based on this example.


In [ ]:
# Import Python package 
import lavaburst

In [ ]:
good_bins = mtx.astype(bool).sum(axis=0) > 1 # We have to mask rows/cols if data is missing

gam=[0.15, 0.25, 0.5, 0.75, 1.0] # set of parameters gamma for TADs calling

segments_dict = {}

for gam_current in gam:
    print(gam_current)
    
    S = lavaburst.scoring.armatus_score(mtx, gamma=gam_current, binmask=good_bins)
    model = lavaburst.model.SegModel(S)
    segments = model.optimal_segmentation() # Positions of TADs for input matrix

    segments_dict[gam_current] = segments.copy()

In [ ]:
A = mtx.copy()

good_bins = A.astype(bool).sum(axis=0) > 0

At = lavaburst.utils.tilt_heatmap(mtx, n_diags=100)

start_tmp = 0
end_tmp = 500

f = plt.figure(figsize=(20, 6))

ax = f.add_subplot(111)
blues = sns.cubehelix_palette(0.4, gamma=0.5, rot=-0.3, dark=0.1, light=0.9, as_cmap=True)
ax.matshow(np.log(At[start_tmp: end_tmp]), cmap=blues)

cmap = mpl.cm.get_cmap('brg')

gammas = segments_dict.keys()
for n, gamma in enumerate(gammas):

    segments = segments_dict[gamma]

    for a in segments[:-1]:
        if a[1]<start_tmp or a[0]>end_tmp:
            continue
        ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5)
        ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)

    a = segments[-1]
    ax.plot([a[0]-start_tmp, a[0]+(a[1]-a[0])/2-start_tmp], [0, -(a[1]-a[0])], c=cmap(n/len(gammas)), alpha=0.5, label=gamma)
    ax.plot([a[0]+(a[1]-a[0])/2-start_tmp, a[1]-start_tmp], [-(a[1]-a[0]), 0], c=cmap(n/len(gammas)), alpha=0.5)
    
ax.set_xlim([0,end_tmp-start_tmp])
ax.set_ylim([100,-100])
        
ax.legend(bbox_to_anchor=(1.1, 1.05))
ax.set_aspect(0.5)

In [ ]:
#Let's check what are median TAD sized with different parameters:

for gam_current in gam:
    segments = segments_dict[gam_current]
    tad_lens = segments[:,1]-segments[:,0]
    good_lens = (tad_lens>=200/res)&(tad_lens<100)
    print(res*1000*np.mean(tad_lens[good_lens]))