ChIP-seq I - Prelab

Why use ChIP-seq?
What is ChIP-seq?
How ChIP-seq libraries are prepared
How ChIP-seq data is analyzed
Questions

1. Why use ChIP-seq?

Before we introduce the technology of ChIP-seq, we should have an idea of why it is useful. As we all know, the process of transcribing DNA into RNA and protein is immensely complex and highly regulated by, among other things, proteins known as transcription factors. These transcription factors bind to chromatin at specific regions of genomic DNA while performing their regulatory functions, and so knowing where these transcription factors bind throughout the genome allows us to better understand their mechanisms of action and which genes they are regulating. ChIP-seq is a technique that can be used for exactly this purpose.

DNA can also be organized into nucleosomes, which contain four core histone proteins and DNA. These histone proteins are chemically "marked" during certain processes such as transcription, and so these histone modifications or histone marks can be used to infer the functional activity of a certain region of DNA. For example, if the histone H3 is trimethylated at its lysine at residue number 27 (H3K27me3), which is a known mark for repression of transcription, we can assume that the genomic DNA at that region is not being transcribed; in fact, its transcription is being actively repressed. Therefore, these marks provide valuable information about the functional state of DNA across the genome, and so being able to measure the levels of these marks genome-wide is very valuable. Again, ChIP-seq can be used for this purpose.

2. What is ChIP-seq?

By now, we have become familiar with some "flavors" of sequencing data, the most straightforward being DNA sequencing. However, the basic idea of identifying the nucleotide composition of short pieces of DNA in a massively high-throughput fashion has also been applied to many biological questions beyond understanding the genomic DNA in a sample. These different applications typically involve varying the steps involved in "library preparation", the process of taking a sample (i.e. some cells from cell culture or a tissue sample from a patient) and ultimately extracting some set of DNA fragments to be measured using sequencing technology. For example, RNA sequencing involves reverse transcribing RNA molecules into cDNA, which can then be sequenced using standard DNA sequencing. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) follows the same idea. The goal of the protocol is to isolate DNA fragments that are bound by a specific transcription factor or marked by a specific histone modification, and the fundamental idea is to use antibodies against the transcription factor or histone modification to retrieve only the relevant DNA, then sequence these DNA fragments.

3. How ChIP-seq libraries are prepared

We won't go into the nitty gritty of library preparation here, but we can follow this diagram to understand the general steps:

(Diagram from Wikipedia: https://en.wikipedia.org/wiki/ChIP-sequencing)

First, the input samples are typically treated with formaldehyde, which cross-links DNA with any proteins that are bound to that site (such as transcription factors and histone proteins in the chromatin complex).

Next, the cells are lysed and the crosslinked chromatin is purified and sheared by sonication or micrococcal nuclease so that small (typically 200-600 bp) fragments of DNA remain, including the proteins bound to that DNA.

An antibody is then applied to immunoprecipitate (IP) any DNA:protein/chromatin fragments containing the transcription factor or histone modification of interest.

The crosslinking is then reversed (typically by heating, which reverses formaldehyde fixation) to release the DNA, which is then purified and prepared for sequencing.

4. How ChIP-seq data are analyzed

Once the DNA fragments bound by a specific protein or marked by a specific histone modification have been sequenced, the next challenge is a computational one: how do you take these reads and use them to infer which DNA sites were "truly" bound by the protein or marked by the histone modification?

There are some important principles that should be kept in mind in order to understand ChIP-seq analysis:

Principle 1: The DNA binding event is binary.

At the most basic level of an interaction between a single protein P, a DNA location L, and a cell X, there are only two conditions: either protein P binds to location L in cell X, or it does not. In an ideal world, we would have some technology that could tell, unambiguously, whether this binding event occurs. However, there are several practical limitations that prevent us from measuring this; for example, the scenario I describe only refers to a single cell, but we do not currently have a method sensitive enough to be able to detect binding events from the material of a single cell. This means that we have to measure on pooled samples and compute the relative enrichment of binding events across all the cells in our pool, giving us an estimate of the proportion of these cells have protein P bound at location L.

Principle 2: read pileups reflect binding events. (Figure from Park, Nature Reviews Genetics, 2009)

A binding event of protein P at location L in cell X (or in most or all of the cells in your pooled sample), after the ChIP protocol is applied, will yield a set of DNA fragments overlapping and adjacent to L. In the ideal world, we would know the full sequences of all these fragments, giving us a so-called "read pileup" exactly at the location of protein binding. However, the actual sequencing protocol has some limitations that prevent this from occuring. Namely, because there is a limit to the length of DNA fragments that are sequencable using current technology, each read will only represent part of the full binding event and will be essentially randomly placed around the true binding site, so we have to combine information across reads to detect peaks of reads.

Furthermore, because sequencing is done from the 5' end, reads from each strand of DNA will form two peaks that flank the 'true' binding event (see the red and blue distributions in the figure for an illustration).

For single-end reads, an estimate of the DNA fragment size (i.e. as determined by BioAnalyzer) can be used to combine information across these two peaks to estimate the distribution of all the DNA fragments, thereby giving an estimate of the true, non-strand-specific binding site.

Principle 3: using control data yields sample enrichment.

Another issue is that, even if we detect a peak signal in our ChIP-seq data, there are several possible explanations for where this comes from: it could be due to protein P actually binding to location L, but it could also be due to location L being preferentially accessible to proteins (so that it gets crosslinked with protein P without a true binding event), or because your antibody is non-specific and pulled down location L despite not having a binding event, among other possibilities. Thus, the steps involved in preparing a ChIP-seq library introduce several sources of potential artifacts, and these scenarios highlight the usefulness of having control data in a ChIP-seq experiment.

The idea of using a control is similar to a control you'd see in any well-designed study: calculating the enrichment of the ChIP-seq signal in an experimental condition relative to the signal at the same location in a control experiment gives us a more robust signal and allows us to more confidently state that the signal is true and not just due to noise or artifacts from the library prepration. There are three commonly used types of controls for ChIP-seq experiments:

Input DNA, where some of the DNA is removed prior to antibody-based immunoprecipitation, so that all chromatin-associated DNA is assayed. This represents all the DNA that could "possibly" be sequenced in this experiment, and calculating enrichment relative to this also corrects for biases due to differing relative solubility of different DNA regions, uneven shearing of DNA, and amplification bias in the sequencing library preparation.
Mock immunoprecipitation, where IP is performed without antibodies, so that DNA is pseudo-randomly selected and biases introduced during the IP procedure are corrected for.
Nonspecific antibody IP, where IP is performed with a non-specific antibody such as the one for immunoglobulin, which is used to show that your antibody "works" and selects DNA:chromatin fragments more specifically than the nonspecific antibody. In the specific context of assaying histone modifications, a total histone antibody is also useful to pull out all the histone-associated regions, which should include the specific histone modification of interest, acting as a positive control. The ratio of the specific histone modification to the total histone signal can also give an estimate of the fraction of nucleosomes containing that modification, averaged over all the cells in the sample.

Principle 4: there are many ways to quantify ChIP-seq enrichment.

Given an experiment with good controls and appropriate identification of peaks reflecting binding events, the next important step is to define some way to identify regions that are statistically enriched (or, much more rarely, depleted) in the experimental condition relative to the controls. There are, as usual, several important aspects of the data that must be considered to ensure that the analysis is as correct as possible.

One very simple but important question is exactly how to quantify the 'strength' of a peak. In a very broad sense, we want to identify regions where the depth of the peak is significantly more in the experimental condition than in the control (or significantly less to find depleted regions). However, this idea of relative depth depends on two things: the raw number of reads in each condition as well as the fold ratio of reads between the two conditions. To illustrate this, observe this figure from (Park, Nature Reviews Genetics, 2009). In the left example, the ratio between the two conditions is relatively low (1.5-fold change) and the raw read counts are also low, so we consider this slight enrichment to be not significant. On the right, on the other hand, we have two examples of scenarios which we are more likely to want to call significant: one is having a high fold ratio despite relatively low read counts (in other words, a 4 fold enrichment should be significant even if it is only on a small scale), and the other is having a low fold ratio but having high read counts (a 1.5-fold enrichment given 100-150 reads at that region is much more likely to represent a true binding event, although perhaps a weak one). Of course, the scenario that isn't illustrated here is the most desirable, where the region displays a high fold ratio as well as a high number of raw reads. Also note that, although it is the most widely used, a ratio of reads between the two conditions isn't the only way to quantify enrichment; one can also use the absolute difference in read counts (Experiment - Control) or the proportion of read counts attributed to the experimental condition (Experiment / (Experiment + Control)).

Another aspect to consider is the fact that ChIP-seq can be performed using any antibody, so it can be applied to anything from transcription factors to histone modifications, as described above. Consistent with the differing biological functions of different transcription factors as well as the fact that histone modifications can mark specific regions as well as larger domains, this means that ChIP-seq performed using different antibodies will yield differently shaped peaks. These peak shapes are typically split into sharp, broad, and mixed, where sharp reflects specific binding events or histone modifications at regulatory elements, broad reflects longer domains such as repressed regions, and mixed is found when analyzing factors or marks that can generate both types of peaks, such as RNA Polymerase II. See the figure from (Sims et al., Nature Reviews Genetics, 2014) for an illustration of the different types of peaks. It should be clear that different statistical models are required for identifying peaks of each type, as a broad peak, which might have a somewhat lower number of reads at any one position but be spread across a larger region, will not look the same as a sharp peak, which will have more reads in a smaller region.

There are clearly many things to take into account when performing ChIP-seq analysis, and there are a plethora of software tools that have been developed for ChIP-seq analysis that each deal differently with these issues. For this class, we are going to focus on a single tool, which is currently the most commonly used, called Model-based analysis of ChIP-seq data (MACS), but if you use ChIP-seq for your own research, we would encourage you to compare different tools to figure out which is the most appropriate for your project.

As you probably noticed, I used a lot of figures from a nice ChIP-seq review; if you want a more in-depth discussion of some of these issues, I highly recommend this paper, which is now slightly out of date but still gives a very good background on the technology and its applications: Park, P.J., 2009. ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10), pp.669-680. (http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html)

5. Questions

1. In ChIP-seq library preparation, what step filters DNA fragments bound to our protein of interest from DNA fragments bound to any protein?