Before we introduce the technology of ChIP-seq, we should have an idea of why it is useful. As we all know, the process of transcribing DNA into RNA and protein is immensely complex and highly regulated by, among other things, proteins known as transcription factors. These transcription factors bind to chromatin at specific regions of genomic DNA while performing their regulatory functions, and so knowing where these transcription factors bind throughout the genome allows us to better understand their mechanisms of action and which genes they are regulating. ChIP-seq is a technique that can be used for exactly this purpose.
DNA can also be organized into nucleosomes, which contain four core histone proteins and DNA. These histone proteins are chemically "marked" during certain processes such as transcription, and so these histone modifications or histone marks can be used to infer the functional activity of a certain region of DNA. For example, if the histone H3 is trimethylated at its lysine at residue number 27 (H3K27me3), which is a known mark for repression of transcription, we can assume that the genomic DNA at that region is not being transcribed; in fact, its transcription is being actively repressed. Therefore, these marks provide valuable information about the functional state of DNA across the genome, and so being able to measure the levels of these marks genome-wide is very valuable. Again, ChIP-seq can be used for this purpose.
By now, we have become familiar with some "flavors" of sequencing data, the most straightforward being DNA sequencing. However, the basic idea of identifying the nucleotide composition of short pieces of DNA in a massively high-throughput fashion has also been applied to many biological questions beyond understanding the genomic DNA in a sample. These different applications typically involve varying the steps involved in "library preparation", the process of taking a sample (i.e. some cells from cell culture or a tissue sample from a patient) and ultimately extracting some set of DNA fragments to be measured using sequencing technology. For example, RNA sequencing involves reverse transcribing RNA molecules into cDNA, which can then be sequenced using standard DNA sequencing. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) follows the same idea. The goal of the protocol is to isolate DNA fragments that are bound by a specific transcription factor or marked by a specific histone modification, and the fundamental idea is to use antibodies against the transcription factor or histone modification to retrieve only the relevant DNA, then sequence these DNA fragments.
We won't go into the nitty gritty of library preparation here, but we can follow this diagram to understand the general steps:
Once the DNA fragments bound by a specific protein or marked by a specific histone modification have been sequenced, the next challenge is a computational one: how do you take these reads and use them to infer which DNA sites were "truly" bound by the protein or marked by the histone modification?
There are some important principles that should be kept in mind in order to understand ChIP-seq analysis:
At the most basic level of an interaction between a single protein P, a DNA location L, and a cell X, there are only two conditions: either protein P binds to location L in cell X, or it does not. In an ideal world, we would have some technology that could tell, unambiguously, whether this binding event occurs. However, there are several practical limitations that prevent us from measuring this; for example, the scenario I describe only refers to a single cell, but we do not currently have a method sensitive enough to be able to detect binding events from the material of a single cell. This means that we have to measure on pooled samples and compute the relative enrichment of binding events across all the cells in our pool, giving us an estimate of the proportion of these cells have protein P bound at location L.
A binding event of protein P at location L in cell X (or in most or all of the cells in your pooled sample), after the ChIP protocol is applied, will yield a set of DNA fragments overlapping and adjacent to L. In the ideal world, we would know the full sequences of all these fragments, giving us a so-called "read pileup" exactly at the location of protein binding. However, the actual sequencing protocol has some limitations that prevent this from occuring. Namely, because there is a limit to the length of DNA fragments that are sequencable using current technology, each read will only represent part of the full binding event and will be essentially randomly placed around the true binding site, so we have to combine information across reads to detect peaks of reads.
Furthermore, because sequencing is done from the 5' end, reads from each strand of DNA will form two peaks that flank the 'true' binding event (see the red and blue distributions in the figure for an illustration).
For single-end reads, an estimate of the DNA fragment size (i.e. as determined by BioAnalyzer) can be used to combine information across these two peaks to estimate the distribution of all the DNA fragments, thereby giving an estimate of the true, non-strand-specific binding site.
Another issue is that, even if we detect a peak signal in our ChIP-seq data, there are several possible explanations for where this comes from: it could be due to protein P actually binding to location L, but it could also be due to location L being preferentially accessible to proteins (so that it gets crosslinked with protein P without a true binding event), or because your antibody is non-specific and pulled down location L despite not having a binding event, among other possibilities. Thus, the steps involved in preparing a ChIP-seq library introduce several sources of potential artifacts, and these scenarios highlight the usefulness of having control data in a ChIP-seq experiment.
The idea of using a control is similar to a control you'd see in any well-designed study: calculating the enrichment of the ChIP-seq signal in an experimental condition relative to the signal at the same location in a control experiment gives us a more robust signal and allows us to more confidently state that the signal is true and not just due to noise or artifacts from the library prepration. There are three commonly used types of controls for ChIP-seq experiments:
Principle 4: there are many ways to quantify ChIP-seq enrichment.
Given an experiment with good controls and appropriate identification of peaks reflecting binding events, the next important step is to define some way to identify regions that are statistically enriched (or, much more rarely, depleted) in the experimental condition relative to the controls. There are, as usual, several important aspects of the data that must be considered to ensure that the analysis is as correct as possible.
There are clearly many things to take into account when performing ChIP-seq analysis, and there are a plethora of software tools that have been developed for ChIP-seq analysis that each deal differently with these issues. For this class, we are going to focus on a single tool, which is currently the most commonly used, called Model-based analysis of ChIP-seq data (MACS), but if you use ChIP-seq for your own research, we would encourage you to compare different tools to figure out which is the most appropriate for your project.
As you probably noticed, I used a lot of figures from a nice ChIP-seq review; if you want a more in-depth discussion of some of these issues, I highly recommend this paper, which is now slightly out of date but still gives a very good background on the technology and its applications: Park, P.J., 2009. ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10), pp.669-680. (http://www.nature.com/nrg/journal/v10/n10/full/nrg2641.html)
2. Describe why ChIP-seq reads pile up in two distinct peaks at a binding site when mapped to the reference genome.