ChIP-seq II - In Class

Experiment Description
.bam file analysis
bedGraph file generation
Running peak calling
bedtools intersect usage

1. Experiment Description

In this class, we are going to apply the ChIP-seq analysis approach we learned about in the first ChIP-seq module to a real dataset. We are going to analyze some p53 ChIP-seq data from a Genome Research paper from 2014, Sammons et al., which comes from Shelley Berger's group here at Penn: http://genome.cshlp.org/content/early/2014/11/12/gr.181883.114.abstract. This experiment consists of ChIP-seq data from the tumor suppressor transcription factor p53 in basal (uninduced) and activated (induced) conditions in IMR90 primary human fibroblasts. Activation is induced using Nutlin-3, a small molecule inhibitor of the interaction between MDM2 and p53, while the basal uninduced condition is just treated with DMSO as a control. We won't be recreating their whole analysis, but we will go through some of the steps of the ChIP-seq analysis.

2. .bam file analysis

In the interest of not wasting your class time waiting for alignment to finish, we have mapped a subsampled set of reads (1 million) from each of four conditions: input DNA treated with DMSO (Input_DMSO.sub.sorted.bam), input DNA treated with Nutlin (Input_Nutlin.sub.sorted.bam), p53 ChIP treated with DMSO (p53_DMSO.sub.sorted.bam), and p53 ChIP treated with Nutlin (p53_Nutlin.sub.sorted.bam). We mapped these files using the same bowtie2 parameters as in the previous module. Now we can look at some characteristics of these bam files.

First, provide the command to move into the 09_Data_ChIP-Seq-II directory which contains these bam files.

Then, let's use the samtools flagstat commmand to get some information about each of these files, including the percentage of the reads that mapped. Give the 4 commands to run samtools flagstat on each bam file:

$ $ $ $

What was the range of the percentage of input reads that mapped for each of these four files?

3. bedGraph file generation

Using the command from ChIP-seq module I, give and run the commands to generate four bedGraph files from these bam files, and call them Input_DMSO.sub.bedgraph, Input_Nutlin.sub.bedgraph, p53_DMSO.sub.bedgraph, and p53_Nutlin.sub.bedgraph. We are going to slightly modify this command; since we are going to upload these tracks to the UCSC Genome Browser, we want to add some track attributes (https://genome.ucsc.edu/goldenpath/help/customTrack.html#TRACK) to these files so that we can tell them apart. This can be done by using the -trackopts argument to the bedtools genomecov call. For each file, just add this argument:

-trackopts 'name="Input_DMSO" type=bedGraph'

Make sure to replace "Input_DMSO" with the relevant name for each file, and if you want to add any more attributes you can either change this argument to genomecov or just go in and edit the file yourself after it's generated. Now give the four commands to generate these annotated bedGraph files:

$ $ $ $

Give the command to find the number of lines in each of these files:

How many lines are in each? Does this correspond to the number of reads in each input file? Why or why not?

Give the commands to pull out the chromosome column, sort it, and count how many intervals per chromosome there are for each file:

What do you make of all the extra chromosomes in these results?

Try adding this grep call after your call to 'uniq' above:

grep "chr[0-9XY]*$"

This will pull out only the canonical chromosomes. The explanation is that it looks for chromosome names that contain the numbers between 0 and 9 or the letters 'X' or 'Y' any number of times followed immediately by the end of the line. This avoids all the chromosome names like "chr1_gl000192_random" because the underscores do not match.

4. Running peak calling

Next, we are going to run MACS2 on these files to identify the peaks in each condition. Following the command from module I, give and run the MACS2 commands to perform peak calling on the .bam files from each condition, using the prefixes p53-input.DMSO and p53-input.Nutlin. Remember that you need both the p53 and Input files for each peak call.

$ $

Now we have bed files for each condition, so give the command to see how many peaks there are in each condition:

5. bedtools intersect usage

One important analysis that comes up often in bioinformatics and especially in peak calling applications is to take two .bed files containing genomic intervals and finding how many of these intervals are present in both files. To accomplish this, we will use another tool from the bedtools suite, called bedtools intersect or intersectBed (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html). Like all the bedtools tools, there are many flags for this method, but here we will just use a very simple approach, where we will report any shared intervals between the two files. Try running this command to get the intersection between the published peaks in the two conditions:

$ bedtools intersect -a GSM1418969_p53_DMSO_Peaks_hg19_FDR1.bed -b GSM1418970_p53_Nutlin_Peaks_hg19_FDR1.bed

You can pipe this to wc -l to count how many peaks were overlapping. Next, give the command to find out how many lines were in each of these files:

What proportion of the published peaks from the Nutlin condition were found in the DMSO condition?

Homework problems

Question 1. Upload all four bedgraph files (Input_DMSO.sub.bedgraph, Input_Nutlin.sub.bedgraph, p53_DMSO.sub.bedgraph, and p53_Nutlin.sub.bedgraph) to UCSC Genome Browser and check the p21/CDKN1A gene locus using coordinate chr6:36,633,000-36,656,000. By visualizing the alignments in this region (remember that the bedGraph files are just the raw signal of where your reads mapped in the genome, before any peak calling), can you roughly tell where p53 binds under Nutlin induced condition (i.e. the majority of the aligned reads are enriched in which of the following regions: the CDKN1A promoter, gene body, gene end or other non-coding region)? How about uninduced condition (DMSO)? (2 points)

Question 2. Check the p21/CDKN1A locus using coordinate chr6:36,633,000-36,656,000 using the called peak files from section 4 above (files should have the format NAME_peaks.bed); you can upload these to the genome browser or just look at them directly. Where does p53 bind under induced and uninduced condition in this region (please give the exact peak coordinates)? Are they consistent with your answer in the first question? (2 points)

Question 3. Give the bedtools intersect call to find the number of p53 peaks in the Nutlin condition (from your .bed peak-calls file) that are present in the DMSO condition: (1 point)

Question 4. What fraction of Nutlin p53 peaks is present in DMSO? (1 point)

Question 5. Give the two bedtools intersect calls to compare the Nutlin and DMSO condition peak calls we generated (the .bed files created in section 4 before) with the published results (remember, the published files are GSM1418969_p53_DMSO_Peaks_hg19_FDR1.bed and GSM1418970_p53_Nutlin_Peaks_hg19_FDR1.bed): (2 points)

$ $

Question 6. How many of the published peaks did you find in the induced (Nutlin) and uninduced (DMSO) conditions? (1 point)

Question 7. Short answer: other than the fact that our data was a small fraction of the data from the published paper, what other factors might affect the peaks that we detected with our analysis? Can you suggest any other analyses that would be useful to further understand this data? (1 point)