Now that we have some background of what ChIP-seq data looks like and what we want to get out of it, let's look a little more closely at how you would actually go about analyzing this kind of data. Let's say you've performed your ChIP-seq experiment in the lab, sent it off for sequencing, and you now have the raw sequence data from the sequencing core or whoever did the sequencing.
For this in-class activity, we are going to walk through the steps of a standard ChIP-seq analysis to give you an idea of the different file formats and commands that are necessary to go all the way from sequencing reads to ChIP-seq peaks. To brush up on your Unix, we will be asking you to fill in the commands necessary to look at these example data files.
Step 1: Read alignment (fastq -> sam/bam)
As we've seen when we looked in previous classes, raw sequencing data are stored in the FASTQ format, where each read is represented by four lines: information about the read and a sequence identifier, the actual sequenced bases of the read, a third line that sometimes repeats the read information, and a line encoding the quality of the sequenced reads. Let's take a look at a sample fastq file containing a small number of reads. First, give the command to move into the inclass_data
directory:
To figure out how many reads we have in this file, give the command to count the number of lines in the file p53_DMSO.sub.short.fastq
:
Each read is represented by four lines in a fastq file. How many lines and reads are there in this file?
Give the command to look at just the first read (so the first four lines of the fastq file):
This shows us the fastq format: the first line here includes a bunch of information about the read, such as the sequence identifier for this read (HWI-ST965:593:C2NYUACXX:6:1101:1200:1893) and the barcode that was used for that read (for this class, we're not going to look much at this information). The second line is the actual nucleotide calls that were sequenced. You may notice that the first character here is an "N", which is not one of the four standard nucleotides. This is because sequencers have some special codes for various situations; the "N" means that it was not sure whether it was an A, C, G, T, or U. This actually comes from the fasta format, which is a related file format used for storing sequence, and you can find all the special codes here: https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation. The third line is just a "+" with nothing after it. This is because the third line of a read in a fastq format is required to start with "+", and can optionally be followed by the sequence identifier again, but this file does not repeat that information. Finally, the fourth line contains the quality information for this read.
You can find all the detail you might ever want to know about fastq files on Wikipedia: https://en.wikipedia.org/wiki/FASTQ_format
Now that we know what fastq files look like, the first step you have to take for your ChIP-seq analysis (and really for any sequencing analysis) is to take these reads and align them to the genome of the organism you are studying. You have a choice of many different alignment programs here, but Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is often used for ChIP-seq analysis. You won't be directly using Bowtie2 here because it requires very large genome reference files, but we will provide you with the result files and the commands run at each step so that you can see how we would go through the real analysis.
We have mapped this very short fastq file for you using the following bowtie2 command:
Explanation of the flags:
--end-to-end: means that each read has to fully align to the genome (as opposed to allowing some part of the read to not map)
-k 1: report only one alignment per read
-N 1: allow up to 1 mismatch in the alignment of the read
-t: print the time taken by each search
-x ~/data/refgenomes/hg19/bowtie2/hg19: this is the prefix for the Bowtie alignment files, and the path just reflects where these files reside on the server that I ran this on
-U p53_DMSO.sub.short.fastq: this specifies the fastq file containing the reads to be mapped
-S p53_DMSO.sub.short.sam: this says to output the results in a sam file called p53_DMSO.sub.short.sam
After bowtie2 maps the reads in the fastq file, we get what are called sequence alignment/map (sam) files. As you might expect, this is a file format that stores a bunch of information about the mapped reads, including the genomic position that each read was mapped to, the quality of the mapping, and other information. For the full specification, you can check the documentation: http://www.htslib.org/doc/sam.html, or a more detailed explanation at http://samtools.github.io/hts-specs/SAMv1.pdf. For now, let's just look at this file. Give a command to scroll through this file:
As you can see, the sam file consists of a bunch of header lines, which start with the '@' symbol and contain information about the reference genome like how long each chromosome is, followed by one line for each mapped read. In this case, all 10 reads were mapped, so the last ten lines of the file contain our mapped reads. Give a command using 'tail' with a flag to pull out those last 10 reads:
As you can confirm in the documentation for sam files, each of these lines contains 11 required columns, including the name of the specific read, the mapping position and quality, the input sequence, and other information. These required columns are followed by 1 or more optional columns which are formatted as TAG:TYPE:VALUE, where tag is a two character string and type is a single case-sensitive letter describing the format of the value. We won't go into too much detail for these columns as there are tons of different optional columns used by different programs (including information like alignment scores, weightings for multi-mapping reads, etc) and we're not going to use this information here.
As we saw, sam files are stored as plain text, which uses up a lot of disk space (if you have more than just 10 reads). If you recall from the Unix modules, there are ways of compressing files so that they are not stored as plain text but instead are no longer human-readable, but take up less space. Of course, you can use one of the compression tools we learned (zip, tar, gzip) on a sam file so that it takes up less space, but for the purposes of alignment files, we always use a related file format called a bam file, which is a binary version of a sam file. This means that the information in the sam file is now encoded as a series of 0s and 1s, but is stored in a specific way such that tools can access this data without having to decompress the file back to the much larger sam file.
A very useful way to deal with sam and bam files is called samtools, which is a suite of tools that offers a great deal of functionality for manipulating these files. There are over 20 different commands that you can run from this, but here we are only going to look at a few of them. You can find the full list at http://www.htslib.org/doc/samtools.html.
First, we will convert the .sam file into a .bam file. The way to do this is to use the 'samtools view' command with some specific flags. You can learn about the possible flags for samtools view by just running it alone, i.e. try running this:
$ samtools view
Then you can see all the possible flags. Give (and run) the command to convert the .sam file into a .bam file called "p53_DMSO.sub.short.bam"; the only flags you need are that the input is a sam file and the output is a bam file:
Next, we want to visualize our tracks. To do this, we need to sort the bam files and then convert them to what are called bedGraph files. First, write the samtools command to sort the bam file and write it to a file called p53_DMSO.sub.short.sorted.bam (hint: it's a very simple command):
Now, we need to use a command from the bedtools suite (http://bedtools.readthedocs.org/en/latest/) to convert this bam file into a bedGraph file, which we can then use to visualize our mapped reads. As an aside, the bedtools suite is a very useful toolbox for dealing with genomic data in various formats, including the .bed format, which is very common. The tool we will use to convert our sorted bam file into bedGraph format can be called as 'genomeCoverageBed' or 'bedtools genomecov'. You can find the flags and more description of this command on its help page: http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html
The bedGraph file format is quite simple, and consists of four columns: the chromosome, the start of a region, the end of a region, and a score for that region. In this case, the score represents the read coverage of that region (i.e. the number of reads mapping to that region). Let's convert our sorted bam file into a genomeCoverageBed file. To do this, the genomeCoverageBed command needs to know the sizes of each chromosome, which we have provided in the file 'hg19.chrom.sizes', in the inclass_data/ folder.
Using the help page, give the command (starting with either 'genomeCoverageBed' or 'bedtools genomecov') to convert our bam file into a bedGraph file called 'p53_DMSO.sub.short.bedgraph' (hint: we need three flags telling it to use a bam file as input, to output a bedGraph file, and where the genome file (hg19.chrom.sizes) is):
Step 2: Visualize mapped reads
Now let's look at this file. Give the command to print out the whole file (it only consists of 10 lines, one for each read that we mapped):
Notice that the last column in each row just contains a '1'; this is because there was only 1 read mapped to that region. In general, the usefulness of this file type is that it can summarize many reads mapping to a region, so if there are 100 reads in a region the file does not need to use 100 lines to store each one, but instead can represent all 100 in a single line.
For the second module, we will be looking at bedGraph files of full data, so we want to learn how to look at these results. Let's start by uploading this smaller bedGraph file to the UCSC Genome Browser for visualization. To learn how to upload the track, go to https://genome.ucsc.edu/goldenPath/help/customTrack.html, and remember that we are using the hg19 assembly for this. We only have 10 reads here, so there aren't any meaningful patterns to look at, but use the coordinates in the bedgraph file to navigate to where these reads are in the genome browser to confirm that they are represented. You should see black bars representing each read.
Step 3: Peak calling (sam/bam -> bed)
Finally, the last step necessary to go from sequencing reads to interpretable ChIP-seq data is the peak calling step. As we saw in the prelab (principles 2 and 4), there are many complexities and factors to consider in order to correctly identify and interpret "peaks" representing true binding pileups. As discussed in the prelab, for this class we are going to give you a fairly typical 'one size fits all' approach, but it is important to remember that there are many different ways to analyze this data, and each analysis choice should be very carefully considered!
As mentioned in the prelab, we will be using a common tool for identifying peaks called Model-based analysis of ChIP-seq data, or MACS. We won't go into too much detail with the statistics here (see the original paper at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2592715/), but the idea is to model the shift between ChIP-seq tags on different strands and a Poisson-based test for enrichment to identify binding sites that are significantly enriched over the background with good spatial resolution.
In terms of file formats, we are going from a sam or bam file, containing the mapped reads, to what is called a browser extensible data file, or bed file. These are related to bedGraph files, but are more general. See this useful page from the UCSC Genome Browser for more detail: https://genome.ucsc.edu/FAQ/FAQformat.html#format1. For now, all we need to know is that the first three columns, chromosome, start, and end, are required, and the other optional columns contain more information about the intervals defined by the first three columns. Because it is not exactly meaningful to run peak calling on only 10 reads, we have provided you with the result of peak calling on the full p53_DMSO dataset, using an input file as the control. More detail about these files will come in the next module, but for now, here is the command ran to generate the file p53-input.DMSO_peaks.narrowPeak:
We're using fairly straightforward flags here:
-t: this defines the file containing the data from the treatment condition
-c: this defines the file containing the data from the control condition
-f BAM: this says that our input files are in .bam format
-g hs: this says that we are in the Homo sapiens genome, and this flag tells MACS the size of the genome that is actually mappable
-n p53-input.DMSO: this says that our output filenames should be prefixed with 'p53-input.DMSO', reflecting that it is the p53 minus the input, treated with DMSO
-p 1e-5: this says to use a p-value cutoff of $1x10^{-5}$ for reporting peaks
The main output file, 'p53-input.DMSO_peaks.narrowPeak' is a bed file (even though its extension is not .bed.) containing 4 extra columns. Write the command to look at the top few lines of this file:
Notice the first line, which I've manually added. This is a line necessary for telling the UCSC Genome Browser that this file is formatted as a narrowPeak file, so that it knows how to interpret it. Then, all the lines following this follow the .bed format.
The first three columns are the chromosome, start position, and end position, the fourth column contains the name of the peak, the fifth is the integer score for that peak, the sixth is the strand (these are all '.', meaning that we have no strandedness information), the seventh is the fold change between the input and experimental conditions, the 8th and 9th are the -log10 values of the p- and q-values, respectively, and the 10th is the location of the "summit" of the peak, relative to the start of the peak.
We have presented a fairly simple approach for ChIP-seq analysis: read alignment, some visualization, and peak calling. However, it is important to note that there are also many other steps that can be added to a ChIP-seq analysis pipeline and sources of bias to consider. For a very in-depth review of these issues, see the following review paper from one of the authors of the MACS paper: http://www.nature.com/nrg/journal/v15/n11/full/nrg3788.html. We saw some of these considerations in the prelab for this class, such as the shape of the peaks and the appropriate choice of controls. This review discusses many other sources of bias and analysis considerations such as allele specificity of reads, duplicate read removal, and adjusting for sequencing depth.
Question 1. First, let's use some Unix commands to further characterize our data. Give the command to find how many peaks we have (1 point):
Question 2. Next, give a command to pull out the first column of the bed file, sort it, and count how many peaks there are on each chromosome (1 point):
Question 3. Next, upload this file to the UCSC genome browser (you can use the same uploading approach as the bedGraph file). Then, navigate to the genomic interval chr15:63,449,045-63,449,917 and look at the peak overlap. What gene does this peak overlap? (1 point)
Question 4. Now go back to the p53-input.DMSO_peaks.narrowPeak
file. Which peak name (in the fourth column) does this locus correspond to? (1 point)
Question 5. What are the -log10 values of the p- and q-value for this peak? (2 points)
Question 6. We have provided a bedgraph file of some of the reads from the p53 condition around this region, in the partial_p53_chr15_reads.bedgraph
file (there are no reads from the input condition). Next, upload this file to the genome browser and look at this bedgraph file next to the called peak, making sure you're at the chr15:63,449,045-63,449,917 genomic interval. Do the raw read patterns seem to support the called peak? (1 point)
Question 7. Short answer: Where along the gene does this peak seem to fall? Does it seem to differ between isoforms (make sure to look at the UCSC genes)? Given that this data reflects p53 binding, a transcription factor, what might this mean about the relationship between p53 and the gene? (3 points)