Task If you're unfamiliar with epigenetics of ChIP-seq, please read the lecture notes on canvas.
Task: Skim through A User’s Guide to the Encyclopedia of DNA Elements (ENCODE): http://www.plosbiology.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pbio.1001046&representation=PDF
1) Give a brief summary of the goal of the ENCODE project.
2) In addition to ChIP-seq, name and describe 3 other experimental techinques used in ENCODE.
3) In which scenario(s) would ENCODE data be useful:
a. You want to find all regions of the genome where a specific transcription factor bind in humans.
b. You want a list of all variants associated with a disease in human.
c. You want to understand what epigenetic marks are enriched near a set of genetic variants you generated.
d. You want to understand how the epigentic landscape changes across time in a single cell.
e. You want to collect evidence about whether a specific region of the genome is transcribed or not in a specific cell type.
ENCODE data is housed at https://www.encodeproject.org/. Go to this website. Then, click on the 'Data' Tab. You will see a drop-down menu with several options. The 'Assays' option is the link to get to the ENCODE data. Click on some of the data sets. The files to download will be at the bottom of the page. You will see several different data formats available. One of these formats is bed format, which you have seen previously. As a refresher, this format is tab-separated and contains the chromosome and coordinates of each feature of interest (for example, evidence of transcription factor binding or DNAse hypersensitivity).
When epigenetic experiments are performed, the result is not a simple yes/no for containing a feature of interest for each base pair in the genome. Instead, there are peaks, often with a clear signal at the feature in the center, and a less clear signal surrounding it. To deal with this, the ENCODE consortium has developed several different data types that are very similar in character to bed files. Amount the most common are: narrow-peaked ENCODE format and broad-peaked ENCODE format. These file formats are suitable for data types in which certain regions of the genome will have a signal and others won't. In contrast, Wiggle format is used for data types in which the value of all regions of the genome will have a value, such as percentage of nucleotides of a given type.
Explanations of all ENCODE data types can be found here: https://genome.ucsc.edu/FAQ/FAQformat.html#format12
An extremely useful piece of software for processing genomic data is Bedtools (http://bedtools.readthedocs.org/en/latest/). Bedtools allows manipulation and analysis of data that is in the bed format. You previously used this tool to process chip-seq data. In this module, we will make further use of it to study epigenetics.
Bedtools has a plethora of commands. Some of the most useful are:
Intersectbed: Finds overlapping basepairs in the features contained in two different bed files.
Mergebed: Merges overlapping elements into a single feature (a nice figure explaining this function: http://bedtools.readthedocs.org/en/latest/content/tools/merge.html)
Shufflebed: Randomly shuffles genomic location of features, keeping their size and number the same.
A list of all bedtools utilities, and their documentation, can be found at: http://bedtools.readthedocs.org/en/latest/content/bedtools-suite.html
Basic Use of Bedtools
Bedtools is simple to use. For instance, if you want to find the number of features in fileA.bed that overlap with at least one feature infileB.bed you would simply run the command:
bedtools intersect -a fileA.bed -b fileB.bed
This command would output a file containing all overlaps of size 1 bp or higher. For instance, if fileA contained an element that was found on chr. 1, coordinates 670000-670010 and fileB contained a feature on chr. 1 670009-67050, this would be reported as an overlap at chr.1 670009-670010
Intersect also has advanced options. For instance, if you only want bedtools to report features overlapping a certain size, you can use the -f option to specify the minimum fraction of basepairs of the feature in fileA that must be overlapped by a feature in fileB to be reported.
So the command:
bedtools intersect -a datasetA.bed -b datasetB.bed -f .25
would not report the overlap on chromosome 1, since 1 base pair is not at least one quarter of the feature length in file A.
1) Write out the command you would use if you wanted to merge overlapping peaks together in a file called fileC.bed.