In-Class: ENCODE

Part 1: ENCODE on the Genome Browser

We will now start to look at ENCODE data. The quickest way to look in epigenetic marks at a specific region of the genome is through the UCSC Genome Browser.

Task: You are interested in the gene SLC24A5 due to its association with albinism(Wei et al. 2013) and would like to learn about its regulation using ENCODE data.

Go to the UCSC Genome Browser at https://genome.ucsc.edu/cgi-bin/hgGateway
Using the hg19 assembly, Enter the gene name SLC24A5 (coordinates chr15:48,413,169-48,434,589) into the search box.
Go to the "Regulation" group of tracks. Show whichever tracks you think will be relevant to answer the below group of questions. For help, refer to: https://genome.ucsc.edu/ENCODE/usageResources/ENCODE_QuickReferenceCard.pdf

1) Wei et al. showed that decreased levels of SLC24A5 is associated with albinism. You want to investigate this further by knocking out relevant transcription factors. Name two transcription factors that show evidence of binding to the gene and three that bind upstream of the gene based on the UCSC Genome Browser.

2) Look at the DNase hypersensativity information in the region directly upstream this gene by loading the relevant ENCODE track (the ENCODE tracks have the black and white helix logo next to them). Can you find evidence of DNase hypsersensitivity in a relevant cell type to this phenotype? If so, what is that cell type?

3) You want to determine whether the region upstream of this gene is an enhancer in any cell type. As you can see, it is very difficult to quickly tell this from looking at just the epigenetic marks. Luckily, there are tools that make computational predictions about the functional activity of a region by combining many epigenetic marks. One such tool is chromHMM. Look at the chromHMM tracks by loading the ENCODE histone modification track. Is there an enhancer directly upstream of this gene in any of the assayed cell type? If so, which one? Hint: Zoom out 1.5x to view the entire upstream region.

Part 2: Using ENCODE like a Genomicist

The Genome Browser works great if you want to look at a few specific sites, however is not a feasible way to investigate many sites across the entire genome. To do this, we will download data from the ENCODE consortium, then analyze it using bedtools.

You are interested in understanding the epigenetic basis of the autoimmune disease Type I Diabetes. You are specifically interested in the possible role that B lymphocytes play in the disease. In order to study this connection, you want to download several data sets.

Task: Download data types of interest using the instructions below.

Go to the ENCODE website: https://www.encodeproject.org/
From the 'Data' Tab on the top menu bar, select 'Search'.
Using the search filters on the left hand side of the screen, find the data set: DNase-seq of B cell (Homo sapiens, female, adult 27 year) from Gregory Crawford's lab. To do this, click on the filters: Homo Sapiens, adult, DNase-seq, Gregory Crawford.
Under "External resources" press on the link to the data on GEO.
Right click to copy the path to the bed narrowPeak formatted file.
Use wget on the terminal to download the file to cocalc.

Task: Let's start with a simple experiment. Are SNPs associated with Type I Diabetes more likely to be in DNase hypersensitive sites than random SNPs in the genome?

Use the BedTools function intersect (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html) to measure the overlap between the disease-associated SNPs (Found in T1D_SNPs.bed) and the DNAse hypersensitive regions. Intersect looks for the intersection of regions between two bed-formatted files. The -a and -b flags specify the file names.

Tip: To output the results to a file, use the > unix command.

Remember, you need to start a terminal window to use bedtools.

Question 1a. How many SNPs are in T1D_SNPs.bed?

Question 1b. What is the command to find the intersections?

Question 1c. Now, run the command. How many overlaps are there?

Next, we want to test if this overlap is more than is expected due to chance. To do this, we are going to use simulations in order to get an idea how likely it is that we get an overlap this significant by chance.

To do these simulations, we are going to generate a random set of SNPs the same size as our set of interest using bedtool's random function. The -l flag specifies the length of each region (in this case it will be 1, since we are generating single nucleotide variants), the -n specififes the number of random regions to generate, and the -g flag specifies a file containing the structure of the genome(number of chromosomes and their length).

bedtools random -l 1 -n 836 -seed x -g human.hg19.genome

Do this 10 times, but replace x with the seeds 1 through 10. Specificying a seed causes the same numbers to be generated by the random number generator. This allows for reproducible research when software relies on randomness. So, your first command would be:

bedtools random -l 1 -n 836 -seed 1 -g human.hg19.genome

Question 2. After each random replicate, run the intersect command to find the number of overlaps. Each time, write the number below.

Question 3. Does the overlap with the original SNP set still seem significant?

Question 4. Should this convince you that DNase hypersensitivity in B Cells is causing these SNPs to be associated with Type I Diabetes? Hint: Think about confounding factors.

Homework: Using ENCODE to study Alzheimer's

We will now try to learn about possible genetic mechanisms of alzheimer's using ENCODE data.

SNPs associated with this disease (in hg19) can be found in ALZ_SNPs.bed.

Choose an ENCODE data set from the ENCODE website that you think may be relevant to the disease (note: feel free to choose a tissue other than brain...there could be other tissues linked to the disease as well). Think both about tissue-type and ENCODE mark. Please do not use DNAse hypersensitivity or RNA-seq datasets. Hint: it will be easiest if you find a data set with data in the bed file format.

Question 1: What dataset did you choose and why? Include the tissue, epigenetic mark tested, and any identifying information, such as the name of the sample. (2 points)

Question 2: Now, test for significant overlap bewtween the Alzheimer's SNPs and your ENCODE mark. This time, instead of generating fake SNP sets, use the fisher exact test (http://bedtools.readthedocs.org/en/latest/content/tools/fisher.html). This test looks for an association between two classifications. In our case, we are looking for an association between being a SNP in ALZ_SNPs.bed and being in an interval from your ENCODE data set. Run this test on your datasets. What is the right-tailed p-value? This is the p-value that the given parameter we're estimating (probability of overlap) is greater than that expected by chance. (Hint: you will need to sort each file before running this test. To do this, you can use the linux command: sort -k1,1 -k2,2n in.bed > in.sorted.bed ) (5 points)

Question 3: Was there a significant association between the disease-associated genetic variants and the ENCODE mark? Explain your findings to the best of your abilities. Are you suprised bythe result? If so, why? If not, why not? (3 points)