Motif discovery analysis - III

Motif analysis for ChIP-Seq data

In this lab, we are going to learn an important downstream analysis for ChIP-Seq data: how to find motifs enriched in ChIP-Seq peaks.

Why might we want to do motif analysis for ChIP-Seq data? There are several reasons:

(1). Motif analysis can be used to validate ChIP-Seq experimental data. If you are doing a ChIP-Seq experiment for a transcription factor with known binding motifs, you would expect to identify those motifs enriched in the ChIP-Seq peaks. For example, it is known that transcription factor Foxa2 binds to motif "GTAAACA". Then motif analysis of Foxa2 ChIP-Seq experiment should identify "GTAAACA" as one of the enriched motifs in the peaks. Otherwise, the quality of the ChIP-Seq experiment is questionable and probably needs further investigation. Therefore, researchers can use motif analysis results to validate their ChIP-Seq experiments. If you are interested, see the following reference for an example: Xu, Chenhuan, et al. "Genome-wide roles of Foxa2 in directing liver specification." Journal of molecular cell biology (2012)

(2). Motif analysis can also be used to identify novel binding motifs for transcription factors. If you are studing a transcription factor that has an unknown binding motif, you can use motif analysis to identify novel binding motifs. Those novel binding motifs can give useful information about the function of the transcription factor. For example, Bing Ren's group identified a novel binding motif for insulator protein CTCF. By analyzing the new motif, they identified some new functions of this CTCF protein. For reference, you can read this paper: Kim, Tae Hoon, et al. "Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome." Cell 128.6 (2007): 1231-1245.

(3). Motif analysis can be used to identify cofactors from the ChIP-Seq experiment. It is common to identify multiple different motifs from a given ChIP-Seq experiment. Some of the motifs may belong to transcription factors that are not studied in the ChIP-Seq experiment and those transcription factors are potentially cofactors. Here is a reference for this kind of analysis: Ding, Jun, et al. "Systematic discovery of cofactor motifs from ChIP-seq data by SIOMICS." Methods 79 (2015): 47-51.

In this lab, we are going to learn how to use HOMER to run motif analysis on some ChIP-Seq data. HOMER is a toolkit for motif discovery based on sequencing data and it is freely available at http://homer.salk.edu/homer/ngs/peakMotifs.html. HOMER contains several perl scripts (perl is a programming language similar to python). We already installed HOMER on CoCalc so you can use it directly for the following analysis.

The data we are going to use is from a published Foxa2 ChIP-Seq experiment. The winged helix protein FOXA2 is a highly conserved, regionally-expressed transcription factor that regulates networks of genes controlling complex metabolic functions. The raw reads were aligned to the reference genome and Foxa2 binding peaks were identified. You can find the ChIP-Seq peak data (GSE25836_Human_Liver_FOXA2_GLITR_1p5_FDR.bed) in BED format in the folder "data_for_motif_analysis". We are going to use this file for the following motif analysis.

We will use the findMotifsGenome.pl script in HOMER to find enriched motifs in Foxa2 ChIP-Seq peaks. The basic syntax is as follows:

NOTE: These commands are to be run in the CoCalc terminal

findMotifsGenome.pl -size # [options]
1. is the input ChIP-Seq peaks in BED file format. 2. is the reference genome. We will use human reference genome hg18 for the analysis. 3. is the output folder. 4.-size Selecting the size of the region for motif finding. If you wish to find motifs using your peaks using their exact sizes, use the option "-size given"). However, for Transcription Factor peaks, most of the motifs are found +/- 50-75 bp from the peak center, making it better to use a fixed size rather than depend on your peak size. 5.[options] are some other options.

Here is an example using findMotifsGenome.pl to identify motifs in peaks.bed.

findMotifsGenome.pl peaks.bed hg18 MotifOutput/ -size 200 -mask -preparsedDir parsed_genome -len 8
  • "-mask" means we want to mask out the repeat sequences in the genome.
  • "-preparsedDir parsed_genome" specifies the output folder for parsed sequences.
  • "-len 8" specifies the motif length to be 8.

NOTE: You do not need to create folders when executing this command

Based on the above example, now run the motif analysis on the Foxa2 ChIP-Seq peaks. Write down your code.

NOTE: that the above analysis may take 5-10 mins to run.

While we are waiting, let's do something else. Most of bioinformaticians would agree that they spend most of their time reading documents, manuals or things from google, rather than coding. Next, try to be a real bioinformatician and read the section "How findMotifsGenome.pl works" at http://homer.salk.edu/homer/ngs/peakMotifs.html. Try to get a general idea about how HOMER (findMotifsGenome.pl) works and write down the steps the program goes through.

When your motif analysis is done, look at the output folder on sagemath. You will find two 'html' files, "knownResults.html" and "homerResults.html". Let's open the "knownResults.html" first (if the html file does not open in Cocalc, download the MotifOutput directory to your local computer and view it locally). You will see a series of known motifs enriched in the ChIP-Seq peaks. The motifs are ranked by their significance (q-value column). It is important to note that this method is somewhat random based on how HOMER chooses the set of background sequences to compare against, among other factors, so the results you get may look different from run to run and from the results your friends get!

What are top 5 known motifs enriched in our Foxa2 ChIP-Seq peaks? Write down their names.

Considering that we are analyzing Foxa2 ChIP-Seq data, do the top 5 motifs make sense? Why?

Look at the known motifs list, do you find any other motifs that do not belong to the Fox famliy? Find three and write down their protein names.

Why do we find those non-Fox motifs in Foxa2 ChIP-Seq peaks?

Let's open "homerResults.html". Those motifs are de novo motifs identified by HOMER. Since we have specified "-len 8", all the de novo motifs are 8mers. HOMER also matches the de novo motifs to known motifs and gives the best match in the output.

What is the top de novo motif identified in our Foxa2 ChIP-Seq peaks? Write down the best match protein name.

For the top motif, what percentage of ChIP-Seq peaks have this motif? What percentage of background sequences have this motif?

Are there any other non-Fox de novo motifs in the list? What biological questions you can ask based on these results?

NOTE: Please delete all unnecessary output HOMER files before submitting the homework with command below. First navigate to "data_for_motif_analysis/" Then "rm -r MotifOutput parsed_genome"

Homework (10 points)

Let's use another ChIP-Seq data set for homework. The nuclear receptor PPARg is a transcription factor that regulates networks of genes controlling complex metabolic functions. We will use a dataset from a PPARg ChIP-Seq experiment performed on a human adipocyte cell-line (SGBS).

Question 1: Perform the motif analysis on the PPARg ChIP-Seq peaks (GSE25836_Human_SGBS_Ads_PPARg_GLITR_1p5_FDR.bed) with HOMER. Write down your code. (1 point)

Question 2: In the list of known motifs that are enriched in the ChIP-Seq peaks, can you find the PPARg motif? What can you conclude about the quality of this ChIP-Seq experiment? (1 point)

Question 3: What other non-PPARg motif can you find in the list of known motifs? Find three and write down their names. Then, qualitatively compare their motifs to the motif of the top one. What kind of similarities are there, and why were all these different motifs found to be significantly enriched? (3 points)

Question 4: What are the top three de novo motifs? Write down the names of their best match proteins. Similarly to the last question, compare these three de novo motifs. What kind of similarities are there? Are they more or less similar to each other than the known motifs were? Do any of them seem similar to the known motifs that were identified? (2 points)

Question 5: Compare the p-value of the top known motif with the p-value of the top de novo motif. Why might they differ so much? (1 point)

Question 6: How do you interpret these p-values (hint: what is the null hypothesis, and what is the alternative hypothesis? (1 point)

Question 7: Find a known motif that is present in a higher percentage of target sequences than the top one. Why does this motif have a larger p-value than the top one? (1 point)

NOTE: Please delete all unnecessary output HOMER files before submitting the homework with command below. First navigate to "data_for_motif_analysis/" Then "rm -r MotifOutput parsed_genome"