Motif analysis

It is often interesting to find out whether we can associate the identified binding sites with a sequence pattern or motif. To do so, we will identify the summit regions of the strongest PAX5 binding sites, retrieve the sequences associated with these regions, and use MEME for motif analysis.

Since many peak-finding tools merge overlapping areas of enrichment, the resulting peaks tend to be much wider than the actual binding sites. The summit and its vicinity are the best estimate for the true protein binding site, and so it is here where we look for repeated sequence patterns, called motifs, to which the transcription factor may preferentially bind.

Sub-dividing the enriched areas by accurately partitioning enriched loci into a finer-resolution set of individual binding sites, and fetching sequences from the summit region where binding motifs are most likely to appear enhances the quality of the motif analysis. Sub-peak summit sequences have already been called by MACS2 with the --call-summits option.

De novo motif finding programs take as input a set of sequences in which to search for repeated short sequences. Since motif discovery is computationally heavy, we will restrict our search for the Oct4 motif to the genome regions around the summits of the 300 most significant PAX5 subpeaks on Chromosome 1.

If you are not in there already, change into the data directory.


In [ ]:
cd data

Sort the PAX5 peaks by the height of the summit (the maximum number of overlapping reads).


In [ ]:
sort -k5 -nr PAX5_summits.bed > PAX5_summits.sorted.bed

Using the sorted file, select the top 300 peaks and create a BED file for the regions of 60 base pairs centred around the peak summit.


In [ ]:
awk 'BEGIN{FS=OFS="\t"}; NR < 301 { print $1, $2-30, $3+29 }' \
PAX5_summits.sorted.bed > PAX5_top300_summits.bed

The human genome sequence is available in FASTA format in the bowtie_index directory.

Use bedtools to extract the sequences around the PAX5 peak summits in FASTA format, which we save in a file named PAX5_top300_summits.fa.


In [ ]:
bedtools getfasta -fi genome/HS19.fa \
-bed PAX5_top300_summits.bed -fo PAX5_top300_summits.fa

We are now ready to perform de novo motif discovery, for which we will use the tool MEME.

Open a web bowser, go to the MEME website at http://meme-suite.org/, and choose the "MEME" tool.

Fill in the necessary details, such as:

  • the sub-peaks fasta file PAX5_top300_summits.fa (will need uploading), or just paste in the sequences.
  • the number of motifs we expect to find (1 per sequence)
  • the width of the desired motif (between 6 to 20) in the "Advanced" options
  • the maximum number of motifs to find (3 by default).

For PAX5 one classical motif is known.

Start Search.

Your MEME analysis will now be queued and will run on a server in the US. The results page will refresh automatically and once the tool has finished running there will be a link to the results. Depending on how busy the servers are your analysis may take a longer or shorter time to run.

You can check the load of the server here:

http://meme-suite.org/opal2/dashboard?command=statistics

Analyse the results from MEME

We would like to know if this motif is similar to any other known motif. We will use the results from TOMTOM for this.

On either the results from the web MEME run or the local run please follow the link “MEME html output”. Scroll down until you see the first motif logo.

Click under the option Submit/Download and choose the TOMTOM button to compare to known motifs in motif databases, and on the new page choose to compare your motif to those in the JASPAR CORE and UniPROBE Mouse database.

Running MEME locally

If you want to speed things up you may want to run MEME on your own machine. You can try to do this as well if you wish, or skip the following bonus exercise and go to the next section.

To bring up the help page for the local installation of MEME, type:


In [ ]:
meme

Run MEME locally, setting the output directory with the option -o (e.g. -o meme_out).


In [ ]:
meme PAX5_top300_summits.fa -o meme_out -dna -nmotifs 1 -minw 6 -maxw 20

Once MEME has finished running look in this directory for the file meme.html and open it in a web browser. You can do this by either copying the path to the file to the address bar in Firefox or double click on the .html file.

Alternatively, you can run the following command to automatically open the HTML file in Firefox:


In [ ]:
firefox meme_out/meme.html

Scroll down until you see the first motif logo.

We would like to know if this motif is similar to any other known motif. We will use TOMTOM and a set of known motif databases stored in motif_databases for this.

To compare your newly found motifs to the motif databases JASPAR CORE and UniPROBE Mouse you can run:


In [ ]:
tomtom -o tomtom_out meme_out/meme.html \
motif_databases/JASPAR/JASPAR_CORE_2016_vertebrates.meme \
motif_databases/MOUSE/uniprobe_mouse.meme

Once again, once TOMTOM has finished running look in tomtom_out for the file tomtom.html.

Open tomtom.html in a web browser.


In [ ]:
firefox tomtom_out/tomtom.html

Questions

Q1. Which motif was found to be the most similar to your motif?


Congratulations, you have reached the end of this tutorial!

We hope you've enjoyed our ChIP-Seq tutorial. You can find the answers to all of the questions in this tutorial in answers.ipynb. You can revisit inspecting genomic regions using bedtools or, go back to the beginning.