Motif discovery and regulatory analysis - II

Table of Contents

  1. Sequence motif databases
  2. Gene list promoter motif analysis

1. Sequence motif databases

As we saw in the prelab lecture, there are several databases containing known motifs for various transcription factors. Two very commonly used tools that we saw in the lecture are JASPAR, a motif database, and TOMTOM, a tool for matching motifs to databases, and here we will have some practice using these tools. Recall the PWM from questions 1-3 from the last homework:

NucleotidePos. 1Pos. 2Pos. 3Pos. 4Pos. 5Pos. 6Pos. 7Pos. 8
A0.010.10.970.950.50.050.80.4
C0.030.050.010.010.10.60.10.08
G0.950.050.010.030.10.050.050.02
T0.010.80.010.010.30.30.050.5

Let's try submitting this PWM to JASPAR. Navigate to http://jaspar.genereg.net/ and scroll to the very bottom, to the box labelled 'Align to a custom matrix or IUPAC string.' We can't directly use this matrix in the table format provided here, so copy and paste the table in the following format:

A [ 0.01 0.1 0.97 0.95 0.5 0.05 0.8 0.4 ] C [ 0.03 0.05 0.01 0.01 0.1 0.6 0.1 0.08 ] G [ 0.95 0.05 0.01 0.03 0.1 0.05 0.05 0.02 ] T [ 0.01 0.8 0.01 0.01 0.3 0.3 0.05 0.5 ]

Then, press the 'Align' button underneath the table entry form. You will be taken to a result page showing all the matching transcription factors including their name, species, class, family, score, percent score, and canonical sequence logo. If you click on the sequence logo for any match, you can find more information including the latin name of the species that the motif comes from. What is the best matching transcription factor, and what species does it come from?

What is the second best matching transcription factor, and what species does it come from?

What do you make of the fact that the top two matches come from very different species?

A related tool is TOMTOM, which is part of the MEME suite. To practice with this, we are going to use the MEME results we generated in the second half of the homework from last class. If you lost the link, you can re-run the job the same way you did last time. Once you have these results, go to the top discovered motif and click the rightward facing arrow. This time, select the 'Submit motif' tab. TOMTOM is in the same suite of tools as the MEME motif discovery tool, and so they are synced up. This means that we can easily submit this top motif to the TOMTOM tool to find its matching TFs. Select 'TOMTOM' under 'Submit to program' and then click submit. You will be taken to the TOMTOM submission page, with 'submitted motifs' already chosen for you under the 'Input query motifs' section. Select 'JASPAR (NON-REDUNDANT) DNA' under the 'Select target motifs' section, and make sure the box is checked under 'run immediately', then click the 'start search' button. This should take you to the results page fairly quickly.

How many matches did TOMTOM return?

What is the name and p-value of the top match?

In the box below, provide an "in your own words" interpretation of the p-value for the top match. What's the null hypothesis? What's the alternative hypothesis? Once you have something, check it with your table. Once your table has converged on an answer, raise your hands and dicsuss the answer with one of the course staff members.

Your interpretation:

Your table's interpretation:

Do you notice any sort of pattern in the top 5 or so matches?

For this homework, we are going to do a similar analysis using MEME and TOMTOM, but this time using the file called 'promoter_seqs.fasta' in the inclass_data/ folder of this assignment. This file contains promoter sequences from several co-expressed genes, so these different promoters may contain different TF motifs. You should submit this file now so that it will be done processing by the time you get to the actual homework questions.

Submit this file the same way you did before (make sure to provide an email if you think you might lose the link because this analysis will take a while), but make sure to select 'any number of repetitions' under the 'Select the site distribution' section, and lower the maximum motif width to 20 under the advanced options.

2. Gene list promoter motif analysis

Another analysis we might want to do is to find the over-represented motifs for a list of co-expressed or otherwise related genes. To practice with this, we will be using a set of genes from an RNA sequencing study of a mouse model of ALS, where a disease-related protein called TDP-43 was mutated to localize to the cytosol instead of the nucleus (http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0141836). In the inclass_data/ folder, there are two files you will need for this analysis: tdp_downreg_genes.txt, which contains the 88 genes that were significantly downregulated by at least 2-fold when the protein was mutated, and all_tdp_genes.txt, which contains the 10,601 genes that met the minimum read count threshold in this experiment. You will probably want to download these files to your own computer so that you can upload them easily without having to copy and paste 10,000 genes. We would like to determine whether there is any significant motif enrichment in these genes, so we will use the oPOSSUM tool, which was described in the prelab lecture.

Go to the site at http://opossum.cisreg.ca/oPOSSUM3/, and select Human under 'Single site analysis (SSA)', in order to detect over-represented conserved transcription factor binding sites in these sequences. Upload the 'tdp_downreg_genes.txt' file as the target genes and the 'all_tdp_genes.txt' file as the background gene list, and then run the analysis using the default settings (using JASPAR CORE profiles). This may take a few minutes to finish running.

Once you have the results, click the column marked 'Fisher score' in order to sort by that column. This column contains the negative natural logarithm of the p-value for the Fisher's exact test-based enrichment. In other words, the higher, the better.

What is the transcription factor with the most significant overrepresentation based on the Fisher score?

What is its Fisher score, and what p-value does this correspond to (you can use exp(-X) in R, where X is the Fisher score)?

Provide your interpretation of this p-value in the box below (hint: what is the null hypothesis, and what is the alternative hypothesis?).

How many of the target genes is this TF predicted to interact with? (Use the 'Target gene hits' column)

Now sort the list by Z-score. What are the TFs with the highest and lowest Z-scores?

Navigate to the help page at http://opossum.cisreg.ca/oPOSSUM3/help.html. Using the statistical analysis section, can you explain the differences between the Z-score test and the Fisher score test?

How might these differences explain the difference between the TFs prioritized by the two different scores?

Homework problems: matching motifs to TFs

For the first section of this homework, we are going to use the results of the MEME analysis of the promoter_seqs.fasta file.

Question 1: What is the consensus sequence for the best motif found by MEME? What is its E-value? </b>(1 point)</b>

Question 2: Now, follow the same procedure as what we did for the in-class assignment to submit this top motif to TOMTOM, using the JASPAR database. What's the name of the closest matching motif? (1 point)

Question 3: What is the consensus sequence for this motif (either degenerate or exact)? How does this compare to the motif found by MEME? (1 point)

Question 4: Now let's run the top motif through JASPAR ourselves, for comparison. Use the top motif from MEME, but instead of submitting it directly to TOMTOM, download the probability matrix. (Under the 'download motif' tab). To submit this to JASPAR, you will have to transpose this matrix: the format MEME provides has the rows as the positions and the columns as each nucleotide, while JASPAR requires the rows to be nucleotides and the columns to be positions. There are many ways to accomplish this: one easy way is to download the PWM, read it into R, and use the t() command to transpose it. Then, you can write that result out as a table and copy and paste it to get it into JASPAR. If you do this, make sure to include the following arguments to write.table: "quote=F, row.names=F, col.names=F". Paste the table you are using here:

Now upload this PWM to JASPAR. What is the name of the top matched TF for this motif? How does it compare to the one we found using TOMTOM? (1 point)

Question 5: Now compare the other matches in JASPAR and TOMTOM. List the second and third best matches in each database. Can you interpret why these aren't the same? (1 point)

For the rest of the homework, we will perform another oPOSSUM analysis, but using another dataset based on microarray results from a colon cancer study. In the inclass_data folder, the file 'cancer_gene_list.txt' contains a list of genes that were upregulated in colon cancer. Upload this gene list to the oPOSSUM tool, this time just using the 'all 24,752 genes in the oPOSSUM database' as the background, and run a similar analysis to what we did for the in-class data.

Question 6: What is the most significantly overrepresented transcription factor by Fisher score? (1 point)

Question 7: What is the Fisher score and corresponding p-value for this TF, and what is your interpretation of this p-value? (1 point)

Question 8: How many genes in the list is this TF predicted to interact with? (1 point)

Question 9: Click on the entry in the JASPAR ID column to go to the JASPAR page for this TF. Looking at the sequence motif, which are the most informative positions of this motif and what bases do they contain? (1 point)

Question 10 (short answer): The next most enriched TF by Fisher score interacts with the same number of target genes. However, if you compare which genes they interact with, they are not completely overlapping. Can you look at its JASPAR motif and the class of the two TFs to explain why these are different? (1 point)