Analysis of Gene Expression Data via Arrays using Bioconductor/R - III

Today, this notebook constitutes your in-class activity and homework.

We will pick up from where we left off with the analysis of our gene expression data set from last time.

We left off having performed our differential gene-expression analysis. However, our results were in terms of "probes" rather than genes; we'd like to be able to know what genes are actually of interest in our analysis. Today, we'll connect our analysis to annotation data so that we can obtain this information.

Q1. Let's recover our pipeline from last time.

Copy code from the in-class assignment Gene Expr. II, from: Q1. These are the steps that correspond to: (i) loading R libraries, (ii) loading the file that maps phenotype to cel files (i.e., the .csv file), (iii) storing the list of cel files available, (iv) reading the cel files.
Copy code from the in-class assignment Gene Expr. II, from: Q2 and Q3. These steps correspond to (i) the normalization step, and (ii) creating a box-plot of the normalized array data.
Copy code from the in-class assignment Gene Expr. II, from: Q4 and Q5. These steps correspond to (i) the group/design analysis matrices, (ii) the analysis, (iii) the empirical Bayes correction, (iv) the top table construction, and (v) print the top five results to screen.



In [0]:

Q2. Load the annotation library package biomaRt in R. Provide and execute your R code below.



In [0]:

Q3. OK, now let's obtain some annotation information

create a variable called "mymart", which loads the appropriate library from the ensembl database for mouse.
obtain a list of the affy probe ids for the top 100 results variable "tt" above, stored in a variable called pidsTophits
print the first 5 probe entries from pidsTophits

Provide and execute your R code below.



In [0]:

Q4. Continue obtaining annotation information. Note that getting all of the annotations will take some time!

obtain the following annotations from ensembl stored in a new variable called "annot" for your list of affy probe ids:

  probeids (for the array technology used for this experiment, affy_mouse430_2)
  chromosome
  start position
  end position
  mgi gene symbol (mouse)
  entrez gene id
  Description

print out to screen for the first 5 entries in your table: probeid, chr, start, end, mgi, and description

Provide and execute your R code below.



In [0]:

Q5. Output your annotations to file that reports everything but the description, without quotes, and separated by commas (useful for data parsing) to a file called tt-annot-forparse.csv

Provide and execute your R code below.



In [0]:

Q5. Now that we have annotations, we need to merge them together with the table that contained our association information. However, notice that the annot matrix lists the probe ids in a different order than the statistical association table.

Create a table called "tt_ids" which includes data from the "tt" variable with the variable "pidsTophits", using the R function cbind(). Store into a new variable called tt_ids.
Rename the column containing pidTophits to affyprobeid in the variable tt_ids (hint: colnames).
print the first five entries of tt_ids to the screen.

Provide and execute your R code below.



In [0]:

Q6. Now, we need to prepare and merge the annot file with tt_ids:

Rename the first column in the variable "annot" - the one that contains your probeid to "affyprobeid" using the R function colnames().
output the first 5 entries in the variable "annot" to ensure that the name has been changed
Use the R function merge() to aggregate the variable "tt_ids" with "annot", using affyprobeid as the key. Save the results into a new table called "tt_plus_annot"

Provide and execute your R code below.



In [0]:

Q7. Let's sort the table to get the most significant results to the top of the dable.

use the R function order() to sort the the table by the key "adj.P.Val". Store the sorted table to a new variable called "tt_annot_s"
output the results of the first 5 row of "tt_annot_s"

Provide and execute your R code below.



In [0]:

Q8. What are the genes listed in the top 10 results?

The top 20 differentially expressed probes map to the genes: Mbl2, Mafb, Rgs1, Ubd, Pglyrp2, Il7r, Gpnmb, Lpl, Lilra6, and Gm15922.

Q9. How many unique genes are given in the your top results variable tt_annot_s?

TIP: Think through (in human terms) the specific tasks/function calls you will need to do perform the task.
Create a variable called "uniqgenelist" that lists your mgi gene symbols. (Hint: Use the variable "tt_annot_s".)
Double check the list of genes that you have constructed. Do all of the entries have a gene listed?
Create a new variable called "uniqgenelist_filt" that excludes entries that are "NA" or are empty (i.e., "").
Hint #2: length(), unique(), and is.na() functions may be useful to you.
Hint #3: the "!" (she-bang) character computes ths inverse of the function. e.g. "!is.na(list)" means in human terms "which elements in list are NOT NA?"
Hint #4: you can use logical operations (i.e., ==, !=) to match 'empty' entries (i.e., ""), even within []. Try it!

Provide and execute your R code below.



In [0]:

Q10. Save the list of unique genes (filtered to exclude "NA"s) to a file called "mygenelist.txt", no quotes, row names, or column names. Provide and execute your R code below.



In [0]:

Now, let us use WebGestalt to perform some ontology analysis using the unique set of genes created above.

Under "Basic Parameters" choose the appropriate organism, ORA as the method, and geneontology/Biological_process_noRedundant as the functional database.
Copy and paste (or upload) the list of genes in the "Gene List" section
Choose the appropriate reference set from the drop down menu in the "Reference Gene List" section. Hint: What is found in "affyRaw", the variable that stores your array data?
Under "Advanced parameters", choose BH (Benjamini-Hochberg) for multiple test correction and tell it to output the top 10 results. Leave the other options unchanged.
Hit submit.

Q11. What background gene list did you select? Why is this important?

Q12. Does that tool find all of the genes that you provided? Why or why not? How could you fix this?

Q13. Do any of the top 10 associations pass a FDR (false discovery rate) of 0.002 or better? What does "significance level" mean? If yes, do any of these categories make plausible or intuitive sense for the gene expression analysis that you performed?

Q14. Here, we selected the top 100 genes for enrichment analysis. Was this choice warrented? Why or why not? Having performed this analysis, what might you do differently?

Q11. One final item in your pipeline to finish it off: run the R function sessionInfo() to record the functions and versions used for your analysis.

Provide and execute your R code below.



In [0]:

Analysis of Gene Expression Data via Arrays using Bioconductor/R - III

Today, this notebook constitutes your in-class activity and homework.

We will pick up from where we left off with the analysis of our gene expression data set from last time.

Q1. Let's recover our pipeline from last time.

Q2. Load the annotation library package biomaRt in R. Provide and execute your R code below.

Q3. OK, now let's obtain some annotation information

Provide and execute your R code below.

Q4. Continue obtaining annotation information. Note that getting all of the annotations will take some time!

Provide and execute your R code below.

Q5. Output your annotations to file that reports everything but the description, without quotes, and separated by commas (useful for data parsing) to a file called tt-annot-forparse.csv

Provide and execute your R code below.

Q5. Now that we have annotations, we need to merge them together with the table that contained our association information. However, notice that the annot matrix lists the probe ids in a different order than the statistical association table.

Provide and execute your R code below.

Q6. Now, we need to prepare and merge the annot file with tt_ids:

Provide and execute your R code below.

Q7. Let's sort the table to get the most significant results to the top of the dable.

Provide and execute your R code below.

Q8. What are the genes listed in the top 10 results?

Q9. How many unique genes are given in the your top results variable tt_annot_s?

Provide and execute your R code below.

Q10. Save the list of unique genes (filtered to exclude "NA"s) to a file called "mygenelist.txt", no quotes, row names, or column names. Provide and execute your R code below.

Now, let us use WebGestalt to perform some ontology analysis using the unique set of genes created above.

Q11. What background gene list did you select? Why is this important?

Q12. Does that tool find all of the genes that you provided? Why or why not? How could you fix this?

Q13. Do any of the top 10 associations pass a FDR (false discovery rate) of 0.002 or better? What does "significance level" mean? If yes, do any of these categories make plausible or intuitive sense for the gene expression analysis that you performed?

Q14. Here, we selected the top 100 genes for enrichment analysis. Was this choice warrented? Why or why not? Having performed this analysis, what might you do differently?

Q11. One final item in your pipeline to finish it off: run the R function sessionInfo() to record the functions and versions used for your analysis.

Provide and execute your R code below.

Congrats! You have completed your first pipeline! High five yourself and breathe a sigh of relief. :)