Analysis of Gene Expression Data via Arrays using Bioconductor/R - II

Today, this notebook constitutes your in-class activity and homework.

We will pick up from where we left off with the analysis of our gene expression data set from last time.

Q1. Let's recover our pipline from last time.

  • Copy code from the previous in-class set from: Q4, Q7, Q8 and Q9. These are the steps that correspond to:
  1. loading R libraries
  2. loading the file that maps phenotype to cel files (i.e., the .csv file)
  3. storing the list of cel files available
  4. reading the cel files, and (v) making the box plot. (Remember to only include the CEL files of interest here)
  • At the end, you should have a boxplot of the data loaded. We'll pick up here!

In [0]:

Q2. Our next step will be to normalize the intensity data using RMA, and store the normalized data in a variable called "genenorm". Provide and execute your R code.


In [0]:

Q3. Plot a boxplot for the normalized intensity data. Provide and execute your R code.


In [0]:

Q4. 6. Prepare your design matrix for the analysis

  • create list of your treated/untreated sample and store in a variable called "group". Print it to the screen.
  • create the design matrix using the "group" variable, and store in a variable called "design". Print it to the screen.

Provide and execute your R code below.


In [0]:

Q5. Great! We can now finally perform the gene expression analysis.

  • Analyze the data using the R function lmFit(). Use your "design" and "genenorm" variables as arguments. Store result in a new variable called "fit".
  • Apply empirical Bayes correction using the R function eBayes(). Store result a variable called "efit".
  • Obtain the top 100 results from the "efit" variable using the R function topTable(). Sort this by P-value, and store result in variable called "tt"
  • Output the top 5 results to the screen.

Provide and execute your R code below.


In [0]:

Q6. Generate summary outputs for your top results.

  • Output to file the list of your top 100 results variable tt to a file named "mytop100results.txt". Exclude quotation marks, separate the entries by comma, and include row and column names in the output table.

Provide and execute your R code below.


In [0]:

Q7. Make a volcano plot of all of your results from the "efit" variable, using the argument "coef=2" (i.e., so as to plot the association p-values for gene expression, rather than the intercept).

Provide and execute your R code below.


In [0]:

Q9. Some final "challenge" questions about the analysis you performed here.

  • How many probes did you analyze in total? (Hint: look at the eFit object)

In [0]:

Q10. How many probes returned an adjusted p-value less than 0.05?

  • Create a table with all of the results from the "efit" object into a new object called "allTT"
  • TIP: Think through (in human terms) the specific tasks/function calls you will need to do perform the task.
  • Hint: there are three specific tasks/steps, and two functions that will be helpful to you.

Provide and execute your R code below.


In [0]:

Q11. One final item in your pipeline to finish it off complete.

  • run the R function sessionInfo() to record the functions and versions used for your analysis.

Provide and execute your R code below.


In [0]:

Congrats! You have completed part 2. Feel free to continue on to Part 3, so that you can get ahead of the work in class! :)