Analysis of Gene Expression Data via Arrays using Bioconductor/R - I

Today, this notebook constitutes your in-class activity and homework. Over the next 3 days, we will be constructing your own gene expression analysis pipeline, using available tools in R, and available data from the gene expression omnibus (GEO):

https://www.ncbi.nlm.nih.gov/geo/

Let's take a look at a new data set to analyze: GSE35961

Q1. Describe in your own words, the treatment groups, number of samples, and subjects that were characterized in this experiment.

Q2. What is the citation attached to this paper?

Q3. What is NASH? What is Metformin? What is the hypothesis that this experiment was designed to test?

Let's use R to download this data set, use UNIX to prepare our associated input files and organize our directory.

Q4. Using R, load the libraries we will use for our analysis here. Provide/Execute your code below.


In [0]:

Q5. Using R, download the data set GSE35961 from GEO. Provide/Execute your code below.


In [0]:

Q6. Now, we need to process the data that we downloaded. Using a terminal in UNIX perform the following tasks, and provide your code below (but execute them in a terminal).

  • Navigate to the newly created directory GSE35961
  • Expand the GSE35961_RAW.tar archive [UNIX]
  • Uncompress all of the .gz files [UNIX]
  • Delete the GSE35961_RAW.tar file [UNIX]

Q7. We need to prepare our phenotype file for analysis. For this assignment, we will compare the samples of NASH to the NASH treated with metformin.

  • In UNIX, prepare a phenotype.csv file which references the appropriate CEL files for analysis (use the example phenotype file from the prelab as a reference).

How many samples are there?


In [ ]:

Then, using R:

  • load this phenotype file into R and store it into a variable called phenoData.
  • and return a summary for the variable phenoData using the pData command.

Provide and execute your R code below.


In [0]:

There should be only 8 files!!!

Q8. Next, we need to load our CEL file data into our R pipeline.

  • read a list of cel files into an object called "celFilelist" (using list.celfiles(), where you give the name of the directory that contains the cel files)
  • create a new variable called "celFiles" which contains only the CEL files that you want to analyze (i.e., present in the phenoData variable in Q7, above) using brackets [] on celFileList
  • report the contents of the celFiles variables using print()
  • read intensity data from the list of celFiles into a variable called "affyRaw" using read.celfiles()

Provide and execute your R code below.


In [0]:

Q9. Great! Now let's take a look at the data.

  • Create a boxplot of the intensity data for all CEL files loaded.

Provide and execute your R code below.


In [0]:

Q10. Next, create a histogram of the intensity data for all CEL files loaded. Provide and execute your R code below.


In [0]:

Congrats! You have completed part 1. Feel free to continue on to Part 2, so that you can get ahead of the work in class! :)