In-class: Reproducible Computational Research

NOTE: Make sure your kernel is set to "R (SageMath)" so you can test these commands in your ipynb! (Kernel > Change kernel)

Today, you we will be covering several practical principles and guidelines to making your code reproducible.

  1. Making your code clear: Comments, variable names.
  2. Versioning and reusing code: Creating a new notebook
  3. Notebook documentation: Who, what, where, why, when.
  4. A little automation: Making your code "generic"

Your homework will build upon these, to practice making a new pipeline (and writing it so that is clear to you).

Part 1: Making your code clear

When writing code (or an analysis pipeline), it is essential that each part of what your pipeline is doing is clearly documented and understandable to someone who is knowledgable enough to read code.

The best way that I think about this is the "Future Self: 3 month from now" principle. You will frequently find yourself in a position where you have generated what you think is a final result (for a scientific paper), only to realize 3 months later that you need to do something else (tweak the code, reanalyze data again, etc.).

It is precisely this situation that leaving yourself 'clues' as to what you were doing (and thinking) at the time you were writing particular bits of code really help you quickly get back up to speed on what you were previously doing (hours instead of days; minutes, instead of hours), but also cofidence that you can reproduce exactly what you had done previously.

In all programming languages, you can add "comments" your code. In R, which we will focus on today, that can be done using the # symbol.

For every block of code in R, it is very good practice to have comments describing what it does. For instance:


In [0]:
# Prints the next even number after a
a <- 9
if(a %% 2 == 0) {
    print(a + 2)
} else { 
    print(a + 1) }

1. Now, look at the following 3 blocks of R code. For each block:

a. Figure out what the following code blocks are doing, and add a comment describing, in human terms, what it is doing.

b. In addition, rename any variables with sub-optimal names (i.e., those we named "badvariable")

Variables names should be short and descriptive when possible. For instance, a variable that contains a list of prime numbers could be called primes. However, if you have a variable that just stores the value 5, something simple, like "a" would be fine. In order to figure out what the code is doing, run the code blocks. If needed, you can replace the "Hello Students" strings or x value with whatever you think would be useful in helping you figure it out.


In [0]:
w = "Hello Students" 
s <- strsplit(w, " ")
substr(s[[1]], 1, 2)

In [0]:
w = "Hello Students"
badlynamedvariable1 <- c("a", "e", "i", "o", "u")
badlynamedvariable2 <- tolower(strsplit(w, "")[[1]])
badlynamedvariable3 <- badlynamedvariable2[badlynamedvariable2 %in% badlynamedvariable1]
print(badlynamedvariable3)

In [0]:
x <- 18
nums <- c(5, 10, 16, 19, 11)
badlynamedvariable4 <- abs(nums - x)
which.min(badlynamedvariable4)

2a. I have provided you a list of numbers in the variable "mylist". Write some code in R that calculates the sample standard deviation, by hand:

x_i: each i elements in your list

xbar: the mean of your list

N: the size of your list

After each block of code that you write, use code comments to document what is being done.

2b. Savvy users will note that R has a built in function mean() and sd(). Use this function to check your work (with a comment stating as such!)


In [0]:
# We're giving you a list of numbers. Make sure to run this cell to have mylist available to you.
mylist <- c(71, 27, 363, 12, 3, 976)

In [0]:

Part 2: Versioning and reusing code: Creating a new notebook

In class, you have been working with notebooks that we have been providing you.

But in reality, you will have to make notebooks entirely on your own, ones that execute the code that you want to run. Literally, a "blank slate". In this section, we'll walk you through the process of creating your own notebook, and good practices there.

3a. Using CoCalc, create a new notebook. You can do this by clicking on the "+ create" button, name your workbook "my_new_notebook", and then click the "Jupyter Notebook" button.

3b. We need to change a setting so that the code you input is recognized as R code, and processed by the notebook accordingly. To do this, select the "kernel" menu, then the option "change kernel" and select "R (SageMath)".

Your notebook consists of "cells" in which you can input information. These cells have specific types: For example, they can be set to process basic text ("Markdown"), or be set to execute computer code ("Code").

3c. Change the first cell type to "Markdown". You can do this by click on the cell (so that the cursor is visible), then select "Markdown" from the drop-down box (found between the circular refresh and keyboard icons).

Part 3: Notebook Documentation

Each time you create a new notebook, computational pipeline, or code for analysis, it is important that you include the following details at the start of the document - the "header". There, you should provide:

  • A title for the workbook (ideally, the same name as the file)
  • Who wrote the document/code
  • When they created it
  • When it was last modified
  • What the code is designed to do, the objectives, and rationale (why)
  • List any data file that the code and pipeline was designed to use, or analyze.

4a. As the name of our first notebook implies, let's imagine that we are writing this book to calculate a mean and sample standard deviation for the list that we provided above. In cell that you just set to Markdown, add details into the document (title, your name, when created, the above information) to your notebook.

4b. Next, create a new cell. You can do this by using "Insert" menu, and "Insert cell below". Remember, you want to makes sure you add comments where you think they are appropriate. For example, what analysis are you trying to do? What results do you get and how you interpret them?

There is no right way to organize or comment your code in jupyter notebooks, as long as it is easy to read and descriptive!

4c. Finally, copy the code you wrote in 2b into this new cell. Execute the code to make sure it works!

Part 4: A little bit of automation

Ok, so now you have code that will calculate a mean and sd for a given list of numbers. That's cool. But let's make it cooler.

5. Suppose that instead, your data is located in a file called "mylist.csv", a file that is separated by commas. How would you read that data into R, into a variable called "mylist"? Write your R code below.

Hint: Use the as.numeric() to convert the contents of the data in mylist to "numbers", e.g.

mylist <- as.numeric( read.table( ... ) )


In [0]:

6. Based on this, add a new block of code in your new notebook to calculate the mean and sample standard deviation from the file "mylist.csv". Create a new cell ("Code") to achieve this task.

Be sure to add R comments to your new code!

Sometimes, you will want to run the same bit of code on a couple of data files. In the code we have written, this isn't very obviously amenable to this.

What we'd like to do is start by providing the file name, and then the code will "run" using data contained in the file. That way, all we'd have to do is change the file name (at the start of the program). Or eventually, read the file from the command line (in UNIX).

7. Modify your code in your file in the following way:

a. Copy your code into a new cell.

b. Create a new variable called "myfile", and assign to it the name of your input file "mylist.csv".

c. Change the read.table() function to refer to the variable myfile, rather than the name of the file.

d. Execute your code to make sure it works!

Homework assignment

In the following homework assignment, imagine you are working on a gene expression dataset. You want to do some analysis with this dataset and publish a paper. To make sure someone else can reproduce your analysis and results, you want to generate a report with the code and the output results. You will do so using a jupyter notebook.

In this HW, you will use the programming language R, as you have been above.

You will be doing a basic analysis of the genes.table dataset.

1. Open the jupyter notebook you created in class ("my_new_notebook.ipynb"). Add a new section called "Homework", and include an appropriate header. Add your code and answers to the questions below to the Homework section. Make sure the kernel is set to R (SageMath)

2: Create "chunks" (i.e., new cells) of R code that perform the following tasks.

a. First, read in the genes.table file and stored into a variable. Use the head() function to print out the first few rows of the table and check if you correctly load the table.

What are the columns? Write your answer in a markdown cell in your new notebook.

3. Using the hist() function, create a new cell and write code to plot the distribution of abundance for each gene. Outside the code chunk, use comments to interpret what you observe from these plots. (For example, do they look normally distributed?) If you wanted to test for differences in expression between genes, which tests would you use for which comparisons? Why? (Hint: What does the t-test assume? What is an alternative to a t-test?). Write your answers in a markdown cell in your new notebook.

4. Create a new cell, and use the approprite statistical test to compare the expression patterns of the following pairs: Hint: In R, the functions t.test() and wilcox.test() may be useful to you.

a. geneA vs geneB

b. geneA vs geneC

c. geneB vs geneC

d. geneB vs geneD

Outside the code chunks, write down what statistical test you use in the analysis and what results you get. Write your answers in a markdown cell in your new notebook.

5a. Now, make sure you hit the save button on your jupyter notebook!

5b. Download your ipython notebook file as an html file. To accomplish this, go to the "File" menu and select "Download as", and select HTML.

Open in up in a web browser. As you can see, it looks very similar to the code in the jupyter notebook, except that you can not edit it.

5c. Next, upload the HTML back into your directory. .html files can be opened by any web browser. Therefore, this html file can be useful for sharing your results with anyone who does not have jupyter installed on their computer. You can also download jupyter notebooks as .pdf files if you have jupyter installed on your own computer. However, the pdf convertor does not always work on CoCalc.

6. Finally, create a new cell, and in R, call the sessionInfo() command. This command gives you a snapshot of the versions of libraries and various tools that you have installed (and run). Useful for reproducibility!