Prelab: Introduction to R, Continued

Preamble:

R provides a platform containing many tools to perform statistical tasks including the manipulation, hypothesis testing, graphical display of data, as well as statistical programming. There are many extensions (packages) for R which can add more specialized functionality for a variety of tasks. We are going to introduce you to the basics of R, and at the end of this document there is a list of resources for further information and more comprehensive coverage of these basics.

Following this exercise, we anticipate that the user will be able to:

Open the R programming environment and run commands interactively
Create and work with basic data types in R, including vectors and tables (data frames)
Run basic functions for statistical computing, like “mean”, “sd”, and “t.test”
Plot data as a scatterplot and add a regression line
Visualize distributions of data by histogram or boxplot
Find additional documentation on any command – e.g. help(mean) or ?mean

To get complete credit for this prelab, execute the commands in your ipynb. Remember to set your kernel to R (SageMath).

Load the genes.table dataset with following function:



In [0]:

    
genes <- read.table("genes.table", header=T)

Attach the data so that we can refer to the columns by column names.



In [0]:

    
attach(genes)

Linear Regression

Are the expression values for geneA and geneB correlated across samples? We can plot the data against each other with this simple command:



In [0]:

    
plot(geneA, geneB)

You will see a plot pop up after you run the above code. The X axis is the first variable in the plot function, which is geneA, and the Y axis is the second variable in the plot function, which is geneB.

Do they look correlated? Let’s find out if there is a significant linear correlation. Create the regression line with "lm" function:



In [0]:

    
regline <- lm(geneB ~ geneA)

The syntax "geneB ~ geneA" means geneB is the response variable (sometimes also called dependent variable) and geneA is the explanatory variable (or independent variable). Here we only include one explanatory variable, geneA, which is called simple regression. If we want to include more explanatory variables geneC and geneD, we can write "geneB ~ geneA + geneC + geneD" in the lm function.

The lm function output can be added to the plot:



In [0]:

    
plot(geneA,geneB)
abline(regline,col=2)

"col=2" here means use the color value 2, which is red, to plot the result.

We can check to see if there is a significant correlation between geneA and geneB with this command:



In [0]:

    
summary(regline)

The output includes a table showing the regression coefficients and p values for each variable and F statistic for the overall significance. If you don't understand the meaning of those statistics, refer to the statistics lecture.

Plotting With R

One of the most useful things R can do for you is to help visualize your data. There are way too many possibilities for plotting data to list very many here in a meaningful way. Here are a few examples, try out the following:

Display data in a boxplot.



In [0]:

    
boxplot(geneA)

Each boxplot represets the expression values of one gene. The line in the middle of the box indicates the median.



In [0]:

    
boxplot(geneA,geneB,geneC,geneD)

Now let’s plot the distributions of how each gene express among the samples.



In [0]:

    
hist(geneA, col="green")
hist(geneB, col="red")
hist(geneC, col="black")
hist(geneD, col="blue")

You can see a bell-shaped distribution for geneA and geneB, but not geneC or geneD. It is important to check the distribution of the data before you apply the t-test on the data. Because the t-test assumes the data are normally distributed. There are some statistical methods to rigorously test if the data are normally distributed or not. However, in practice, it is usually OK to just plot the data and check if the data are bell-shaped.