1. Preamble:
Thus far you've been running a lot of command line tools. These are programs that are already written. Now that you're becoming comfortable with this, it's time to take on a new challenge. You've got your yellow belt.
Now let's earn that orange belt! You can also write programs yourself! With R, we're going to be writing programs that perform a few steps to produce an output.
R provides a platform containing many tools to perform statistical tasks including the manipulation, hypothesis testing, graphical display of data, as well as statistical programming. There are many extensions (packages) for R which can add more specialized functionality for a variety of tasks. We are going to introduce you to the basics of R, and at the end of this document there is a list of resources for further information and more comprehensive coverage of these basics.
Following this lecture note, we anticipate that the user will be able to:
To get complete credit for this prelab, execute the code cells in all examples below. Feel free to add your own code to try out more commands.
2. Using R in CoCalc
We will be using R in iPython notebooks directly on CoCalc. The kernel of your ipynb should be set to "R (SageMath)". If your kernel is set, you can execute the code in a cell by clicking the cell and then clicking "run cell" or typing Shift+Enter. To make sure R is working in your ipynb, run the code block below. The answer should appear below the cell.
In [0]:
3*4
You can also use R on your personal computer. Go to https://cran.rstudio.com/ and find the right version for your computer (Linux, Mac or Windows).
3. Basic Usage
At its simplest level, R can function as a handy calculator. Try some basic arithmetic operations e.g.
In [0]:
7*12+2
R can also hold information in variables; try assigning the result to a variable ‘x’ by using "<-" or "=" (either way is OK, but <- is preferred):
In [0]:
x <- 7*12+2
x = 7*12+2
The variable ‘x’ now contains the result of that operation, so we can just refer to the variable to retrieve it:
In [0]:
x
To see all the variables currently in use, try the ls() command.
In [0]:
ls()
By using vectors, we can work with many numbers at once. In R, the simplest way to create a vector is to use "c()" command. Try creating a vector, in this case think of the vector as a list of numbers:
In [0]:
myvec <- c(1,8,3,2,6,4,5,7)
myvec
For consecutive values such as a vector of 1,2,3,4, we can use one of the three ways:
In [0]:
myvec <- c(1,2,3,4,5,6,7,8)
myvec <- seq(1,8)
myvec <- 1:8
myvec
The seq function generates sequences of numbers given a starting and stopping point. For example, seq(1,8) is to generate a sequence numbers (a vector) from 1 to 8. For more information try:
In [0]:
help(seq)
We can use the index to take out elements from the vector with "[]". For example, if we want the second element of a vector:
In [0]:
myvec[2]
If we want more than one elements, we can use a vector of index. For example, if we want the first and third elements:
In [0]:
myvec[c(1,3)]
If we want the first,second and third elements:
In [0]:
myvec[c(1,2,3)]
In [0]:
myvec[1:3]
We can use conditions to take out elements. For example, we want elements that larger than 5:
In [0]:
myvec>5
In [0]:
myvec[myvec>5]
Here are the logical operators in R:
| operator | function |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | equals |
| != | does not equal |
| !x | not x |
| x | y | x or y |
| x & y | x and y |
| isTRUE(x) | test if x is true |
R is very useful because it has lots of functions for statistical computing. For example, if we want to calculate the mean value of a numeric vector, we can use the "mean()" function:
In [0]:
mean(myvec)
We can calculate the stadard deviation of a numeric vector with "sd()" function:
In [0]:
sd(myvec)
We can calculate the sum of a numeric vector with "sum()" function:
In [0]:
sum(myvec)
We can find the max of min of a numeric vector with "max()" or "min()" function:
In [0]:
max(myvec)
In [0]:
min(myvec)
A very useful function to calculate the summary statistic is "summary()". It calculates min, max, mean and quantiles of a numeric vector.
In [0]:
summary(myvec)
Besides vector, another very useful data type in R is matrix. Vector is one dimensional, but matrix is two dimensional. To create a 3x3 matrix (three rows and three columns):
In [0]:
y <- matrix(1:9,ncol=3)
y
Similarly, we can take out elements we want by using the index method with "[]". Since matrix has row index and column index, we need to specify them separately.
Take out the element in the first row and first column:
In [0]:
y[1,1]
Take out the elements in the second column:
In [0]:
y[,2]
In the above examples, the first number in "[]" is the index number for rows and the second is for columns. If the number is not specified (empty), all rows or columns will be chosen. For example, in y[,2], the row index is empty, so all rows are chosen.
4. Getting help
There are so many functions in R, we can not remember every detail about the functions. If we aren’t sure from the name of the function or the output, we can try looking at help(x) or simply ?x, where x is the function name of what you are interested in knowing more about.
In [0]:
help(mean)
In [0]:
?mean
We will see a help page describing the mean function. At the bottom of the page, there are some examples of the function. If you don't know how to use this function, copy the example code and modify it to fit your purpose.
5. Loading files into R
Now let’s consider a real dataset. A common use for R is to analyze gene expression data. There are many sophisticated tools for the analysis of gene expression data in the Bioconductor packages for R, but for our purposes we will just be looking at a small example. Start by loading the data into a variable called ‘genes’ (this file is in the directory for this prelab).
In [0]:
genes <- read.table("genes.table", header=TRUE)
The parameter of "header=TRUE" tells R to read the first line of the data as the header.
Take a look at the data:
In [0]:
genes
The line with "geneA geneB ..." is the header.
To see the dimensions of this genes matrix, use:
In [0]:
dim(genes)
The dim() function returns two numbers: number of rows and number of columns in the data. In our case, there are 500 rows(samples) and 4 columns (genes).
To take a look at the summary statistic about this data, try out the summary command:
In [0]:
summary(genes)
The output is the summary statistic for each gene.
Attach this data frame to the R search path to make referring to the individual columns easier:
In [0]:
attach(genes)
Then, we can take out the geneA column from the data matrix by simply typing the name of that column "geneA". Otherwise, we have to use the method that we mentioned before to take out certain columns with "[]".
Suppose we want to get the expression value for geneA from sample number 72. This can be accessed this way:
In [0]:
geneA[72]
Suppose we want to get consecutive expression values for geneA through geneC, from sample number 72 to 82. This can be accessed this way:
In [0]:
genes[72:82,1:3]
6. T-test
In the following exercise, we assume you have basic knowledge about hypothesis testing. If don't have such knowledge, please watch the pre-recorded stats lecture, which is the pre-lab for the first lecture.
Next, we will perform a T-test to see if the true mean of the distribution of values for geneA is different from 0. Before doing the T-test, it is important to know that T-test has an assumpution: the data must follow a normal distribution. Otherwise, the T-test is not valid. The simple way to check if the data is normally distributed or not is to plot them. We will learn how to plot in R in the next lecture. For now, let's assume the data are normally distributed and it is safe to use the t-test on them.
In this case, the null hypothesis of the T-test is that the true mean of the distribution of values for geneA is 0 and alternative hypothesis is that the true mean of the distribution of values for geneA is not 0. If the test statistic is very large or the p value is very small, this indicates that the null hypothesis should be rejected. In other words, the true mean for geneA is not 0. If the test statistic is small or the p value is large, it indicates that we don't have enough evidence to reject the null hypothesis. In other words, the true mean for geneA is 0.
The following command performs the above test:
In [0]:
t.test(geneA, mu = 0)
From the output, you will see the name of the test is "One Sample t-test". The t-statistic is "t = -1.423" and the p value is "p-value = 0.1554". The alternative hypothesis and 95% confidence interval is also included in the output. Since the p value is relatively large (>0.05) or the 95% confidence interval for the estimate of the mean includes 0, we don't have enough evidence to reject the null hypothesis and thus conclude that the true mean of geneA is 0.
In the above example, we did a one-sample T-test to compare the mean value of geneA to 0. We can also perform a two-sample T-test to compare the mean value of two genes. For example, we can compare the mean value of geneA with that of geneB. In thie case, the null hypothesis of the two-sample T-test is that the true mean of the distribution of values for geneA is the same as that for geneB. The alternative hypothesis is that the true mean of the distribution of values for geneA is not the same as that for geneB. If the test statistic is very large or the p value is very small, it indicates that the null hypothesis should be rejected. In other words, the true mean for geneA is not the same as the true mean for geneB. If the test statistic is small or the p value is large, it indicates that we don't have enough evidence to reject the null hypothesis. In other words, the true mean for geneA is the same as the true mean for geneB.
In [0]:
t.test(geneA, geneB)
The output is similar to one-sample T-test. You can see the name of the test is "Two Sample t-test". The t-statistic is "t = 0.20203" and the p value is "p-value = 0.8399". The alternative hypothesis and 95% confidence interval is also included in the output. Since the p value is relatively large (>0.05) or the 95% confidence interval includes 0, we don't have enough evidence to reject the null hypothesis and thus conclude that the true mean of geneA and geneB are the same.
7. Appendix: some commonly used functions in R
| function | comments |
|---|---|
| read.table("filename") | reads file and creates data frame from contents |
| read.csv("filename") | read comma-delimited files |
| paste | convert vectors to character and concatenate |
| strsplit(x,split) | split character string into substring |
| na.omit(x) | ignore observations with missing data |
| seq(a,b,by=i) | creates sequence from a to b in increments of i |
| rep(x,t) | replicate x t times |
| data.frame() | create data frame from input |
| rbind() | combine by rows |
| cbind() | combine by columns |
| x[n] | nth element of x |
| x[1:n] | first n elements of x |
| x[x > 1 & x <= 4] | elements of x within (1,4] |
| x[i,j] | element at row i, column j |
| methods(as) | view element types and how to convert |
| class(x) | get the class of x |
| length(x) | number of elements in x |
| unique(x) | unique elements of x |
| max(x), min(x) | maximum or minimum of x |
| var(x), sd(x) | variance, standard deviation of elements of x |
| sum(x), diff(x), prod(x) | sum, difference, product of elements of x |
| summary(x) | statistical summary of x |
| log(x, b) | logarithm of x with base b |
| plot(x,y) | scatter plot of x on x axis, Y on y axis |
| hist(x) | histogram of x |
| boxplot(x) | box-and-whiskers plot |
| points(x,y) | adds points to plot |
| lines(x,y) | add lines to plot |
| text(x,y,labels) | adds text to plot at coordinates x,y |
| abline(x) | draws line given by x |
| lm(formula) | fit linear model |
| help.search("test") | view statistical tests |
| rnorm(n,mean=0,sd=1) | normal distribution; replace r with d (density), p (cum. prob. fun.), or q (quantile) |