Introduction to R

Make sure your kernel is set to "R (SageMath)" so you can test these commands in your ipynb! (Kernel > Change kernel). To run a selected cell, type Shift+Enter.

Basic Usage

In the following exercises, write your code in the cells.

Use R to calculate (7+2*6-100)^2



In [0]:

Assign the above calculation result into a new variable "newvar".



In [0]:

Create a vector from 5 to 10 with "c()" and assigne it to a variable "newvec"



In [0]:

Create the same vector with "seq()" function and assigne it to a variable "newvec2"



In [0]:

Calculate the mean of the vector "newvec"



In [0]:

Calculate the standard deviation of the vector "newvec"



In [0]:

Calculate the summary statistics of the vector "newvec"



In [0]:

Sometimes, the data have outliers and they have great influence on the mean value calculation. Median is robust to the outliers and can be used to summarize the data. Check the help page of "median" function and calculate the median value of the vector "newvec"



In [0]:

Then load genes.table, the same file we used in the prelab. It has been copied into this assignment's directory, so you can use the same code as in the prelab to load it.



In [0]:

Count how many samples and genes in the data set with dim() function



In [0]:

Calculate summary statistics for each gene:



In [0]:

What is the mean value of geneA, geneB, geneC and geneD (find the mean values from the above output)? Write down the values in the following cell.



In [0]:

Get the expression value for the first sample of the first gene in the data matrix.



In [0]:

Get the expression values for all genes from sample number 70 and 71.



In [0]:

Attach the data so that you can refer to the individual columns:



In [0]:

Use one-sample T-test to test if the true mean of geneB is 0.



In [0]:

What is the value of the above t statistic? What is the 95% confidence interval for the estimate of the mean? Would you reject the null hypothesis based on the value of the t statistic? Write down your answers in the following cell.

Compare the mean expression of genes A and D using a two-sample T-test. What are your null hypothesis of doing these tests? Describe what t.test tells you, and reason your decisions from the output. Put your code in the first cell and your answers in the second cell.



In [0]:



In [0]:

Homework (10 points)

For the following questions, write your code in the cells.

Create a vector with elements 5,6,7,8,9,10 and assign it to a variable "vec"



In [0]:

Calculate vec*10.



In [0]:

You can see that each element in "vec" is multipled by 10 with this command. Vectorized calculation is a very useful feature of R. With this feature, we don't need to write a "for" loop to do the same calculation, which can take longer to complete.

A robust version of the mean is trimmed mean. That is, trim some small values and large values before calculating the mean. In R, the trimmed mean can be calculate with the "trim" paramter. Check the help page of "mean" function and caculate the mean value with trim=0.1



In [0]:

Next, we will use a recently publised single cell RNA-Seq dataset. Load "single_cell_rnaseq_hw1.txt" into a variable named "scdata". Note that this file has the first line as the header.



In [0]:

Use "[]" to get the first 5 rows and all columns of "scdata". Note that the columns are genes and rows are samples.



In [0]:

Use "[]" and "=" to change the value of the object in the second row, second column in scdata to 333. Use "[]" to get the first 5 rows and all columns again.



In [0]:

Now reload the file "single_cell_rnaseq_hw1.txt" into the scdata variable again and get the first 5 rows again. Is the value in the second row, second column 333, or the original value? Why? Write your code in the first cell and answer in the second cell.



In [0]:



In [0]:

Calculate the dimentions of "scdata". How many samples and genes in this dataset? Note that only a small subset of the original dataset are provided to you. Write down the code in the first cell and your answers in the second cell.



In [0]:



In [0]:

Calculate the summary statistics for each gene. What are the mean values for gene "Sub1" and "Scg2" from the summary statistics? Write down the code in the first cell and your answers in the second cell.



In [0]:



In [0]:

Next you want to investigate whether the mean values of Sub1 and Scg2 are the same. If we want to use t-test to test the statistical difference of these two genes, what is the null hypothesis and alternative hypothesis in this case? Write your answer in the following cell.



In [0]:

Perform the t-test to compare the mean values of "Sub1" and "Scg2". Don't forget to attach the data matrix first, otherwise you can't access the columns just by their column names.



In [0]:

What are the t-statistic and p-value from the above test? What is your conclusion?



In [0]: