Introduction to R

The purpose of this notebook, or series of notebooks, is to provide an introduction to R for students of the BSc programm in Geosciences at the LMU Muenchen.

Here I provide a number of examples that can be run in an interactive manner and should help the student get familiar with R as a tool for the analysis of data.

It will also provide an opportunity to write markdown and see its power.

R basics

Maybe the first thing you need to know is that R can be used in interactive mode. You can ask R to add two number, for example, and R will add them for you and give you the results back. For instance, if you run the following operation (Press Ctrl+Enter while the code snipped is selected), R will output 2.


In [1]:
1+1


2

If you know want to know the value of the number Pi:


In [2]:
pi


3.14159265358979

and so on. But R can be used for more interesting stuff than this, like, say, analyzing dataset you or somebody else generated during a research project. For this the first thing you need to do is load data into R. There are many ways to do this, we will use only one method for this purpose:


In [9]:
GlidingSnakeUndulations<-read.csv("~/Repos/WP49_2_Datenverarbeitung_WiSe1617/Data/chap03e1GlidingSnakeUndulations.csv", head=T, sep=",")

here we need to note three things:

  • We have two words separated by an arrow (<-), which is an assignment operator. The left word gives the name of a new Object where our data will be stored. The right word (Note: the . is part of the "word") is a Method telling R what to do. In this case, it says to read a csv file located under the path quoted, and it gives R a hint indicating that in the csv file the first row contains column names and not data per se.
  • The method name is followed by () where the options (arguments) the method needs to achieve its mission are specified.
  • The arrow indicates the direction in which the output moves. In this case, the method takes an input in the form of a localition in my computer, reads it, and outputs an object of a certain type that will be stored in the object GlidingSnakeUndulations.

If the previous line of code runs, you just uploaded data into R. Congrats!

Now that you have data in R, lets try to see how the data looks like without really getting into the analysis of it. There are three or four commands that very often come very handy and provide information about the data I loaded into R. These are:


In [10]:
head(GlidingSnakeUndulations)


undulationRate
0.9
1.2
1.2
1.3
1.4
1.4

In [11]:
names(GlidingSnakeUndulations)


'undulationRate'

In [12]:
ls()


'GlidingSnakeUndulations'

ls() can be used to list the objects stored in the environment. If you want to delete an object, and free the memory it occupied. You can use rm().

Now lets say we want to describe somehow the data we have at hand. In this case, the Undulation Rate of some snakes. A common way to do this is by calling the method summary() on the object data, in this case GlidingSnakeUndulations. For instance,


In [13]:
summary(GlidingSnakeUndulations)


 undulationRate 
 Min.   :0.900  
 1st Qu.:1.200  
 Median :1.350  
 Mean   :1.375  
 3rd Qu.:1.450  
 Max.   :2.000  

This method will provide you with basic information about the distribution of the variable. It gives you the mean and the median, as well as the 1st and 3rd quartiles, and it provides the max and min observed values.

If a more graphical way is desired, a good way to visualize how a variable is distributed is through a histogram. In our example the data is probably not enough to produce a nice histogram, but R doesn't care about that and the graph will be produced.


In [14]:
hist(GlidingSnakeUndulations$undulationRate)


Usually, when we have a continuous variable we want to have an idea about the central value of the distribution, this is usually the mean (although this depends on the shape of the distribution). The method summary() will give you the value of the mean, however it will not give any information of the dispersion of the data in relation to the mean value. Typically we measure dispersion in the data, by taking the mean as a reference and assessing the distance at which it data point is located relative to it. This idea is condensed in the standard deviation, which is a measurment that must be always associated to the mean value; so remember, every time you report a mean you have to report an standard deviation.

R provides several methods to calculate different values from a dataset. For instance:


In [15]:
mean(GlidingSnakeUndulations$undulationRate)


1.375

In [16]:
median(GlidingSnakeUndulations$undulationRate)


1.35

In [17]:
sd(GlidingSnakeUndulations$undulationRate)


0.324037034920393

In [18]:
max(GlidingSnakeUndulations$undulationRate)


2

In [19]:
min(GlidingSnakeUndulations$undulationRate)


0.9

R also provides methods to calculate whatever quantile you want to have. For instance:


In [20]:
quantile(GlidingSnakeUndulations$undulationRate, 0.25)


25%: 1.2

In [21]:
quantile(GlidingSnakeUndulations$undulationRate, 0.5)


50%: 1.35

In [22]:
quantile(GlidingSnakeUndulations$undulationRate, 0.75)


75%: 1.45

With these basic methods you can describe continuous quantitative variable in a very thorough manner. And with a bit of imagination you can derive other descriptors by combining methods, for instance the standard error of the mean and the coefficient of variation would be calculated in the following way:


In [23]:
sd(GlidingSnakeUndulations$undulationRate)/sqrt(nrow(GlidingSnakeUndulations))


0.114564392373896

In [24]:
sd(GlidingSnakeUndulations$undulationRate)/mean(GlidingSnakeUndulations$undulationRate)


0.235663298123922