Julien Wist / 2017 / Universidad del Valle
Andrés Bernal / 2017 / ???
An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/
In [1]:
options(repr.plot.width=4, repr.plot.height=4)
R is a language tought from the statistical standpoint. It is at first a little counter intuitive for people used to matlab like programming. Matlab and the likes, scilab, octave, python, are build from the linear algebra standpoint.
Thus R is optimal for statistical data manipulation since it provide a large number of built-in functions. However most of these functions are ignored because they have no equivalent in matlab.
In this section, I will list some very important functions that are usually discovered too late... and that can greatly simplify the code and more importantly make it more readeable.
First load some data to play with.
In [3]:
load(url('https://github.com/jwist/chemometrics/raw/master/datasets/coffeeMulti.rda'))
In [4]:
# use ls to list the variables and discover that whas saved into this file
ls()
In [6]:
# use the following command to explore the variable.
names(coffeeMulti)
In [10]:
# use head to visualize the data
head( coffeeMulti$irms )
Let's play with the Isotope Ratio Mass Spectrometry (IRMS) data
In [13]:
d <- coffeeMulti$irms
is(d)
Let's compute the mean colomn from the two columns "caffeine1" and "caffeine2". This is done easily using the apply function to manipulate arrays. The first argument is the array, in this case the two columns, the second argument is the "MARGIN", 1 for rows and 2 for columns, and the last arguments is the function to be applied. The MARGIN tells how to apply the mean function.
In [24]:
m <- apply(d[,c('caffeine1','caffeine2')], 1, mean)
head( data.frame(mean = m) )
to make it clear we can use MARGIN = 2
In [28]:
m <- apply(d[,c('caffeine1','caffeine2')], 2, mean)
head( data.frame('mean by columns' = m) )
Instead of mean any function can be used, like max, min, quantile, etc.
In [27]:
m <- apply(d[,c('caffeine1','caffeine2')], 2, quantile)
head( data.frame(quantile = m) )
The message is that the loop should be avoided as much as possible in R and that there are many build-in function for that purpose.
Another useful example is centering the data. Say you want to find the mean of a column and then substract this value to all the elements of that column. This way you will have centered your data. At first glance, this look fairly complex and implies several operations. R provide a simple framework for this, called sweep()
In [61]:
head( sweep( d[c('caffeine1', 'caffeine2')], 2, apply(d[c('caffeine1', 'caffeine2')], 2, mean), "-" ) )
In [67]:
head( apply(d[c('caffeine1', 'caffeine2')], 2, function(x) x - mean(x)) )
This second example is my favourite and shows how apply works. Instead of using a predifined function such as mean() we use a user defined function. In this case it is clear that x selects the column and the function performes: the column minus its means.
Althought sweep() achieves the same results, the second example is much more general.
Another very useful example is the data aggregation. For example you want to find the mean value of your variable by country. In this case the isotope ratio mean by country. Again, if looked at from the matlab standpoint, this is not straighforward. However R provide a very handy solution. The first aggregation function you should know is table(). Look how this works:
In [71]:
table( d$country )
So, we know how many samples we have from each country. But now we want to compute the mean.
In [84]:
aggregate(d[c('caffeine1', 'caffeine2')], by = list(unlist(d['country'], use.names = FALSE)), mean)
The by= argument must be a list object. Because our data are not perfectly stored, we have to first unlist our country column and create a new clean list. The unlist() function is very usefull to unformat any vector of data before reassigning it with a new type.
In [248]:
by(d['mean'], d['country'], function(x) max(x))
Another way to obtain the same results.
In [249]:
by(d, d$country, function(x) max(x$mean))
For more control over the data, aggregation can be performed without applying any function. Functions can be applied later.
In [250]:
a <- aggregate(d[c('caffeine1', 'caffeine2')], list(d$country), function(x) x)
a
For example to obtain boxplots
In [241]:
apply(a[c(2,3)], 2, function(x) {
boxplot(x, main = names(x) , names = a[[1]], xlab = "country", ylab = "IRMS")
}
)
The above example is, however, not a very good one since boxplot itself is a very powerful function to aggregate data. The same result is thus optained by the simple call:
In [224]:
boxplot(mean ~ country, d)
In [ ]:
In [ ]:
In [ ]: