In [ ]:
Julien Wist / 2017 / Universidad del Valle
Andrés Bernal / 2017 / ???
An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/
In [1]:
options(repr.plot.width=3, repr.plot.height=3)
Although internet contributed to accelerate the pace at which information is shared, it had very little effect on the way information is formatted. Publications illustrate well this phenomena. Today, data produced in the lab are mostly digital. Analysis of data, digital or not, is performed using computers. The writing and sharing of the results is done digitally. Despite this, the end product, the publication is an exact copy of a plain printed paper with close to no enhanced features. A paper from 1905 is exactly the same product as a paper from 2017, even if its production and distribution differ. Publications were introduced to allow researchers to claim authorship for their contribution, thus allowing to accelerate the speed at which information was shared. Today, publication may well be deserving the counter objective, to slow the flux of information. In some research area this fragmentation of the knowledge and data is so detrimental that it may invalidate years of research that cost billions of dollars. Here we present cheminfo.org, a platform to store and share data and code libraries to facilitate solving problems related to chemistry. In our opinion this represent a step toward a more efficient way to share contributions within the academic community.
A webpate is typically a more appropriate and "modern" way to share information in that not only the content is transfered but the code for its proper visualization. Usually, the raw data and the code used to manipulate them are not transfered since this latter is executed on the server. Although it is technically possible to transfer the tool to the end user, this would require instalation and configuration on the client side.
A better way to share information in the context of collaborative science require the author to share all the necessary information for someone else first to check what has been done and second to reproduce and extend the research. Example of such exists, but there is still room for improvement. An example is the very same tool used for this presentation/publication, jupyter.
The tree elements of collaborative science
Raw data must be analyzed and therfore libraries of function must be developped. Today, two very powerfull option exists.
Both are high end languages and thus allow to write relatively easily pieces of code. Both are scripting languages meaning that they don't require compilation. This makes the coding and debugging much simpler tasks.
See the following link to see the most common programming languages on Github: http://githut.info/.
It can be readily installed for all platforms (https://www.continuum.io/downloads)
Alternatively you can download R directly (https://www.r-project.org/)
Because writing scritps is not always an easy task, program refered to as IDE exists to simplify writting. These programs offer an environment to suggest/auto-complete or provide easy access to help, etc.
Front-ends for R are:
Desktop (multi platform) https://www.rstudio.com/
Online (no installation required) https://www.getdatajoy.com/
Front-ends for python:
Among others!
We've just seen that important software exists and are available to the community to perform data analysis or statistic. Using a well described, widely used, programming language instead of a close, private, commercial software is a huge step towards reproducible science. However a piece is still missing. The notebook.
Data analysis is an experimental art, pretty much as is chemistry. Thus, data analyst need to document the steps they follow to obtain their results. Several solutions exists to
In this example we will publish a simple linear regression. The function we use is the following:
$$ f(x) = -1.874 + 0.075 x + 0.01 x^2$$Instead of sharing an image of the results, we provide here all the code that is necessary for another group to reproduce our findings.
In [ ]:
#rm(list=ls(all=TRUE)) # we clear the variable space
N <- 20 # we define the number of observations
# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <- runif(length(Y), -1, 1)
# add some noise to Y
Y <- Y + 3*noise
y = Y
In [2]:
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]
In [3]:
data <- data.frame(x=x, x2=x^2, y=y)
data[1:6,]
dim(data)
Now the first thing to do is to have a look to the data. This can be done readily with a single command:
In [4]:
plot(data$x, data$y, xlab='x', ylab='y') # plot data
There is two options, if we have no idea about the model we may simple test with a single linear regression, since the data are looking to be linear. Therefore we can use a built-in tool in R that achieve just this. Exactly the same way you would ask Microsoft Excel to add a linear regression to your data.
In [5]:
fit.lm = lm(y ~ x, data=data)
#plot(fit.lm)
The line above means that we will use a routine called lm()
for linear model, and define the model as "y=ax", the equation of a line. If you are curious, you can uncomment the "#" in the cell above and see what results.
And we may want to see the results and superimpose this resulting model to the "experimental" data.
In [6]:
plot(data$x, data$y, xlab='x', ylab='y') # plot data
lines(data$x, fitted(fit.lm), col=2) # plot fitted curve
The result looks nice, but we would like to know the regression coefficient for this model. Most complex routine in R own a procedure that is called summary(). This allows will show you important parameters of the model object fit.lm. If you only need the coefficient, you can simplify the output using the second line.
In [7]:
summary(fit.lm)
round(summary(fit.lm)$r.squared, 3)
Althouth a correlation of 0.99 is acceptable, a more carefull look at our data indicates we may have a quadratic term. In order to test this, we can simple use another function in the lm() routine.
In [8]:
fit.lm = lm(y ~ x + x2, data=data)
plot(data$x, data$y, xlab='x', ylab='y') # plot data
lines(data$x, fitted(fit.lm), col=2) # plot fitted curve
#round(summary(fit.lm)$r.squared, 3)
print(paste("R^2 =",round(summary(fit.lm)$r.squared, 3)))
This new model adjust better to our data than the previous one. And we may conclude that a quadratic effect exists in the phenomenon we are trying to observe.
With this simple example we have demonstrated that it is fairly simple to use R for simple task usually done in Microsoft Excel or similar programs. We also illustrated the power of sharing not only a pdf of the class notes, but a file that contains the information and all the necessary code in order to be checked and validate.
http://localhost:8888/notebooks/jupyterNotebooks/LOSC_Event_tutorial.ipynb
Homeworks
please to check and do at least one of the tutorial listed above!
end of document