In [ ]:

Chemometrics

Julien Wist / 2017 / Universidad del Valle
Andrés Bernal / 2017 / ???

An up-to-date version of this notebook can be found here: https://github.com/jwist/chemometrics/



In [1]:

    
options(repr.plot.width=3, repr.plot.height=3)

Aims

reproducibility in science
available tools for chemometrics
a first modern publication

reproducible science

a better publication for a better science

Although internet contributed to accelerate the pace at which information is shared, it had very little effect on the way information is formatted. Publications illustrate well this phenomena. Today, data produced in the lab are mostly digital. Analysis of data, digital or not, is performed using computers. The writing and sharing of the results is done digitally. Despite this, the end product, the publication is an exact copy of a plain printed paper with close to no enhanced features. A paper from 1905 is exactly the same product as a paper from 2017, even if its production and distribution differ. Publications were introduced to allow researchers to claim authorship for their contribution, thus allowing to accelerate the speed at which information was shared. Today, publication may well be deserving the counter objective, to slow the flux of information. In some research area this fragmentation of the knowledge and data is so detrimental that it may invalidate years of research that cost billions of dollars. Here we present cheminfo.org, a platform to store and share data and code libraries to facilitate solving problems related to chemistry. In our opinion this represent a step toward a more efficient way to share contributions within the academic community.

A webpate is typically a more appropriate and "modern" way to share information in that not only the content is transfered but the code for its proper visualization. Usually, the raw data and the code used to manipulate them are not transfered since this latter is executed on the server. Although it is technically possible to transfer the tool to the end user, this would require instalation and configuration on the client side.

A better way to share information in the context of collaborative science require the author to share all the necessary information for someone else first to check what has been done and second to reproduce and extend the research. Example of such exists, but there is still room for improvement. An example is the very same tool used for this presentation/publication, jupyter.

The tree elements of collaborative science

Useful tools for cheminformatics

Raw data must be analyzed and therfore libraries of function must be developped. Today, two very powerfull option exists.

python + pandas (numerical simulation + statistics)
R (statistics)

Both are high end languages and thus allow to write relatively easily pieces of code. Both are scripting languages meaning that they don't require compilation. This makes the coding and debugging much simpler tasks.

See the following link to see the most common programming languages on Github: http://githut.info/.

what is R?

R is a powerful and free to use tool to perform statistics
R is not a software for numerical vectorial algebra
R is free
R has a large community that supports its development and a lot of tutorials available

Alternatively, common commercial software for statistics are:

Stata
SPSS
SAS

what is python?

python is free numeric simulation program
unlike R python is suitable for vectorial alegbra
python is open
python is supported by a very large community. New methods are rapidely implemented and shared.

Alternatives to python are:
Octave
scilab
matlab
to some extend: mathematica

How to get R and python?

It can be readily installed for all platforms (https://www.continuum.io/downloads)

Alternatively you can download R directly (https://www.r-project.org/)

front-ends

Because writing scritps is not always an easy task, program refered to as IDE exists to simplify writting. These programs offer an environment to suggest/auto-complete or provide easy access to help, etc.

Front-ends for R are:

Desktop (multi platform) https://www.rstudio.com/
Online (no installation required) https://www.getdatajoy.com/

Front-ends for python:

pycharm

Among others!

R

From now on we will focuse on R.

great documentation

https://cran.r-project.org/manuals.html

free online tutorials

online code execution

http://www.tutorialspoint.com
(most available language, has IDE, install.packages() works, free)
https://cloud.sagemath.com/
(allows to run R, scilab, julia, octave and maintain a labbook, install.packages() might work for paid subscription, allows for classroom management, otherwise free)
https://datajoy.com (discontinued)
(allows to install packages, not free)
http://www.r-fiddle.org
(allows to run simple code, very limited in size, install.packages() not available, no registration, free)</small>
google collaboratory (check if availabe)

a few dataset to play with

the notebook for data analysis

We've just seen that important software exists and are available to the community to perform data analysis or statistic. Using a well described, widely used, programming language instead of a close, private, commercial software is a huge step towards reproducible science. However a piece is still missing. The notebook.

Data analysis is an experimental art, pretty much as is chemistry. Thus, data analyst need to document the steps they follow to obtain their results. Several solutions exists to

a first example of publication with jupyter

In this example we will publish a simple linear regression. The function we use is the following:

$$ f(x) = -1.874 + 0.075 x + 0.01 x^2$$

Instead of sharing an image of the results, we provide here all the code that is necessary for another group to reproduce our findings.

experimental section

creation of the data. Here data are simulated, but real experimental data could be used instead.



In [ ]:

    
#rm(list=ls(all=TRUE)) # we clear the variable space

N <- 20 # we define the number of observations

# we create a fake dataset using a quadratic
#equation and adding some noise
# first create a vector of x repeated rep times
rep <- 2 # number of replicates
X <- rep(seq(from=43, to=96, length.out=N),rep)
# then create the Y vector according to the equation:
Y <- (0.075 * X + -1.874 + 0.01* X^2)
# create some noise
noise <- runif(length(Y), -1, 1) 
# add some noise to Y
Y <- Y + 3*noise
y = Y



In [2]:

    
x_sorted <- sort(X,index.return=TRUE)
x <-x_sorted$x
y <- Y[x_sorted$ix]

results and discussion

blah blah



In [3]:

    
data <- data.frame(x=x, x2=x^2, y=y)
data[1:6,]
dim(data)









    





x x2 y

	43.00000 1849.000 17.37903
	43.00000 1849.000 16.92747
	45.78947 2096.676 25.28350
	45.78947 2096.676 20.41811
	48.57895 2359.914 24.94950
	48.57895 2359.914 25.20949









    





	40
	3

Now the first thing to do is to have a look to the data. This can be done readily with a single command:



In [4]:

    
plot(data$x, data$y, xlab='x', ylab='y') # plot data

There is two options, if we have no idea about the model we may simple test with a single linear regression, since the data are looking to be linear. Therefore we can use a built-in tool in R that achieve just this. Exactly the same way you would ask Microsoft Excel to add a linear regression to your data.



In [5]:

    
fit.lm = lm(y ~ x, data=data)
#plot(fit.lm)

The line above means that we will use a routine called lm() for linear model, and define the model as "y=ax", the equation of a line. If you are curious, you can uncomment the "#" in the cell above and see what results.

And we may want to see the results and superimpose this resulting model to the "experimental" data.



In [6]:

    
plot(data$x, data$y, xlab='x', ylab='y') # plot data
lines(data$x, fitted(fit.lm), col=2) # plot fitted curve

The result looks nice, but we would like to know the regression coefficient for this model. Most complex routine in R own a procedure that is called summary(). This allows will show you important parameters of the model object fit.lm. If you only need the coefficient, you can simplify the output using the second line.



In [7]:

    
summary(fit.lm)
round(summary(fit.lm)$r.squared, 3)









    





Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1348 -1.8015 -0.4273  2.1378  6.6372 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -49.57573    2.07861  -23.85   <2e-16 ***
x             1.49347    0.02914   51.26   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.964 on 38 degrees of freedom
Multiple R-squared:  0.9857,	Adjusted R-squared:  0.9854 
F-statistic:  2627 on 1 and 38 DF,  p-value: < 2.2e-16







    




0.986

Althouth a correlation of 0.99 is acceptable, a more carefull look at our data indicates we may have a quadratic term. In order to test this, we can simple use another function in the lm() routine.



In [8]:

    
fit.lm = lm(y ~ x + x2, data=data)
plot(data$x, data$y, xlab='x', ylab='y') # plot data
lines(data$x, fitted(fit.lm), col=2) # plot fitted curve
#round(summary(fit.lm)$r.squared, 3)
print(paste("R^2 =",round(summary(fit.lm)$r.squared, 3)))









    



[1] "R^2 = 0.995"

This new model adjust better to our data than the previous one. And we may conclude that a quadratic effect exists in the phenomenon we are trying to observe.

With this simple example we have demonstrated that it is fairly simple to use R for simple task usually done in Microsoft Excel or similar programs. We also illustrated the power of sharing not only a pdf of the class notes, but a file that contains the information and all the necessary code in order to be checked and validate.

some publications are already available in this format

https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks#reproducible-academic-publications

http://localhost:8888/notebooks/jupyterNotebooks/LOSC_Event_tutorial.ipynb

Homeworks

please to check and do at least one of the tutorial listed above!

end of document

x	x2	y
43.00000	1849.000	17.37903
43.00000	1849.000	16.92747
45.78947	2096.676	25.28350
45.78947	2096.676	20.41811
48.57895	2359.914	24.94950
48.57895	2359.914	25.20949