Welcome to R from a Jupyter notebook!

This is how a Jupyter Notebook looks like. Here I will create the dataset for our toy lake example and I will introduce some basic (and more advance as well) plotting functionality to you.

First a test that the R kernel works:


In [1]:
1+1


2

It works! By the way, the text you see is Markdown. So, a Jupyter Notebook is a good way to combine text (formatted using Markdown) and R or Python code to create reproducible analytical pipelines.

In this notebook, I will use bold face whenever I write R syntax we discussed during the lecture.

Generating data for Lake Poing

The first step after our design lecture is to generate the data we wanted to sample from our lake. We agreed on 50 stations and gained some knowledge about the geography of the lake (i.e. its surroundings, etc.). We are going to use this knowledge to conduct our initial sampling and later in the course to design our experiments in a meaningful way.

Next I will generate a dataframe that will contain the following variables. To aggregate the variable in our dataframe I will create 1D arrays that will represent a physical/chemical parameter of the lake (e.g. temperature). Once I have all the sampling points I want in the vectors I will aggregate them into a dataframe.

We agreed in that we wanted to measure the following parameters at each station:

  • Distance from coast line (meters)
  • Depth (meters)
  • Surface temperature (degrees Celsius)
  • Turbidity (Secchi disk meters)
  • Nitrate concentration (mg/ml)
  • Phosphate concentration (mg/ml)
  • pH
  • Dissolved Oxygen (mg/mL)
  • Chlorophyl concentration (mg/ml)

And we mentioned different areas in which the lake can be divided:

  • Natural reserve
  • Agricultural land use
  • Town area
  • Industrial area (the mining processing company that wants to expand)

I think we need to include a fifth area that will be something like open waters and will group stations that are too faraway from the land. The problem is, of course, that at the begining of the study we don't know what "far" means. For now we only document the distance to land of each station we sample. We can use these data later if an open water category is needed to groups some stations.

To generate the data, we need to learn a bit about probability distributions. I will introduce a few of them in the next blocks of R code and I will try to give examples of how these distributions can be used.

Lets start with the normal distribution. This one looks... normal.


In [2]:
#this is how you import libraries into your session

library(MASS)#needed to generate multivariate normal data
library(tmvtnorm)#needed to generate truncated multivariate normal data
library(lavaan)#needed to get the funcion cor2cov


Loading required package: mvtnorm
Loading required package: Matrix
Loading required package: stats4
Loading required package: gmm
Loading required package: sandwich
This is lavaan 0.5-23.1097
lavaan is BETA software! Please report any bugs.

In [3]:
plot(function(x) dnorm(x, 0, 1), -5, 5, ylab="Density", main="A normal distribution with mean=0 and stdev=1")


If you only want positive numbers (we will use this one later to generate our data for some variables), you can use a lognormal distribution. This one will be truncated at zero


In [4]:
plot(function(x) dlnorm(x, 1, 1), -1, 20, ylab="Density", main="A lognormal distribution with mean=1 and stdev=1")


These two are good for continuous variables. What if you have counts? Ussually counts are modelled using a Poisson distribution.


In [5]:
plot(dpois(x=1:6, lambda=1), ylab="Density", xlab="x", main="A poisson distribution with lambda=1")


We could keep adding distributions to the Notebook, but I am going to stop here. I just want to add that R implements many distributions and that we use those distributions to generate data points. For instance, suppose we want to simulate sampling 50 data points from a normal distribution with mean=0 and stdev=1 (a normal standard distribution). We do the following:


In [6]:
set.seed(12345)#this assures that we always get the same results when the next function is called
normal_data<-rnorm(50,0,1)
print(normal_data)


 [1]  0.58552882  0.70946602 -0.10930331 -0.45349717  0.60588746 -1.81795597
 [7]  0.63009855 -0.27618411 -0.28415974 -0.91932200 -0.11624781  1.81731204
[13]  0.37062786  0.52021646 -0.75053199  0.81689984 -0.88635752 -0.33157759
[19]  1.12071265  0.29872370  0.77962192  1.45578508 -0.64432843 -1.55313741
[25] -1.59770952  1.80509752 -0.48164736  0.62037980  0.61212349 -0.16231098
[31]  0.81187318  2.19683355  2.04919034  1.63244564  0.25427119  0.49118828
[37] -0.32408658 -1.66205024  1.76773385  0.02580105  1.12851083 -2.38035806
[43] -1.06026555  0.93714054  0.85445172  1.46072940 -1.41309878  0.56740325
[49]  0.58318765 -1.30679883

These data comes from a distribution that looks like the distribution plotted above. If you plot that data it doesn't look normal or similar to the curve we saw before.


In [7]:
hist(normal_data)


This is because we only took 50 points. If you change the number of sampled points to 5000 things change.


In [8]:
set.seed(12345)#this assures that we always get the same results when the next function is called
lots_of_normal_data<-rnorm(5000,0,1)
hist(lots_of_normal_data)


Lets now generate the data for Lake Poing.

We want:

  • Distance from coast line (meters)
  • Depth (meters)
  • Surface temperature (degrees Celsius)
  • Turbidity (Secchi disk meters)
  • Nitrate concentration (mg/ml)
  • Phosphate concentration (mg/ml)
  • pH
  • Dissolved Oxygen (mg/mL)
  • Chlorophyl concentration (mg/ml)

The function below generates data from the 50 stations for each of the variables above (assuming of course that they are normally distributed).

Tto keep it simple, I will generate 50 datapoints per variable first. Assuming we went there and sampled one time so far.

The function needs a vector of means, and because I want to generate data that makes some sense, I will generate the data from a multivariate normal to add some (correlation) structure to it. To do this I need information about the covariance matrix... However, I am not that good and cannot generate nice data on the fly from the covariance (because I don't know which covariance values to use...). The good news is, I can calculate the covariance matrix from the correlation matrix, which I can specify easily. The only information I need is the correlation matrix and a vector with the standard deviation of the variables and the function cor2cov from the package lavaan will do the magic.

As discussed in class, I don't want to have negative values for most variables. So I will use a truncate multivariate normal distribution to avoid values < 0.


In [9]:
#btw, this is a comment...


genMultiVariateNormalData <- function(n,mu=c(0),sigma=matrix(1), varnames, truncate=T, left_truncate=c(-Inf), right_truncate=c(Inf)) {
  
  #check that the mean vector and the stdev vector are of the same length.
  if ((r <- length(mu)) != NCOL(sigma)) {
    stop("mu and sigma must have the same number of columns", call.=FALSE)
  }

  stationID<-rep(1:n)

  if(truncate){

    x <- rtmvnorm(n,mean=mu,sigma=sigma, lower=left_truncate, upper=right_truncate)
      
  }else{
  
    x <- rmvnorm(n,mean=mu,Sigma=sigma)
      
  }
  
  dd<-data.frame(var1=stationID,var2=x)
  colnames(dd)<-varnames
  dd  
}

#the mean value for each variable
meanVector<-c(500,35,17.8,22.4,27.79,2.03,8.4,4.1,398.73)

#the value of the standard deviation of each variable
sdVector<-c(150,17,3.3,8,7.8,1.8,0.3,3.1,200)

#the correlation matrix specifying how different variables relate to each other.
corMatrix<-matrix(c(1,0.23,-0.36,-0.43,-0.62,-0.51,0.1,0.15,0.27,0.23,1,-0.34,-0.54,-0.57,-0.47,-0.12,0.17,0.32,-0.36,-0.34,1,0.07,0.45,0.39,0.09,-0.24,0.23,-0.43,-0.54,0.07,1,0.7,0.59,-0.03,-0.27,0.8,-0.62,-0.57,0.45,0.7,1,0.79,0.21,0.4,0.87,-0.51,-0.47,0.39,0.59,0.79,1,0.13,0.57,0.91,0.1,-0.12,0.09,-0.03,0.21,0.13,1,0.07,0.23,0.15,0.17,-0.24,-0.27,0.4,0.57,0.07,1,0.28,0.27,0.32,0.23,0.8,0.87,0.91,0.23,0.28,1),9,9)
#here we calculate the covariace matrix.
covMatrix<-cor2cov(corMatrix,sdVector)

#and here we generate the data. Note we get some warnings!
LakePoing_PhysicoChemData<-genMultiVariateNormalData(50, meanVector, covMatrix, c("StationID","DistanceFromCoastLine","Depth","Temperature","Turbidity","NitrateConcentration","PhosphateConcentration","pH","DissolvedOxygen","ChlorophylConcentration"), truncate=T, left_truncate=rep(0,9), right_truncate=rep(Inf,9))


Warning message:
In rtmvnorm.rejection(n, mean, sigma, lower, upper, ...): Acceptance rate is very low (0) and rejection sampling becomes inefficient. Consider using Gibbs sampling.Warning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefiniteWarning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefiniteWarning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefiniteWarning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefiniteWarning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefiniteWarning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefiniteWarning message:
In rmvnorm(nproposals, mean = mean, sigma = sigma): sigma is numerically not positive semidefinite

The data has been generated, but, is it how we wanted it to be?


In [10]:
LakePoing_PhysicoChemData[1,2]<-NA
#check the header of the data
head(LakePoing_PhysicoChemData)

#print the meanVector used to generate the data
meanVector
#apply the function mean to all the columns in the dataframe
sapply(LakePoing_PhysicoChemData, mean)
#print the sdVector used to generate the data
sdVector
#apply the function sd to all the columns in the dataframe
sapply(LakePoing_PhysicoChemData, sd)


StationIDDistanceFromCoastLineDepthTemperatureTurbidityNitrateConcentrationPhosphateConcentrationpHDissolvedOxygenChlorophylConcentration
1 NA66.2560819.5887217.0042924.999851.7573948.1551443.633303433.2720
2 406.172739.4299415.9927534.2310239.455325.1985058.1635366.349586587.0538
3 749.959352.8549518.5272525.1387528.128104.0568978.3784684.963911732.9942
4 511.940770.0592817.2464217.5673829.622523.2101578.3793968.481022578.2946
5 490.170821.2986719.7306517.4765326.197931.0610418.5882043.977796255.7663
6 406.114644.7393916.8546032.0927536.916324.7625098.7705585.268300569.1504
  1. 500
  2. 35
  3. 17.8
  4. 22.4
  5. 27.79
  6. 2.03
  7. 8.4
  8. 4.1
  9. 398.73
StationID
25.5
DistanceFromCoastLine
<NA>
Depth
35.6221835580608
Temperature
18.7932010447359
Turbidity
24.7425744096173
NitrateConcentration
32.9751833129685
PhosphateConcentration
3.14902905966643
pH
8.40507489017078
DissolvedOxygen
5.30868257423964
ChlorophylConcentration
456.269838526354
  1. 150
  2. 17
  3. 3.3
  4. 8
  5. 7.8
  6. 1.8
  7. 0.3
  8. 3.1
  9. 200
StationID
14.5773797371133
DistanceFromCoastLine
<NA>
Depth
16.6911668632812
Temperature
3.06820799305462
Turbidity
8.73061357623073
NitrateConcentration
7.97899025277311
PhosphateConcentration
1.90228191159872
pH
0.275940524092886
DissolvedOxygen
2.5117915641513
ChlorophylConcentration
168.476392031186

Looks decent. What about the correlation structure of the variables?


In [11]:
#print correlation matrix used
corMatrix
#we don't care about the first column = stationID; print actual correlation matrix
cor(LakePoing_PhysicoChemData[,2:10])


1.00 0.23-0.36-0.43-0.62-0.51 0.10 0.150.27
0.23 1.00-0.34-0.54-0.57-0.47-0.12 0.170.32
-0.36-0.34 1.00 0.07 0.45 0.39 0.09-0.240.23
-0.43-0.54 0.07 1.00 0.70 0.59-0.03-0.270.80
-0.62-0.57 0.45 0.70 1.00 0.79 0.21 0.400.87
-0.51-0.47 0.39 0.59 0.79 1.00 0.13 0.570.91
0.10-0.12 0.09-0.03 0.21 0.13 1.00 0.070.23
0.15 0.17-0.24-0.27 0.40 0.57 0.07 1.000.28
0.27 0.32 0.23 0.80 0.87 0.91 0.23 0.281.00
DistanceFromCoastLineDepthTemperatureTurbidityNitrateConcentrationPhosphateConcentrationpHDissolvedOxygenChlorophylConcentration
DistanceFromCoastLine 1 NA NA NA NA NA NA NA NA
DepthNA 1.000000000 0.006836875-0.28713456 -0.2245421 -0.12865580 -0.10911953 0.07600874 0.2852102
TemperatureNA 0.006836875 1.000000000-0.05344176 0.1721365 0.17918399 0.15425890 -0.44786711 -0.0984531
TurbidityNA -0.287134559-0.053441763 1.00000000 0.8702103 0.62758425 0.13935539 -0.13901238 0.5863490
NitrateConcentrationNA -0.224542143 0.172136490 0.87021031 1.0000000 0.77123670 0.32658433 0.11916594 0.5272028
PhosphateConcentrationNA -0.128655801 0.179183987 0.62758425 0.7712367 1.00000000 0.06232714 0.28630361 0.4355788
pHNA -0.109119528 0.154258901 0.13935539 0.3265843 0.06232714 1.00000000 0.15459914 0.1156120
DissolvedOxygenNA 0.076008739-0.447867115-0.13901238 0.1191659 0.28630361 0.15459914 1.00000000 0.2328112
ChlorophylConcentrationNA 0.285210156-0.098453102 0.58634903 0.5272028 0.43557884 0.11561198 0.23281121 1.0000000

Another way to do it is to use the function pairs().


In [12]:
pairs(LakePoing_PhysicoChemData)


Now the usual exploratory data analysis round. This time I'll do it the lazy way... I mean with a bit of programming.


In [13]:
for(column in colnames(LakePoing_PhysicoChemData[,2:10])){
    
    title<-paste(c("Histogram of", column), sep=" ")
    hist(LakePoing_PhysicoChemData[[column]], main=title, xlab=column)
    
}


This is the equivalent of:

>hist(LakePoing_PhysicoChemData$DistanceFromCoastLine, main="DistanceFromCoastLine")

>hist(LakePoing_PhysicoChemData$Temperature, main="Temperature")

...

>hist(LakePoing_PhysicoChemData$ChlorophylConcentration, main="ChlorophylConcentration")

but way faster... I mean I have to write less code (but think/google more).

If you are interested in doing something similar to what the pairs() function does you can do it for selected variables. For example:


In [14]:
plot(LakePoing_PhysicoChemData$ChlorophylConcentration~LakePoing_PhysicoChemData$PhosphateConcentration)


Or you can try to code something:


In [15]:
#remember we don't want column 1...

for(y in 2:ncol(LakePoing_PhysicoChemData)){#for all columns starting with column 2
    for(x in (y+1):ncol(LakePoing_PhysicoChemData)){#for all columns starting with the previous selected column
        title<-paste(c(as.character(colnames(LakePoing_PhysicoChemData[y])), "vs.", as.character(colnames(LakePoing_PhysicoChemData[x]))), sep=" ")
        plot(LakePoing_PhysicoChemData[[y]]~LakePoing_PhysicoChemData[[x]], main=title)
    } 
}

#how can you modify the x and y labels in the previous nested for loop to plot nicer graphs?


Error in `[.data.frame`(LakePoing_PhysicoChemData, x): undefined columns selected
Traceback:

1. paste(c(as.character(colnames(LakePoing_PhysicoChemData[y])), 
 .     "vs.", as.character(colnames(LakePoing_PhysicoChemData[x]))), 
 .     sep = " ")   # at line 5 of file <text>
2. colnames(LakePoing_PhysicoChemData[x])
3. is.data.frame(x)
4. LakePoing_PhysicoChemData[x]
5. `[.data.frame`(LakePoing_PhysicoChemData, x)
6. stop("undefined columns selected")

Now, remember we wanted to have a sampling that reflected the zonation of the lake margins. These were:

  • Natural reserve
  • Agricultural land use
  • Town area
  • Industrial area (the mining processing company that wants to expand)

using the function genMultiVariateNormalData we could generate n data points for each of these areas as long as they amount to 50 (our N), because we sampled only 50 stations.

Above, I mentioned I wanted a fifth class to group stations that cannot be assigned to any area. This will allow us to sample 10 datapoint per area.

Now, coupling a for cycle with a call to the function genMultiVariateNormalData I can generate 10 data points for all the variables (9) for all the areas (5).


In [ ]:
areaDF<-data.frame(StationID=as.numeric(),DistanceFromCoastLine=as.numeric(),Depth=as.numeric(),Temperature=as.numeric(),Turbidity=as.numeric(),NitrateConcentration=as.numeric(),PhosphateConcentration=as.numeric(),pH=as.numeric(),DissolvedOxygen=as.numeric(),ChlorophylConcentration=as.numeric())

for(area in rep(letters[1:5])) {
    
    #the mean value for each variable
    meanVector<-c(500,35,17.8,22.4,27.79,2.03,8.4,4.1,398.73)

    #the value of the standard deviation of each variable
    sdVector<-c(150,17,3.3,8,7.8,1.8,0.3,3.1,200)

    #the correlation matrix specifying how different variables relate to each other.
    corMatrix<-matrix(c(1,0.23,-0.36,-0.43,-0.62,-0.51,0.1,0.15,0.27,0.23,1,-0.34,-0.54,-0.57,-0.47,-0.12,0.17,0.32,-0.36,-0.34,1,0.07,0.45,0.39,0.09,-0.24,0.23,-0.43,-0.54,0.07,1,0.7,0.59,-0.03,-0.27,0.8,-0.62,-0.57,0.45,0.7,1,0.79,0.21,0.4,0.87,-0.51,-0.47,0.39,0.59,0.79,1,0.13,0.57,0.91,0.1,-0.12,0.09,-0.03,0.21,0.13,1,0.07,0.23,0.15,0.17,-0.24,-0.27,0.4,0.57,0.07,1,0.28,0.27,0.32,0.23,0.8,0.87,0.91,0.23,0.28,1),9,9)
    
    #here we calculate the covariace matrix.
    covMatrix<-cor2cov(corMatrix,sdVector)
    tmpDF<-data.frame(genMultiVariateNormalData(10, meanVector, covMatrix, c("StationID","DistanceFromCoastLine","Depth","Temperature","Turbidity","NitrateConcentration","PhosphateConcentration","pH","DissolvedOxygen","ChlorophylConcentration"), truncate=T, left_truncate=rep(0,9), right_truncate=rep(Inf,9)),Area=rep(area,10))
    areaDF<-merge(areaDF,tmpDF, all=T)
}

Lets check the data generated for any variable:


In [ ]:
tapply(areaDF$PhosphateConcentration,areaDF$Area, mean)

In order to make the different areas of the lake different in a meaningful way, one would have to pass a matrix specifying the mean and standard deviations of the different categories and would need to iterate over the rows (representing the mean or standard deviation vector for different areas) to sample from a multivariate normal the matches those particular values. For this example I will keep the covariance structure of the matrix constant (as far as possible), i.e. it will be the same for all localities and will use the same standard deviation for each variable, i.e. I will only change the mean values at each point.

One way to achieve this would be to call the funcion genMultiVariateNormalData inside a for loop that iterates over matrix rows.

For example:


In [ ]:
#the mean value for each variable will appear as a column in the matrix. Rows will represent localities.

#the order of the variables is:
#c("DistanceFromCoastLine","Depth","Temperature","Turbidity","NitrateConcentration","PhosphateConcentration","pH","DissolvedOxygen","ChlorophylConcentration")

#the order of the areas will be:
#1=lake, 2=city, 3=natural reserve, 4=agriculture, 5=mining

meanMatrix<-matrix(c(420,32,16.2,17,25.85,2.21,8.1,3.9,360.30,
                     510,31,17.1,16,26.90,2.45,8.2,4.1,380.65,
                     370,23,16.8,19,22.20,1.03,8.3,4.3,300.03,
                     368,28,17.5,15,29.90,3.50,8.0,3.0,450.79,
                     480,35,16.4,14,27.79,2.03,7.4,2.2,398.47), ncol=9, nrow=5, byrow=T)

#the value of the standard deviation of each variable
sdVector<-c(50,7,2.1,4,2.8,0.8,0.3,1.1,120)

#the correlation matrix specifying how different variables relate to each other.
corMatrix<-matrix(c(1,0.23,-0.36,-0.43,-0.62,-0.51,0.1,0.15,0.27,0.23,1,-0.34,-0.54,-0.57,-0.47,-0.12,0.17,0.32,-0.36,-0.34,1,0.07,0.45,0.39,0.09,-0.24,0.23,-0.43,-0.54,0.07,1,0.7,0.59,-0.03,-0.27,0.8,-0.62,-0.57,0.45,0.7,1,0.79,0.21,0.4,0.87,-0.51,-0.47,0.39,0.59,0.79,1,0.13,0.57,0.91,0.1,-0.12,0.09,-0.03,0.21,0.13,1,0.07,0.23,0.15,0.17,-0.24,-0.27,0.4,0.57,0.07,1,0.28,0.27,0.32,0.23,0.8,0.87,0.91,0.23,0.28,1),9,9)

#here we calculate the covariace matrix.
covMatrix<-cor2cov(corMatrix,sdVector)

#empty dataframe to store the data.
areaDF<-data.frame(StationID=as.numeric(),DistanceFromCoastLine=as.numeric(),Depth=as.numeric(),Temperature=as.numeric(),Turbidity=as.numeric(),NitrateConcentration=as.numeric(),PhosphateConcentration=as.numeric(),pH=as.numeric(),DissolvedOxygen=as.numeric(),ChlorophylConcentration=as.numeric())

for(row in 1:nrow(meanMatrix)){
    
    tmpDF<-data.frame(genMultiVariateNormalData(10, meanMatrix[row,], covMatrix, c("StationID","DistanceFromCoastLine","Depth","Temperature","Turbidity","NitrateConcentration","PhosphateConcentration","pH","DissolvedOxygen","ChlorophylConcentration"), truncate=T, left_truncate=rep(0,9), right_truncate=rep(Inf,9)),Area=rep(letters[row],10))
    areaDF<-merge(areaDF,tmpDF, all=T)
}

write.csv(LakePoing_PhysicoChemData, "./LakePoing_PhysicoChemData.csv")

In [ ]:
plot(areaDF$ChlorophylConcentration~areaDF$Area)