This notebook trains machine learning algorithms to predict facies from well log data. The dataset comes from a class exercise from the University of Kansas on Neural Networks and Fuzzy Systems and is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields.
My approach in this notebook will be to build many derivative features in an attempt to incorporate more spatial awareness into each observation. I will tune the model using a cross-validation technique which combines the well-known leave one out and k-folds techniques. Since the contest goal is to accurately (using the F1-score metric) predict facies for the STUART and CRAWFORD wells, the cross-validation will leave one out in the sense that it will the data for two wells out of the training set. It will borrow from k-folds in the sense that I will loop through each pair of leave one out wells, building multiple models for each set of tuning parameters.
Because only some of our wells have the PE log, we should build two classifier systems. One including PE but fewer observations, and one excluding PE but with more observations. It will be interesting to note performance differences between the two.
To start, let's load the data! For brevity, much of the code is hidden in R files. You may note this whenever you see source("somefile.R") in the code cell. These files may be viewed and are located in the same directory as this notebook.
In [1]:
source("loadData.R")
# load and clean the data
raw <- loadData()
raw <- cleanData(raw)
head(raw,3)
In [2]:
source("preProcData.R")
"Before center and scaling:"
format(head(raw, 3), digits=3)
# center and scale the petrophysical log features
cs <- raw
cs$GR <- (cs$GR - mean(cs$GR)) / sd(cs$GR)
cs$ILD_log10 <- (cs$ILD_log10 - mean(cs$ILD_log10)) / sd(cs$ILD_log10)
cs$DeltaPHI <- (cs$DeltaPHI - mean(cs$DeltaPHI)) / sd(cs$DeltaPHI)
cs$PHIND <- (cs$PHIND - mean(cs$PHIND)) / sd(cs$PHIND)
cs$PE <- (cs$PE - mean(cs$PE, na.rm=T)) / sd(cs$PE, na.rm=T)
"After center and scaling:"
format(head(cs, 3), digits=3)
Our center and scaling appears to be working. Now let's pre-process the data by adding derivative features. For this, we'll loop thorugh each well and apply a lag to each petrophysical feature. The purpose is to add some spatial context to each observation (in a way not dissimilar from how a petrophysicist would interpret well logs).
Now let's move forward and split the data into two data frames:
In [3]:
# we only want data with complete observations (no missing features)
# we also exclude the Recruit F9 well, as our derivative features won't make sense for them
cs_withPE <- cs[complete.cases(cs) & cs$Well.Name != "Recruit F9",]
cs_withoutPE <- cs[cs$Well.Name != "Recruit F9",c("Facies", "Formation", "Well.Name", "Depth", "GR",
"ILD_log10", "DeltaPHI", "PHIND", "isMarine", "RELPOS")]
"With PE log:"
paste("Rows:", nrow(cs_withPE))
format(head(cs_withPE, 3), digits=3)
"Without PE log:"
paste("Rows:", nrow(cs_withoutPE))
format(head(cs_withoutPE, 3), digits=3)
From this we can see that the With PE data frame has an extra feature, PE, giving it 3164 values not in the Without PE dataframe. However, the Without PE data frame has 4069-3164=905 observations not in the With PE data frame - across four raw petrophysical features (GR, ILD_log10, DeltaPHI, and PHIND) amounting 3620 raw numeric values, plus the categorical/metadata features of Formation, Well name, Depth, isMarine, and Relative position.
Moving forward, it will be interesting to see the differences in performance between these two cases.
Now let's add some derivative features. First, I shall add forward, backward, and central first-order differences of the petrophysical logs GR, ILD_log10, DeltaPHI, PHIND, and PE. Second, I shall add features by lagging GR, ILD_log10, DeltaPHI, PHIND, PE, and isMarine by various amounts. After this the original channels will be discarded as they will be equivalent to the zero-lag channels.
In [4]:
source("preProcData.R")
paste("# columns before forward, backward, and central difference feature-building:", ncol(cs_withPE))
lag_withPE <- lagData(cs_withPE, 30)
lag_withoutPE <- lagData(cs_withoutPE, 30)
paste("# columns after lag feature-building:", ncol(lag_withPE))
format(head(lag_withPE,3), digits=3)
Looks good, let's do some cleanup and move on to model tuning!
In [5]:
# cleanup memory
data_withPE <- lag_withPE
data_withoutPE <- lag_withoutPE
data_raw <- raw
rm(raw, cs, cs_withPE, cs_withoutPE, lag_withPE, lag_withoutPE)
In [7]:
options(warn=-1)
library(caret)
source("accuracyMetrics.R")
train <- data_withPE[data_withPE$Well.Name != "SHRIMPLIN",]
test <- data_withPE[data_withPE$Well.Name == "SHRIMPLIN",]
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10
)
fit <- train(Facies ~ ., data=subset(train, select=-c(Well.Name, Depth)),
method="rf", metric="Kappa",
trControl=fitControl, tuneLength=10, preProcess=c("ica"))
test$Predicted <- predict(fit, newdata=test)
f1 <- myF1Metric(test$Predicted, test$Facies)
print(paste("F1-score of:", f1))
In [7]:
library(caret)
source("accuracyMetrics.R")
fits <- list()
weights <- list()
train <- data_withPE[data_withPE$Well.Name != "SHRIMPLIN",]
test <- data_withPE[data_withPE$Well.Name == "SHRIMPLIN",]
wells <- unique(train$Well.Name)
for (well_i in wells) {
if (well_i == "Recruit F9") {
# generic model using all data
} else {
# model for current well_i
train_i <- train[train$Well.Name == well_i,]
fit <- train(Facies ~ ., data=subset(train_i, select=-c(Well.Name, Depth)),
method="svmLinear", metric="Kappa",
trControl=fitControl, tuneLength=10, preProcess=c("ica"))
test[,paste(well_i)] <- predict(fit, newdata=test)
}
}
In [13]:
library(caret)
source("accuracyMetrics.R")
train <- data_withoutPE[data_withoutPE$Well.Name != "SHRIMPLIN",]
test <- data_withoutPE[data_withoutPE$Well.Name == "SHRIMPLIN",]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
fit <- train(Facies ~ ., data=train, method="rf", metric="Kappa", trControl=fitControl)
test$Predicted <- predict(fit, newdata=test)
f1 <- myF1Metric(test$Predicted, test$Facies)
print(paste("F1-score of:", f1))
In [14]:
print(fit)
In [23]:
source("accuracyMetrics.R")
options(warn=-1)
suppressMessages(library(caret))
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 3)
wells <- unique(data_withPE$Well.Name)
results <- data.frame()
for (i in 1:(length(wells)-1)) {
for (j in (i+1):(length(wells))) {
trainIndex <- data_withPE$Well.Name != wells[i] & data_withPE$Well.Name != wells[j]
train <- data_withPE[trainIndex,]
test <- data_withPE[!trainIndex,]
fit <- train(Facies ~ ., data=train,
method="rf",
metric="Kappa")
test$Predicted <- predict(fit, newdata=test)
f1 <- myF1Metric(test$Predicted, test$Facies)
results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
}
}
print(format(results, digits=4))
print(paste("Average F1-score of:", round(mean(results$f1), 4)))
In [ ]:
source("accuracyMetrics.R")
options(warn=-1)
suppressMessages(library(caret))
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 3)
wells <- unique(data_withPE$Well.Name)
results <- data.frame()
for (i in 1:(length(wells)-1)) {
for (j in (i+1):(length(wells))) {
trainIndex <- data_withPE$Well.Name != wells[i] & data_withPE$Well.Name != wells[j]
train <- data_withPE[trainIndex,]
test <- data_withPE[!trainIndex,]
fit <- train(Facies ~ ., data=train,
method="svmLinear",
metric="Kappa",
tuneLength=4)
print(fit)
test$Predicted <- predict(fit, newdata=test)
f1 <- myF1Metric(test$Predicted, test$Facies)
results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
}
}
print(format(results, digits=4))
print(paste("Average F1-score of:", round(mean(results$f1), 4)))
In [26]:
raw_withPE <- data_raw[complete.cases(data_raw) & data_raw$Well.Name != "Recruit F9",]
raw_withoutPE <- data_raw[data_raw$Well.Name != "Recruit F9",c("Facies", "Formation", "Well.Name", "Depth", "GR",
"ILD_log10", "DeltaPHI", "PHIND", "isMarine", "RELPOS")]
In [27]:
source("accuracyMetrics.R")
options(warn=-1)
suppressMessages(library(caret))
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 3)
wells <- unique(raw_withPE$Well.Name)
results <- data.frame()
for (i in 1:(length(wells)-1)) {
for (j in (i+1):(length(wells))) {
trainIndex <- raw_withPE$Well.Name != wells[i] & raw_withPE$Well.Name != wells[j]
train <- raw_withPE[trainIndex,]
test <- raw_withPE[!trainIndex,]
fit <- train(Facies ~ ., data=train,
method="rf",
metric="Kappa")
test$Predicted <- predict(fit, newdata=test)
f1 <- myF1Metric(test$Predicted, test$Facies)
results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
}
}
print(format(results, digits=4))
In [28]:
print(paste("Average F1-score of:", round(mean(results$f1), 4)))
In [29]:
source("accuracyMetrics.R")
options(warn=-1)
suppressMessages(library(caret))
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
wells <- unique(raw_withPE$Well.Name)
results <- data.frame()
for (i in 1:(length(wells)-1)) {
for (j in (i+1):(length(wells))) {
trainIndex <- raw_withPE$Well.Name != wells[i] & raw_withPE$Well.Name != wells[j]
train <- raw_withPE[trainIndex,]
test <- raw_withPE[!trainIndex,]
fit <- train(Facies ~ ., data=train,
method="rf",
metric="Kappa")
test$Predicted <- predict(fit, newdata=test)
f1 <- myF1Metric(test$Predicted, test$Facies)
results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
}
}
print(format(results, digits=4))
print(paste("Average F1-score of:", round(mean(results$f1), 4)))
In [ ]: