Facies classification using Machine Learning

Joshua Poirier, NEOS

This notebook trains machine learning algorithms to predict facies from well log data. The dataset comes from a class exercise from the University of Kansas on Neural Networks and Fuzzy Systems and is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields.

My approach in this notebook will be to build many derivative features in an attempt to incorporate more spatial awareness into each observation. I will tune the model using a cross-validation technique which combines the well-known leave one out and k-folds techniques. Since the contest goal is to accurately (using the F1-score metric) predict facies for the STUART and CRAWFORD wells, the cross-validation will leave one out in the sense that it will the data for two wells out of the training set. It will borrow from k-folds in the sense that I will loop through each pair of leave one out wells, building multiple models for each set of tuning parameters.

Because only some of our wells have the PE log, we should build two classifier systems. One including PE but fewer observations, and one excluding PE but with more observations. It will be interesting to note performance differences between the two.

Load data

To start, let's load the data! For brevity, much of the code is hidden in R files. You may note this whenever you see source("somefile.R") in the code cell. These files may be viewed and are located in the same directory as this notebook.


In [1]:
source("loadData.R")

# load and clean the data
raw <- loadData()
raw <- cleanData(raw)

head(raw,3)


FaciesFormationWell.NameDepthGRILD_log10DeltaPHIPHINDPEisMarineRELPOS
FSiS A1 SH SHRIMPLIN2793.0 77.45 0.664 9.9 11.915 4.6 FALSE 1.000
FSiS A1 SH SHRIMPLIN2793.5 78.26 0.661 14.2 12.565 4.1 FALSE 0.979
FSiS A1 SH SHRIMPLIN2794.0 79.05 0.658 14.8 13.050 3.6 FALSE 0.957

Pre-process data

First, let's center and scale the petrophysical features (GR, ILD_log10, DeltaPHI, PHIND, and PE).


In [2]:
source("preProcData.R")

"Before center and scaling:"
format(head(raw, 3), digits=3)

# center and scale the petrophysical log features
cs <- raw
cs$GR <- (cs$GR - mean(cs$GR)) / sd(cs$GR)
cs$ILD_log10 <- (cs$ILD_log10 - mean(cs$ILD_log10)) / sd(cs$ILD_log10)
cs$DeltaPHI <- (cs$DeltaPHI - mean(cs$DeltaPHI)) / sd(cs$DeltaPHI)
cs$PHIND <- (cs$PHIND - mean(cs$PHIND)) / sd(cs$PHIND)
cs$PE <- (cs$PE - mean(cs$PE, na.rm=T)) / sd(cs$PE, na.rm=T)

"After center and scaling:"
format(head(cs, 3), digits=3)


'Before center and scaling:'
FaciesFormationWell.NameDepthGRILD_log10DeltaPHIPHINDPEisMarineRELPOS
FSiS A1 SH SHRIMPLIN2793 77.5 0.664 9.9 11.9 4.6 FALSE 1.000
FSiS A1 SH SHRIMPLIN2794 78.3 0.661 14.2 12.6 4.1 FALSE 0.979
FSiS A1 SH SHRIMPLIN2794 79.0 0.658 14.8 13.1 3.6 FALSE 0.957
'After center and scaling:'
FaciesFormationWell.NameDepthGRILD_log10DeltaPHIPHINDPEisMarineRELPOS
FSiS A1 SH SHRIMPLIN2793 0.413 0.01755 1.04 -0.1803 0.976 FALSE 1.000
FSiS A1 SH SHRIMPLIN2794 0.440 0.00568 1.86 -0.0892 0.418 FALSE 0.979
FSiS A1 SH SHRIMPLIN2794 0.466 -0.00620 1.97 -0.0212 -0.140 FALSE 0.957

Our center and scaling appears to be working. Now let's pre-process the data by adding derivative features. For this, we'll loop thorugh each well and apply a lag to each petrophysical feature. The purpose is to add some spatial context to each observation (in a way not dissimilar from how a petrophysicist would interpret well logs).

Now let's move forward and split the data into two data frames:

  • Every observation has a valid value for every feature (with the PE log in our case)
  • All features with a valid value for every observation (without the PE log)

In [3]:
# we only want data with complete observations (no missing features)
# we also exclude the Recruit F9 well, as our derivative features won't make sense for them
cs_withPE <- cs[complete.cases(cs) & cs$Well.Name != "Recruit F9",]
cs_withoutPE <- cs[cs$Well.Name != "Recruit F9",c("Facies", "Formation", "Well.Name", "Depth", "GR", 
                          "ILD_log10", "DeltaPHI", "PHIND", "isMarine", "RELPOS")]

"With PE log:"
paste("Rows:", nrow(cs_withPE))
format(head(cs_withPE, 3), digits=3)

"Without PE log:"
paste("Rows:", nrow(cs_withoutPE))
format(head(cs_withoutPE, 3), digits=3)


'With PE log:'
'Rows: 3164'
FaciesFormationWell.NameDepthGRILD_log10DeltaPHIPHINDPEisMarineRELPOS
FSiS A1 SH SHRIMPLIN2793 0.413 0.01755 1.04 -0.1803 0.976 FALSE 1.000
FSiS A1 SH SHRIMPLIN2794 0.440 0.00568 1.86 -0.0892 0.418 FALSE 0.979
FSiS A1 SH SHRIMPLIN2794 0.466 -0.00620 1.97 -0.0212 -0.140 FALSE 0.957
'Without PE log:'
'Rows: 4069'
FaciesFormationWell.NameDepthGRILD_log10DeltaPHIPHINDisMarineRELPOS
FSiS A1 SH SHRIMPLIN2793 0.413 0.01755 1.04 -0.1803 FALSE 1.000
FSiS A1 SH SHRIMPLIN2794 0.440 0.00568 1.86 -0.0892 FALSE 0.979
FSiS A1 SH SHRIMPLIN2794 0.466 -0.00620 1.97 -0.0212 FALSE 0.957

From this we can see that the With PE data frame has an extra feature, PE, giving it 3164 values not in the Without PE dataframe. However, the Without PE data frame has 4069-3164=905 observations not in the With PE data frame - across four raw petrophysical features (GR, ILD_log10, DeltaPHI, and PHIND) amounting 3620 raw numeric values, plus the categorical/metadata features of Formation, Well name, Depth, isMarine, and Relative position.

Moving forward, it will be interesting to see the differences in performance between these two cases.

Now let's add some derivative features. First, I shall add forward, backward, and central first-order differences of the petrophysical logs GR, ILD_log10, DeltaPHI, PHIND, and PE. Second, I shall add features by lagging GR, ILD_log10, DeltaPHI, PHIND, PE, and isMarine by various amounts. After this the original channels will be discarded as they will be equivalent to the zero-lag channels.


In [4]:
source("preProcData.R")

paste("# columns before forward, backward, and central difference feature-building:", ncol(cs_withPE))

lag_withPE <- lagData(cs_withPE, 30)
lag_withoutPE <- lagData(cs_withoutPE, 30)

paste("# columns after lag feature-building:", ncol(lag_withPE))
format(head(lag_withPE,3), digits=3)


'# columns before forward, backward, and central difference feature-building: 11'
'# columns after lag feature-building: 191'
FaciesFormationWell.NameDepthRELPOSGR_-15ILD_log10_-15DeltaPHI_-15PHIND_-15isMarine_-15...DeltaPHI_14PHIND_14isMarine_14PE_14GR_15ILD_log10_15DeltaPHI_15PHIND_15isMarine_15PE_15
FSiS A1 SH SHRIMPLIN2793 1.000 0.436 -0.762 1.402 -0.1803 FALSE ... 1.88 -0.1803 FALSE 0.976 0.436 -0.762 1.40 -0.1803 FALSE 0.976
FSiS A1 SH SHRIMPLIN2794 0.979 0.144 -0.877 0.966 -0.0892 FALSE ... 1.69 -0.0892 FALSE 0.418 0.830 -0.556 1.88 -0.0892 FALSE 0.418
FSiS A1 SH SHRIMPLIN2794 0.957 -0.046 -0.956 0.834 -0.0212 FALSE ... 1.31 -0.0212 FALSE -0.140 0.610 -0.485 1.69 -0.0212 FALSE -0.140

Looks good, let's do some cleanup and move on to model tuning!


In [5]:
# cleanup memory
data_withPE <- lag_withPE
data_withoutPE <- lag_withoutPE
data_raw <- raw

rm(raw, cs, cs_withPE, cs_withoutPE, lag_withPE, lag_withoutPE)

With PE


In [7]:
options(warn=-1)

library(caret)
source("accuracyMetrics.R")

train <- data_withPE[data_withPE$Well.Name != "SHRIMPLIN",]
test <- data_withPE[data_withPE$Well.Name == "SHRIMPLIN",]

fitControl <- trainControl(method = "repeatedcv", 
                           number = 10,
                           repeats = 10
                          )
fit <- train(Facies ~ ., data=subset(train, select=-c(Well.Name, Depth)), 
             method="rf", metric="Kappa", 
             trControl=fitControl, tuneLength=10, preProcess=c("ica"))
test$Predicted <- predict(fit, newdata=test)

f1 <- myF1Metric(test$Predicted, test$Facies)
print(paste("F1-score of:", f1))


[1] "F1-score of: 0.431372549019608"

In [7]:
library(caret)
source("accuracyMetrics.R")

fits <- list()
weights <- list()

train <- data_withPE[data_withPE$Well.Name != "SHRIMPLIN",]
test <- data_withPE[data_withPE$Well.Name == "SHRIMPLIN",]
wells <- unique(train$Well.Name)

for (well_i in wells) {
    if (well_i == "Recruit F9") {
        # generic model using all data
    } else {
        # model for current well_i
        train_i <- train[train$Well.Name == well_i,]
        fit <- train(Facies ~ ., data=subset(train_i, select=-c(Well.Name, Depth)),
                    method="svmLinear", metric="Kappa",
                    trControl=fitControl, tuneLength=10, preProcess=c("ica"))
        test[,paste(well_i)] <- predict(fit, newdata=test)
    }
}


[1] "F1-score of: 0.456475583864119"

Without PE


In [13]:
library(caret)
source("accuracyMetrics.R")

train <- data_withoutPE[data_withoutPE$Well.Name != "SHRIMPLIN",]
test <- data_withoutPE[data_withoutPE$Well.Name == "SHRIMPLIN",]

fitControl <- trainControl(## 10-fold CV
                   method = "repeatedcv",
                   number = 10,
                   ## repeated ten times
                   repeats = 10)
fit <- train(Facies ~ ., data=train, method="rf", metric="Kappa", trControl=fitControl)
test$Predicted <- predict(fit, newdata=test)

f1 <- myF1Metric(test$Predicted, test$Facies)
print(paste("F1-score of:", f1))


[1] "F1-score of: 0.549019607843137"

In [14]:
print(fit)


Random Forest 

3598 samples
 121 predictor
   9 classes: 'SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 3241, 3237, 3239, 3237, 3237, 3238, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
    2   0.7134674  0.6563012
   71   0.8150548  0.7804672
  141   0.8136680  0.7789113

Kappa was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 71. 

Tuning

With PE Case


In [23]:
source("accuracyMetrics.R")

options(warn=-1)
suppressMessages(library(caret))

fitControl <- trainControl(## 10-fold CV
                   method = "repeatedcv",
                   number = 10,
                   ## repeated ten times
                   repeats = 3)

wells <- unique(data_withPE$Well.Name)

results <- data.frame()

for (i in 1:(length(wells)-1)) {
    for (j in (i+1):(length(wells))) {
        trainIndex <- data_withPE$Well.Name != wells[i] & data_withPE$Well.Name != wells[j]
        train <- data_withPE[trainIndex,]
        test <- data_withPE[!trainIndex,]
        
        fit <- train(Facies ~ ., data=train,
                    method="rf",
                    metric="Kappa")
        
        test$Predicted <- predict(fit, newdata=test)
        
        f1 <- myF1Metric(test$Predicted, test$Facies)
        
        results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
    }
}

print(format(results, digits=4))
print(paste("Average F1-score of:", round(mean(results$f1), 4)))


[1] "Average F1-score of:"
        testwell1       testwell2     f1
1       SHRIMPLIN         SHANKLE 0.5849
2       SHRIMPLIN        LUKE G U 0.5478
3       SHRIMPLIN  CROSS H CATTLE 0.4615
4       SHRIMPLIN           NOLAN 0.5079
5       SHRIMPLIN           NEWBY 0.5241
6       SHRIMPLIN CHURCHMAN BIBLE 0.5640
7         SHANKLE        LUKE G U 0.4736
8         SHANKLE  CROSS H CATTLE 0.4680
9         SHANKLE           NOLAN 0.5278
10        SHANKLE           NEWBY 0.5164
11        SHANKLE CHURCHMAN BIBLE 0.4841
12       LUKE G U  CROSS H CATTLE 0.5073
13       LUKE G U           NOLAN 0.5194
14       LUKE G U           NEWBY 0.4177
15       LUKE G U CHURCHMAN BIBLE 0.5040
16 CROSS H CATTLE           NOLAN 0.4007
17 CROSS H CATTLE           NEWBY 0.4149
18 CROSS H CATTLE CHURCHMAN BIBLE 0.3691
19          NOLAN           NEWBY 0.4704
20          NOLAN CHURCHMAN BIBLE 0.4481
21          NEWBY CHURCHMAN BIBLE 0.4529

In [ ]:
source("accuracyMetrics.R")

options(warn=-1)
suppressMessages(library(caret))

fitControl <- trainControl(## 10-fold CV
                   method = "repeatedcv",
                   number = 10,
                   ## repeated ten times
                   repeats = 3)

wells <- unique(data_withPE$Well.Name)

results <- data.frame()

for (i in 1:(length(wells)-1)) {
    for (j in (i+1):(length(wells))) {
        trainIndex <- data_withPE$Well.Name != wells[i] & data_withPE$Well.Name != wells[j]
        train <- data_withPE[trainIndex,]
        test <- data_withPE[!trainIndex,]
        
        fit <- train(Facies ~ ., data=train,
                    method="svmLinear",
                    metric="Kappa",
                    tuneLength=4)
        print(fit)
        
        test$Predicted <- predict(fit, newdata=test)
        
        f1 <- myF1Metric(test$Predicted, test$Facies)
        
        results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
    }
}

print(format(results, digits=4))
print(paste("Average F1-score of:", round(mean(results$f1), 4)))


Loading required package: kernlab

Attaching package: 'kernlab'

The following object is masked from 'package:ggplot2':

    alpha

Without PE Case


In [26]:
raw_withPE <- data_raw[complete.cases(data_raw) & data_raw$Well.Name != "Recruit F9",]
raw_withoutPE <- data_raw[data_raw$Well.Name != "Recruit F9",c("Facies", "Formation", "Well.Name", "Depth", "GR", 
                          "ILD_log10", "DeltaPHI", "PHIND", "isMarine", "RELPOS")]

Raw with PE case


In [27]:
source("accuracyMetrics.R")

options(warn=-1)
suppressMessages(library(caret))

fitControl <- trainControl(## 10-fold CV
                   method = "repeatedcv",
                   number = 10,
                   ## repeated ten times
                   repeats = 3)

wells <- unique(raw_withPE$Well.Name)

results <- data.frame()

for (i in 1:(length(wells)-1)) {
    for (j in (i+1):(length(wells))) {
        trainIndex <- raw_withPE$Well.Name != wells[i] & raw_withPE$Well.Name != wells[j]
        train <- raw_withPE[trainIndex,]
        test <- raw_withPE[!trainIndex,]
        
        fit <- train(Facies ~ ., data=train,
                    method="rf",
                    metric="Kappa")
        
        test$Predicted <- predict(fit, newdata=test)
        
        f1 <- myF1Metric(test$Predicted, test$Facies)
        
        results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
    }
}

print(format(results, digits=4))


        testwell1       testwell2     f1
1       SHRIMPLIN         SHANKLE 0.5971
2       SHRIMPLIN        LUKE G U 0.5011
3       SHRIMPLIN  CROSS H CATTLE 0.4781
4       SHRIMPLIN           NOLAN 0.5327
5       SHRIMPLIN           NEWBY 0.5421
6       SHRIMPLIN CHURCHMAN BIBLE 0.6000
7         SHANKLE        LUKE G U 0.4659
8         SHANKLE  CROSS H CATTLE 0.5391
9         SHANKLE           NOLAN 0.5498
10        SHANKLE           NEWBY 0.5492
11        SHANKLE CHURCHMAN BIBLE 0.5304
12       LUKE G U  CROSS H CATTLE 0.5218
13       LUKE G U           NOLAN 0.4954
14       LUKE G U           NEWBY 0.4423
15       LUKE G U CHURCHMAN BIBLE 0.4740
16 CROSS H CATTLE           NOLAN 0.4192
17 CROSS H CATTLE           NEWBY 0.3955
18 CROSS H CATTLE CHURCHMAN BIBLE 0.3845
19          NOLAN           NEWBY 0.4888
20          NOLAN CHURCHMAN BIBLE 0.4969
21          NEWBY CHURCHMAN BIBLE 0.4300

In [28]:
print(paste("Average F1-score of:", round(mean(results$f1), 4)))


[1] "Average F1-score of: 0.4969"

In [29]:
source("accuracyMetrics.R")

options(warn=-1)
suppressMessages(library(caret))

fitControl <- trainControl(## 10-fold CV
                   method = "repeatedcv",
                   number = 10,
                   ## repeated ten times
                   repeats = 10)

wells <- unique(raw_withPE$Well.Name)

results <- data.frame()

for (i in 1:(length(wells)-1)) {
    for (j in (i+1):(length(wells))) {
        trainIndex <- raw_withPE$Well.Name != wells[i] & raw_withPE$Well.Name != wells[j]
        train <- raw_withPE[trainIndex,]
        test <- raw_withPE[!trainIndex,]
        
        fit <- train(Facies ~ ., data=train,
                    method="rf",
                    metric="Kappa")
        
        test$Predicted <- predict(fit, newdata=test)
        
        f1 <- myF1Metric(test$Predicted, test$Facies)
        
        results <- rbind(results, data.frame(testwell1=wells[i], testwell2=wells[j], f1=f1))
    }
}

print(format(results, digits=4))
print(paste("Average F1-score of:", round(mean(results$f1), 4)))


        testwell1       testwell2     f1
1       SHRIMPLIN         SHANKLE 0.6044
2       SHRIMPLIN        LUKE G U 0.5033
3       SHRIMPLIN  CROSS H CATTLE 0.5711
4       SHRIMPLIN           NOLAN 0.5282
5       SHRIMPLIN           NEWBY 0.5398
6       SHRIMPLIN CHURCHMAN BIBLE 0.5502
7         SHANKLE        LUKE G U 0.4604
8         SHANKLE  CROSS H CATTLE 0.5491
9         SHANKLE           NOLAN 0.6122
10        SHANKLE           NEWBY 0.5556
11        SHANKLE CHURCHMAN BIBLE 0.5652
12       LUKE G U  CROSS H CATTLE 0.5301
13       LUKE G U           NOLAN 0.4943
14       LUKE G U           NEWBY 0.4267
15       LUKE G U CHURCHMAN BIBLE 0.4933
16 CROSS H CATTLE           NOLAN 0.4290
17 CROSS H CATTLE           NEWBY 0.4800
18 CROSS H CATTLE CHURCHMAN BIBLE 0.3757
19          NOLAN           NEWBY 0.4817
20          NOLAN CHURCHMAN BIBLE 0.5031
21          NEWBY CHURCHMAN BIBLE 0.4300
[1] "Average F1-score of: 0.5087"

Raw without PE case


In [ ]: