Nonmarine vs. Marine Analysis

Joshua Poirier, NEOS

2016 SEG Machine Learning Contest

1 Introduction

This notebook analyzes the relationship between the Nonmarine vs. Marine log indicator (NM_M in the dataset) and its impact on facies. The idea is that different facies should have nearly all observations belonging to either the nonmarine class, or the marine class. If this is the case, it may be prudent to develop two separate classifiers split on the Nonmarine vs. Marine indicator. This may help reduce misclassification (observations with a nonmarine indicator may only be classified as a nonmarine facies, likewise for marine). Similarly, the machine learning algorithm may have already "learned" this - in which case the results will be equal or worse.

After loading the data, I will examine the distribution of the NM_M feature across each facies. I will then determine which facies may be considered nonmarine origin, and which facies marine origin. I will then apply the Support Vector Machine algorithm to A - the entire dataset (as with jpoirier001.ipynb), and B - nonmarine and marine separately to assess any improvement in predictive accuracy. Here, accuracy will be measured by the average F1 score across all classes when applied to the Newby well - which will be considered a blind test and not included when training the model.

First, let's load packages required to perform our analysis.


In [2]:
# visualization packages
library(repr)
library(ggplot2)
library(ggthemes)
library(cowplot)

# machine learning packages
library(caret)
library(e1071)

2 Exploring the dataset

First let's load the data and look at the first few rows. The data is contained in the file facies_vectors.csv and contains five wireline log measurements, two indicator variables, and a facies label at half-foot increments.


In [3]:
# load data
fname <- "../facies_vectors.csv"
data <- read.csv(fname, colClasses=c(rep("factor",3), rep("numeric",6), "factor", "numeric"))

# display first five rows of data set and it's dimensions
head(data)
paste(dim(data)[1], "rows x", dim(data)[2], "columns")


FaciesFormationWell.NameDepthGRILD_log10DeltaPHIPHINDPENM_MRELPOS
3 A1 SH SHRIMPLIN2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000
3 A1 SH SHRIMPLIN2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979
3 A1 SH SHRIMPLIN2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957
3 A1 SH SHRIMPLIN2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936
3 A1 SH SHRIMPLIN2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915
3 A1 SH SHRIMPLIN2795.5 73.97 0.636 14.0 13.385 3.6 1 0.894
'4149 rows x 11 columns'

In [4]:
# training and validation maybe?
blind <- data[data$Well.Name == 'NEWBY',]
data <- data[data$Well.Name != 'NEWBY',]

This data is from the Panoma Council Grove Field (predominantly gas reservoir) over 2700 sq mi in SW Kansas. The dataset is from nine wells with 4149 samples. Each sample consists of seven predictor variables and a rock facies. The validation (test) data have 830 samples from two wells having the same seven predictor variables. Facies are based on examination of cores from the nine wells taken vertically at half-foot intervals. The predictor variables include five wireline log measurements and two geologic constraining variables that are derived from geologic knowledge and are sampled at the same half-foot rate.

The seven predictor variables are:

  • GR - Gamma ray
  • ILD_log10 - Resistivity logging
  • PE - Photoelectric effect (some wells are missing this log)
  • DeltaPhi - Neutron-density porosity difference
  • PHIND - Average neutron-density porosity

The two geologic constraining variables are:

  • NM_M - nonmarine-marine indicator
  • RELPOS - Relative position

The nine discrete facies (classes of rock) are:

  1. Nonmarine sandstone
  2. Nonmarine coarse siltstone
  3. Nonmarine fine siltstone
  4. Marine siltstone and shale
  5. Mudstone (limestone)
  6. Wackestone (limestone)
  7. Dolomite
  8. Pckstone-grainstone (limestone)
  9. Phylloid-algal bafflestone (limestone)

These facies are not discrete and can gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels, and their approximate neighbors.

Facies Label Adjacent Facies
1 SS 2
2 CSiS 1,3
3 FSiS 2
4 SiSh 5
5 MS 4,6
6 WS 5,7
7 D 6,8
8 PS 6,7,9
9 BS 7,8

Now let's define a colormap for the facies such that they are represented by consistent colors in this tutorial. We'll also take a peek at the statistical distribution of the input variables.


In [5]:
# 1=sandstone, 2=c_siltstone, 3=f_siltstone, 4=marine_silt_shale, 5=mudstone, 
# 6=wackestone, 7=dolomite, 8=packestone, 9=bafflestone

facies_colors <- c('#F4D03F', '#F5B041', '#DC7633', '#6E2C00', '#1B4F72', '#2E86C1', '#AED6F1', '#A569BD', '#196F3D')
facies_labels <- c('SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS')

summary(data)


     Facies      Formation             Well.Name       Depth     
 2      :842   C LM   : 594   CROSS H CATTLE:501   Min.   :2574  
 3      :700   A1 LM  : 509   SHRIMPLIN     :471   1st Qu.:2806  
 8      :630   A1 SH  : 379   ALEXANDER D   :466   Median :2931  
 6      :486   C SH   : 351   LUKE G U      :461   Mean   :2903  
 1      :268   B5 LM  : 319   SHANKLE       :449   3rd Qu.:3008  
 5      :268   B1 SH  : 303   KIMZEY A      :439   Max.   :3138  
 (Other):492   (Other):1231   (Other)       :899                 
       GR           ILD_log10           DeltaPHI           PHIND      
 Min.   : 10.15   Min.   :-0.02595   Min.   :-21.832   Min.   : 0.55  
 1st Qu.: 45.57   1st Qu.: 0.48900   1st Qu.:  1.700   1st Qu.: 8.55  
 Median : 65.68   Median : 0.63500   Median :  4.369   Median :12.17  
 Mean   : 65.46   Mean   : 0.65765   Mean   :  4.519   Mean   :13.40  
 3rd Qu.: 80.20   3rd Qu.: 0.82900   3rd Qu.:  7.600   3rd Qu.:16.24  
 Max.   :361.15   Max.   : 1.80000   Max.   : 19.312   Max.   :84.40  
                                                                      
       PE        NM_M         RELPOS      
 Min.   :0.200   1:1820   Min.   :0.0000  
 1st Qu.:3.078   2:1866   1st Qu.:0.2740  
 Median :3.500            Median :0.5260  
 Mean   :3.717            Mean   :0.5211  
 3rd Qu.:4.371            3rd Qu.:0.7678  
 Max.   :8.094            Max.   :1.0000  
 NA's   :917                              

Looking at the statistical summary of the input variables, it can be seen that all but the PE (photoelectric effect) inputs have no NA's listed. For this tutorial, we will drop the feature vectors that do not have a valid PE entry.


In [6]:
PE_mask <- complete.cases(data)
data <- data[PE_mask,]
paste(dim(data)[1], "rows x", dim(data)[2], "columns")


'2769 rows x 11 columns'

Out of the original 4149 samples, we will be training our model on 2769 samples. Now let's build some familiar log plots!


In [7]:
logplot <- function(x, incl_fac=TRUE, incl_pred=FALSE) {
    # GR gamma ray track
    g1 <- ggplot(x) + theme_economist_white(gray_bg=T) + 
        scale_y_continuous(lim=c(0,400), breaks=seq(0,400,100), labels=c("0"="0","100"="","200"="200","300"="","400"="400")) +
        scale_x_continuous(trans="reverse") + coord_flip() + labs(title="", x="Depth", y="GR") +
        geom_bar(stat="identity", data=x, aes(x=Depth, y=GR, fill=GR, alpha=0.5), width=0.5) +
        geom_line(aes(x=Depth, y=GR), lwd=.5, col='black') +
        scale_fill_continuous(limits=c(0,225), low="yellow", high="black") +
        theme(panel.grid.major.x = element_line(colour="gray", size=0.5), legend.position="none",
             axis.text=element_text(size=6), axis.title=element_text(size=8,face="bold"))
    g1 <- switch_axis_position(g1, 'x')
    
    # ILD resistivity track (transform it back to actual units)
    g2 <- ggplot(x) + theme_economist_white(gray_bg=T) +
        scale_y_log10(lim=c(0.1,50), breaks=c(.1,.2,.4,.6,.8,1,2,4,6,8,10,20,40), 
                      labels=c(".1"=".1",".2"="",".4"="",".6"="",".8"="",
                               "1"="1","2"="","4"="","6"="","8"="","10"="10",
                              "20"="","40"="")) + 
        scale_x_continuous(trans="reverse") + 
        coord_flip() + labs(title="", x="", y="ILD") +
        geom_line(aes(x=Depth, y=10^ILD_log10), lwd=.5, col="skyblue4") +
        theme(panel.grid.major.x = element_line(colour="gray", size=0.25), legend.position="none", 
              axis.text=element_text(size=6), axis.title=element_text(size=8,face="bold"), axis.text.y=element_blank())
    g2 <- switch_axis_position(g2, 'x')
    
    # DeltaPhi track
    g3 <- ggplot(x) + theme_economist_white(gray_bg=T) +
        scale_y_continuous(lim=c(-20,20), breaks=seq(-20,20,10),labels=c("-20"="-20","-10"="","0"="0","10"="","20"="20")) + 
        scale_x_continuous(trans="reverse") + coord_flip() + labs(title="", x="", y="DeltaPhi") +
        geom_line(aes(x=Depth, y=DeltaPHI), lwd=.5, col="seagreen4") +
        theme(panel.grid.major.x = element_line(colour="gray", size=0.25), legend.position="none", 
              axis.text=element_text(size=6), axis.title=element_text(size=8,face="bold"), axis.text.y=element_blank())
    g3 <- switch_axis_position(g3, 'x')
    
    # PHIND neutron porosity track
    g4 <- ggplot(x) + theme_economist_white(gray_bg=T) +
        scale_y_continuous(lim=c(0,50), breaks=c(0,15,30,45)) + scale_x_continuous(trans="reverse") +
        coord_flip() + labs(title="", x="", y="PHIND") +
        geom_line(aes(x=Depth, y=PHIND), lwd=.5, col="firebrick") +
        theme(panel.grid.major.x = element_line(colour="gray", size=0.25), legend.position="none", 
              axis.text=element_text(size=6), axis.title=element_text(size=8,face="bold"), axis.text.y=element_blank())
    g4 <- switch_axis_position(g4, 'x')
    
    # PE photoelectric effect track
    g5 <- ggplot(x) + theme_economist_white(gray_bg=T) +
        scale_y_continuous(lim=c(0,8), breaks=c(0,2,4,6,8)) + scale_x_continuous(trans="reverse") +
        coord_flip() + labs(title="", x="", y="PE") +
        geom_line(aes(x=Depth, y=PE), lwd=.5, col="black") +
        theme(panel.grid.major.x = element_line(colour="gray", size=0.25), legend.position="none", 
              axis.text=element_text(size=6), axis.title=element_text(size=8,face="bold"), axis.text.y=element_blank())
    g5 <- switch_axis_position(g5, 'x')
    
    x$ones <- rep(1, nrow(x))
    # build a facies track if we are to include
    if (incl_fac) {
        g6 <- ggplot(x) + theme_economist_white(gray_bg=T) +
            scale_y_continuous(lim=c(-0.1,1.1), breaks=c(0,1), labels=c("0"="", "1"="")) + scale_x_continuous(trans="reverse") +
            coord_flip() + labs(title="", x="", y="Facies") +
            geom_bar(stat="identity", data=x, aes(x=Depth, y=ones, fill=Facies), width=0.5) +
            scale_fill_manual(values=facies_colors, drop=F, labels=facies_labels) +
            theme(axis.title=element_text(size=8,face="bold"), axis.text.y=element_blank(), axis.text.x=element_text(size=6))
    }
    
    # build a prediction track if we are to include
    if (incl_pred) {
        # build Predicted Facies track
        g7 <- ggplot(x) + theme_economist_white(gray_bg=T) +
            scale_y_continuous(lim=c(-0.1,1.1), breaks=c(0,1), labels=c("0"="", "1"="")) + scale_x_continuous(trans="reverse") +
            coord_flip() + labs(title="", x="", y="Predicted") +
            geom_bar(stat="identity", data=x, aes(x=Depth, y=ones, fill=Predicted), width=0.5) +
            scale_fill_manual(values=facies_colors, drop=F, labels=facies_labels) +
            theme(legend.position="right", legend.text=element_text(size=6), legend.title=element_blank()) +
            theme(axis.title=element_text(size=8,face="bold"), axis.text.y=element_blank(), axis.text.x=element_text(size=6))
        g7 <- switch_axis_position(g7, 'x')
        
        # finish off Facies track with no legend if we are to include
        if (incl_fac) {
            g6 <- g6 + theme(legend.position="none")
            g6 <- switch_axis_position(g6, 'x')
            
            # bring all the tracks together as a grid
            g <- plot_grid(g1, g2, g3, g4, g5, g6, g7, ncol=7, rel_widths=c(4,3,3,3,3,2,5))
        }
        else {
            # bring all the tracks together as a grid
            g <- plot_grid(g1, g2, g3, g4, g5, g7, ncol=6, rel_widths=c(4,3,3,3,3,5))
        }
        
        ggdraw() + draw_plot(g, width=1, height=1) + draw_plot_label(x$Well.Name[1], size=10)
    }
    else {
        if (incl_fac) {
            # finish off Facies track with a legend
            g6 <- g6 + theme(legend.position="right", legend.text=element_text(size=6), legend.title=element_blank())
            g6 <- switch_axis_position(g6, 'x')
            
            # bring all the tracks together as a grid
            g <- plot_grid(g1, g2, g3, g4, g5, g6, ncol=6, rel_widths=c(4,3,3,3,3,6))
        }
        else {
            # bring all the tracks together as a grid
            g <- plot_grid(g1, g2, g3, g4, g5, ncol=5, rel_widths=c(4,3,3,3,3))
        }
        
        ggdraw() + draw_plot(g, width=1, height=1) + draw_plot_label(x$Well.Name[1], size=10)    
    }
}

In [8]:
options(repr.plot.width=8, repr.plot.height=5)

# plot logs for the Shrimplin and Shankle wells
logplot(data[data$Well.Name == "SHRIMPLIN",])
logplot(data[data$Well.Name == "SHANKLE",])


3 Too marine or not too marine?

Now let's examine the distribution of facies for both observations featuring a nonmarine indicator as well as a marine indicator. To do this, let's build a histogram faceted on the NM_M variable.


In [9]:
options(repr.plot.width=8, repr.plot.height=4)

# modify the NM_M factors to be a string - more descriptive and plots nicer
levels(data$NM_M)[levels(data$NM_M)=="1"] <- "Nonmarine"
levels(data$NM_M)[levels(data$NM_M)=="2"]   <- "Marine"

# build histogram faceted on the NM_M (nonmarine vs marine) feature
g <- ggplot(data, aes(x=Facies)) + theme_economist_white(gray_bg=T) + 
        facet_grid(. ~ NM_M) +
        geom_bar(aes(x=Facies, fill=Facies)) + labs(title="Distribution of Facies", x="", y="") +
        scale_x_discrete(labels=facies_labels) +
        scale_fill_manual(values=facies_colors, drop=F, labels=facies_labels) +
        theme(legend.position="none", legend.title=element_blank(), legend.text=element_text(size=6),
             axis.text=element_text(size=6), plot.title=element_text(size=10), axis.title=element_blank(),
             axis.ticks.x=element_blank())

g


That's a pretty interesting visual. The SS, CSiS, and FSiS facies appear to be nonmarine; while the remaining facies appear to be marine. There do seem to be a handful of cross-classifications - let's make a table to quantify the percentage of observations not following this trend.


In [10]:
# modify the factor levels for facies to be more descriptive
levels(data$Facies)[levels(data$Facies)=="1"] <- "SS"
levels(data$Facies)[levels(data$Facies)=="2"]   <- "CSiS"
levels(data$Facies)[levels(data$Facies)=="3"]   <- "FSiS"
levels(data$Facies)[levels(data$Facies)=="4"]   <- "SiSh"
levels(data$Facies)[levels(data$Facies)=="5"]   <- "MS"
levels(data$Facies)[levels(data$Facies)=="6"]   <- "WS"
levels(data$Facies)[levels(data$Facies)=="7"]   <- "D"
levels(data$Facies)[levels(data$Facies)=="8"]   <- "PS"
levels(data$Facies)[levels(data$Facies)=="9"]   <- "BS"

# count observations of facies which are nonmarine or marine
t <- table(data$Facies, data$NM_M == "Marine")

# calculate those counts as percentages
nm_percent <- round(100 * t[,1] / (t[,1] + t[,2]),0)
m_percent <- round(100 * t[,2] / (t[,1] + t[,2]),0)
t <- as.table(cbind(t[,1], t[,2], nm_percent, m_percent))

# format the table and output
dimnames(t)[[2]] <- c('# Nonmarine', '# Marine', '% Nonmarine', '% Marine')
t


     # Nonmarine # Marine % Nonmarine % Marine
SS           259        0         100        0
CSiS         634        6          99        1
FSiS         519       16          97        3
SiSh           5      121           4       96
MS            11      178           6       94
WS             2      364           1       99
D              1       81           1       99
PS            12      430           3       97
BS             0      130           0      100

MS - Mudstone is the facies with the most uncertainty here. Of the 189 observations of the MS facies, 11 (6%) were considered nonmarine. It's probably safe to call SS, CSiS, and FSiS all nonmarine, and the remaining facies marine. How many of the observations overall do not fall into this ordering of the data?


In [11]:
# sum up how many correctly/incorrectly fall under our "rule"
nm_m_true <- t[1,1] + t[2,1] + t[3,1] + t[4,2] + t[5,2] + t[6,2] + t[7,2] + t[8,2] + t[9,2]
nm_m_false <- t[1,2] + t[2,2] + t[3,2] + t[4,1] + t[5,1] + t[6,1] + t[7,1] + t[8,1] + t[9,1]

# percentage of observations not falling under our "rule*
paste(round(100 * nm_m_false / (nm_m_true + nm_m_false),2), "%")


'1.91 %'

Less than 2% of observations fail our rule. Let's move forward to applying some Support Vector Machines to our data!

4 A support vector machine classifier

As in the previous notebook jpoirier001.ipynb, let's build a Support Vector Machine classifier. First, we'll build a classifier and tune the results. Second, we'll build two classifiers (one for nonmarine indicators, one for marine).

4.1 Support vector machine with all data

This workflow will be very similar to that of jpoirier001.ipynb. Ultimately we expect the same results.


In [12]:
## HELPER FUNCTIONS
## ################################################################################################

# list of adjacent facies
adjacent_facies <- list(as.list(c(2)), as.list(c(1,3)), 
                        as.list(c(2)), as.list(c(5)), 
                        as.list(c(4,6)), as.list(c(5,7,8)), 
                        as.list(c(6,8)), as.list(c(6,7,9)), 
                        as.list(c(7,8)))

# function to calculate the confusion matrix which includes adjacent facies as correctly classified
adj_cm_table <- function(cm_table) {    
    adj_cm_table <- cm_table
    
    # loop through facies to build adjacent facies confusion matrix
    for (i in 1:9) {
        cor <- cm_table[i,i]

        # move adjacently correct facies into correct facies
        for (j in 1:length(adjacent_facies[[i]])) {
            cor <- cor + cm_table[adjacent_facies[[i]][[j]],i]
            adj_cm_table[adjacent_facies[[i]][[j]],i] <- 0
        }
        adj_cm_table[i,i] <- cor
    }
    
    # return adjacently corrected confusion matrix
    adj_cm_table
}

# function to display a confusion matrix
#   replaces the integer facies with facies shortnames first
disp_cm <- function(cm) {
    dimnames(cm)$Prediction <- facies_labels
    dimnames(cm)$Reference <- facies_labels
    print(cm)
}

# cm - confusion matrix either as a confusionMatrix object or table object
# x - the data used, from this we calculate how many observations (support) for each facies
accuracy_metrics <- function(cm, x) {
    # if given confusion matrix (cm) is a confusionMatrix object
    if (class(cm) == "confusionMatrix") {
        df <- data.frame("Facies" = facies_labels,
                        "Precision" = cm[["byClass"]][,5],
                        "Recall" = cm[["byClass"]][,6],
                        "F1" = cm[["byClass"]][,7],
                        "Support" = as.matrix(table(x$Facies)))
        df[,-1] <- round(df[,-1],2)
        rownames(df) <- NULL
    }
    # if given confusion matrix is a table object
    else if (class(cm) == "table") {
       # initialize vectors for precision, recall, and f1 metrics with zeros
        prec <- rep(0,9)
        recall <- rep(0,9)
        f1 <- rep(0,9)

        # loop through facies to compute precision, recall, and f1 for each facies
        beta <- 1
        for (i in 1:9) {
            prec[i] <- cm[i,i] / sum(cm[i,])
            recall[i] <- cm[i,i] / sum(cm[,i])
            f1[i] <- (1 + beta^2) * prec[i] * recall[i] / ((beta^2 * prec[i]) + recall[i])
        }
        
        # calculate model metrics for precision, recall, and f1 and output
        df <- data.frame(Facies=facies_labels, 
                         Precision=prec, 
                         Recall=recall, 
                         F1=f1, 
                         Support=as.matrix(table(x$Facies)))

        # round values to two digits
        df[,-1] <- round(df[,-1],2)
    }
    
    # average F1 score across all classes
    print(paste0("Overall F1-score of: ", round(mean(df$F1, na.rm=T),2)))
    print("Accuracy metrics:")
    df
}

In [13]:
set.seed(1234)

# split into training and test data sets
feature_vectors <- data[, c("Facies", "GR", "ILD_log10", "DeltaPHI", "PHIND", "PE", "NM_M", "RELPOS")]
trainIndex <- createDataPartition(feature_vectors$Facies, p=.8, list=F, times=1)
x_train <- feature_vectors[trainIndex,]
x_test <- feature_vectors[-trainIndex,]

set.seed(3124)

# calculates models for a variety of hyperparameters - this will take a few minutes
tune.out <- tune(svm, Facies ~ ., data=x_train, 
                 kernel="radial",
                 ranges=list(cost=c(.01,1,5,10,20,50,100,1000,5000,10000),
                            gamma=c(.0001,.001,.01,1,10)))

# predict facies using the best model
cv_predictions <- predict(tune.out$best.model, newdata=x_test)

## PART ONE: Confusion matrix for facies classification
"Facies classification confusion matrix:"
cv_cm <- confusionMatrix(cv_predictions, x_test$Facies)
disp_cm(as.matrix(cv_cm[["table"]]))

## PART TWO: Confusion matrix for adjacent facies classification
"Adjacent facies classification confusion matrix:"
cv_adj <- adj_cm_table(as.matrix(cv_cm[["table"]]))
disp_cm(cv_adj)


'Facies classification confusion matrix:'
          Reference
Prediction SS CSiS FSiS SiSh MS WS  D PS BS
      SS   40   11    0    0  0  0  0  0  0
      CSiS 10   95   18    0  1  1  1  0  0
      FSiS  1   22   87    1  0  0  0  5  0
      SiSh  0    0    2   17  1  5  0  0  0
      MS    0    0    0    2 24  3  0  1  0
      WS    0    0    0    2  6 59  2 10  0
      D     0    0    0    1  0  0  9  2  0
      PS    0    0    0    2  5  5  2 70  0
      BS    0    0    0    0  0  0  2  0 26
'Adjacent facies classification confusion matrix:'
          Reference
Prediction  SS CSiS FSiS SiSh  MS  WS   D  PS  BS
      SS    50    0    0    0   0   0   0   0   0
      CSiS   0  128    0    0   1   1   1   0   0
      FSiS   1    0  105    1   0   0   0   5   0
      SiSh   0    0    2   19   0   5   0   0   0
      MS     0    0    0    0  31   0   0   1   0
      WS     0    0    0    2   0  67   0   0   0
      D      0    0    0    1   0   0  13   0   0
      PS     0    0    0    2   5   0   0  82   0
      BS     0    0    0    0   0   0   2   0  26

Now let's apply the model to our blind data set an output the confusion matrices and accuracy metrics!


In [14]:
# modify the factor levels for facies to be more descriptive
levels(blind$Facies)[levels(blind$Facies)=="1"] <- "SS"
levels(blind$Facies)[levels(blind$Facies)=="2"]   <- "CSiS"
levels(blind$Facies)[levels(blind$Facies)=="3"]   <- "FSiS"
levels(blind$Facies)[levels(blind$Facies)=="4"]   <- "SiSh"
levels(blind$Facies)[levels(blind$Facies)=="5"]   <- "MS"
levels(blind$Facies)[levels(blind$Facies)=="6"]   <- "WS"
levels(blind$Facies)[levels(blind$Facies)=="7"]   <- "D"
levels(blind$Facies)[levels(blind$Facies)=="8"]   <- "PS"
levels(blind$Facies)[levels(blind$Facies)=="9"]   <- "BS"

"PART ONE: Facies classification"
blind_predictions <- predict(tune.out$best.model, newdata=blind[,c(-2,-3,-4)])
blind_cm <- confusionMatrix(blind_predictions, blind$Facies)
disp_cm(as.matrix(blind_cm[["table"]]))
accuracy_metrics(blind_cm, blind)

"PART TWO: Adjacent facies classification"
"Facies classification accuracy metrics"
blind_adj <- adj_cm_table(as.matrix(blind_cm[["table"]]))
disp_cm(blind_adj)
accuracy_metrics(cv_cm, x_test)


'PART ONE: Facies classification'
          Reference
Prediction SS CSiS FSiS SiSh MS WS  D PS BS
      SS    0   20    4    0  0  0  0  0  0
      CSiS  0   49   29    0  1  1  0  1  0
      FSiS  0   27   43    4  1  4  0  6  0
      SiSh  0    0    0   42  4  5  0  0  2
      MS    0    0    0    2  1 32  0  5  0
      WS    0    0    0    5  9 10  0  4  4
      D     0    0    0    0  0  0  0  2  0
      PS    0    2    4    5 12 42 16 38 25
      BS    0    0    0    0  0  2  0  0  0
[1] "Overall F1-score of: 0.4"
[1] "Accuracy metrics:"
FaciesPrecisionRecallF1Support
SS 0.00 NA NA 0
CSiS0.600.500.5598
FSiS0.510.540.5280
SiSh0.790.720.7658
MS 0.020.040.0328
WS 0.310.100.1696
D 0.000.00 NaN16
PS 0.260.680.3856
BS 0.000.00 NaN31
'PART TWO: Adjacent facies classification'
'Facies classification accuracy metrics'
          Reference
Prediction SS CSiS FSiS SiSh MS WS  D PS BS
      SS    0    0    4    0  0  0  0  0  0
      CSiS  0   96    0    0  1  1  0  1  0
      FSiS  0    0   72    4  1  4  0  6  0
      SiSh  0    0    0   44  0  5  0  0  2
      MS    0    0    0    0 14  0  0  5  0
      WS    0    0    0    5  0 84  0  0  4
      D     0    0    0    0  0  0 16  0  0
      PS    0    2    4    5 12  0  0 44  0
      BS    0    0    0    0  0  2  0  0 25
[1] "Overall F1-score of: 0.77"
[1] "Accuracy metrics:"
FaciesPrecisionRecallF1Support
SS 0.780.780.78 51
CSiS0.750.740.75128
FSiS0.750.810.78107
SiSh0.680.680.68 25
MS 0.800.650.72 37
WS 0.750.810.78 73
D 0.750.560.64 16
PS 0.830.800.81 88
BS 0.931.000.96 26

These results are consistent with what we observed in jpoirier001.ipynb - an overall F1 score of 0.4 and 0.77 for Facies and Adjacent Facies classification respectively.

4.2.1 Support vector machine with nonmarine data

Now let's subset the training data to only include observations with a nonmarine indicator. From this, I will build a support vector machine model. We will perform cross-validation again to see if the optimal model parameters are different for this subset of data. We'll wait to do the adjacent facies classification accuracy analysis once we bring these classifications back together with the marine classifications.


In [15]:
# subset data - drop "marine" levels and also no longer train using NM_M channel since we only have one value
nm_feature_vectors <- droplevels(feature_vectors[feature_vectors$NM_M == "Nonmarine" &
                                      feature_vectors$Facies %in% c("SS", "CSiS", "FSiS"),-7])

# split into training and cross-validation
set.seed(3124)
nm_trainIndex <- createDataPartition(nm_feature_vectors$Facies, p=.8, list=F, times=1)
nm_x_train <- nm_feature_vectors[nm_trainIndex,]
nm_x_test <- nm_feature_vectors[-nm_trainIndex,]

# calculates models for a variety of hyperparameters - this will take a few minutes
nm_tune.out <- tune(svm, Facies ~ ., data=nm_x_train, 
                 kernel="radial",
                 ranges=list(cost=c(.01,1,5,10,20,50,100,1000,5000,10000),
                            gamma=c(.0001,.001,.01,1,10)))
summary(nm_tune.out)

# predict facies using the best model
nm_cv_predictions <- predict(nm_tune.out$best.model, newdata=nm_x_test)

## PART ONE: Confusion matrix for facies classification
"Facies classification confusion matrix:"
nm_cv_cm <- confusionMatrix(nm_cv_predictions, nm_x_test$Facies)
print(as.matrix(nm_cv_cm[["table"]]))


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    5     1

- best performance: 0.1908166 

- Detailed performance results:
    cost gamma     error dispersion
1  1e-02 1e-04 0.5512343 0.03163411
2  1e+00 1e-04 0.5512343 0.03163411
3  5e+00 1e-04 0.5309036 0.02981048
4  1e+01 1e-04 0.5017233 0.04100947
5  2e+01 1e-04 0.4513973 0.04280660
6  5e+01 1e-04 0.3895668 0.04841838
7  1e+02 1e-04 0.3922683 0.04295195
8  1e+03 1e-04 0.3453656 0.03622960
9  5e+03 1e-04 0.3480515 0.03436988
10 1e+04 1e-04 0.3515991 0.03959831
11 1e-02 1e-03 0.5512343 0.03163411
12 1e+00 1e-03 0.5026083 0.03988109
13 5e+00 1e-03 0.3895591 0.04593555
14 1e+01 1e-03 0.3887285 0.04055643
15 2e+01 1e-03 0.3816410 0.04816919
16 5e+01 1e-03 0.3533458 0.03854425
17 1e+02 1e-03 0.3506986 0.03504697
18 1e+03 1e-03 0.3374631 0.03777764
19 5e+03 1e-03 0.3242121 0.03736446
20 1e+04 1e-03 0.3250893 0.03928614
21 1e-02 1e-02 0.5512343 0.03163411
22 1e+00 1e-02 0.3692750 0.03898310
23 5e+00 1e-02 0.3418724 0.03865736
24 1e+01 1e-02 0.3445117 0.03786381
25 2e+01 1e-02 0.3374554 0.03431208
26 5e+01 1e-02 0.3233271 0.04113683
27 1e+02 1e-02 0.3206645 0.03950213
28 1e+03 1e-02 0.2870828 0.02447096
29 5e+03 1e-02 0.2747011 0.02892338
30 1e+04 1e-02 0.2720618 0.02460487
31 1e-02 1e+00 0.5512343 0.03163411
32 1e+00 1e+00 0.2084226 0.04889297
33 5e+00 1e+00 0.1908166 0.03394943
34 1e+01 1e+00 0.1969880 0.03812521
35 2e+01 1e+00 0.2040677 0.04054369
36 5e+01 1e+00 0.2084847 0.04458786
37 1e+02 1e+00 0.2208508 0.03885730
38 1e+03 1e+00 0.2217358 0.04092815
39 5e+03 1e+00 0.2217358 0.04092815
40 1e+04 1e+00 0.2217358 0.04092815
41 1e-02 1e+01 0.5512343 0.03163411
42 1e+00 1e+01 0.3612405 0.03704766
43 5e+00 1e+01 0.3373544 0.04673191
44 1e+01 1e+01 0.3373544 0.04673191
45 2e+01 1e+01 0.3373544 0.04673191
46 5e+01 1e+01 0.3373544 0.04673191
47 1e+02 1e+01 0.3373544 0.04673191
48 1e+03 1e+01 0.3373544 0.04673191
49 5e+03 1e+01 0.3373544 0.04673191
50 1e+04 1e+01 0.3373544 0.04673191
'Facies classification confusion matrix:'
          Reference
Prediction  SS CSiS FSiS
      SS    35    4    0
      CSiS  15  104   16
      FSiS   1   18   87

If we recall jpoirier001.ipynb, the cross-validation showed best parameters of cost 10 and gamma 1. We observe here best parameters of cost 5 and gamma 1 (although, the error levels for each are very close). It is noteworthy that our best performance has improved from 0.2475297 in jpoirier001.ipynb to 0.1908166 for nonmarine classifications. What about marine?

4.2.2 Support vector machine with marine data

Now let's subset the training data to only include observations with a marine indicator. From this, I will build a support vector machine model. We will perform cross-validation again to see if the optimal model parameters are different for this subset of data.


In [16]:
# subset data - drop "nonmarine" levels and also no longer train using NM_M channel since we only have one value
m_feature_vectors <- droplevels(feature_vectors[feature_vectors$NM_M == "Marine" &
                                      feature_vectors$Facies %in% c("SiSh", "MS", "WS", "D", "PS", "BS"),-7])

# split into training and cross-validation
set.seed(3124)
m_trainIndex <- createDataPartition(m_feature_vectors$Facies, p=.8, list=F, times=1)
m_x_train <- m_feature_vectors[m_trainIndex,]
m_x_test <- m_feature_vectors[-m_trainIndex,]

# calculates models for a variety of hyperparameters - this will take a few minutes
m_tune.out <- tune(svm, Facies ~ ., data=m_x_train, 
                 kernel="radial",
                 ranges=list(cost=c(.01,1,5,10,20,50,100,1000,5000,10000),
                            gamma=c(.0001,.001,.01,1,10)))
summary(m_tune.out)

# predict facies using the best model
m_cv_predictions <- predict(m_tune.out$best.model, newdata=m_x_test)

## PART ONE: Confusion matrix for facies classification
"Facies classification confusion matrix:"
m_cv_cm <- confusionMatrix(m_cv_predictions, m_x_test$Facies)
print(as.matrix(m_cv_cm[["table"]]))


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost gamma
    5     1

- best performance: 0.2506868 

- Detailed performance results:
    cost gamma     error dispersion
1  1e-02 1e-04 0.6708242 0.03281789
2  1e+00 1e-04 0.6708242 0.03281789
3  5e+00 1e-04 0.6708242 0.03281789
4  1e+01 1e-04 0.6315476 0.05430612
5  2e+01 1e-04 0.5453938 0.05752132
6  5e+01 1e-04 0.5367582 0.06444069
7  1e+02 1e-04 0.4898810 0.06566294
8  1e+03 1e-04 0.4659707 0.06404152
9  5e+03 1e-04 0.4611905 0.06686358
10 1e+04 1e-04 0.4621520 0.06554248
11 1e-02 1e-03 0.6708242 0.03281789
12 1e+00 1e-03 0.6334615 0.05409037
13 5e+00 1e-03 0.5367674 0.06357036
14 1e+01 1e-03 0.4908333 0.06563223
15 2e+01 1e-03 0.4755586 0.06708216
16 5e+01 1e-03 0.4678938 0.06955061
17 1e+02 1e-03 0.4621154 0.06885379
18 1e+03 1e-03 0.4563645 0.05971992
19 5e+03 1e-03 0.4458425 0.05120021
20 1e+04 1e-03 0.4525641 0.05573591
21 1e-02 1e-02 0.6708242 0.03281789
22 1e+00 1e-02 0.4841392 0.06235293
23 5e+00 1e-02 0.4602289 0.06401126
24 1e+01 1e-02 0.4583150 0.05748891
25 2e+01 1e-02 0.4487546 0.06197430
26 5e+01 1e-02 0.4439469 0.05503038
27 1e+02 1e-02 0.4372527 0.05466984
28 1e+03 1e-02 0.3885073 0.05481514
29 5e+03 1e-02 0.3473810 0.05054735
30 1e+04 1e-02 0.3367949 0.04861741
31 1e-02 1e+00 0.6708242 0.03281789
32 1e+00 1e+00 0.2775366 0.05516705
33 5e+00 1e+00 0.2506868 0.03271017
34 1e+01 1e+00 0.2583059 0.03280258
35 2e+01 1e+00 0.2602656 0.02811034
36 5e+01 1e+00 0.2650275 0.03352356
37 1e+02 1e+00 0.2726923 0.03482104
38 1e+03 1e+00 0.2784432 0.03288058
39 5e+03 1e+00 0.2784432 0.03288058
40 1e+04 1e+00 0.2784432 0.03288058
41 1e-02 1e+01 0.6708242 0.03281789
42 1e+00 1e+01 0.4258425 0.02893821
43 5e+00 1e+01 0.3837821 0.03667919
44 1e+01 1e+01 0.3837821 0.03667919
45 2e+01 1e+01 0.3837821 0.03667919
46 5e+01 1e+01 0.3837821 0.03667919
47 1e+02 1e+01 0.3837821 0.03667919
48 1e+03 1e+01 0.3837821 0.03667919
49 5e+03 1e+01 0.3837821 0.03667919
50 1e+04 1e+01 0.3837821 0.03667919
'Facies classification confusion matrix:'
          Reference
Prediction SiSh MS WS  D PS BS
      SiSh   19  5  3  0  0  0
      MS      0 14  4  0  3  0
      WS      0  6 48  0 12  0
      D       0  1  1 11  2  1
      PS      5  9 14  4 68  0
      BS      0  0  2  1  1 25

For marine facies, our error level has actually gone up to 0.2506868. Realistically, some k-folds cross-validation here could give us a better picture of error levels. But for now, let's see what happens when we bring it all together!

4.2.3 Bringing nonmarine and marine classifications together

Now let's apply the cross-validated models to the blind well data (Newby) and evaluate the results against our initial model built using all data at once.


In [65]:
# modify the NM_M factors to be a string - more descriptive and plots nicer
levels(blind$NM_M)[levels(blind$NM_M)=="1"] <- "Nonmarine"
levels(blind$NM_M)[levels(blind$NM_M)=="2"]   <- "Marine"

# subset data - drop "nonmarine" levels and also no longer train using NM_M channel since we only have one value
nm_blind <- blind[blind$NM_M == "Nonmarine",]
m_blind <- blind[blind$NM_M == "Marine",]

# predict facies using the best models
nm_blind_predictions <- predict(nm_tune.out$best.model, newdata=nm_blind)
m_blind_predictions <- predict(m_tune.out$best.model, newdata=m_blind)

# combind the confusion matrixes for the nonmarine and marine predictions
nm_blind_cm <- as.matrix(confusionMatrix(nm_blind_predictions, nm_blind$Facies)[["table"]])
m_blind_cm <- as.matrix(confusionMatrix(m_blind_predictions, m_blind$Facies)[["table"]])
nmm_blind_cm <- cv_adj
nmm_blind_cm[1:9, 1:9] <- 0
nmm_blind_cm[1:3, 1:3] <- nm_blind_cm[1:3, 1:3]
nmm_blind_cm[4:9, 4:9] <- m_blind_cm[4:9, 4:9]

"PART ONE: Facies classification"
"Facies classification accuracy metrics"
nmm_blind_cm
accuracy_metrics(nmm_blind_cm, blind)

"PART TWO: Adjacent facies classification"
"Facies classification accuracy metrics"
nmm_blind_adj <- adj_cm_table(nmm_blind_cm)
disp_cm(nmm_blind_adj)
accuracy_metrics(nmm_blind_adj, blind)


Warning message in confusionMatrix.default(nm_blind_predictions, nm_blind$Facies):
"Levels are not in the same order for reference and data. Refactoring data to match."Warning message in levels(reference) != levels(data):
"longer object length is not a multiple of shorter object length"Warning message in confusionMatrix.default(m_blind_predictions, m_blind$Facies):
"Levels are not in the same order for reference and data. Refactoring data to match."
'PART ONE: Facies classification'
'Facies classification accuracy metrics'
          Reference
Prediction SS CSiS FSiS SiSh MS WS  D PS BS
      SS    0   23    5    0  0  0  0  0  0
      CSiS  0   49   29    0  0  0  0  0  0
      FSiS  0   26   46    0  0  0  0  0  0
      SiSh  0    0    0   29  2  5  0  0  1
      MS    0    0    0    2  1 33  0  2  0
      WS    0    0    0   16 10 11  0  7  3
      D     0    0    0    5  0  0  0  1  0
      PS    0    0    0    6 14 45 16 46 27
      BS    0    0    0    0  1  2  0  0  0
[1] "Overall F1-score of: 0.4"
[1] "Accuracy metrics:"
FaciesPrecisionRecallF1Support
SSSS 0.00 NaN NaN 0
CSiSCSiS0.630.500.5698
FSiSFSiS0.640.570.6180
SiShSiSh0.780.500.6158
MSMS 0.030.040.0328
WSWS 0.230.110.1596
DD 0.000.00 NaN16
PSPS 0.300.820.4456
BSBS 0.000.00 NaN31
'PART TWO: Adjacent facies classification'
'Facies classification accuracy metrics'
          Reference
Prediction SS CSiS FSiS SiSh MS WS  D PS BS
      SS    0    0    5    0  0  0  0  0  0
      CSiS  0   98    0    0  0  0  0  0  0
      FSiS  0    0   75    0  0  0  0  0  0
      SiSh  0    0    0   31  0  5  0  0  1
      MS    0    0    0    0 13  0  0  2  0
      WS    0    0    0   16  0 89  0  0  3
      D     0    0    0    5  0  0 16  0  0
      PS    0    0    0    6 14  0  0 54  0
      BS    0    0    0    0  1  2  0  0 27
[1] "Overall F1-score of: 0.83"
[1] "Accuracy metrics:"
FaciesPrecisionRecallF1Support
SSSS 0.00 NaN NaN 0
CSiSCSiS1.001.001.0098
FSiSFSiS1.000.940.9780
SiShSiSh0.840.530.6558
MSMS 0.870.460.6028
WSWS 0.820.930.8796
DD 0.761.000.8616
PSPS 0.730.960.8356
BSBS 0.900.870.8931

5 Conclusions

While the overall F1-score of the adjacent facies classification increased from 0.77 to 0.83; the facies classification problem itself stayed the same at 0.4. The SVM algorithm likely identified this strong relationship between nonmarine/marine indicator and facies - so there was little we could do to improve it by manually forcing it. So let's try something else next.