Let's take a different approach from traditional machine learning algorithms. Something simple. For each test observation, I will cross-correlate it (and surrounding observations - a log section) against all log sections in the train data set. The highest correlation (averaged across all logs) gets to assign its facies to the test observation.
Let's load the entire training data set and perform some quick pre-processing (turning the non-marine/marine feature into a boolean). Let's also center and scale the data - but I'll do it on a well-by-well basis to correct for any instrument/environmental bias.
In [5]:
source("loadData.R")
# load and clean the data
data <- loadData()
data <- cleanData(data)
dataPrime <- data.frame()
wells <- unique(data$Well.Name)
for (well_i in wells) {
data_i <- data[data$Well.Name == well_i,]
data_i$GR <- (data_i$GR - mean(data_i$GR, na.rm=T)) / sd(data_i$GR, na.rm=T)
data_i$ILD_log10 <- (data_i$ILD_log10 - mean(data_i$ILD_log10, na.rm=T)) / sd(data_i$ILD_log10, na.rm=T)
data_i$DeltaPHI <- (data_i$DeltaPHI - mean(data_i$DeltaPHI, na.rm=T)) / sd(data_i$DeltaPHI, na.rm=T)
data_i$PHIND <- (data_i$PHIND - mean(data_i$PHIND, na.rm=T)) / sd(data_i$PHIND, na.rm=T)
data_i$PE <- (data_i$PE - mean(data_i$PE, na.rm=T)) / sd(data_i$PE, na.rm=T)
dataPrime <- rbind(dataPrime, data_i)
}
data <- dataPrime
rm(dataPrime)
format(head(data,3), digits=3)
For each test observation I will now cross-correlate it's section (the observation and n observations above/below it) against each well. Each well will provide the best correlation (averaged across each log) found, as well as the corresponding facies at that train observation.
My suspicion is that the advantage of this approach will leverage all the data. Other approaches must choose between subsetting the observations (as the PE log is only available for some wells) or subsetting the features (excluding the PE log) in order to utilize all observations.
A disadvantage may be that it does not make sense to utilize our Recruit F9 pseudowell - as it is composed of manually selected observations independent of spatial context. This approach attempts to leverage spatial (vertical) context by cross-correlating log sections as opposed to looking at each observation individually. This in my opinion is closer to how a petrophysicist works.
In [7]:
source("mirrorData.R")
corrPredict <- function(train, test, l) {
wells <- unique(train$Well.Name)
for (i in 1:nrow(test)) {
top <- i - l / 2 + 1
base <- i + l / 2
test_i <- subsetData(test, top, base)
for (well_j in wells) {
train_j <- train[train$Well.Name == well_j,]
cors <- data.frame()
for (k in 1:nrow(train_j)) {
top_k <- k - l / 2 + 1
base_k <- k + l / 2
train_jk <- subsetData(train_j, top_k, base_k)
corGR <- cor(test_i$GR, train_jk$GR)
corILD <- cor(test_i$ILD_log10, train_jk$ILD_log10)
corDeltaPHI <- cor(test_i$DeltaPHI, train_jk$DeltaPHI)
corPHIND <- cor(test_i$PHIND, train_jk$PHIND)
if (sum(!is.na(test_i$PE)) == nrow(test_i) & sum(!is.na(train_jk$PE)) == nrow(train_jk)) {
corPE <- cor(test_i$PE, train_jk$PE)
} else { corPE <- NA }
c <- c(corGR, corILD, corDeltaPHI, corPHIND, corPE)
corAVG <- mean(c, na.rm=T)
temp <- data.frame(corGR=corGR, corILD=corILD, corDeltaPHI=corDeltaPHI, corPHIND=corPHIND, corPE=corPE,
corAVG=corAVG,
testWell=test_i$Well.Name[i], trainWell=well_j,
testDepth=test$Depth[i], trainDepth=train_j$Depth[k])
cors <- rbind(cors, temp)
}
best_j <- cors[which.max(cors$corAVG),]
test[i, paste0("Facies_", well_j)] <- train_j[train_j$Depth==best_j$trainDepth[1], "Facies"][1]
test[i, paste0("Corr_", well_j)] <- best_j$corAVG[1]
}
}
test
}
Before we include the contest test wells (STUART and CRAWFORD) let's perform some cross-validation to see what type of performance we may expect with this unorthodox machine learning approach. To simulate contest conditions, I will hold as a test set each two-well combination possible. The train set will be the remaining wells. As such, I will be building a model for each combination and we can see how much the performance varies.
Each well combination will call the previously defined corrPredict function which will identify each train well's vote. Instead of a democratic vote, I will simply take the highest cross-correlation across all wells and choose that train observations Facies as the prediction.
Each well combination will also print out the names of the two test wells and the F1-score from that model.
In [ ]:
source("accuracyMetrics.R")
wells <- unique(data$Well.Name)
wells <- wells[!wells %in% c("Recruit F9")]
# loop through test well pairs
for (i in 1:(length(wells)-1)) {
for (j in (i+1):(length(wells))) {
trainIndex <- data$Well.Name != wells[i] & data$Well.Name != wells[j]
train <- data[trainIndex & data$Well.Name != "Recruit F9",]
test <- data[!trainIndex,]
trainWells <- unique(train$Well.Name)
testPrime <- corrPredict(train, test, 20)
print(head(testPrime))
# find the best cross correlation from each well - use that as the predictor
# for (i in 1:nrow(testPrime)) {
# c <- NULL
# f <- NULL
# for (well_j in trainWells) {
# c <- c(c, testPrime[i, paste0("Corr_", well_j)])
# f <- c(f, testPrime[i, paste0("Facies_", well_j)])
# }
# j <- which.max(c)
# testPrime[i, "Predicted"] <- f[j]
# }
# testPrime$Predicted <- as.factor(testPrime$Predicted)
# levels(testPrime$Predicted) <- c("SS", "CSiS", "FSiS", "SiSh", "MS", "WS", "D", "PS", "BS")
# print(paste("-----------",
# "\nTest well 1:", wells[i],
# "\nTest well 2:", wells[j],
# "\nF1-score:", myF1Metric(testPrime$Predicted, testPrime$Facies),
# "\n-----------"))
}
}
In [ ]: