Facies classification from well logs using a Convolutional Network

Notebook by Valentin Tschannen and Janis Keuper, Fraunhofer ITWM, Kaiserslautern, Germany.


In this notebook we propose a possible approach to the facies classification contest proposed by the SEG. We thanks the organisers and the original notebook authors for providing us with a great dataset and an amusing challenge.

Using the deep learning library tensorflow, we have implemented a prototype network using inception modules and tried (many) regularization techniques to reduce overfitting. During the testing phase, we assed the performance of our model by retaining one well and use it as a blind validation set. Despite the efforts we put in hyper-parameters tuning, prediction accuracy remains disappointing.

As convolutional networks have proven to be very successful on many computer vision benchmarks, we anyway believe that this approach is interesting. With more time for improving the code and probably also more data, there is no reason to believe these nets should perform poorly on such problem. Finding out how much data is needed to obtain acceptable classification scores would be an interesting question to answer, but requires access to a larger dataset.

This notebook will outline the 3 main parts of our approach:

  • Preparing the data as a set of overlapping sequences with associated labels
  • The use of inception modules
  • Possible regularization methods


In [1]:
%matplotlib notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

import random

from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, confusion_matrix

import classification_utilities as utils
from ConvNet import inception_net

Data preparation

Load from csv

For information about the data and geological facies we try to predict, you can refer to the original notebook.


In [2]:
# Some parameters
feature_names = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
facies_names = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
approximate_neighbors = [[2], [1, 3], [2], [5], [4, 6], [5, 7, 8], [6, 8, 9], [6, 7, 9], [7, 8]]

# Load training data from file
training_data = pd.read_csv('../facies_vectors.csv').drop('Formation', axis=1)
pd.set_option("display.max_rows", 4, "float_format", lambda v: "%.2f" % v )
training_data


Out[2]:
Facies Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 SHRIMPLIN 2793.00 77.45 0.66 9.90 11.91 4.60 1 1.00
1 3 SHRIMPLIN 2793.50 78.26 0.66 14.20 12.56 4.10 1 0.98
... ... ... ... ... ... ... ... ... ... ...
4147 5 CHURCHMAN BIBLE 3122.00 51.47 0.96 3.08 7.71 3.15 2 0.66
4148 5 CHURCHMAN BIBLE 3122.50 50.03 0.97 2.61 6.67 3.29 2 0.65

4149 rows × 10 columns


In [3]:
# Well labels
well_name = training_data['Well Name']

# Features and labels
X = training_data[feature_names].values
y = training_data['Facies'].values

In [4]:
# Not sure if good idea to use F9 well?
# We don't
well_to_be_used = []
for well in well_name.unique():
    if well != 'Recruit F9':
        well_to_be_used.append(well)        
#well_to_be_used.append('Recruit F9')

In [5]:
# Load blind data
blind_data = pd.read_csv('../validation_data_nofacies.csv')
well_blind = blind_data['Well Name']
X_blind = blind_data[feature_names].values

Filling the missing values

As explained on the forum there are "917 with missing values for the PE variable. Of these, 466 are from well ALEXANDER D, 439 from well KIMZEY A, and 12 from well Recruit F9." Borrowing idea from other competitors (I do not know which team did that first), we train a regressor to imputate those missing values. I use a Laplacian kernel ridge regressor from scikit-learn with parameter selection based on k-fold cross validation.


In [6]:
krr_param_grid = {"alpha": [1e0, 1e-1, 1e-2, 1e-3],
                  "kernel": ['laplacian'],
                  "gamma": np.linspace(0.01,1,10)}
krr = GridSearchCV(KernelRidge(), cv=5, param_grid=krr_param_grid)

data_missing_vals = training_data[feature_names].copy()
data_no_missing_vals = data_missing_vals.dropna(axis = 0, inplace=False)
f_ = data_no_missing_vals.loc[:, data_no_missing_vals.columns != 'PE']
l_ = data_no_missing_vals.loc[:, 'PE']
krr.fit(f_, l_)
X[np.array(data_missing_vals.PE.isnull()),4] = krr.predict(data_missing_vals.loc[data_missing_vals.PE.isnull(),:].drop('PE',axis=1,inplace=False))

In [7]:
fig, ax = plt.subplots(figsize=(10,3))
mask = np.isnan(training_data[feature_names].values[:,4])
ax.plot(np.arange(X.shape[0]), X[:,4],color='r', label='imputed by KRR')
ax.hold
ax.plot(np.arange(X.shape[0]), training_data[feature_names].values[:,4])
ax.set_xlim(300,X.shape[0]-1000)
ax.set_ylim(X[:,4].min(), X[:,4].max()*1.1)
ax.set_yticks([])
ax.legend(loc='best')
fig.suptitle('Concatenated PE logs', fontweight='bold')


Out[7]:
<matplotlib.text.Text at 0x7fccb826c630>

Preparing the data for the network


In [8]:
training_data_imp = pd.DataFrame(X,columns=feature_names, index=well_name)
training_data_imp['Facies'] = training_data['Facies'].values
training_data_imp['Depth'] = training_data['Depth'].values

blind_data_ = pd.DataFrame(X_blind,columns=feature_names, index=well_blind)
blind_data_['Depth'] = blind_data['Depth'].values

For each well, we will split the logs in overlapping sequences of equal length in depth (with an odd number of samples), and attribute to every sequence the facies corresponding to its middle sample. We use zero-padding when at the top and bottom of every wells. When feeding such sequence, the network can see what is above and below the point it is currently classifying and we hope this makes the learning more robust. The sequence length is a hyperparameter that needs to be played with. We chose a 31 samples window (15.5 feet).


In [9]:
sequence_length = 31
perc_overlap = 30

# Generate sequences
TST_X, TST_Y, scaler = utils.generate_sequences(training_data_imp, well_to_be_used,feature_names,
                                                sequence_length=sequence_length,perc_overlap=perc_overlap)

        

X_stuart, Y_stuart_absolute_idx_pos = utils.generate_sequences(blind_data_, ['STUART'],feature_names,
                                                             sequence_length=sequence_length, 
                                                             with_labels=False,perc_overlap=perc_overlap, 
                                                               scaler=scaler)


X_crawford, Y_crawford_absolute_idx_pos = utils.generate_sequences(blind_data_, ['CRAWFORD'],feature_names,
                                                             sequence_length=sequence_length, 
                                                             with_labels=False,perc_overlap=perc_overlap, 
                                                                   scaler=scaler)

Data augmentation ?

When the training data set is too small, it is common practice to use augmentation techniques. As I was unsure of what would make sense with well logs I prefere not doing anything here. However, as explained in the original notebook, mislabeling can occur as some facies have a similar response to the different well measurements. The author gives the list linking a facies to the other facies it is likelly to be confused with. We use this additional information to pertube a certain percentage of the training examples in the hope this will make the newtwork more robust.


In [10]:
approximate_neighbors = [[2], [1, 3], [2], [5], [4, 6], [5, 7, 8], [6, 8, 9], [6, 7, 9], [7, 8]]
perturbation_ratio = 0.08

The network

Convolutional layers

Our network uses what the original authors call "inception" modules stacked on top of each other with pooling layers inbetween and Rectified Linear Units for the activation. The exact architecture can be seen in the file ConvNet.py.

Such modules have two interesting characteristics (at least). The first one, which doesn't matter so much in this exercise as our dataset is small, is the use of 1x1 convolutions to compresse the data in the channel/feature dimension an therefore keep the total number of parameters to a reasonable number (moreover, these 1x1 convolutions are also followed by a ReLU activation, which brings additional non-linearity to the net and is beneficial according to the authors). The second one, and this is the reason why we chose to use them, is that they are composed of 4 different paths that the input data follows in parallel before being concatenated at the output.

  • $1^{st}$: A 1x1 convolution, which besides compressing the data in the feature dimension, also preserves the vertical resolution of the sequences.
  • $2^{nd}$: A 1x1 convolution followed by a small kernel convolution (1x2 in our case). This filter should be more sensitive to high frequency information present in the logs.
  • $3^{rd}$: A 1x1 convolution followed by a large kernel convolution (1x4 in our case). This filter should be more sensitive to low frequency information.
  • $4^{th}$: A pooling followed by a 1x1 convolution. This operation can be seen as a low-frequency filtering of the logs, in order to progressivelly look at more spatially averaged features.

    Since it is unclear to me whether the low frequency or the hight frequency (or both) information dominates the learning process better let the network decide on its own! As we stack those modules on top of each other, the net is supposed to progressively learn more and more abstract features from the data which can be fed to the classifier.

Fully connected layers and classifier

We use 2 fully connected layers of width 140 and 70 as a transition between the convolutional layers and the output layer. We tried reducing the width, and use only 1, to avoid over fitting, but nothing could do... The output layer is a softmax classifier containing 9 weights (for the 9 facies we want to classify) and uses cross-entropy to measure the discrepency between the training and predicted labels. This setting as the advantage to give the prediction as a probability distribution of the 9 facies. Effectively, we select the answer to be the facies with maximum probability.

Regularization

Since we had a hard time training a newtwork that performed well on blind data, we used many regularization techniques, without too much success... In practice, it is probably not recommended to use so many, and it is a sign that we have a problem. Hoping that there is no bug in my code, it probably means we do not have enough data to train. Anyway, to list without explaning them in details, we used:

  • Dropout in the convolution layers
  • Dropout in the fully connected layers
  • Batch renormalization
  • l2 penalty on the convolutional weights
  • Max norm clipping on the fully connected weights
  • Use a truncated Gaussian to initialize the weights

Of those, dropout is probably the easiest to use and is also helping a lot. Batch normalization is more difficult to understand and takes quiet a lot of time to compute, but it also seems to help. The l2 penalty is decreasing the performance of the classifier but reduces the standard deviation of the blind f1 score (since the weights are initialized randomly and the training data is shuffled and randomly perturbed, we have variations between two runs of the same code). We also employ an exponentially decaying learning rate and a fairly small batch size, which makes each epoch longer to compute, but might help with gaining in generalization.


In [11]:
# Parameters for the network
display_step = 10

# Batch size
batch_size = 32 
training_epochs = 71

# Learning rate
initial_learning_rate = 0.001 
epoch_decrease_every = 25
decrease_factor = 0.96

# Regularization
beta = .02 # l2 penalty on the convolutional weights
clip_norm=1.5 # Max norm clipping on the fully connected weights.
dropout_fc = 0.5 # probability to keep units in fully connected layers
dropout_conv = 0.5 # probability to keep units in fully conv layers

In [12]:
# Take some data from training set for intermediate testing
X_train, Y_train, X_test, Y_test = utils.split_training_testing(TST_X, TST_Y, percentage_for_testing=10)

    
    
total_batch = X_train.shape[0]//batch_size

    
# Parameters
# learning rate
global_step = tf.Variable(0, trainable=False)
lr = tf.train.exponential_decay(initial_learning_rate, global_step, 
                                epoch_decrease_every*total_batch, decrease_factor, staircase=True)

        

# Network i/o
n_input = sequence_length 
n_features = len(feature_names)
n_classes = len(facies_names)


x_net = tf.placeholder(tf.float32, [None, 1, n_input, n_features])
y_net = tf.placeholder(tf.float32, [None, n_classes])
keep_prob_fc = tf.placeholder(tf.float32) #dropout
keep_prob_conv = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool)

    
# Construct model
pred = inception_net(x_net, keep_prob_fc, is_training, dropout_conv=keep_prob_conv, clip_norm=clip_norm)

# Loss and optimizer
conv_weights = [var for var in tf.global_variables() if 'kernel' in var.name]
weights_constraint = tf.reduce_sum(beta*tf.pack([tf.nn.l2_loss(w) for w in conv_weights]))
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y_net))
cost = tf.add(loss,weights_constraint)
optimizer = tf.train.AdamOptimizer(learning_rate=lr).minimize(cost, global_step=global_step)

# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y_net, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()

In [13]:
# Launch the graph
with tf.Session() as sess:
    sess.run(init)
    
    # Training cycle
    for epoch in range(training_epochs):
        X_train, Y_train, X_test, Y_test = utils.split_training_testing(TST_X, TST_Y, 
                                                                                percentage_for_testing=10)
        # shuffle training and testing set
        idx_train = np.random.permutation(np.arange(X_train.shape[0]))
        idx_test = np.random.permutation(np.arange(X_test.shape[0]))
        
        x_train = X_train[idx_train,...]
        y_train = Y_train[idx_train]
        x_test = X_test[idx_test,...]
        y_test = Y_test[idx_test]
        
        # perturbe training labels
        y_train = utils.perturbe_label(y_train, approximate_neighbors, perturbation_ratio=perturbation_ratio)
        
        # Loop over all batches
        for i in range(total_batch):
            batch_idx = range(i*batch_size,(i+1)*batch_size)
            batch_x = x_train[batch_idx,...] 
            batch_y = utils.dense_to_one_hot(y_train[batch_idx],n_classes)
            # Run optimization op (backprop)
            sess.run(optimizer, feed_dict={x_net: batch_x, 
                                           y_net: batch_y, 
                                           keep_prob_fc: dropout_fc,
                                           keep_prob_conv:dropout_conv,
                                           is_training:True})
            
        

    # Calculate prediction for blind wells
    y_stuart_pred = sess.run([pred], feed_dict={x_net: X_stuart,
                                               keep_prob_fc: 1.,
                                               keep_prob_conv:1.,
                                               is_training:False})
    
    
    y_crawford_pred = sess.run([pred], feed_dict={x_net: X_crawford,
                                               keep_prob_fc: 1.,
                                               keep_prob_conv:1.,
                                               is_training:False})
    
    
    y_stuart_pred_sm = sess.run(tf.nn.softmax(y_stuart_pred[0]))
    y_crawford_pred_sm = sess.run(tf.nn.softmax(y_crawford_pred[0]))

In [14]:
y_stuart_pred_final = utils.redundant_to_final(blind_data_.ix['STUART']['PE'].shape[0],
                                                      utils.one_hot_to_dense(y_stuart_pred_sm),
                                                      Y_stuart_absolute_idx_pos)
    

y_crawford_pred_final = utils.redundant_to_final(blind_data_.ix['CRAWFORD']['PE'].shape[0],
                                                      utils.one_hot_to_dense(y_crawford_pred_sm),
                                                      Y_crawford_absolute_idx_pos)

In [15]:
#blind_data_final = blind_data_[blind_data_['Well Name'] == 'STUART']
#blind_data_final['Facies'] = y_stuart_pred_final

In [16]:
blind_data['Facies'] = np.vstack((np.expand_dims(y_stuart_pred_final,1),np.expand_dims(y_crawford_pred_final,1)))
blind_data


Out[16]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS Facies
0 A1 SH STUART 2808.00 66.28 0.63 3.30 10.65 3.59 1 1.00 3
1 A1 SH STUART 2808.50 77.25 0.58 6.50 11.95 3.34 1 0.98 3
... ... ... ... ... ... ... ... ... ... ... ...
828 C SH CRAWFORD 3160.00 90.97 0.52 7.04 17.00 3.15 1 0.56 2
829 C SH CRAWFORD 3160.50 90.11 0.51 7.50 17.59 3.12 1 0.53 2

830 rows × 11 columns


In [17]:
_ = utils.make_facies_log_plot(blind_data[blind_data['Well Name'] == 'STUART'],facies_colors, figsize=(8,8))
_ = utils.make_facies_log_plot(blind_data[blind_data['Well Name'] == 'CRAWFORD'],facies_colors, figsize=(8,8))



In [18]:
# Write to file
blind_data.to_csv('./predicted_facies_itwm_01.csv')