Facies classification using Machine Learning

LA Team Submission 3

Lukas Mosser, Alfredo De la Fuente

In this python notebook we explore a facies classification model using Deep Neural Networks taking into account spatial dependencies to outperform the prediction model proposed in the prediction facies from wel logs challenge.

Problem Modeling


The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a classifier to predict facies types.

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:

  1. Nonmarine sandstone
  2. Nonmarine coarse siltstone
  3. Nonmarine fine siltstone
  4. Marine siltstone and shale
  5. Mudstone (limestone)
  6. Wackestone (limestone)
  7. Dolomite
  8. Packstone-grainstone (limestone)
  9. Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies Label Adjacent Facies
1 SS 2
2 CSiS 1,3
3 FSiS 2
4 SiSh 5
5 MS 4,6
6 WS 5,7
7 D 6,8
8 PS 6,7,9
9 BS 7,8

Data Preprocessing


Let's import all the libraries that will be particularly needed for the analysis.


In [1]:
%%sh
pip install pandas
pip install scikit-learn
pip install keras


Requirement already satisfied (use --upgrade to upgrade): pandas in /home/alfredo/anaconda2/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /home/alfredo/anaconda2/lib/python2.7/site-packages (from pandas)
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /home/alfredo/anaconda2/lib/python2.7/site-packages (from pandas)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from pandas)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from python-dateutil->pandas)
Requirement already satisfied (use --upgrade to upgrade): scikit-learn in /home/alfredo/anaconda2/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): keras in /home/alfredo/anaconda2/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): pyyaml in /home/alfredo/anaconda2/lib/python2.7/site-packages (from keras)
Requirement already satisfied (use --upgrade to upgrade): theano in /home/alfredo/anaconda2/lib/python2.7/site-packages (from keras)
Requirement already satisfied (use --upgrade to upgrade): six in /home/alfredo/anaconda2/lib/python2.7/site-packages (from keras)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.1 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from theano->keras)
Requirement already satisfied (use --upgrade to upgrade): scipy>=0.11 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from theano->keras)
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [2]:
from __future__ import print_function
import numpy as np
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold , StratifiedKFold
from classification_utilities import display_cm, display_adj_cm
from sklearn.metrics import confusion_matrix, f1_score
from sklearn import preprocessing


Using TensorFlow backend.

We load the training and testing data to preprocess it for further analysis.


In [3]:
filename = 'train_test_data.csv'
data = pd.read_csv(filename)
data.head(10)


Out[3]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915
5 3 A1 SH SHRIMPLIN 2795.5 73.97 0.636 14.0 13.385 3.6 1 0.894
6 3 A1 SH SHRIMPLIN 2796.0 73.72 0.630 15.6 13.930 3.7 1 0.872
7 3 A1 SH SHRIMPLIN 2796.5 75.65 0.625 16.5 13.920 3.5 1 0.830
8 3 A1 SH SHRIMPLIN 2797.0 73.79 0.624 16.2 13.980 3.4 1 0.809
9 3 A1 SH SHRIMPLIN 2797.5 76.89 0.615 16.9 14.220 3.5 1 0.787

We fill the missing data values in the PE field with zero and proceed to normalize the data that will be fed into our model.


In [4]:
# Set 'Well Name' and 'Formation' fields as categories
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')

# Fill missing values and normalize for 'PE' field
data['PE'] = data['PE'].fillna(value=0)
mean_pe = data['PE'].mean()
std_pe = data['PE'].std()
data['PE'] = (data['PE']-mean_pe)/std_pe

# Normalize the rest of fields (GR, ILD_log10, DelthaPHI, PHIND,NM_M,RELPOS)
correct_facies_labels = data['Facies'].values
feature_vectors = data.drop(['Formation', 'Depth'], axis=1)
well_labels = data[['Well Name', 'Facies']].values
data_vectors = feature_vectors.drop(['Well Name', 'Facies'], axis=1).values

scaler = preprocessing.StandardScaler().fit(data_vectors)
scaled_features = scaler.transform(data_vectors)
data_out = np.hstack([well_labels, scaled_features])

In order to start training stage, it is required to format the data by considering a sliding window over the depth component in order to classify a given set of features at some specific depth for each well in the training set.


In [5]:
def preprocess(data_out):
    
    data = data_out
    well_data = {}
    well_names = list(set(data[:, 0]))
    for name in well_names:
        well_data[name] = [[], []]

    for row in data:
        well_data[row[0]][1].append(row[1])
        well_data[row[0]][0].append(list(row[2::]))

    # Sliding window
    positive_lag = 0
    negative_lag = 0

    chunks = []
    chunks_test = []
    chunk_length = positive_lag+negative_lag+1 
    chunks_facies = []
    for name in well_names:
        if name not in ['STUART', 'CRAWFORD']:
            test_well_data = well_data[name]
            log_values = np.array(test_well_data[0])
            facies_values =  np.array(test_well_data[1])
            for i in range(log_values.shape[0]):
                chunks.append(log_values[i:i+1, :])
                chunks_facies.append(facies_values[i])
        else:
            test_well_data = well_data[name]
            log_values = np.array(test_well_data[0])
            facies_values =  np.array(test_well_data[1])
            for i in range(log_values.shape[0]):
                chunks_test.append(log_values[i:i+1, :])
    
    chunks_facies = np.array(chunks_facies, dtype=np.int32)-1
    X_ = np.array(chunks)
    X = np.zeros((len(X_),len(X_[0][0]) * len(X_[0])))
    for i in range(len(X_)):
        X[i,:] = X_[i].flatten()
        
    X_test = np.array(chunks_test)
    X_test_out = np.zeros((len(X_test),len(X_test[0][0]) * len(X_test[0])))
    for i in range(len(X_test)):
        X_test_out[i,:] = X_test[i].flatten()
    y = np_utils.to_categorical(chunks_facies)
    return X, y, X_test_out

Data Analysis


We will experiment with an ensemble of classification models to outperform the accuracy at predicting facies. As input we will consider a set of features extracted by padding a depth interval segment, that way we take into account spatial dependencies. As output we obtain a vector filled with values ranging from [0-8] that indicate the presence of each facie with respect to depth.


In [6]:
# Reproducibility
np.random.seed(7) 
# Load data
X_train, y_train, X_test = preprocess(data_out)
# Obtain labels
y_labels = np.zeros((len(y_train),1))
for i in range(len(y_train)):
    y_labels[i] = np.argmax(y_train[i])
y_labels = y_labels.astype(int)

In order to evaluate our classification model accurary we will use the our following defined metrics, based on the confusion matrix once the classification is performed. The first metric only considers misclassification error and the second one takes into account the fact that facies could be misclassified if they belong to a same group with similar geological characteristics.


In [7]:
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS','WS', 'D','PS', 'BS']

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))

Model 1 - Deep Neural Network

Our model will be composed by a Deep Neural Network of an input layer, two hidden layers and an output layer.


In [8]:
# Set parameters
input_dim = 77
hidden_dim_1 = 200
hidden_dim_2 = 50
output_dim = 9 
batch_size = 32
nb_epoch = 10

In [9]:
def dnn_model():
    # Define the model
    model = Sequential()
    model.add(Dense(200, input_dim=7, init='normal', activation='relu'))
    model.add(Dense(50, input_dim=200, init='normal', activation='relu'))
    model.add(Dense(9, init='normal', activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Once the set of parameters are fixed, the training stage of our model begins. We perform a Cross Validation routine to evaluate the performance of the model.


In [10]:
# Cross Validation
estimator = KerasClassifier(build_fn=dnn_model, nb_epoch=10, batch_size=1, verbose=0)
skf = StratifiedKFold(n_splits=5, shuffle=True)
results_dnn = cross_val_score(estimator, X_train, y_train, cv= skf.get_n_splits(X_train, y_train))
print(' Cross Validation Results')
print( results_dnn )

# Load the model
model_dnn = dnn_model()

#Train model
model_dnn.fit(X_train, y_train, nb_epoch=10, verbose=2)

# Predict Values on Training set
y_predicted = model_dnn.predict( X_train , batch_size=1, verbose=2)

# Print Report

# Format output [0 - 8 ]
y_ = np.zeros((len(y_train),1))
for i in range(len(y_train)):
    y_[i] = np.argmax(y_train[i])

y_predicted_ = np.zeros((len(y_predicted), 1))
for i in range(len(y_predicted)):
    y_predicted_[i] = np.argmax( y_predicted[i] )
    
# Confusion Matrix
conf = confusion_matrix(y_, y_predicted_)

# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)


 Cross Validation Results
[ 0.55662651  0.52289157  0.50963855  0.43855422  0.46803378]
Epoch 1/10
0s - loss: 1.5783 - acc: 0.4268
Epoch 2/10
0s - loss: 1.1611 - acc: 0.5298
Epoch 3/10
0s - loss: 1.0903 - acc: 0.5642
Epoch 4/10
0s - loss: 1.0545 - acc: 0.5736
Epoch 5/10
0s - loss: 1.0272 - acc: 0.5809
Epoch 6/10
0s - loss: 1.0030 - acc: 0.5932
Epoch 7/10
0s - loss: 0.9870 - acc: 0.5951
Epoch 8/10
0s - loss: 0.9673 - acc: 0.5975
Epoch 9/10
0s - loss: 0.9459 - acc: 0.6117
Epoch 10/10
0s - loss: 0.9327 - acc: 0.6230

Model Report
-Accuracy: 0.624970
-Adjacent Accuracy: 0.926006

Confusion Matrix
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   161    93    14                                       268
     CSiS    66   599   272                 2           1         940
     FSiS     4   168   596     2           3           7         780
     SiSh     1     2     2   203     4    52     2     5         271
       MS           6     8    41    12   168     5    56         296
       WS     1           1    54    13   377    12   122     2   582
        D                 1     9     2    12    82    33     2   141
       PS           1    12    30     6   145    15   460    17   686
       BS                       2          12    12    56   103   185

Precision  0.69  0.69  0.66  0.60  0.32  0.49  0.64  0.62  0.83  0.62
   Recall  0.60  0.64  0.76  0.75  0.04  0.65  0.58  0.67  0.56  0.62
       F1  0.64  0.66  0.71  0.66  0.07  0.56  0.61  0.65  0.67  0.61

Prediction


We obtain the predictions for test data.


In [11]:
# DNN model Prediction
y_test = model_dnn.predict( X_test , batch_size=100, verbose=0)
predictions_dnn = np.zeros((len(y_test),1))
for i in range(len(y_test)):
    predictions_dnn[i] = np.argmax(y_test[i]) + 1 
predictions_dnn = predictions_dnn.astype(int)
# Store results
test_data = pd.read_csv('../validation_data_nofacies.csv')
test_data['Facies'] = predictions_dnn
test_data.to_csv('Prediction_3.csv')

In [ ]: