Facies classification using ensemble classifiers

by: Evgeny Sorkin SJ Geophysics

Original contest notebook by Brendon Hall, Enthought

This notebook demonstrates how to train a machine learning algorithm to predict facies from well log data. The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a majority-vote classifier comprised of support vector machine, random forest and XGBoosted tree. We will use the classifiers implementation in scikit-learn and XGBoost

Exploring the dataset

First, we will examine the data set we will use to train the classifier. The training data is contained in the file training_data.csv. The dataset consists of 5 wireline log measurements, two indicator variables and a facies label at half foot intervals. In machine learning terminology, each log measurement is a feature vector that maps a set of 'features' (the log measurements) to a class (the facies type). We will use the pandas library to load the data into a dataframe, which provides a convenient data structure to work with well log data.



In [2]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable

from pandas import set_option
set_option("display.max_rows", 20)
pd.options.mode.chained_assignment = None

filename = '../training_data.csv'
training_data = pd.read_csv(filename)
training_data









    Out[2]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      1
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      1
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      1
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      1
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      1
      0.915
    
    
      5
      3
      A1 SH
      SHRIMPLIN
      2795.5
      73.970
      0.636
      14.000
      13.385
      3.600
      1
      0.894
    
    
      6
      3
      A1 SH
      SHRIMPLIN
      2796.0
      73.720
      0.630
      15.600
      13.930
      3.700
      1
      0.872
    
    
      7
      3
      A1 SH
      SHRIMPLIN
      2796.5
      75.650
      0.625
      16.500
      13.920
      3.500
      1
      0.830
    
    
      8
      3
      A1 SH
      SHRIMPLIN
      2797.0
      73.790
      0.624
      16.200
      13.980
      3.400
      1
      0.809
    
    
      9
      3
      A1 SH
      SHRIMPLIN
      2797.5
      76.890
      0.615
      16.900
      14.220
      3.500
      1
      0.787
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3222
      5
      C LM
      CHURCHMAN BIBLE
      3118.0
      55.563
      1.052
      4.296
      7.325
      3.805
      2
      0.726
    
    
      3223
      5
      C LM
      CHURCHMAN BIBLE
      3118.5
      58.313
      1.034
      3.863
      7.465
      3.584
      2
      0.718
    
    
      3224
      5
      C LM
      CHURCHMAN BIBLE
      3119.0
      55.344
      1.003
      2.225
      7.541
      3.645
      2
      0.710
    
    
      3225
      5
      C LM
      CHURCHMAN BIBLE
      3119.5
      53.313
      0.972
      1.640
      7.295
      3.629
      2
      0.702
    
    
      3226
      5
      C LM
      CHURCHMAN BIBLE
      3120.0
      49.594
      0.954
      1.494
      7.149
      3.727
      2
      0.694
    
    
      3227
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      2
      0.685
    
    
      3228
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      2
      0.677
    
    
      3229
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      2
      0.669
    
    
      3230
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      2
      0.661
    
    
      3231
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      2
      0.653
    
  

3232 rows × 11 columns

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:

Nonmarine sandstone
Nonmarine coarse siltstone
Nonmarine fine siltstone
Marine siltstone and shale
Mudstone (limestone)
Wackestone (limestone)
Dolomite
Packstone-grainstone (limestone)
Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies	Label	Adjacent Facies
1	SS	2
2	CSiS	1,3
3	FSiS	2
4	SiSh	5
5	MS	4,6
6	WS	5,7
7	D	6,8
8	PS	6,7,9
9	BS	7,8

Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.



In [3]:

    
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()









    Out[3]:





[SHRIMPLIN, SHANKLE, LUKE G U, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (8, object): [SHRIMPLIN, SHANKLE, LUKE G U, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]



In [4]:

    
training_data.columns[4:]









    Out[4]:





Index(['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS'], dtype='object')



In [5]:

    
training_data.describe()









    Out[5]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
    
    
      mean
      4.422030
      2875.824567
      66.135769
      0.642719
      3.559642
      13.483213
      3.725014
      1.498453
      0.520287
    
    
      std
      2.504243
      131.006274
      30.854826
      0.241845
      5.228948
      7.698980
      0.896152
      0.500075
      0.286792
    
    
      min
      1.000000
      2573.500000
      13.250000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.010000
    
    
      25%
      2.000000
      2791.000000
      46.918750
      0.492750
      1.163750
      8.346750
      3.100000
      1.000000
      0.273000
    
    
      50%
      4.000000
      2893.500000
      65.721500
      0.624437
      3.500000
      12.150000
      3.551500
      1.000000
      0.526000
    
    
      75%
      6.000000
      2980.000000
      79.626250
      0.812735
      6.432500
      16.453750
      4.300000
      2.000000
      0.767250
    
    
      max
      9.000000
      3122.500000
      361.150000
      1.480000
      18.600000
      84.400000
      8.094000
      2.000000
      1.000000

This is a quick view of the statistical distribution of the input variables. Looking at the count values, there are 3232 feature vectors in the training set.



In [6]:

    
# 1=sandstone  2=c_siltstone   3=f_siltstone 
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00',
       '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']

facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies_colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] -1]

def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[5])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
def compare_facies_plot(logs, compadre, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=7, figsize=(9, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[6])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im2, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-2):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    ax[6].set_xlabel(compadre)
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    ax[6].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)



In [6]:

    
make_facies_log_plot(
    training_data[training_data['Well Name'] == 'SHRIMPLIN'],
    facies_colors)

Next: make labels and features vectors



In [7]:

    
feat_labels =training_data.columns[4:]

Feature engineering



In [8]:

    
def add_del(df,feat_names):
    """"""
    for fn in feat_names:
        df["del_"+fn] = np.gradient(df[fn] )
    return df
training_data.columns[4:]
training_data = add_del(training_data,[fn for fn in training_data.columns[4:] if fn != 'NM_M'])
training_data = add_del(training_data,[fn for fn in training_data.columns[4:] if fn != 'NM_M'])
training_data.columns[4:]
feat_labels =training_data.columns[4:]



In [9]:

    
feat_labels









    Out[9]:





Index(['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS',
       'del_GR', 'del_ILD_log10', 'del_DeltaPHI', 'del_PHIND', 'del_PE',
       'del_RELPOS', 'del_del_GR', 'del_del_ILD_log10', 'del_del_DeltaPHI',
       'del_del_PHIND', 'del_del_PE', 'del_del_RELPOS'],
      dtype='object')

import and initilize a few classifires that we play with



In [10]:

    
y = training_data['Facies'].values
X = training_data.drop(['Formation', 'Well Name', 'Depth','Facies'], axis=1).values
feat_labels =training_data.columns[4:]
label_encoded_y  = np.unique(y)



In [12]:

    
## import and initilize a few classifires that we play with



In [13]:

    
def randomize(dataset, labels):
  permutation = np.random.permutation(labels.shape[0])
  shuffled_dataset = dataset[permutation,:]
  shuffled_labels = labels[permutation]
  return shuffled_dataset, shuffled_labels
X, y = randomize(X, y)



In [14]:

    
from sklearn import __version__ as sklearn_version
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 1)
forest = RandomForestClassifier( n_estimators = 10000, random_state=50, n_jobs=-1)
bag = BaggingClassifier(base_estimator = tree, n_estimators=400,random_state=0)
knn = KNeighborsClassifier(n_neighbors = 5, p=2, metric = 'minkowski')



In [15]:

    
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import svm
SVC_classifier = svm.SVC(kernel = 'rbf', random_state=0, gamma=0.01)
pipe_svm = Pipeline([('scl',StandardScaler()),('clf',SVC_classifier)])



In [16]:

    
import xgboost
from xgboost import XGBClassifier









    



/home/evgeny/.conda/envs/py3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Split the training data into training and test sets. Let's use 10% of the data for the test set.



In [17]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X,y, test_size=0.1, random_state=42)



In [18]:

    
# the scores 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer
pre_scorer= make_scorer(score_func=f1_score, greater_is_better=True,average = 'micro')
#the scorer is f1_scrore with micro averaging, i.e.
from sklearn.cross_validation import StratifiedKFold
kfold = StratifiedKFold(y_train, n_folds=10, shuffle=True, random_state=7)
# stratified kfold cross-validation keeps the classes balanced in each fold.



In [19]:

    
from sklearn.model_selection import GridSearchCV

check feature importance with random forest classifier



In [20]:

    
forest.fit(X, y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f+1 , 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))









    



 1) NM_M                           0.098258
 2) GR                             0.097813
 3) del_RELPOS                     0.095153
 4) PHIND                          0.084868
 5) ILD_log10                      0.084802
 6) PE                             0.083189
 7) RELPOS                         0.057806
 8) DeltaPHI                       0.055448
 9) del_ILD_log10                  0.036323
10) del_del_PHIND                  0.035217
11) del_del_ILD_log10              0.034313
12) del_PHIND                      0.033774
13) del_GR                         0.033501
14) del_del_GR                     0.033481
15) del_del_PE                     0.030326
16) del_del_DeltaPHI               0.029371
17) del_DeltaPHI                   0.029149
18) del_PE                         0.028322
19) del_del_RELPOS                 0.018885

Looks like not all features are nearly important, e.g. may consider dropping those with relative importance below 2.3%



In [21]:

    
if sklearn_version < '0.18':
    X_selected = forest.transform(X, threshold=0.023)
else:
    from sklearn.feature_selection import SelectFromModel
    sfm = SelectFromModel(forest, threshold=0.023, prefit=True)
    X_selected = sfm.transform(X)



In [22]:

    
X_selected = X



In [23]:

    
from sklearn.cross_validation import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
        X_selected,y, test_size=0.1, random_state=42)



In [26]:

    
#tuned forest
forest =  RandomForestClassifier( 
    n_estimators = 303,
    min_samples_leaf = 1,
    oob_score = True,
    random_state=50, n_jobs=-1)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
print ("F1-score: %.4g" % f1_score(y_test,y_pred, average='micro'))









    



F1-score: 0.7932



In [27]:

    
#tuned xgb 
xg = XGBClassifier(learning_rate =0.005,
 n_estimators=1888,
 max_depth=10,
 min_child_weight=1,
 gamma=0.2,
 subsample=0.9,
 colsample_bytree=0.7,
 reg_alpha=0,
 nthread=-1,
 objective='multi:softprob',
 scale_pos_weight=1,
 seed=43)

xg.fit(X_train, y_train,eval_metric=pre_scorer)
y_pred = xg.predict(X_test)
print ("F1-score: %.4g" % f1_score(y_test,y_pred, average='micro'))









    



F1-score: 0.7809



In [28]:

    
print('Predicted F1-Score: {}'.format(xg.score(X_test,y_test)))









    



Predicted F1-Score: 0.7808641975308642



In [29]:

    
#tuned SVM
pipe_svm = Pipeline([('scl',StandardScaler()),('clf',svm.SVC(kernel = 'rbf', random_state=0, gamma=0.1, C=100))])

The combined majority vote classifier is constructed out of 3-individual classifiers



In [30]:

    
from sklearn.ensemble import VotingClassifier
mv = VotingClassifier(estimators=[('forest',forest),('XGBoost',xg),('svn',pipe_svm)])



In [31]:

    
clf_labels = [ 'Random Forest' ,'XGBoost','SVN', 'Majority-Vote']

print('10-fold cross validation:\n')
for clf, label in zip([forest,xg,pipe_svm,mv], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=kfold,
                             scoring=pre_scorer)
    print("F1-score: %0.2f (+/- %0.2f) [%s]"
          % (scores.mean(), scores.std(), label))









    



10-fold cross validation:

F1-score: 0.77 (+/- 0.02) [Random Forest]
F1-score: 0.78 (+/- 0.02) [XGBoost]
F1-score: 0.70 (+/- 0.03) [SVN]
F1-score: 0.78 (+/- 0.02) [Majority-Vote]



In [32]:

    
mv.fit(X_train,y_train)
print('Predicted F1-Score: {}'.format(mv.score(X_test,y_test)))









    



Predicted F1-Score: 0.7870370370370371



In [33]:

    
y_pred=mv.predict(X_test)
print('Predicted F1-Score: {}'.format(f1_score(y_true=y_test,y_pred=y_pred, average='micro')))









    



Predicted F1-Score: 0.7870370370370371

Some more detailes metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.

The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j] are equal to the number of observations predicted to have facies j, but are known to have facies i.

To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py in this repo for the display_cm() function.



In [34]:

    
predicted_labels=mv.predict(X_test)



In [35]:

    
from sklearn.metrics import confusion_matrix



In [36]:

    
from classification_utilities import display_cm, display_adj_cm
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS'];



In [37]:

    
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    22           1                                        23
     CSiS     1    59     3                                        63
     FSiS          15    43                                        58
     SiSh                 1    17     1     2           1          22
       MS                 2     3    14     7     1     7          34
       WS                       5          36           9          50
        D                                         8                 8
       PS                 2     1           6          39          48
       BS                                   1                17    18

Precision  0.96  0.80  0.83  0.65  0.93  0.69  0.89  0.70  1.00  0.80
   Recall  0.96  0.94  0.74  0.77  0.41  0.72  1.00  0.81  0.94  0.79
       F1  0.96  0.86  0.78  0.71  0.57  0.71  0.94  0.75  0.97  0.78

The entries along the diagonal are the facies that have been correctly classified. Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.



In [38]:

    
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i, adjacent_facies[i] is an array of the adjacent facies labels.



In [39]:

    
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))



In [40]:

    
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))









    



Facies classification accuracy = 0.787037
Adjacent facies classification accuracy = 0.925926



In [41]:

    
display_adj_cm(conf, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    22           1                                        23
     CSiS          63                                              63
     FSiS                58                                        58
     SiSh                 1    18           2           1          22
       MS                 2          24           1     7          34
       WS                       5          45                      50
        D                                         8                 8
       PS                 2     1                      45          48
       BS                                   1                17    18

Precision  1.00  1.00  0.91  0.75  1.00  0.94  0.89  0.85  1.00  0.93
   Recall  0.96  1.00  1.00  0.82  0.71  0.90  1.00  0.94  0.94  0.93
       F1  0.98  1.00  0.95  0.78  0.83  0.92  0.94  0.89  0.97  0.92

Considering adjacent facies, the F1 scores for all facies types are above 0.9, except when classifying SiSh or marine siltstone and shale. The classifier often misclassifies this facies (recall of 0.87), most often as wackestone.

Now we can train the classifier using the entire data set



In [42]:

    
mv.fit(X,y)
print('Predicted F1-Score: {}'.format(mv.score(X,y)))









    



Predicted F1-Score: 1.0

Applying the classification model to new data

Now that we have a trained facies classification model we can use it to identify facies in wells that do not have core data. In this case, we will apply the classifier to two wells, but we could use it on any number of wells for which we have the same set of well logs for input.

This dataset is similar to the training data except it does not have facies labels. It is loaded into a dataframe called test_data.



In [43]:

    
well_data = pd.read_csv('../validation_data_nofacies.csv')
well_data['Well Name'] = well_data['Well Name'].astype('category')
well_data.columns[4:]
well_data= add_del(well_data,[fn for fn in well_data.columns[3:] if fn != 'NM_M'])
well_data = add_del(well_data,[fn for fn in well_data.columns[3:] if fn != 'NM_M'])
well_data.columns[3:]
feat_labels =well_data.columns[3:] 
X_unknown = well_data.drop(['Formation', 'Well Name', 'Depth'], axis=1).values



In [44]:

    
well_data.columns[3:] 

#training_data.columns[4:]









    Out[44]:





Index(['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS',
       'del_GR', 'del_ILD_log10', 'del_DeltaPHI', 'del_PHIND', 'del_PE',
       'del_RELPOS', 'del_del_GR', 'del_del_ILD_log10', 'del_del_DeltaPHI',
       'del_del_PHIND', 'del_del_PE', 'del_del_RELPOS'],
      dtype='object')

Finally we predict facies labels for the unknown data, and store the results in a Facies column of the test_data dataframe.



In [45]:

    
#predict facies of unclassified data
y_unknown = mv.predict(X_unknown)
well_data['Facies'] = y_unknown
well_data









    Out[45]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      ...
      del_PHIND
      del_PE
      del_RELPOS
      del_del_GR
      del_del_ILD_log10
      del_del_DeltaPHI
      del_del_PHIND
      del_del_PE
      del_del_RELPOS
      Facies
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.300
      10.650
      3.591
      1
      1.000
      ...
      1.3000
      -0.2500
      -0.0220
      -2.66450
      0.01300
      -0.15000
      0.17500
      -0.01350
      0.000000e+00
      2
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.500
      11.950
      3.341
      1
      0.978
      ...
      1.4750
      -0.2635
      -0.0220
      -4.63325
      0.02450
      -0.85000
      -0.32500
      0.03400
      -2.500000e-04
      3
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.400
      13.600
      3.064
      1
      0.956
      ...
      0.6500
      -0.1820
      -0.0225
      -5.88775
      0.03400
      -1.70000
      -1.05000
      0.12075
      -2.500000e-04
      3
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.500
      13.250
      2.977
      1
      0.933
      ...
      -0.6250
      -0.0220
      -0.0225
      -2.53375
      0.01650
      -1.40000
      -0.57500
      0.11825
      2.500000e-04
      3
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.700
      12.350
      3.020
      1
      0.911
      ...
      -0.5000
      0.0545
      -0.0220
      2.22975
      -0.00900
      -0.37500
      0.33750
      0.02900
      2.500000e-04
      3
    
    
      5
      A1 SH
      STUART
      2810.5
      73.955
      0.667
      6.900
      12.250
      3.086
      1
      0.889
      ...
      0.0500
      0.0360
      -0.0220
      4.16375
      -0.01850
      0.50000
      0.35000
      -0.01800
      -2.500000e-04
      3
    
    
      6
      A1 SH
      STUART
      2811.0
      77.962
      0.674
      6.500
      12.450
      3.092
      1
      0.867
      ...
      0.2000
      0.0185
      -0.0225
      1.11775
      -0.01425
      0.60000
      0.12500
      -0.01075
      -2.500000e-04
      3
    
    
      7
      A1 SH
      STUART
      2811.5
      83.894
      0.667
      6.300
      12.650
      3.123
      1
      0.844
      ...
      0.3000
      0.0145
      -0.0225
      -2.66825
      -0.00625
      0.40000
      -0.02500
      -0.00825
      2.500000e-04
      3
    
    
      8
      A1 SH
      STUART
      2812.0
      84.424
      0.653
      6.700
      13.050
      3.121
      1
      0.822
      ...
      0.1500
      0.0020
      -0.0220
      -2.95575
      0.00475
      0.10000
      -0.40000
      -0.00075
      2.500000e-04
      3
    
    
      9
      A1 SH
      STUART
      2812.5
      83.160
      0.642
      7.300
      12.950
      3.127
      1
      0.800
      ...
      -0.5000
      0.0130
      -0.0220
      -3.35600
      0.01500
      -0.52500
      -0.61250
      -0.00875
      -2.775558e-17
      2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      820
      C SH
      CRAWFORD
      3156.0
      69.445
      0.617
      3.390
      11.540
      3.820
      1
      0.778
      ...
      1.2100
      -0.1455
      -0.0280
      -0.95725
      0.00050
      0.23875
      0.41125
      -0.07000
      -2.500000e-04
      2
    
    
      821
      C SH
      CRAWFORD
      3156.5
      73.389
      0.608
      3.625
      12.775
      3.620
      1
      0.750
      ...
      0.9400
      -0.1765
      -0.0280
      -0.29575
      0.00125
      0.27250
      -0.34250
      0.00775
      0.000000e+00
      2
    
    
      822
      C SH
      CRAWFORD
      3157.0
      77.115
      0.605
      4.140
      13.420
      3.467
      1
      0.722
      ...
      0.5250
      -0.1300
      -0.0280
      -0.54225
      -0.00400
      0.08625
      -0.11375
      0.02325
      2.500000e-04
      2
    
    
      823
      C SH
      CRAWFORD
      3157.5
      79.840
      0.596
      4.875
      13.825
      3.360
      1
      0.694
      ...
      0.7125
      -0.1300
      -0.0275
      -0.05325
      -0.00750
      -0.27125
      0.31875
      0.01525
      2.500000e-04
      2
    
    
      824
      C SH
      CRAWFORD
      3158.0
      82.616
      0.577
      5.235
      14.845
      3.207
      1
      0.667
      ...
      1.1625
      -0.0995
      -0.0275
      0.18450
      -0.00250
      -0.19250
      0.12000
      0.04275
      -2.500000e-04
      2
    
    
      825
      C SH
      CRAWFORD
      3158.5
      86.078
      0.554
      5.040
      16.150
      3.161
      1
      0.639
      ...
      0.9525
      -0.0445
      -0.0280
      -0.45650
      0.00450
      0.28875
      -0.42375
      0.05150
      -2.500000e-04
      2
    
    
      826
      C SH
      CRAWFORD
      3159.0
      88.855
      0.539
      5.560
      16.750
      3.118
      1
      0.611
      ...
      0.3150
      0.0035
      -0.0280
      -1.02975
      0.00525
      0.28750
      -0.41500
      0.03125
      2.500000e-04
      2
    
    
      827
      C SH
      CRAWFORD
      3159.5
      90.490
      0.530
      6.360
      16.780
      3.168
      1
      0.583
      ...
      0.1225
      0.0180
      -0.0275
      -1.19850
      0.00175
      -0.04375
      0.04625
      -0.01250
      2.500000e-04
      2
    
    
      828
      C SH
      CRAWFORD
      3160.0
      90.975
      0.522
      7.035
      16.995
      3.154
      1
      0.556
      ...
      0.4075
      -0.0215
      -0.0275
      -0.96350
      -0.00025
      -0.13375
      0.23875
      -0.02350
      -2.500000e-04
      2
    
    
      829
      C SH
      CRAWFORD
      3160.5
      90.108
      0.513
      7.505
      17.595
      3.125
      1
      0.528
      ...
      0.6000
      -0.0290
      -0.0280
      -0.67600
      -0.00050
      -0.10250
      0.19250
      -0.00750
      -5.000000e-04
      2
    
  

830 rows × 23 columns



In [46]:

    
well_data['Well Name'].unique()









    Out[46]:





[STUART, CRAWFORD]
Categories (2, object): [STUART, CRAWFORD]

We can use the well log plot to view the classification results along with the well logs.



In [47]:

    
make_facies_log_plot(
    well_data[well_data['Well Name'] == 'STUART'],
    facies_colors=facies_colors)

make_facies_log_plot(
    well_data[well_data['Well Name'] == 'CRAWFORD'],
    facies_colors=facies_colors)

Finally we can write out a csv file with the well data along with the facies classification results.



In [49]:

    
well_data.to_csv('well_data_with_facies.csv')

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.450	0.664	9.900	11.915	4.600	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.260	0.661	14.200	12.565	4.100	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.050	0.658	14.800	13.050	3.600	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.100	0.655	13.900	13.115	3.500	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.580	0.647	13.500	13.300	3.400	1	0.915
5	3	A1 SH	SHRIMPLIN	2795.5	73.970	0.636	14.000	13.385	3.600	1	0.894
6	3	A1 SH	SHRIMPLIN	2796.0	73.720	0.630	15.600	13.930	3.700	1	0.872
7	3	A1 SH	SHRIMPLIN	2796.5	75.650	0.625	16.500	13.920	3.500	1	0.830
8	3	A1 SH	SHRIMPLIN	2797.0	73.790	0.624	16.200	13.980	3.400	1	0.809
9	3	A1 SH	SHRIMPLIN	2797.5	76.890	0.615	16.900	14.220	3.500	1	0.787
...	...	...	...	...	...	...	...	...	...	...	...
3222	5	C LM	CHURCHMAN BIBLE	3118.0	55.563	1.052	4.296	7.325	3.805	2	0.726
3223	5	C LM	CHURCHMAN BIBLE	3118.5	58.313	1.034	3.863	7.465	3.584	2	0.718
3224	5	C LM	CHURCHMAN BIBLE	3119.0	55.344	1.003	2.225	7.541	3.645	2	0.710
3225	5	C LM	CHURCHMAN BIBLE	3119.5	53.313	0.972	1.640	7.295	3.629	2	0.702
3226	5	C LM	CHURCHMAN BIBLE	3120.0	49.594	0.954	1.494	7.149	3.727	2	0.694
3227	5	C LM	CHURCHMAN BIBLE	3120.5	46.719	0.947	1.828	7.254	3.617	2	0.685
3228	5	C LM	CHURCHMAN BIBLE	3121.0	44.563	0.953	2.241	8.013	3.344	2	0.677
3229	5	C LM	CHURCHMAN BIBLE	3121.5	49.719	0.964	2.925	8.013	3.190	2	0.669
3230	5	C LM	CHURCHMAN BIBLE	3122.0	51.469	0.965	3.083	7.708	3.152	2	0.661
3231	5	C LM	CHURCHMAN BIBLE	3122.5	50.031	0.970	2.609	6.668	3.295	2	0.653

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000
mean	4.422030	2875.824567	66.135769	0.642719	3.559642	13.483213	3.725014	1.498453	0.520287
std	2.504243	131.006274	30.854826	0.241845	5.228948	7.698980	0.896152	0.500075	0.286792
min	1.000000	2573.500000	13.250000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.010000
25%	2.000000	2791.000000	46.918750	0.492750	1.163750	8.346750	3.100000	1.000000	0.273000
50%	4.000000	2893.500000	65.721500	0.624437	3.500000	12.150000	3.551500	1.000000	0.526000
75%	6.000000	2980.000000	79.626250	0.812735	6.432500	16.453750	4.300000	2.000000	0.767250
max	9.000000	3122.500000	361.150000	1.480000	18.600000	84.400000	8.094000	2.000000	1.000000

	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS	...	del_PHIND	del_PE	del_RELPOS	del_del_GR	del_del_ILD_log10	del_del_DeltaPHI	del_del_PHIND	del_del_PE	del_del_RELPOS	Facies
0	A1 SH	STUART	2808.0	66.276	0.630	3.300	10.650	3.591	1	1.000	...	1.3000	-0.2500	-0.0220	-2.66450	0.01300	-0.15000	0.17500	-0.01350	0.000000e+00	2
1	A1 SH	STUART	2808.5	77.252	0.585	6.500	11.950	3.341	1	0.978	...	1.4750	-0.2635	-0.0220	-4.63325	0.02450	-0.85000	-0.32500	0.03400	-2.500000e-04	3
2	A1 SH	STUART	2809.0	82.899	0.566	9.400	13.600	3.064	1	0.956	...	0.6500	-0.1820	-0.0225	-5.88775	0.03400	-1.70000	-1.05000	0.12075	-2.500000e-04	3
3	A1 SH	STUART	2809.5	80.671	0.593	9.500	13.250	2.977	1	0.933	...	-0.6250	-0.0220	-0.0225	-2.53375	0.01650	-1.40000	-0.57500	0.11825	2.500000e-04	3
4	A1 SH	STUART	2810.0	75.971	0.638	8.700	12.350	3.020	1	0.911	...	-0.5000	0.0545	-0.0220	2.22975	-0.00900	-0.37500	0.33750	0.02900	2.500000e-04	3
5	A1 SH	STUART	2810.5	73.955	0.667	6.900	12.250	3.086	1	0.889	...	0.0500	0.0360	-0.0220	4.16375	-0.01850	0.50000	0.35000	-0.01800	-2.500000e-04	3
6	A1 SH	STUART	2811.0	77.962	0.674	6.500	12.450	3.092	1	0.867	...	0.2000	0.0185	-0.0225	1.11775	-0.01425	0.60000	0.12500	-0.01075	-2.500000e-04	3
7	A1 SH	STUART	2811.5	83.894	0.667	6.300	12.650	3.123	1	0.844	...	0.3000	0.0145	-0.0225	-2.66825	-0.00625	0.40000	-0.02500	-0.00825	2.500000e-04	3
8	A1 SH	STUART	2812.0	84.424	0.653	6.700	13.050	3.121	1	0.822	...	0.1500	0.0020	-0.0220	-2.95575	0.00475	0.10000	-0.40000	-0.00075	2.500000e-04	3
9	A1 SH	STUART	2812.5	83.160	0.642	7.300	12.950	3.127	1	0.800	...	-0.5000	0.0130	-0.0220	-3.35600	0.01500	-0.52500	-0.61250	-0.00875	-2.775558e-17	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
820	C SH	CRAWFORD	3156.0	69.445	0.617	3.390	11.540	3.820	1	0.778	...	1.2100	-0.1455	-0.0280	-0.95725	0.00050	0.23875	0.41125	-0.07000	-2.500000e-04	2
821	C SH	CRAWFORD	3156.5	73.389	0.608	3.625	12.775	3.620	1	0.750	...	0.9400	-0.1765	-0.0280	-0.29575	0.00125	0.27250	-0.34250	0.00775	0.000000e+00	2
822	C SH	CRAWFORD	3157.0	77.115	0.605	4.140	13.420	3.467	1	0.722	...	0.5250	-0.1300	-0.0280	-0.54225	-0.00400	0.08625	-0.11375	0.02325	2.500000e-04	2
823	C SH	CRAWFORD	3157.5	79.840	0.596	4.875	13.825	3.360	1	0.694	...	0.7125	-0.1300	-0.0275	-0.05325	-0.00750	-0.27125	0.31875	0.01525	2.500000e-04	2
824	C SH	CRAWFORD	3158.0	82.616	0.577	5.235	14.845	3.207	1	0.667	...	1.1625	-0.0995	-0.0275	0.18450	-0.00250	-0.19250	0.12000	0.04275	-2.500000e-04	2
825	C SH	CRAWFORD	3158.5	86.078	0.554	5.040	16.150	3.161	1	0.639	...	0.9525	-0.0445	-0.0280	-0.45650	0.00450	0.28875	-0.42375	0.05150	-2.500000e-04	2
826	C SH	CRAWFORD	3159.0	88.855	0.539	5.560	16.750	3.118	1	0.611	...	0.3150	0.0035	-0.0280	-1.02975	0.00525	0.28750	-0.41500	0.03125	2.500000e-04	2
827	C SH	CRAWFORD	3159.5	90.490	0.530	6.360	16.780	3.168	1	0.583	...	0.1225	0.0180	-0.0275	-1.19850	0.00175	-0.04375	0.04625	-0.01250	2.500000e-04	2
828	C SH	CRAWFORD	3160.0	90.975	0.522	7.035	16.995	3.154	1	0.556	...	0.4075	-0.0215	-0.0275	-0.96350	-0.00025	-0.13375	0.23875	-0.02350	-2.500000e-04	2
829	C SH	CRAWFORD	3160.5	90.108	0.513	7.505	17.595	3.125	1	0.528	...	0.6000	-0.0290	-0.0280	-0.67600	-0.00050	-0.10250	0.19250	-0.00750	-5.000000e-04	2