Facies classification using Machine Learning

Contest entry by Bryan Page

This version has been validated by Matt

Based on the original notebook by Brendon Hall, Enthought

Matt's current sklearn version



In [1]:

    
import sklearn

print(sklearn.__version__)

Exploring the dataset

First, we will examine the data set we will use to train the classifier.



In [2]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable

from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None

filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data









    Out[2]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      1
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      1
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      1
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      1
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      1
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      2
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      2
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      2
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      2
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      2
      0.653
    
  

4149 rows × 11 columns

Remove a single well to use as a blind test later.



In [3]:

    
blind = training_data[training_data['Well Name'] == 'SHANKLE']
training_data = training_data[training_data['Well Name'] != 'SHANKLE']



In [4]:

    
'SHANKLE' in training_data['Well Name'].unique()









    Out[4]:





False



In [5]:

    
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()









    Out[5]:





[SHRIMPLIN, ALEXANDER D, LUKE G U, KIMZEY A, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (9, object): [SHRIMPLIN, ALEXANDER D, LUKE G U, KIMZEY A, ..., NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]

These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.

Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors dataframe.



In [6]:

    
# 1=sandstone  2=c_siltstone   3=f_siltstone # 4=marine_silt_shale 
#5=mudstone 6=wackestone 7=dolomite 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041', '#DC7633','#A569BD',
       '#000000', '#000080', '#2E86C1', '#AED6F1', '#196F3D']

facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies_colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] -1]
    
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
training_data.describe()









    



/Users/matt/anaconda/envs/python3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[6]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      3700.000000
      3700.000000
      3700.000000
      3700.000000
      3700.000000
      3700.000000
      2783.000000
      3700.000000
      3700.000000
    
    
      mean
      4.615676
      2908.853378
      64.873649
      0.663053
      4.651677
      12.892826
      3.805693
      1.540000
      0.524125
    
    
      std
      2.475808
      139.010855
      30.817166
      0.253863
      5.109006
      6.796219
      0.894118
      0.498465
      0.287147
    
    
      min
      1.000000
      2573.500000
      10.149000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.000000
    
    
      25%
      2.000000
      2818.500000
      43.778250
      0.502000
      1.800000
      8.350000
      NaN
      1.000000
      0.278000
    
    
      50%
      4.000000
      2939.000000
      64.817000
      0.645613
      4.400000
      11.857500
      NaN
      2.000000
      0.531000
    
    
      75%
      7.000000
      3015.125000
      80.322500
      0.823000
      7.600000
      15.750000
      NaN
      2.000000
      0.772000
    
    
      max
      9.000000
      3138.000000
      361.150000
      1.800000
      19.312000
      84.400000
      8.094000
      2.000000
      1.000000

This is a quick view of the statistical distribution of the input variables. Looking at the count values, most values have 4149 valid values except for PE, which has 3232. In this tutorial we will drop the feature vectors that don't have a valid PE entry.



In [7]:

    
PE_mask = training_data['PE'].notnull().values
training_data = training_data[PE_mask]



In [8]:

    
training_data.describe()









    Out[8]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
    
    
      mean
      4.558390
      2873.456522
      66.249445
      0.644637
      3.754963
      13.118929
      3.805693
      1.523895
      0.523057
    
    
      std
      2.515249
      138.349383
      31.610849
      0.241922
      5.045916
      7.389665
      0.894118
      0.499518
      0.287499
    
    
      min
      1.000000
      2573.500000
      13.250000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.010000
    
    
      25%
      2.000000
      2767.500000
      46.081500
      0.497000
      1.300000
      8.165000
      3.200000
      1.000000
      0.276500
    
    
      50%
      4.000000
      2895.000000
      65.541000
      0.627000
      3.581000
      11.900000
      3.600000
      2.000000
      0.529000
    
    
      75%
      7.000000
      2987.000000
      80.714000
      0.812323
      6.500000
      16.144000
      4.400000
      2.000000
      0.771500
    
    
      max
      9.000000
      3122.500000
      361.150000
      1.480000
      18.500000
      84.400000
      8.094000
      2.000000
      1.000000

Let's take a look at the data from individual wells in a more familiar log plot form. We will create plots for the five well log variables, as well as a log for facies labels. The plots are based on the those described in Alessandro Amato del Monte's excellent tutorial.



In [9]:

    
#I changed the colors to match the paper for easier compare.
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[5])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)

In addition to individual wells, we can look at how the various facies are represented by the entire training set. Let's plot a histgram of the number of training examples for each facies class.



In [10]:

    
#count the number of unique entries for each facies, sort them by
#facies number (instead of by number of entries)
facies_counts = training_data['Facies'].value_counts().sort_index()
#use facies labels to index each count
facies_counts.index = facies_labels

facies_counts.plot(kind='bar',color=facies_colors, 
                   title='Distribution of Training Data by Facies')
facies_counts









    Out[10]:





SS      170
CSiS    649
FSiS    498
SiSh    177
MS      198
WS      391
D        81
PS      458
BS      161
Name: Facies, dtype: int64

Conditioning the data set

Now we extract just the feature variables we need to perform the classification. The predictor variables are the five wireline values and two geologic constraining variables. We also get a vector of the facies labels that correspond to each feature vector.



In [11]:

    
correct_facies_labels = training_data['Facies'].values

feature_vectors = training_data.drop(['Formation', 'Well Name', 'Depth','Facies','FaciesLabels'], axis=1)
feature_vectors.describe()









    Out[11]:






  
    
      
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
      2783.000000
    
    
      mean
      66.249445
      0.644637
      3.754963
      13.118929
      3.805693
      1.523895
      0.523057
    
    
      std
      31.610849
      0.241922
      5.045916
      7.389665
      0.894118
      0.499518
      0.287499
    
    
      min
      13.250000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.010000
    
    
      25%
      46.081500
      0.497000
      1.300000
      8.165000
      3.200000
      1.000000
      0.276500
    
    
      50%
      65.541000
      0.627000
      3.581000
      11.900000
      3.600000
      2.000000
      0.529000
    
    
      75%
      80.714000
      0.812323
      6.500000
      16.144000
      4.400000
      2.000000
      0.771500
    
    
      max
      361.150000
      1.480000
      18.500000
      84.400000
      8.094000
      2.000000
      1.000000



In [12]:

    
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)



In [13]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        scaled_features, correct_facies_labels, test_size=0.3, random_state=42)









    



/Users/matt/anaconda/envs/python3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Training the SVM classifier

Now we use the cleaned and conditioned training set to create a facies classifier. As mentioned above, we will use a type of machine learning model known as a support vector machine. The SVM is a map of the feature vectors as points in a multi dimensional space, mapped so that examples from different facies are divided by a clear gap that is as wide as possible.

[Matt: I changed this to the LinearSVC as well; I am also setting a random state to ensure it shuffles the same every time.]



In [14]:

    
from sklearn import svm

clf = svm.LinearSVC(random_state=42)

Now we can train the classifier using the training set we created above.



In [15]:

    
clf.fit(X_train, y_train)









    Out[15]:





LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0)



In [16]:

    
predicted_labels = clf.predict(X_test)



In [17]:

    
from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    29    25     1                                        55
     CSiS     3   160    33                             1         197
     FSiS          48    79     1                       1     2   131
     SiSh                 1    32          16           4     1    54
       MS           3     1     8          45     2    14     2    75
       WS                      12          65     1    29     2   109
        D                       4                 3    19          26
       PS           1     1     5          42          76    13   138
       BS                       1           2          17    30    50

Precision  0.91  0.68  0.68  0.51  0.00  0.38  0.50  0.47  0.60  0.54
   Recall  0.53  0.81  0.60  0.59  0.00  0.60  0.12  0.55  0.60  0.57
       F1  0.67  0.74  0.64  0.55  0.00  0.47  0.19  0.51  0.60  0.54



In [18]:

    
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc



In [19]:

    
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))



In [20]:

    
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))









    



Facies classification accuracy = 0.567665
Adjacent facies classification accuracy = 0.906587

Model parameter selection

The classifier so far has been built with the default parameters. However, we may be able to get improved classification results with optimal parameter choices.



In [21]:

    
#model selection takes a few minutes, change this variable
#to true to run the parameter loop
do_model_selection = False

if do_model_selection:
    C_range = np.array([.01, 1, 5, 10, 20, 50, 100, 1000, 5000, 10000])
    gamma_range = np.array([0.0001, 0.001, 0.01, 0.1, 1, 10])
    
    fig, axes = plt.subplots(3, 2, 
                        sharex='col', sharey='row',figsize=(10,10))
    plot_number = 0
    for outer_ind, gamma_value in enumerate(gamma_range):
        row = int(plot_number / 2)
        column = int(plot_number % 2)
        cv_errors = np.zeros(C_range.shape)
        train_errors = np.zeros(C_range.shape)
        for index, c_value in enumerate(C_range):
            
            clf = svm.SVC(C=c_value, gamma=gamma_value)
            clf.fit(X_train,y_train)
            
            train_conf = confusion_matrix(y_train, clf.predict(X_train))
            cv_conf = confusion_matrix(y_test, clf.predict(X_test))
        
            cv_errors[index] = accuracy(cv_conf)
            train_errors[index] = accuracy(train_conf)

        ax = axes[row, column]
        ax.set_title('Gamma = %g'%gamma_value)
        ax.semilogx(C_range, cv_errors, label='CV error')
        ax.semilogx(C_range, train_errors, label='Train error')
        plot_number += 1
        ax.set_ylim([0.2,1])
        
    ax.legend(bbox_to_anchor=(1.05, 0), loc='lower left', borderaxespad=0.)
    fig.text(0.5, 0.03, 'C value', ha='center',
             fontsize=14)
             
    fig.text(0.04, 0.5, 'Classification Accuracy', va='center', 
             rotation='vertical', fontsize=14)

The best accuracy on the cross validation error curve was achieved for gamma = 1, and C = 10. We can now create and train an optimized classifier based on these parameters:



In [22]:

    
clf = svm.LinearSVC(class_weight='balanced', tol=1e-03, random_state=42)        
clf.fit(X_train, y_train)

cv_conf = confusion_matrix(y_test, clf.predict(X_test))

print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf, adjacent_facies))









    



Optimized facies classification accuracy = 0.51
Optimized adjacent facies classification accuracy = 0.88



In [23]:

    
display_cm(cv_conf, facies_labels, 
           display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    40    14     1                                        55
     CSiS    35   113    48                                   1   197
     FSiS     6    39    80     1                 1     2     2   131
     SiSh                 1    33           6    11     2     1    54
       MS     1     2     1    11     5    40     6     5     4    75
       WS                      13     2    58    10    11    15   109
        D                       2                19     3     2    26
       PS           1     1     5     3    41    14    37    36   138
       BS                       1           3           7    39    50

Precision  0.49  0.67  0.61  0.50  0.50  0.39  0.31  0.55  0.39  0.54
   Recall  0.73  0.57  0.61  0.61  0.07  0.53  0.73  0.27  0.78  0.51
       F1  0.58  0.62  0.61  0.55  0.12  0.45  0.44  0.36  0.52  0.49



In [24]:

    
display_adj_cm(cv_conf, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    54           1                                        55
     CSiS         196                                         1   197
     FSiS     6         119     1                 1     2     2   131
     SiSh                 1    33           6    11     2     1    54
       MS     1     2     1          56           6     5     4    75
       WS                      13          81                15   109
        D                       2                22           2    26
       PS           1     1     5     3               128         138
       BS                       1           3                46    50

Precision  0.89  0.98  0.97  0.60  0.95  0.90  0.55  0.93  0.65  0.89
   Recall  0.98  0.99  0.91  0.61  0.75  0.74  0.85  0.93  0.92  0.88
       F1  0.93  0.99  0.94  0.61  0.84  0.81  0.67  0.93  0.76  0.88

We held a well back from the training, and stored it in a dataframe called blind:



In [25]:

    
blind









    Out[25]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      937
      2
      A1 SH
      SHANKLE
      2774.5
      98.36
      0.642
      -0.1
      18.685
      2.9
      1
      1.000
    
    
      938
      2
      A1 SH
      SHANKLE
      2775.0
      97.57
      0.631
      7.9
      16.745
      3.2
      1
      0.984
    
    
      939
      2
      A1 SH
      SHANKLE
      2775.5
      98.41
      0.615
      12.8
      14.105
      3.2
      1
      0.968
    
    
      940
      2
      A1 SH
      SHANKLE
      2776.0
      85.92
      0.597
      13.0
      13.385
      3.4
      1
      0.952
    
    
      941
      2
      A1 SH
      SHANKLE
      2776.5
      83.16
      0.592
      12.3
      13.345
      3.4
      1
      0.935
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1381
      8
      C LM
      SHANKLE
      3006.0
      32.84
      1.120
      -2.2
      3.455
      5.1
      2
      0.060
    
    
      1382
      8
      C LM
      SHANKLE
      3006.5
      25.16
      1.112
      -1.6
      2.890
      4.8
      2
      0.045
    
    
      1383
      8
      C LM
      SHANKLE
      3007.0
      25.16
      1.112
      -1.6
      2.890
      4.8
      2
      0.030
    
    
      1384
      4
      C LM
      SHANKLE
      3007.5
      26.22
      1.092
      -0.4
      3.400
      4.5
      2
      0.030
    
    
      1385
      4
      C LM
      SHANKLE
      3008.0
      65.36
      1.026
      1.6
      4.715
      4.5
      2
      0.015
    
  

449 rows × 11 columns

The label vector is just the Facies column:



In [26]:

    
y_blind = blind['Facies'].values

We can form the feature matrix by dropping some of the columns and making a new dataframe:



In [27]:

    
well_features = blind.drop(['Facies', 'Formation', 'Well Name', 'Depth'], axis=1)

Now we can transform this with the scaler we made before:



In [28]:

    
X_blind = scaler.transform(well_features)

Now it's a simple matter of making a prediction and storing it back in the dataframe:



In [29]:

    
y_pred = clf.predict(X_blind)
blind['Prediction'] = y_pred

Let's see how we did with the confusion matrix:



In [30]:

    
cv_conf = confusion_matrix(y_blind, y_pred)

print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf, adjacent_facies))









    



Optimized facies classification accuracy = 0.55
Optimized adjacent facies classification accuracy = 0.93

We managed 0.75 using the test data, but it was from the same wells as the training data. This more reasonable test does not perform as well...



In [31]:

    
display_cm(cv_conf, facies_labels,
           display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    75    14                                              89
     CSiS    28    43    16     2                                  89
     FSiS     4    61    52                                       117
     SiSh                       5           2                       7
       MS                      12     2     2     2     1          19
       WS                      14     1    45     3     7     1    71
        D                                        17                17
       PS                       1     5    17           9     8    40
       BS                                                           0

Precision  0.70  0.36  0.76  0.15  0.25  0.68  0.77  0.53  0.00  0.61
   Recall  0.84  0.48  0.44  0.71  0.11  0.63  1.00  0.23  0.00  0.55
       F1  0.77  0.42  0.56  0.24  0.15  0.66  0.87  0.32  0.00  0.56

...but does remarkably well on the adjacent facies predictions.



In [32]:

    
display_adj_cm(cv_conf, facies_labels, adjacent_facies,
               display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    89                                                    89
     CSiS          87           2                                  89
     FSiS     4         113                                       117
     SiSh                       5           2                       7
       MS                            16           2     1          19
       WS                      14          56                 1    71
        D                                        17                17
       PS                       1     5                34          40
       BS                                                           0

Precision  0.96  1.00  1.00  0.23  0.76  0.97  0.89  0.97  0.00  0.96
   Recall  1.00  0.98  0.97  0.71  0.84  0.79  1.00  0.85  0.00  0.93
       F1  0.98  0.99  0.98  0.34  0.80  0.87  0.94  0.91  0.00  0.94



In [33]:

    
def compare_facies_plot(logs, compadre, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=7, figsize=(9, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[6])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im2, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-2):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    ax[6].set_xlabel(compadre)
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    ax[6].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)



In [34]:

    
compare_facies_plot(blind, 'Prediction', facies_colors)

K-fold cross validation



In [35]:

    
from sklearn.cross_validation import KFold
kf = KFold(4, n_folds=2)



In [36]:

    
print(kf)









    



sklearn.cross_validation.KFold(n=4, n_folds=2, shuffle=False, random_state=None)



In [ ]:

    
for train_index, test_index in kf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]



In [ ]:

Applying the classification model to new data

Now that we have a trained facies classification model we can use it to identify facies in wells that do not have core data. In this case, we will apply the classifier to two wells, but we could use it on any number of wells for which we have the same set of well logs for input.

This dataset is similar to the training data except it does not have facies labels. It is loaded into a dataframe called test_data.



In [35]:

    
well_data = pd.read_csv('../validation_data_nofacies.csv')
well_data['Well Name'] = well_data['Well Name'].astype('category')
well_features = well_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)

The data needs to be scaled using the same constants we used for the training data.



In [36]:

    
X_unknown = scaler.transform(well_features)

Finally we predict facies labels for the unknown data, and store the results in a Facies column of the test_data dataframe.



In [37]:

    
#predict facies of unclassified data
y_unknown = clf.predict(X_unknown)
well_data['Facies'] = y_unknown
well_data









    Out[37]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      Facies
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.300
      10.650
      3.591
      1
      1.000
      2
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.500
      11.950
      3.341
      1
      0.978
      3
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.400
      13.600
      3.064
      1
      0.956
      3
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.500
      13.250
      2.977
      1
      0.933
      2
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.700
      12.350
      3.020
      1
      0.911
      2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      825
      C SH
      CRAWFORD
      3158.5
      86.078
      0.554
      5.040
      16.150
      3.161
      1
      0.639
      2
    
    
      826
      C SH
      CRAWFORD
      3159.0
      88.855
      0.539
      5.560
      16.750
      3.118
      1
      0.611
      2
    
    
      827
      C SH
      CRAWFORD
      3159.5
      90.490
      0.530
      6.360
      16.780
      3.168
      1
      0.583
      3
    
    
      828
      C SH
      CRAWFORD
      3160.0
      90.975
      0.522
      7.035
      16.995
      3.154
      1
      0.556
      3
    
    
      829
      C SH
      CRAWFORD
      3160.5
      90.108
      0.513
      7.505
      17.595
      3.125
      1
      0.528
      3
    
  

830 rows × 11 columns



In [38]:

    
well_data['Well Name'].unique()









    Out[38]:





[STUART, CRAWFORD]
Categories (2, object): [STUART, CRAWFORD]

We can use the well log plot to view the classification results along with the well logs.



In [39]:

    
well_data.to_csv('well_data_with_facies.csv')



In [ ]:

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.450	0.664	9.900	11.915	4.600	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.260	0.661	14.200	12.565	4.100	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.050	0.658	14.800	13.050	3.600	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.100	0.655	13.900	13.115	3.500	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.580	0.647	13.500	13.300	3.400	1	0.915
...	...	...	...	...	...	...	...	...	...	...	...
4144	5	C LM	CHURCHMAN BIBLE	3120.5	46.719	0.947	1.828	7.254	3.617	2	0.685
4145	5	C LM	CHURCHMAN BIBLE	3121.0	44.563	0.953	2.241	8.013	3.344	2	0.677
4146	5	C LM	CHURCHMAN BIBLE	3121.5	49.719	0.964	2.925	8.013	3.190	2	0.669
4147	5	C LM	CHURCHMAN BIBLE	3122.0	51.469	0.965	3.083	7.708	3.152	2	0.661
4148	5	C LM	CHURCHMAN BIBLE	3122.5	50.031	0.970	2.609	6.668	3.295	2	0.653

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	3700.000000	3700.000000	3700.000000	3700.000000	3700.000000	3700.000000	2783.000000	3700.000000	3700.000000
mean	4.615676	2908.853378	64.873649	0.663053	4.651677	12.892826	3.805693	1.540000	0.524125
std	2.475808	139.010855	30.817166	0.253863	5.109006	6.796219	0.894118	0.498465	0.287147
min	1.000000	2573.500000	10.149000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.000000
25%	2.000000	2818.500000	43.778250	0.502000	1.800000	8.350000	NaN	1.000000	0.278000
50%	4.000000	2939.000000	64.817000	0.645613	4.400000	11.857500	NaN	2.000000	0.531000
75%	7.000000	3015.125000	80.322500	0.823000	7.600000	15.750000	NaN	2.000000	0.772000
max	9.000000	3138.000000	361.150000	1.800000	19.312000	84.400000	8.094000	2.000000	1.000000

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
937	2	A1 SH	SHANKLE	2774.5	98.36	0.642	-0.1	18.685	2.9	1	1.000
938	2	A1 SH	SHANKLE	2775.0	97.57	0.631	7.9	16.745	3.2	1	0.984
939	2	A1 SH	SHANKLE	2775.5	98.41	0.615	12.8	14.105	3.2	1	0.968
940	2	A1 SH	SHANKLE	2776.0	85.92	0.597	13.0	13.385	3.4	1	0.952
941	2	A1 SH	SHANKLE	2776.5	83.16	0.592	12.3	13.345	3.4	1	0.935
...	...	...	...	...	...	...	...	...	...	...	...
1381	8	C LM	SHANKLE	3006.0	32.84	1.120	-2.2	3.455	5.1	2	0.060
1382	8	C LM	SHANKLE	3006.5	25.16	1.112	-1.6	2.890	4.8	2	0.045
1383	8	C LM	SHANKLE	3007.0	25.16	1.112	-1.6	2.890	4.8	2	0.030
1384	4	C LM	SHANKLE	3007.5	26.22	1.092	-0.4	3.400	4.5	2	0.030
1385	4	C LM	SHANKLE	3008.0	65.36	1.026	1.6	4.715	4.5	2	0.015

	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS	Facies
0	A1 SH	STUART	2808.0	66.276	0.630	3.300	10.650	3.591	1	1.000	2
1	A1 SH	STUART	2808.5	77.252	0.585	6.500	11.950	3.341	1	0.978	3
2	A1 SH	STUART	2809.0	82.899	0.566	9.400	13.600	3.064	1	0.956	3
3	A1 SH	STUART	2809.5	80.671	0.593	9.500	13.250	2.977	1	0.933	2
4	A1 SH	STUART	2810.0	75.971	0.638	8.700	12.350	3.020	1	0.911	2
...	...	...	...	...	...	...	...	...	...	...	...
825	C SH	CRAWFORD	3158.5	86.078	0.554	5.040	16.150	3.161	1	0.639	2
826	C SH	CRAWFORD	3159.0	88.855	0.539	5.560	16.750	3.118	1	0.611	2
827	C SH	CRAWFORD	3159.5	90.490	0.530	6.360	16.780	3.168	1	0.583	3
828	C SH	CRAWFORD	3160.0	90.975	0.522	7.035	16.995	3.154	1	0.556	3
829	C SH	CRAWFORD	3160.5	90.108	0.513	7.505	17.595	3.125	1	0.528	3