SEG Machine Learning (Well Log Facies Prediction) Contest

Entry by Justin Gosses of team Pet_Stromatolite

This is an "open science" contest designed to introduce people to machine learning with well logs and brainstorm different methods through collaboration with others, so this notebook is based heavily on the introductary notebook with my own modifications.

more information at https://github.com/seg/2016-ml-contest

and even more information at http://library.seg.org/doi/abs/10.1190/tle35100906.1

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:

Nonmarine sandstone
Nonmarine coarse siltstone
Nonmarine fine siltstone
Marine siltstone and shale
Mudstone (limestone)
Wackestone (limestone)
Dolomite
Packstone-grainstone (limestone)
Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies	Label	Adjacent Facies
1	SS	2
2	CSiS	1,3
3	FSiS	2
4	SiSh	5
5	MS	4,6
6	WS	5,7
7	D	6,8
8	PS	6,7,9
9	BS	7,8

Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.

=================================================================================================================

Notes:

Early Ideas for feature engineering

take out any points in individual wells where not all the logs are present
test whether error increases around the depths where PE is absent?
test whether using formation, depth, or depth&formation as variables impacts prediction
examine well logs & facies logs (including prediction wells) to see if there aren't trends that might be dealt with by increasing the population of certain wells over others in the training set?
explore effect size of using/not using marine or non-marine flags
explore making 'likely to predict wrong' flags based on first-pass results with thin facies surrounded by thicker facies such that you might expand a 'blended' response due to the measured response of the tool being thicker than predicted facies
explore doing the same above but before prediction using range of thickness in predicted facies flags vs. range of thickness in known facies flags
explore using multiple prediction loops, in order words, predict errors not just facies.
Explore error distribution: adjacent vs. non-adjacent facies, by thickness, marine vs. non-marine, by formation, and possible human judgement patterns that influence interpreted facies.



In [1]:

    
### loading 
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable



In [2]:

    
### setting up options in pandas
from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None


### taking a look at the training dataset 
filename = 'training_data.csv'
training_data = pd.read_csv(filename)
training_data









    Out[2]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      1
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      1
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      1
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      1
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      1
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3227
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      2
      0.685
    
    
      3228
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      2
      0.677
    
    
      3229
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      2
      0.669
    
    
      3230
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      2
      0.661
    
    
      3231
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      2
      0.653
    
  

3232 rows × 11 columns



In [3]:

    
### Checking out Well Names
training_data['Well Name'] = training_data['Well Name'].astype('category')

training_data['Well Name'].unique()









    Out[3]:





[SHRIMPLIN, SHANKLE, LUKE G U, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (8, object): [SHRIMPLIN, SHANKLE, LUKE G U, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]



In [4]:

    
training_data['Well Name']









    Out[4]:





0             SHRIMPLIN
1             SHRIMPLIN
2             SHRIMPLIN
3             SHRIMPLIN
4             SHRIMPLIN
             ...       
3227    CHURCHMAN BIBLE
3228    CHURCHMAN BIBLE
3229    CHURCHMAN BIBLE
3230    CHURCHMAN BIBLE
3231    CHURCHMAN BIBLE
Name: Well Name, dtype: category
Categories (8, object): [CHURCHMAN BIBLE, CROSS H CATTLE, LUKE G U, NEWBY, NOLAN, Recruit F9, SHANKLE, SHRIMPLIN]



In [5]:

    
well_name_list = training_data['Well Name'].unique()
well_name_list









    Out[5]:





[SHRIMPLIN, SHANKLE, LUKE G U, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (8, object): [SHRIMPLIN, SHANKLE, LUKE G U, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]



In [6]:

    
### Checking out Formation Names
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Formation'].unique()









    Out[6]:





[A1 SH, A1 LM, B1 SH, B1 LM, B2 SH, ..., B4 LM, B5 SH, B5 LM, C SH, C LM]
Length: 14
Categories (14, object): [A1 SH, A1 LM, B1 SH, B1 LM, ..., B5 SH, B5 LM, C SH, C LM]



In [7]:

    
training_data.describe()









    Out[7]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
    
    
      mean
      4.422030
      2875.824567
      66.135769
      0.642719
      3.559642
      13.483213
      3.725014
      1.498453
      0.520287
    
    
      std
      2.504243
      131.006274
      30.854826
      0.241845
      5.228948
      7.698980
      0.896152
      0.500075
      0.286792
    
    
      min
      1.000000
      2573.500000
      13.250000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.010000
    
    
      25%
      2.000000
      2791.000000
      46.918750
      0.492750
      1.163750
      8.346750
      3.100000
      1.000000
      0.273000
    
    
      50%
      4.000000
      2893.500000
      65.721500
      0.624437
      3.500000
      12.150000
      3.551500
      1.000000
      0.526000
    
    
      75%
      6.000000
      2980.000000
      79.626250
      0.812735
      6.432500
      16.453750
      4.300000
      2.000000
      0.767250
    
    
      max
      9.000000
      3122.500000
      361.150000
      1.480000
      18.600000
      84.400000
      8.094000
      2.000000
      1.000000



In [8]:

    
facies_1 = training_data.loc[training_data['Facies'] == 1]
facies_2 = training_data.loc[training_data['Facies'] == 2]
facies_3 = training_data.loc[training_data['Facies'] == 3]
facies_4 = training_data.loc[training_data['Facies'] == 4]
facies_5 = training_data.loc[training_data['Facies'] == 5]
facies_6 = training_data.loc[training_data['Facies'] == 6]
facies_7 = training_data.loc[training_data['Facies'] == 7]
facies_8 = training_data.loc[training_data['Facies'] == 8]
facies_9 = training_data.loc[training_data['Facies'] == 9]

#showing description for just facies 1, Sandstone
facies_1.describe()









    Out[8]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      259.0
      259.000000
      259.000000
      259.000000
      259.000000
      259.000000
      259.000000
      259.0
      259.000000
    
    
      mean
      1.0
      2749.503861
      64.497761
      0.371312
      3.506463
      14.818996
      2.913313
      1.0
      0.452486
    
    
      std
      0.0
      103.722844
      9.397137
      0.211272
      3.052479
      3.910413
      0.278879
      0.0
      0.215447
    
    
      min
      1.0
      2585.000000
      47.697000
      -0.025949
      -19.500000
      6.250000
      2.242000
      1.0
      0.074000
    
    
      25%
      1.0
      2673.250000
      58.047000
      0.215493
      1.819000
      12.275000
      2.700000
      1.0
      0.273000
    
    
      50%
      1.0
      2732.000000
      61.969000
      0.354000
      3.500000
      15.150000
      2.885000
      1.0
      0.429000
    
    
      75%
      1.0
      2839.750000
      70.062500
      0.541493
      5.000000
      17.587500
      3.098000
      1.0
      0.602000
    
    
      max
      1.0
      2941.500000
      92.783000
      0.874000
      11.400000
      36.350000
      3.652000
      1.0
      0.944000



In [9]:

    
#showing description for just facies 9, Phylloid-algal bafflestone (limestone)

facies_9.describe()









    Out[9]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      161.0
      161.000000
      161.000000
      161.000000
      161.000000
      161.000000
      161.000000
      161.0
      161.000000
    
    
      mean
      9.0
      3033.400621
      44.877280
      0.574677
      0.198547
      13.226944
      5.295925
      2.0
      0.451925
    
    
      std
      0.0
      49.093878
      29.823639
      0.298074
      3.191655
      4.376141
      0.839187
      0.0
      0.227553
    
    
      min
      9.0
      2843.000000
      14.840000
      0.100000
      -11.400000
      2.300000
      3.200000
      2.0
      0.088000
    
    
      25%
      9.0
      3011.000000
      34.920000
      0.346000
      -0.022000
      10.040000
      4.700000
      2.0
      0.233000
    
    
      50%
      9.0
      3048.000000
      39.219000
      0.598000
      0.990000
      13.229000
      5.200000
      2.0
      0.467000
    
    
      75%
      9.0
      3071.000000
      44.531000
      0.784000
      1.921000
      17.218000
      5.652000
      2.0
      0.633000
    
    
      max
      9.0
      3104.500000
      222.500000
      1.294000
      4.614000
      19.453000
      8.094000
      2.0
      0.952000



In [10]:

    
#showing description for just facies 8, Packstone-grainstone (limestone)

facies_8.describe()









    Out[10]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      498.0
      498.000000
      498.000000
      498.000000
      498.000000
      498.000000
      498.000000
      498.000000
      498.000000
    
    
      mean
      8.0
      2891.041165
      47.195886
      0.747021
      1.444823
      9.585972
      4.518653
      1.975904
      0.585058
    
    
      std
      0.0
      119.378744
      34.410690
      0.236509
      3.740654
      4.703665
      0.745893
      0.153503
      0.313764
    
    
      min
      8.0
      2608.500000
      13.250000
      -0.012000
      -21.832000
      0.550000
      1.605000
      1.000000
      0.010000
    
    
      25%
      8.0
      2813.500000
      23.502500
      0.600500
      0.033250
      6.400000
      4.000000
      2.000000
      0.318000
    
    
      50%
      8.0
      2923.250000
      34.981500
      0.775000
      1.502500
      8.716500
      4.559500
      2.000000
      0.630000
    
    
      75%
      8.0
      2983.875000
      61.215750
      0.904000
      3.487250
      11.858500
      5.000000
      2.000000
      0.879500
    
    
      max
      8.0
      3108.000000
      273.480000
      1.452000
      13.700000
      47.721000
      8.063000
      2.000000
      1.000000

This is a quick view of the statistical distribution of the input variables. Looking at the count values, there are 3232 feature vectors in the training set. Remove a single well to use as a blind test later.



In [11]:

    
blind = training_data[training_data['Well Name'] == 'SHANKLE']
training_data = training_data[training_data['Well Name'] != 'SHANKLE']

Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors dataframe.



In [12]:

    
# 1=sandstone  2=c_siltstone   3=f_siltstone 
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00',
       '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']

facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies_colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] -1]
    
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)

Let's take a look at the data from individual wells in a more familiar log plot form. We will create plots for the five well log variables, as well as a log for facies labels. The plots are based on the those described in Alessandro Amato del Monte's excellent tutorial.



In [13]:

    
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[5])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)



In [14]:

    
make_facies_log_plot(
    training_data[training_data['Well Name'] == 'SHRIMPLIN'],
    facies_colors)

editing the well viewer code in an attempt to understand it and potentially not show everything



In [15]:

    
def make_faciesOnly_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    f, ax = plt.subplots(nrows=1, ncols=2, figsize=(3, 9))
#     f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
#     ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[0].plot(logs.ILD_log10, logs.Depth, '-')
#     ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
#     ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
#     ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[1].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[1])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
#     ax[0].set_xlabel("GR")
#     ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[0].set_xlabel("ILD_log10")
    ax[0].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
#     ax[2].set_xlabel("DeltaPHI")
#     ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
#     ax[3].set_xlabel("PHIND")
#     ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
#     ax[4].set_xlabel("PE")
#     ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[1].set_xlabel('Facies')
    
#     ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
#     ax[4].set_yticklabels([]); 
    ax[1].set_yticklabels([])
    ax[1].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)

looking at several wells at once



In [16]:

    
# make_faciesOnly_log_plot(
#     training_data[training_data['Well Name'] == 'SHRIMPLIN'],
#     facies_colors)

for i in range(len(well_name_list)-1):
#     well_name_list[i]
    make_faciesOnly_log_plot(
        training_data[training_data['Well Name'] == well_name_list[i]],
        facies_colors)









    



/Users/justingosses/anaconda/lib/python3.5/site-packages/matplotlib/axes/_base.py:3045: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=-0.5, top=-0.5
  'bottom=%s, top=%s') % (bottom, top))






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-16-cd71a9b6d309> in <module>()
      7     make_faciesOnly_log_plot(
      8         training_data[training_data['Well Name'] == well_name_list[i]],
----> 9         facies_colors)

<ipython-input-15-7b8f06744f32> in make_faciesOnly_log_plot(logs, facies_colors)
     48     ax[1].set_yticklabels([])
     49     ax[1].set_xticklabels([])
---> 50     f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
     51 
     52 

/Users/justingosses/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1284             return self._getitem_tuple(key)
   1285         else:
-> 1286             return self._getitem_axis(key, axis=0)
   1287 
   1288     def _getitem_axis(self, key, axis=0):

/Users/justingosses/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1571 
   1572                 # validate the location
-> 1573                 self._is_valid_integer(key, axis)
   1574 
   1575             return self._get_loc(key, axis=axis)

/Users/justingosses/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py in _is_valid_integer(self, key, axis)
   1485         l = len(ax)
   1486         if key >= l or key < -l:
-> 1487             raise IndexError("single positional indexer is out-of-bounds")
   1488         return True
   1489 

IndexError: single positional indexer is out-of-bounds

In addition to individual wells, we can look at how the various facies are represented by the entire training set. Let's plot a histogram of the number of training examples for each facies class.

This shows the distribution of examples by facies for the examples in the training set. Dolomite (facies 7) has the fewest with 81 examples. Depending on the performance of the classifier we are going to train, we may consider getting more examples of these facies. Crossplots are a familiar tool in the geosciences to visualize how two properties vary with rock type. This dataset contains 5 log variables, and scatter matrix can help to quickly visualize the variation between the all the variables in the dataset. We can employ the very useful Seaborn library to quickly create a nice looking scatter matrix. Each pane in the plot shows the relationship between two of the variables on the x and y axis, with each point is colored according to its facies. The same colormap is used to represent the 9 facies.



In [ ]:

    
#save plot display settings to change back to when done plotting with seaborn
inline_rc = dict(mpl.rcParams)

import seaborn as sns
sns.set()
sns.pairplot(training_data.drop(['Well Name','Facies','Formation','Depth','NM_M','RELPOS'],axis=1),
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))

#switch back to default matplotlib plot style
mpl.rcParams.update(inline_rc)



In [ ]:

    
from pandas.tools.plotting import radviz
radviz(training_data.drop(['Well Name','Formation','Depth','NM_M','RELPOS']), "Facies")

Conditioning the data set

Now we extract just the feature variables we need to perform the classification. The predictor variables are the five wireline values and two geologic constraining variables. We also get a vector of the facies labels that correspond to each feature vector.



In [ ]:

    
correct_facies_labels = training_data['Facies'].values

# dropping certain labels and only keeping the geophysical log values to train on
feature_vectors = training_data.drop(['Formation', 'Well Name', 'Depth','Facies','FaciesLabels'], axis=1)
feature_vectors.describe()

Scikit includes a preprocessing module that can 'standardize' the data (giving each variable zero mean and unit variance, also called whitening). Many machine learning algorithms assume features will be standard normally distributed data (ie: Gaussian with zero mean and unit variance). The factors used to standardize the training set must be applied to any subsequent feature set that will be input to the classifier. The StandardScalar class can be fit to the training set, and later used to standardize any training data.



In [ ]:

    
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)



In [ ]:

    
feature_vectors



In [ ]:

    
feature_vectors.describe()

Scikit also includes a handy function to randomly split the training data into training and test sets. The test set contains a small subset of feature vectors that are not used to train the network. Because we know the true facies labels for these examples, we can compare the results of the classifier to the actual facies and determine the accuracy of the model. Let's use 20% of the data for the test set.



In [ ]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        scaled_features, correct_facies_labels, test_size=0.1, random_state=42)

Training the SVM classifier

Now we use the cleaned and conditioned training set to create a facies classifier. As mentioned above, we will use a type of machine learning model known as a support vector machine. The SVM is a map of the feature vectors as points in a multi dimensional space, mapped so that examples from different facies are divided by a clear gap that is as wide as possible.

The SVM implementation in scikit-learn takes a number of important parameters. First we create a classifier using the default settings.



In [ ]:

    
from sklearn import svm

clf = svm.SVC()

Now we can train the classifier using the training set we created above.



In [ ]:

    
clf.fit(X_train,y_train)

Now that the model has been trained on our data, we can use it to predict the facies of the feature vectors in the test set. Because we know the true facies labels of the vectors in the test set, we can use the results to evaluate the accuracy of the classifier.



In [ ]:

    
predicted_labels = clf.predict(X_test)

We need some metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.

The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j] are equal to the number of observations predicted to have facies j, but are known to have facies i.

To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py in this repo for the display_cm() function.



In [ ]:

    
from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, hide_zeros=True)



In [ ]:

    
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i, adjacent_facies[i] is an array of the adjacent facies labels.



In [ ]:

    
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))



In [ ]:

    
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))

Model parameter selection

The classifier so far has been built with the default parameters. However, we may be able to get improved classification results with optimal parameter choices.

We will consider two parameters. The parameter C is a regularization factor, and tells the classifier how much we want to avoid misclassifying training examples. A large value of C will try to correctly classify more examples from the training set, but if C is too large it may 'overfit' the data and fail to generalize when classifying new data. If C is too small then the model will not be good at fitting outliers and will have a large error on the training set.

The SVM learning algorithm uses a kernel function to compute the distance between feature vectors. Many kernel functions exist, but in this case we are using the radial basis function rbf kernel (the default). The gamma parameter describes the size of the radial basis functions, which is how far away two vectors in the feature space need to be to be considered close.

We will train a series of classifiers with different values for C and gamma. Two nested loops are used to train a classifier for every possible combination of values in the ranges specified. The classification accuracy is recorded for each combination of parameter values. The results are shown in a series of plots, so the parameter values that give the best classification accuracy on the test set can be selected.

This process is also known as 'cross validation'. Often a separate 'cross validation' dataset will be created in addition to the training and test sets to do model selection. For this tutorial we will just use the test set to choose model parameters.



In [ ]:

    
#model selection takes a few minutes, change this variable
#to true to run the parameter loop
do_model_selection = True

if do_model_selection:
    C_range = np.array([.01, 1, 5, 10, 20, 50, 100, 1000, 5000, 10000])
    gamma_range = np.array([0.0001, 0.001, 0.01, 0.1, 1, 10])
    
    fig, axes = plt.subplots(3, 2, 
                        sharex='col', sharey='row',figsize=(10,10))
    plot_number = 0
    for outer_ind, gamma_value in enumerate(gamma_range):
        row = int(plot_number / 2)
        column = int(plot_number % 2)
        cv_errors = np.zeros(C_range.shape)
        train_errors = np.zeros(C_range.shape)
        for index, c_value in enumerate(C_range):
            
            clf = svm.SVC(C=c_value, gamma=gamma_value)
            clf.fit(X_train,y_train)
            
            train_conf = confusion_matrix(y_train, clf.predict(X_train))
            cv_conf = confusion_matrix(y_test, clf.predict(X_test))
        
            cv_errors[index] = accuracy(cv_conf)
            train_errors[index] = accuracy(train_conf)

        ax = axes[row, column]
        ax.set_title('Gamma = %g'%gamma_value)
        ax.semilogx(C_range, cv_errors, label='CV error')
        ax.semilogx(C_range, train_errors, label='Train error')
        plot_number += 1
        ax.set_ylim([0.2,1])
        
    ax.legend(bbox_to_anchor=(1.05, 0), loc='lower left', borderaxespad=0.)
    fig.text(0.5, 0.03, 'C value', ha='center',
             fontsize=14)
             
    fig.text(0.04, 0.5, 'Classification Accuracy', va='center', 
             rotation='vertical', fontsize=14)

The best accuracy on the cross validation error curve was achieved for gamma = 1, and C = 10. We can now create and train an optimized classifier based on these parameters:



In [ ]:

    
clf = svm.SVC(C=10, gamma=1)        
clf.fit(X_train, y_train)

cv_conf = confusion_matrix(y_test, clf.predict(X_test))

print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf, adjacent_facies))

Precision and recall are metrics that give more insight into how the classifier performs for individual facies. Precision is the probability that given a classification result for a sample, the sample actually belongs to that class. Recall is the probability that a sample will be correctly classified for a given class.

Precision and recall can be computed easily using the confusion matrix. The code to do so has been added to the display_confusion_matrix() function:



In [ ]:

    
display_cm(cv_conf, facies_labels, 
           display_metrics=True, hide_zeros=True)

To interpret these results, consider facies SS. In our test set, if a sample was labeled SS the probability the sample was correct is 0.8 (precision). If we know a sample has facies SS, then the probability it will be correctly labeled by the classifier is 0.78 (recall). It is desirable to have high values for both precision and recall, but often when an algorithm is tuned to increase one, the other decreases. The F1 score combines both to give a single measure of relevancy of the classifier results.

These results can help guide intuition for how to improve the classifier results. For example, for a sample with facies MS or mudstone, it is only classified correctly 57% of the time (recall). Perhaps this could be improved by introducing more training samples. Sample quality could also play a role. Facies BS or bafflestone has the best F1 score and relatively few training examples. But this data was handpicked from other wells to provide training examples to identify this facies.

We can also consider the classification metrics when we consider misclassifying an adjacent facies as correct:



In [ ]:

    
display_adj_cm(cv_conf, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)

Considering adjacent facies, the F1 scores for all facies types are above 0.9, except when classifying SiSh or marine siltstone and shale. The classifier often misclassifies this facies (recall of 0.66), most often as wackestone.

These results are comparable to those reported in Dubois et al. (2007).

We held a well back from the training, and stored it in a dataframe called blind:



In [ ]:

    
blind

The label vector is just the Facies column:



In [ ]:

    
y_blind = blind['Facies'].values

We can form the feature matrix by dropping some of the columns and making a new dataframe:



In [ ]:

    
well_features = blind.drop(['Facies', 'Formation', 'Well Name', 'Depth'], axis=1)

Now we can transform this with the scaler we made before:



In [ ]:

    
X_blind = scaler.transform(well_features)

Now it's a simple matter of making a prediction and storing it back in the dataframe:



In [ ]:

    
y_pred = clf.predict(X_blind)
blind['Prediction'] = y_pred

Let's see how we did with the confusion matrix:



In [ ]:

    
cv_conf = confusion_matrix(y_blind, y_pred)

print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf, adjacent_facies))

We managed 0.71 using the test data, but it was from the same wells as the training data. This more reasonable test does not perform as well...



In [ ]:

    
display_cm(cv_conf, facies_labels,
           display_metrics=True, hide_zeros=True)

...but does remarkably well on the adjacent facies predictions.



In [ ]:

    
display_adj_cm(cv_conf, facies_labels, adjacent_facies,
               display_metrics=True, hide_zeros=True)



In [ ]:

    
def compare_facies_plot(logs, compadre, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=7, figsize=(9, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[6])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im2, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-2):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    ax[6].set_xlabel(compadre)
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    ax[6].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)



In [ ]:

    
compare_facies_plot(blind, 'Prediction', facies_colors)

Applying the classification model to new data

Now that we have a trained facies classification model we can use it to identify facies in wells that do not have core data. In this case, we will apply the classifier to two wells, but we could use it on any number of wells for which we have the same set of well logs for input.

This dataset is similar to the training data except it does not have facies labels. It is loaded into a dataframe called test_data.



In [ ]:

    
well_data = pd.read_csv('validation_data_nofacies.csv')
well_data['Well Name'] = well_data['Well Name'].astype('category')
well_features = well_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)

The data needs to be scaled using the same constants we used for the training data.



In [ ]:

    
X_unknown = scaler.transform(well_features)

Finally we predict facies labels for the unknown data, and store the results in a Facies column of the test_data dataframe.



In [ ]:

    
#predict facies of unclassified data
y_unknown = clf.predict(X_unknown)
well_data['Facies'] = y_unknown
well_data



In [ ]:

    
well_data['Well Name'].unique()

We can use the well log plot to view the classification results along with the well logs.



In [ ]:

    
make_facies_log_plot(
    well_data[well_data['Well Name'] == 'STUART'],
    facies_colors=facies_colors)

make_facies_log_plot(
    well_data[well_data['Well Name'] == 'CRAWFORD'],
    facies_colors=facies_colors)



In [ ]:

    
well_data.to_csv('well_data_with_facies.csv')

References

Amato del Monte, A., 2015. Seismic Petrophysics: Part 1, The Leading Edge, 34 (4). doi:10.1190/tle34040440.1

Bohling, G. C., and M. K. Dubois, 2003. An Integrated Application of Neural Network and Markov Chain Techniques to Prediction of Lithofacies from Well Logs, KGS Open-File Report 2003-50, 6 pp. pdf

Dubois, M. K., G. C. Bohling, and S. Chakrabarti, 2007, Comparison of four approaches to a rock facies classification problem, Computers & Geosciences, 33 (5), 599-617 pp. doi:10.1016/j.cageo.2006.08.011



In [ ]:

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.450	0.664	9.900	11.915	4.600	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.260	0.661	14.200	12.565	4.100	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.050	0.658	14.800	13.050	3.600	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.100	0.655	13.900	13.115	3.500	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.580	0.647	13.500	13.300	3.400	1	0.915
...	...	...	...	...	...	...	...	...	...	...	...
3227	5	C LM	CHURCHMAN BIBLE	3120.5	46.719	0.947	1.828	7.254	3.617	2	0.685
3228	5	C LM	CHURCHMAN BIBLE	3121.0	44.563	0.953	2.241	8.013	3.344	2	0.677
3229	5	C LM	CHURCHMAN BIBLE	3121.5	49.719	0.964	2.925	8.013	3.190	2	0.669
3230	5	C LM	CHURCHMAN BIBLE	3122.0	51.469	0.965	3.083	7.708	3.152	2	0.661
3231	5	C LM	CHURCHMAN BIBLE	3122.5	50.031	0.970	2.609	6.668	3.295	2	0.653

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000
mean	4.422030	2875.824567	66.135769	0.642719	3.559642	13.483213	3.725014	1.498453	0.520287
std	2.504243	131.006274	30.854826	0.241845	5.228948	7.698980	0.896152	0.500075	0.286792
min	1.000000	2573.500000	13.250000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.010000
25%	2.000000	2791.000000	46.918750	0.492750	1.163750	8.346750	3.100000	1.000000	0.273000
50%	4.000000	2893.500000	65.721500	0.624437	3.500000	12.150000	3.551500	1.000000	0.526000
75%	6.000000	2980.000000	79.626250	0.812735	6.432500	16.453750	4.300000	2.000000	0.767250
max	9.000000	3122.500000	361.150000	1.480000	18.600000	84.400000	8.094000	2.000000	1.000000

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	259.0	259.000000	259.000000	259.000000	259.000000	259.000000	259.000000	259.0	259.000000
mean	1.0	2749.503861	64.497761	0.371312	3.506463	14.818996	2.913313	1.0	0.452486
std	0.0	103.722844	9.397137	0.211272	3.052479	3.910413	0.278879	0.0	0.215447
min	1.0	2585.000000	47.697000	-0.025949	-19.500000	6.250000	2.242000	1.0	0.074000
25%	1.0	2673.250000	58.047000	0.215493	1.819000	12.275000	2.700000	1.0	0.273000
50%	1.0	2732.000000	61.969000	0.354000	3.500000	15.150000	2.885000	1.0	0.429000
75%	1.0	2839.750000	70.062500	0.541493	5.000000	17.587500	3.098000	1.0	0.602000
max	1.0	2941.500000	92.783000	0.874000	11.400000	36.350000	3.652000	1.0	0.944000

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	161.0	161.000000	161.000000	161.000000	161.000000	161.000000	161.000000	161.0	161.000000
mean	9.0	3033.400621	44.877280	0.574677	0.198547	13.226944	5.295925	2.0	0.451925
std	0.0	49.093878	29.823639	0.298074	3.191655	4.376141	0.839187	0.0	0.227553
min	9.0	2843.000000	14.840000	0.100000	-11.400000	2.300000	3.200000	2.0	0.088000
25%	9.0	3011.000000	34.920000	0.346000	-0.022000	10.040000	4.700000	2.0	0.233000
50%	9.0	3048.000000	39.219000	0.598000	0.990000	13.229000	5.200000	2.0	0.467000
75%	9.0	3071.000000	44.531000	0.784000	1.921000	17.218000	5.652000	2.0	0.633000
max	9.0	3104.500000	222.500000	1.294000	4.614000	19.453000	8.094000	2.0	0.952000

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	498.0	498.000000	498.000000	498.000000	498.000000	498.000000	498.000000	498.000000	498.000000
mean	8.0	2891.041165	47.195886	0.747021	1.444823	9.585972	4.518653	1.975904	0.585058
std	0.0	119.378744	34.410690	0.236509	3.740654	4.703665	0.745893	0.153503	0.313764
min	8.0	2608.500000	13.250000	-0.012000	-21.832000	0.550000	1.605000	1.000000	0.010000
25%	8.0	2813.500000	23.502500	0.600500	0.033250	6.400000	4.000000	2.000000	0.318000
50%	8.0	2923.250000	34.981500	0.775000	1.502500	8.716500	4.559500	2.000000	0.630000
75%	8.0	2983.875000	61.215750	0.904000	3.487250	11.858500	5.000000	2.000000	0.879500
max	8.0	3108.000000	273.480000	1.452000	13.700000	47.721000	8.063000	2.000000	1.000000