Facies classification using Machine Learning

Original contest notebook by Brendon Hall, Enthought

Let's train some data. I'm Jesper and I have no clue what I'm doing but I'm having fun. Most texts are still from the original notebook from Brandon Hall. As well as the basis for the code.



In [63]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable

from pandas import set_option
set_option("display.max_rows", 10)
pd.set_option('display.width', 1000)
pd.options.mode.chained_assignment = None

def hotcoder(data,sortlist,base):
    hotdata = [0]*len(data)
    dexin = {x:i for i,x in enumerate(sortlist)}
    for i,a in enumerate(data):
        for j,b in enumerate(a):
            if isinstance(a,list):
                if b not in sortlist:
                    print("Found illegal value of {0} at position {1}".format(b, i))
                    break
                hotdata[i] += base**dexin[b]
            else:
                if a not in sortlist:
                    print("Found illegal value of {0} at position {1}".format(a, i))
                    break
                hotdata[i] += base**dexin[a]
                break
    return hotdata

def distance(latlon1,latlon2):
    lat1 = np.deg2rad(latlon1[0])
    lon1 = np.deg2rad(latlon1[1])
    lat2 = np.deg2rad(latlon2[0])
    lon2 = np.deg2rad(latlon2[1])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    return (6371 * c)


facies_labels = ['SS','CSiS','FSiS','SiSh','MS','WS','D','PS','BS']
facies = pd.DataFrame({'ShortName': pd.Series(facies_labels,index=facies_labels),'Facies' : pd.Series(['Nonmarine sandstone', 'Nonmarine coarse siltstone', 'Nonmarine fine siltstone ', 'Marine siltstone and shale', 'Mudstone (limestone)',  'Wackestone (limestone)', 'Dolomite', 'Packstone-grainstone (limestone)','Phylloid-algal bafflestone (limestone)'],index=facies_labels), 'Neighbours' : pd.Series(['CSiS',['SS','FSiS'],'CSiS','MS',['SiSh','WS'],['MS','D'],['WS','PS'],['WS','D','BS'],['D','PS']],index=facies_labels), 'Colors' : pd.Series(['#F4D03F', '#F5B041','#DC7633','#6E2C00',
       '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D'],index=facies_labels)})


filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
rows = len(training_data.index)
training_data









    Out[63]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      1
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      1
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      1
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      1
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      1
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      2
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      2
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      2
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      2
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      2
      0.653
    
  

4149 rows × 11 columns

Verbatim from Brendon Hall:

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:



In [64]:

    
facies[['Facies']]









    Out[64]:






  
    
      
      Facies
    
  
  
    
      SS
      Nonmarine sandstone
    
    
      CSiS
      Nonmarine coarse siltstone
    
    
      FSiS
      Nonmarine fine siltstone
    
    
      SiSh
      Marine siltstone and shale
    
    
      MS
      Mudstone (limestone)
    
    
      WS
      Wackestone (limestone)
    
    
      D
      Dolomite
    
    
      PS
      Packstone-grainstone (limestone)
    
    
      BS
      Phylloid-algal bafflestone (limestone)

These facies gradually blend into one another. We can see a marine non-marine neighbourhood already.



In [65]:

    
pd.DataFrame({'Neighbours' : [facies.ShortName[x] for x in facies.Neighbours]},index=facies_labels)









    Out[65]:






  
    
      
      Neighbours
    
  
  
    
      SS
      CSiS
    
    
      CSiS
      SS        SS
FSiS    FSiS
Name: ShortName, dty...
    
    
      FSiS
      CSiS
    
    
      SiSh
      MS
    
    
      MS
      SiSh    SiSh
WS        WS
Name: ShortName, dty...
    
    
      WS
      MS    MS
D      D
Name: ShortName, dtype: object
    
    
      D
      WS    WS
PS    PS
Name: ShortName, dtype: object
    
    
      PS
      WS    WS
D      D
BS    BS
Name: ShortName, dt...
    
    
      BS
      D      D
PS    PS
Name: ShortName, dtype: object

The 'Well Name' and 'Formation' columns can be turned into a categorical data type.



In [66]:

    
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()









    Out[66]:





[SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, KIMZEY A, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (10, object): [SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, ..., NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]

Shrimplin:

Longitude: -100.98329
Latitude: 37.98126

ALEXANDER D

Longitude: -95.3534525
Latitude: 37.6607787

SHANKLE

Longitude: -101.3891894
Latitude: 38.0633727

LUKE G U

Longitude: -101.6073725
Latitude: 37.4537739

KIMZEY A

Longitude: -101.39697
Latitude: 37.12289

CROSS H CATTLE

Longitude: -101.6464517
Latitude: 37.9105826

NOLAN

Longitude: -101.0451641
Latitude: 37.7866294

Recruit F9

NEWBY

Longitude: -101.5847635
Latitude: 37.5406739

CHURCHMAN BIBLE

Longitude: -101.1060761
Latitude: 37.3497658

STUART

Longitude: -101.1391063
Latitude: 37.4857262

CRAWFORD

Longitude: -101.1494994
Latitude: 37.1893654



In [67]:

    
latlong = pd.DataFrame({"SHRIMPLIN": [37.98126,-100.98329], "ALEXANDER D": [37.6607787,-95.3534525], "SHANKLE": [38.0633727,-101.3891894], "LUKE G U": [37.4537739,-101.6073725], "KIMZEY A": [37.12289,-101.39697], "CROSS H CATTLE": [37.9105826,-101.6464517], "NOLAN": [37.7866294,-101.0451641], "NEWBY": [37.5406739,-101.5847635], "CHURCHMAN BIBLE": [37.3497658,-101.1060761], "STUART": [37.4857262,-101.1391063], "CRAWFORD": [37.1893654,-101.1494994], "Recruit F9": [37.4,-101]})
#distance([37.660779,-95.353453],[37.349766,-101.106076])
dist_mat= pd.DataFrame(np.zeros((latlong.shape[1],latlong.shape[1])))
#dist_mat = np.zeros(len(latlong.index),len(latlong.index))
for i,x in enumerate(latlong):
    for j,y in enumerate(latlong):
        if i > j:
            dist_mat[i][j] = (distance(latlong[x].values,latlong[y].values))
            dist_mat[j][i] = dist_mat[i][j]
dist_mat









    Out[67]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
    
  
  
    
      0
      0.000000
      508.539085
      514.414809
      553.609438
      537.155038
      551.655595
      549.025730
      500.715074
      498.685691
      531.653931
      495.709564
      510.181434
    
    
      1
      508.539085
      0.000000
      18.244918
      78.442557
      36.049841
      45.766211
      47.291283
      48.872768
      10.911501
      83.166276
      71.045797
      15.396954
    
    
      2
      514.414809
      18.244918
      0.000000
      91.381670
      23.143376
      50.037144
      54.823941
      67.047680
      26.897113
      99.451087
      89.264274
      32.966625
    
    
      3
      553.609438
      78.442557
      91.381670
      0.000000
      90.308868
      50.911093
      41.488243
      54.564443
      80.385857
      28.230578
      58.679434
      64.994310
    
    
      4
      537.155038
      36.049841
      23.143376
      90.308868
      0.000000
      41.232963
      49.333475
      80.071330
      46.729386
      104.579152
      102.175650
      46.345830
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      7
      500.715074
      48.872768
      67.047680
      54.564443
      80.071330
      61.820025
      54.808216
      0.000000
      43.174992
      43.098540
      22.312772
      34.466311
    
    
      8
      498.685691
      10.911501
      26.897113
      80.385857
      46.729386
      53.965104
      53.924917
      43.174992
      0.000000
      81.317150
      64.649885
      15.546242
    
    
      9
      531.653931
      83.166276
      99.451087
      28.230578
      104.579152
      70.445691
      60.608168
      43.098540
      81.317150
      0.000000
      36.708814
      67.888026
    
    
      10
      495.709564
      71.045797
      89.264274
      58.679434
      102.175650
      80.333850
      72.081453
      22.312772
      64.649885
      36.708814
      0.000000
      56.779037
    
    
      11
      510.181434
      15.396954
      32.966625
      64.994310
      46.345830
      41.478111
      39.779513
      34.466311
      15.546242
      67.888026
      56.779037
      0.000000
    
  

12 rows × 12 columns



In [68]:

    
training_data.describe()









    



C:\Users\Jesper\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[68]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      3232.000000
      4149.000000
      4149.000000
    
    
      mean
      4.503254
      2906.867438
      64.933985
      0.659566
      4.402484
      13.201066
      3.725014
      1.518438
      0.521852
    
    
      std
      2.474324
      133.300164
      30.302530
      0.252703
      5.274947
      7.132846
      0.896152
      0.499720
      0.286644
    
    
      min
      1.000000
      2573.500000
      10.149000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.000000
    
    
      25%
      2.000000
      2821.500000
      44.730000
      0.498000
      1.600000
      8.500000
      NaN
      1.000000
      0.277000
    
    
      50%
      4.000000
      2932.500000
      64.990000
      0.639000
      4.300000
      12.020000
      NaN
      2.000000
      0.528000
    
    
      75%
      6.000000
      3007.000000
      79.438000
      0.822000
      7.500000
      16.050000
      NaN
      2.000000
      0.769000
    
    
      max
      9.000000
      3138.000000
      361.150000
      1.800000
      19.312000
      84.400000
      8.094000
      2.000000
      1.000000

The new data set contains about 1000 data points or 33% more than the original one. Rejoice!



In [71]:

    
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies.Colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] ]
processed_data = training_data.copy()
processed_data['Facies'] = processed_data['Facies']-1
processed_data= processed_data[processed_data['PE'].notnull().values]
processed_data.loc[:,'FaciesLabels'] = processed_data.apply(lambda row: label_facies(row, facies_labels), axis=1)

Let's take a look at the data from individual wells in a more familiar log plot form. We will create plots for the five well log variables, as well as a log for facies labels. The plots are based on the those described in Alessandro Amato del Monte's excellent tutorial.



In [ ]:

    
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies.Colors[0:len(facies.Colors)], 'indexed')
    cmap_coarse = colors.ListedColormap(['#c0c0c0','#000000'])
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    clustercoarse=np.repeat(np.expand_dims(logs['NM_M'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=7, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    
    
    im1 = ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=0,vmax=8)
    im2 = ax[6].imshow(clustercoarse, interpolation='none', aspect='auto',
                    cmap=cmap_coarse,vmin=1,vmax=2)
    
    divider = make_axes_locatable(ax[6])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im1, cax=cax)
    cbar.set_label((18*' ').join(facies.index))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-2):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    ax[6].set_xlabel('NM_M')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([]); ax[6].set_yticklabels([])
    ax[5].set_xticklabels([]); ax[6].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
    plt.show()

Here's the SHRIMPLIN well for your vieling pleasure.



In [ ]:

    
make_facies_log_plot(
    processed_data[processed_data['Well Name'] == 'SHRIMPLIN'],
    facies.Colors)

In addition to individual wells, we can look at how the various facies are represented by the entire training set. Let's plot a histogram of the number of training examples for each facies class.



In [ ]:

    
#count the number of unique entries for each facies, sort them by
#facies number (instead of by number of entries)
facies_counts = processed_data['Facies'].value_counts().sort_index()
tmp = processed_data.query('NM_M==1')
facies_counts0 = tmp['Facies'].value_counts().sort_index()

tmp = processed_data.query('NM_M==2')
facies_counts1 = tmp['Facies'].value_counts().sort_index()

#use facies labels to index each count
facies_counts.index = facies_labels

facies_counts.plot(kind='bar',color=facies.Colors, 
                   title='Distribution of Training Data by Facies')
pd.DataFrame(facies_counts).T

This shows the distribution of examples by facies for the examples in the training set. Dolomite (facies 7) has the fewest with 81 examples. Depending on the performance of the classifier we are going to train, we may consider getting more examples of these facies.

Crossplots are a familiar tool in the geosciences to visualize how two properties vary with rock type. This dataset contains 5 log variables, and scatter matrix can help to quickly visualize the variation between the all the variables in the dataset. We can employ the very useful Seaborn library to quickly create a nice looking scatter matrix. Each pane in the plot shows the relationship between two of the variables on the x and y axis, with each point is colored according to its facies. The same colormap is used to represent the 9 facies.



In [ ]:

    
#save plot display settings to change back to when done plotting with seaborn
inline_rc = dict(mpl.rcParams)

import seaborn as sns
sns.set()
plotdata=processed_data.drop(['Well Name','Formation','Depth','RELPOS'],axis=1).dropna()

sns.pairplot(plotdata.drop(['Facies','NM_M'],axis=1),
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))

At this point I woult like to take a look at the separated main classes.



In [ ]:

    
sns.pairplot(plotdata.query('NM_M==1').drop(['Facies','NM_M'],axis=1),
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))



In [ ]:

    
facies_counts0.plot(kind='bar',color=facies.Colors, 
                   title='Distribution of Training Data by Facies')
pd.DataFrame(facies_counts0).T

In this view it becomes obvious that GR, PHIND and PE have some outliers at:

GR > 150
PhiND > 60
PE < 1



In [ ]:

    
sns.pairplot(plotdata.query('NM_M==2').drop(['Facies','NM_M'],axis=1),
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))



In [ ]:

    
facies_counts1.plot(kind='bar',color=facies.Colors, 
                   title='Distribution of Training Data by Facies')
pd.DataFrame(facies_counts1).T

Here we can see aparent outliers in GR, PhiND, PE and DeltaPhi

GR > 350
DeltaPhi < -20
PhiND > 45
PE < 2

I have decided to keep GR as it is a general trend in the data (and not unheard of in that rock). DeltaPhi also fits a general trend and seems to be a feature. PhiND and PE do not lie in a trend or cluster and will therefore be removed.



In [ ]:

    
processed_data0 = processed_data.query('NM_M==1 and not (PHIND > 60 or PE < 1 or GR > 150) and Facies < 3')
processed_data1 = processed_data.query('NM_M==2 and (PHIND < 45 or PE > 2) and Facies > 2')

processed_data = pd.concat([processed_data0,processed_data1])
print("Dropped {0} rows due to outliers and misclassification.".format(rows-len(processed_data.index),))



In [ ]:

    
sns.set()
plotdata=processed_data.drop(['Well Name','Facies','Formation','Depth','NM_M','RELPOS'],axis=1).dropna()

sns.pairplot(plotdata,
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))

#switch back to default matplotlib plot style
mpl.rcParams.update(inline_rc)

Feature Engineering



In [ ]:

    
correct_facies_labels  = training_data['Facies'].values
correct_facies_labels0 = training_data0['Facies'].values
#facies_labels=['Sand','Carbs']
correct_facies_labels1 = training_data1['Facies'].values

feature_vectors = training_data.drop(['Formation', 'Well Name','Depth','Facies','NM_M','FaciesLabels'], axis=1)
feature_vectors0 = training_data0.drop(['Formation', 'Well Name','Depth','Facies','NM_M','FaciesLabels'], axis=1)
feature_vectors1 = training_data1.drop(['Formation', 'Well Name','Depth','Facies','NM_M','FaciesLabels'], axis=1)



In [ ]:

    
feature_vectors.describe()



In [ ]:

    
feature_vectors0.describe()



In [ ]:

    
feature_vectors1.describe()

Scikit includes a preprocessing module that can 'standardize' the data (giving each variable zero mean and unit variance, also called whitening). Many machine learning algorithms assume features will be standard normally distributed data (ie: Gaussian with zero mean and unit variance). The factors used to standardize the training set must be applied to any subsequent feature set that will be input to the classifier. The StandardScalar class can be fit to the training set, and later used to standardize any training data.



In [ ]:

    
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)

scaler0 = preprocessing.StandardScaler().fit(feature_vectors0)
scaled_features0 = scaler0.transform(feature_vectors0)

scaler1 = preprocessing.StandardScaler().fit(feature_vectors1)
scaled_features1 = scaler1.transform(feature_vectors1)

Scikit also includes a handy function to randomly split the training data into training and test sets. The test set contains a small subset of feature vectors that are not used to train the network. Because we know the true facies labels for these examples, we can compare the results of the classifier to the actual facies and determine the accuracy of the model. Let's use 20% of the data for the test set.

Training the SVM classifier

Now we use the cleaned and conditioned training set to create a facies classifier. As mentioned above, we will use a type of machine learning model known as a support vector machine. The SVM is a map of the feature vectors as points in a multi dimensional space, mapped so that examples from different facies are divided by a clear gap that is as wide as possible.

The SVM implementation in scikit-learn takes a number of important parameters. First we create a classifier using the default settings.



In [ ]:

    
from sklearn import svm

clf  = svm.SVC()
clf0 = svm.SVC()
clf1 = svm.SVC()

Now we can train the classifier using the training set we created above.



In [ ]:

    
clf.fit(scaled_features,correct_facies_labels)
clf0.fit(scaled_features0,correct_facies_labels0)
clf1.fit(scaled_features1,correct_facies_labels1)

Now that the model has been trained on our data, we can use it to predict the facies of the feature vectors in the test set. Because we know the true facies labels of the vectors in the test set, we can use the results to evaluate the accuracy of the classifier.



In [ ]:

    
predicted_labels = clf.predict(scaled_features)
predicted_labels0 = clf0.predict(scaled_features0)
predicted_labels1 = clf1.predict(scaled_features1)

We need some metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.

The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j] are equal to the number of observations predicted to have facies j, but are known to have facies i.

To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py in this repo for the display_cm() function.



In [ ]:

    
from sklearn.metrics import confusion_matrix,f1_score, accuracy_score, make_scorer
from classification_utilities import display_cm, display_adj_cm



In [ ]:

    
conf = confusion_matrix(correct_facies_labels, predicted_labels)
display_cm(conf, facies_labels, hide_zeros=True)



In [ ]:

    
conf0 = confusion_matrix(correct_facies_labels0, predicted_labels0)
display_cm(conf0, facies_labels[0:3], hide_zeros=True)



In [ ]:

    
conf1 = confusion_matrix(correct_facies_labels1, predicted_labels1)
display_cm(conf1, facies_labels[3:], hide_zeros=True)



In [ ]:

    
y_test_comb = np.concatenate([correct_facies_labels0,correct_facies_labels1])
predicted_labels_comb = np.concatenate([predicted_labels0,predicted_labels1])
conf_comb = confusion_matrix(y_test_comb, predicted_labels_comb)
display_cm(conf_comb, facies_labels, hide_zeros=True)

The rows of the confusion matrix correspond to the actual facies labels. The columns correspond to the labels assigned by the classifier. For example, consider the first row. For the feature vectors in the test set that actually have label SS, 23 were correctly indentified as SS, 21 were classified as CSiS and 2 were classified as FSiS.

The entries along the diagonal are the facies that have been correctly classified. Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.



In [ ]:

    
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i, adjacent_facies[i] is an array of the adjacent facies labels.



In [ ]:

    
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies,offset):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j-offset]
    return total_correct / sum(sum(conf))



In [ ]:

    
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies,0))



In [ ]:

    
print('Facies classification accuracy = %f' % accuracy(conf0))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf0, adjacent_facies[0:3],0))



In [ ]:

    
print('Facies classification accuracy = %f' % accuracy(conf1))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf1, adjacent_facies[3:],3))



In [ ]:

    
print('Facies classification accuracy = %f' % accuracy(conf_comb))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf_comb, adjacent_facies,0))

Model parameter selection

The classifier so far has been built with the default parameters. However, we may be able to get improved classification results with optimal parameter choices.

We will consider two parameters. The parameter C is a regularization factor, and tells the classifier how much we want to avoid misclassifying training examples. A large value of C will try to correctly classify more examples from the training set, but if C is too large it may 'overfit' the data and fail to generalize when classifying new data. If C is too small then the model will not be good at fitting outliers and will have a large error on the training set.

The SVM learning algorithm uses a kernel function to compute the distance between feature vectors. Many kernel functions exist, but in this case we are using the radial basis function rbf kernel (the default). The gamma parameter describes the size of the radial basis functions, which is how far away two vectors in the feature space need to be to be considered close.

We will train a series of classifiers with different values for C and gamma. Two nested loops are used to train a classifier for every possible combination of values in the ranges specified. The classification accuracy is recorded for each combination of parameter values. The results are shown in a series of plots, so the parameter values that give the best classification accuracy on the test set can be selected.

This process is also known as 'cross validation'. Often a separate 'cross validation' dataset will be created in addition to the training and test sets to do model selection. For this tutorial we will just use the test set to choose model parameters.



In [ ]:

    
from sklearn.model_selection import RandomizedSearchCV,LeaveOneGroupOut
from scipy.stats import uniform as sp_uniform

Fscorer = make_scorer(f1_score, average = 'micro')
Ascorer = make_scorer(accuracy_score)

clf = svm.SVC(cache_size = 800, random_state=1)
clf0 = svm.SVC(cache_size = 800, random_state=1)
clf1 = svm.SVC(cache_size = 800, random_state=1)

#parm_grid={'kernel': ['linear', 'rbf'],
#            'C': [0.5, 1, 10, 50],
#            'gamma':[0.0001, 0.01, 1, 10]}

param_dist = {"kernel": ['rbf'],
              "C": sp_uniform(0.0001,10),
              "gamma": sp_uniform(0.0001, 10)}

n_iter_search = 20



In [ ]:

    
random_search0 = RandomizedSearchCV(clf0, param_distributions=param_dist,scoring = Fscorer,n_iter=n_iter_search)
random_search0.fit(scaled_features0,correct_facies_labels0)
print('Best score for NM: {}'.format(random_search0.best_score_))
print('Best parameters for NM: {}'.format(random_search0.best_params_))
clf0 = random_search0.best_estimator_
random_search0.best_estimator_



In [ ]:

    
random_search1 = RandomizedSearchCV(clf1, param_distributions=param_dist,scoring = Fscorer,n_iter=n_iter_search)
random_search1.fit(scaled_features1,correct_facies_labels1)
print('Best score for random: {}'.format(random_search1.best_score_))
print('Best parameters for random: {}'.format(random_search1.best_params_))
clf1 = random_search1.best_estimator_
random_search1.best_estimator_



In [ ]:

    
random_search =  (clf, param_distributions=param_dist,scoring = Fscorer,n_iter=n_iter_search)
random_search.fit(scaled_features,correct_facies_labels)
print('Best score for combo: {}'.format(random_search.best_score_))
print('Best parameters for combo: {}'.format(random_search.best_params_))
clf = random_search.best_estimator_
random_search.best_estimator_

Okay, take it up a notch.



In [ ]:

    
clf0.fit(scaled_features0, correct_facies_labels0)



In [ ]:

    
clf1.fit(scaled_features1, correct_facies_labels1)



In [ ]:

    
clf.fit(scaled_features, correct_facies_labels)



In [ ]:

    
cv_conf = confusion_matrix(np.concatenate([correct_facies_labels0,correct_facies_labels1]), np.concatenate([clf0.predict(scaled_features0),clf1.predict(scaled_features1)]))
cv_conf_comb = confusion_matrix(correct_facies_labels, clf.predict(scaled_features))


print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf, adjacent_facies,0))



In [ ]:

    
print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf_comb))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf_comb, adjacent_facies,0))

Precision and recall are metrics that give more insight into how the classifier performs for individual facies. Precision is the probability that given a classification result for a sample, the sample actually belongs to that class. Recall is the probability that a sample will be correctly classified for a given class.

Precision and recall can be computed easily using the confusion matrix. The code to do so has been added to the display_confusion_matrix() function:



In [ ]:

    
display_cm(cv_conf, facies_labels, 
           display_metrics=True, hide_zeros=True)



In [ ]:

    
display_cm(cv_conf_comb, facies_labels, 
           display_metrics=True, hide_zeros=True)

To interpret these results, consider facies SS. In our test set, if a sample was labeled SS the probability the sample was correct is 0.8 (precision). If we know a sample has facies SS, then the probability it will be correctly labeled by the classifier is 0.78 (recall). It is desirable to have high values for both precision and recall, but often when an algorithm is tuned to increase one, the other decreases. The F1 score combines both to give a single measure of relevancy of the classifier results.

These results can help guide intuition for how to improve the classifier results. For example, for a sample with facies MS or mudstone, it is only classified correctly 57% of the time (recall). Perhaps this could be improved by introducing more training samples. Sample quality could also play a role. Facies BS or bafflestone has the best F1 score and relatively few training examples. But this data was handpicked from other wells to provide training examples to identify this facies.

We can also consider the classification metrics when we consider misclassifying an adjacent facies as correct:



In [ ]:

    
display_adj_cm(cv_conf, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)



In [ ]:

    
display_adj_cm(cv_conf_comb, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)

Considering adjacent facies, the F1 scores for all facies types are above 0.9, except when classifying SiSh or marine siltstone and shale. The classifier often misclassifies this facies (recall of 0.66), most often as wackestone.

These results are comparable to those reported in Dubois et al. (2007).

LOGO



In [ ]:

    
wells = processed_data["Well Name"].values
wells0 = processed_data0["Well Name"].values
wells1 = processed_data1["Well Name"].values
logo = LeaveOneGroupOut()
logo0 = LeaveOneGroupOut()
logo1 = LeaveOneGroupOut()

f1_SVC0 = []
pred0 = {}
for train, test in logo0.split(scaled_features0, correct_facies_labels0, groups=wells0):
    well_name = wells0[test[0]]
    clf0.fit(scaled_features0[train], correct_facies_labels0[train])
    pred0[well_name] = clf0.predict(scaled_features0[test])
    sc = f1_score(correct_facies_labels0[test], pred0[well_name], labels = np.arange(10), average = 'micro')
    print("{:>20s}  {:.3f}".format(well_name, sc))
    f1_SVC0.append(sc)

print("-Average leave-one-well-out F1 Score: {0}".format(sum(f1_SVC0)/(1.0*(len(f1_SVC0)))))



In [ ]:

    
f1_SVC1 = []
pred1 = {}
for train, test in logo1.split(scaled_features1, correct_facies_labels1, groups=wells1):
    well_name = wells1[test[0]]
    clf1.fit(scaled_features1[train], correct_facies_labels1[train])
    pred1[well_name] = clf1.predict(scaled_features1[test])
    sc = f1_score(correct_facies_labels1[test], pred1[well_name], labels = np.arange(10), average = 'micro')
    print("{:>20s}  {:.3f}".format(well_name, sc))
    f1_SVC1.append(sc)

print("-Average leave-one-well-out F1 Score: {0}".format(sum(f1_SVC1)/(1.0*(len(f1_SVC1)))))



In [ ]:

    
f1_SVC_comb = []

for train, test in logo.split(np.concatenate([scaled_features0,scaled_features1]), np.concatenate([correct_facies_labels0,correct_facies_labels1]), groups=np.concatenate([wells0,wells1])):
    well_name = np.concatenate([wells0,wells1])[test[0]]
    pred=np.concatenate([pred0.get(well_name, []),pred1.get(well_name, [])])
    sc = f1_score(np.concatenate([correct_facies_labels0,correct_facies_labels1])[test], pred, labels = np.arange(10), average = 'micro')
    print("{:>20s}  {:.3f}".format(well_name, sc))
    f1_SVC_comb.append(sc)

print("-Average leave-one-well-out F1 Score: {0}".format(sum(f1_SVC_comb)/(1.0*(len(f1_SVC_comb)))))



In [ ]:

    
f1_SVC_compare = []
pred_comp = {}
for train, test in logo.split(scaled_features, correct_facies_labels, groups=wells):
    well_name = wells[test[0]]
    clf.fit(scaled_features[train], correct_facies_labels[train])
    pred0[well_name] = clf.predict(scaled_features[test])
    sc = f1_score(correct_facies_labels[test], pred0[well_name], labels = np.arange(10), average = 'micro')
    print("{:>20s}  {:.3f}".format(well_name, sc))
    f1_SVC_compare.append(sc)

print("-Average leave-one-well-out F1 Score Comparison: {0}".format(sum(f1_SVC_compare)/(1.0*(len(f1_SVC_compare)))))



In [ ]:

    
display_adj_cm(cv_conf, facies_labels, adjacent_facies,
               display_metrics=True, hide_zeros=True)

Applying the classification model to new data

Now that we have a trained facies classification model we can use it to identify facies in wells that do not have core data. In this case, we will apply the classifier to two wells, but we could use it on any number of wells for which we have the same set of well logs for input.

This dataset is similar to the training data except it does not have facies labels. It is loaded into a dataframe called test_data.



In [ ]:

    
well_data = pd.read_csv('../validation_data_nofacies.csv')
well_data['Well Name'] = well_data['Well Name'].astype('category')
well_data0 = well_data.query('NM_M==1')
well_data1 = well_data.query('NM_M==2')
well_features0 = well_data0.drop(['Formation', 'Well Name','Depth','NM_M'], axis=1)
well_features1 = well_data1.drop(['Formation', 'Well Name','Depth','NM_M'], axis=1)

The data needs to be scaled using the same constants we used for the training data.



In [ ]:

    
X_unknown0 = scaler0.transform(well_features0)
X_unknown1 = scaler1.transform(well_features1)

Finally we predict facies labels for the unknown data, and store the results in a Facies column of the test_data dataframe.



In [72]:

    
#predict facies of unclassified data
y_unknown0 = clf0.predict(X_unknown0)
y_unknown1 = clf1.predict(X_unknown1)
well_data0['Facies'] = y_unknown0
well_data1['Facies'] = y_unknown1
well_data=pd.concat([well_data0,well_data1], axis=0)
well_data









    Out[72]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      Facies
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.300
      10.650
      3.591
      1
      1.000
      3
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.500
      11.950
      3.341
      1
      0.978
      3
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.400
      13.600
      3.064
      1
      0.956
      3
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.500
      13.250
      2.977
      1
      0.933
      2
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.700
      12.350
      3.020
      1
      0.911
      3
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      807
      B5 LM
      CRAWFORD
      3149.5
      83.794
      0.749
      2.075
      7.845
      4.023
      2
      0.086
      6
    
    
      808
      B5 LM
      CRAWFORD
      3150.0
      79.722
      0.771
      2.890
      6.640
      4.040
      2
      0.069
      6
    
    
      809
      B5 LM
      CRAWFORD
      3150.5
      76.334
      0.800
      2.960
      6.290
      3.997
      2
      0.052
      6
    
    
      810
      B5 LM
      CRAWFORD
      3151.0
      73.631
      0.800
      2.680
      6.690
      3.828
      2
      0.034
      6
    
    
      811
      B5 LM
      CRAWFORD
      3151.5
      76.865
      0.772
      2.420
      8.600
      3.535
      2
      0.017
      6
    
  

830 rows × 11 columns



In [73]:

    
well_data['Well Name'].unique()









    Out[73]:





[STUART, CRAWFORD]
Categories (2, object): [STUART, CRAWFORD]

We can use the well log plot to view the classification results along with the well logs.



In [74]:

    
make_facies_log_plot(
    well_data[well_data['Well Name'] == 'STUART'],
    facies_colors=facies.Colors)



In [75]:

    
make_facies_log_plot(
    well_data[well_data['Well Name'] == 'CRAWFORD'],
    facies_colors=facies.Colors)

Finally we can write out a csv file with the well data along with the facies classification results.



In [ ]:

    
well_data0['Facies'] = y_unknown0+1
well_data1['Facies'] = y_unknown1+1
well_data=pd.concat([well_data0,well_data1], axis=0)
well_data.to_csv('well_data_with_facies.csv')

References

Amato del Monte, A., 2015. Seismic Petrophysics: Part 1, The Leading Edge, 34 (4). doi:10.1190/tle34040440.1

Bohling, G. C., and M. K. Dubois, 2003. An Integrated Application of Neural Network and Markov Chain Techniques to Prediction of Lithofacies from Well Logs, KGS Open-File Report 2003-50, 6 pp. pdf

Dubois, M. K., G. C. Bohling, and S. Chakrabarti, 2007, Comparison of four approaches to a rock facies classification problem, Computers & Geosciences, 33 (5), 599-617 pp. doi:10.1016/j.cageo.2006.08.011

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.450	0.664	9.900	11.915	4.600	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.260	0.661	14.200	12.565	4.100	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.050	0.658	14.800	13.050	3.600	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.100	0.655	13.900	13.115	3.500	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.580	0.647	13.500	13.300	3.400	1	0.915
...	...	...	...	...	...	...	...	...	...	...	...
4144	5	C LM	CHURCHMAN BIBLE	3120.5	46.719	0.947	1.828	7.254	3.617	2	0.685
4145	5	C LM	CHURCHMAN BIBLE	3121.0	44.563	0.953	2.241	8.013	3.344	2	0.677
4146	5	C LM	CHURCHMAN BIBLE	3121.5	49.719	0.964	2.925	8.013	3.190	2	0.669
4147	5	C LM	CHURCHMAN BIBLE	3122.0	51.469	0.965	3.083	7.708	3.152	2	0.661
4148	5	C LM	CHURCHMAN BIBLE	3122.5	50.031	0.970	2.609	6.668	3.295	2	0.653

	Facies
SS	Nonmarine sandstone
CSiS	Nonmarine coarse siltstone
FSiS	Nonmarine fine siltstone
SiSh	Marine siltstone and shale
MS	Mudstone (limestone)
WS	Wackestone (limestone)
D	Dolomite
PS	Packstone-grainstone (limestone)
BS	Phylloid-algal bafflestone (limestone)

	Neighbours
SS	CSiS
CSiS	SS SS FSiS FSiS Name: ShortName, dty...
FSiS	CSiS
SiSh	MS
MS	SiSh SiSh WS WS Name: ShortName, dty...
WS	MS MS D D Name: ShortName, dtype: object
D	WS WS PS PS Name: ShortName, dtype: object
PS	WS WS D D BS BS Name: ShortName, dt...
BS	D D PS PS Name: ShortName, dtype: object

	0	1	2	3	4	5	6	7	8	9	10	11
0	0.000000	508.539085	514.414809	553.609438	537.155038	551.655595	549.025730	500.715074	498.685691	531.653931	495.709564	510.181434
1	508.539085	0.000000	18.244918	78.442557	36.049841	45.766211	47.291283	48.872768	10.911501	83.166276	71.045797	15.396954
2	514.414809	18.244918	0.000000	91.381670	23.143376	50.037144	54.823941	67.047680	26.897113	99.451087	89.264274	32.966625
3	553.609438	78.442557	91.381670	0.000000	90.308868	50.911093	41.488243	54.564443	80.385857	28.230578	58.679434	64.994310
4	537.155038	36.049841	23.143376	90.308868	0.000000	41.232963	49.333475	80.071330	46.729386	104.579152	102.175650	46.345830
...	...	...	...	...	...	...	...	...	...	...	...	...
7	500.715074	48.872768	67.047680	54.564443	80.071330	61.820025	54.808216	0.000000	43.174992	43.098540	22.312772	34.466311
8	498.685691	10.911501	26.897113	80.385857	46.729386	53.965104	53.924917	43.174992	0.000000	81.317150	64.649885	15.546242
9	531.653931	83.166276	99.451087	28.230578	104.579152	70.445691	60.608168	43.098540	81.317150	0.000000	36.708814	67.888026
10	495.709564	71.045797	89.264274	58.679434	102.175650	80.333850	72.081453	22.312772	64.649885	36.708814	0.000000	56.779037
11	510.181434	15.396954	32.966625	64.994310	46.345830	41.478111	39.779513	34.466311	15.546242	67.888026	56.779037	0.000000

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	4149.000000	4149.000000	4149.000000	4149.000000	4149.000000	4149.000000	3232.000000	4149.000000	4149.000000
mean	4.503254	2906.867438	64.933985	0.659566	4.402484	13.201066	3.725014	1.518438	0.521852
std	2.474324	133.300164	30.302530	0.252703	5.274947	7.132846	0.896152	0.499720	0.286644
min	1.000000	2573.500000	10.149000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.000000
25%	2.000000	2821.500000	44.730000	0.498000	1.600000	8.500000	NaN	1.000000	0.277000
50%	4.000000	2932.500000	64.990000	0.639000	4.300000	12.020000	NaN	2.000000	0.528000
75%	6.000000	3007.000000	79.438000	0.822000	7.500000	16.050000	NaN	2.000000	0.769000
max	9.000000	3138.000000	361.150000	1.800000	19.312000	84.400000	8.094000	2.000000	1.000000

	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS	Facies
0	A1 SH	STUART	2808.0	66.276	0.630	3.300	10.650	3.591	1	1.000	3
1	A1 SH	STUART	2808.5	77.252	0.585	6.500	11.950	3.341	1	0.978	3
2	A1 SH	STUART	2809.0	82.899	0.566	9.400	13.600	3.064	1	0.956	3
3	A1 SH	STUART	2809.5	80.671	0.593	9.500	13.250	2.977	1	0.933	2
4	A1 SH	STUART	2810.0	75.971	0.638	8.700	12.350	3.020	1	0.911	3
...	...	...	...	...	...	...	...	...	...	...	...
807	B5 LM	CRAWFORD	3149.5	83.794	0.749	2.075	7.845	4.023	2	0.086	6
808	B5 LM	CRAWFORD	3150.0	79.722	0.771	2.890	6.640	4.040	2	0.069	6
809	B5 LM	CRAWFORD	3150.5	76.334	0.800	2.960	6.290	3.997	2	0.052	6
810	B5 LM	CRAWFORD	3151.0	73.631	0.800	2.680	6.690	3.828	2	0.034	6
811	B5 LM	CRAWFORD	3151.5	76.865	0.772	2.420	8.600	3.535	2	0.017	6