SEG Machine Learning (Well Log Facies Prediction) Contest

Entry by Justin Gosses of team Pet_Stromatolite

Facies classification using Machine Learning

This is an "open science" contest designed to introduce people to machine learning with well logs and brainstorm different methods through collaboration with others, so this notebook is based heavily on the introductary notebook with my own modifications.

In particular, I relied on the previous work by LA team, Paolo Bestagini, and especially Hou Math. You can find their work and more information here: https://github.com/seg/2016-ml-contest

and even more information at http://library.seg.org/doi/abs/10.1190/tle35100906.1

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:

Nonmarine sandstone
Nonmarine coarse siltstone
Nonmarine fine siltstone
Marine siltstone and shale
Mudstone (limestone)
Wackestone (limestone)
Dolomite
Packstone-grainstone (limestone)
Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies	Label	Adjacent Facies
1	SS	2
2	CSiS	1,3
3	FSiS	2
4	SiSh	5
5	MS	4,6
6	WS	5,7
7	D	6,8
8	PS	6,7,9
9	BS	7,8

Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.

=================================================================================================================

Early Ideas for feature engineering

take out any points in individual wells where not all the logs are present
test whether error increases around the depths where PE is absent?
test whether using formation, depth, or depth&formation as variables impacts prediction
examine well logs & facies logs (including prediction wells) to see if there aren't trends that might be dealt with by increasing the population of certain wells over others in the training set?
explore effect size of using/not using marine or non-marine flags
explore making 'likely to predict wrong' flags based on first-pass results with thin facies surrounded by thicker facies such that you might expand a 'blended' response due to the measured response of the tool being thicker than predicted facies
explore doing the same above but before prediction using range of thickness in predicted facies flags vs. range of thickness in known facies flags
explore using multiple prediction loops, in order words, predict errors not just facies.
Explore error distribution: adjacent vs. non-adjacent facies, by thickness, marine vs. non-marine, by formation, and possible human judgement patterns that influence interpreted facies.

In the end, ran out of time and concentrated most of my time on creating a feature based on formations.

Each formation was given a float value between 0 and 50. The numbers were not evenly distributed such that formations with a similiar distribution of facies were close together and those with different distribution of facies were farther appart. Due to problems with an error I couldn't quite fix and not wanting to double dip on marine vs. nonmarine, I put the formation data into the NM_M column and ignored the original NM_M data.



In [197]:

    
## import modules

from __future__ import division
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize']=(20.0,10.0)
inline_rc = dict(mpl.rcParams)

import pandas as pd
import numpy as np
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import LeavePGroupsOut
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsOneClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

from scipy.signal import medfilt

from pandas.tools.plotting import scatter_matrix

import matplotlib.colors as colors

import xgboost as xgb
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from classification_utilities import display_cm, display_adj_cm
from sklearn.model_selection import GridSearchCV


from sklearn.model_selection import validation_curve
from sklearn.datasets import load_svmlight_files

from xgboost.sklearn import XGBClassifier
from scipy.sparse import vstack

seed = 123
np.random.seed(seed)



In [198]:

    
# Parameters
feature_names = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
### in the one below, formation is renamed NM_M and made into floats
# feature_names = ['NM_M','GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
featurePlusFacies_names = ['Facies', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
facies_names = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
formation_names = ['A1 LM', 'A1 SH', 'B1 LM', 'B1 SH', 'B2 LM', 'B2 SH', 'B3 LM','B3 SH', 'B4 LM', 'B4 SH', 'B5 LM', 'B5 SH', 'C LM', 'C SH']
## formation_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D','#FAA03F', '#FAA041','#DCAA33','#6AAC00', '#1BAA72']
formation_colors = ['#000000','#00FF00','#0000FF','#FF0000','#01FFFE','#FFA6FE','#FFDB66','#006401','#010067','#95003A','#007DB5','#FF00F6','#FFEEE8','#774D00']



In [199]:

    
### setting up options in pandas
from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None



In [200]:

    
# Load data for training ata from file
data = pd.read_csv('facies_vectors.csv')



In [201]:

    
data









    Out[201]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      1
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      1
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      1
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      1
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      1
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      2
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      2
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      2
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      2
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      2
      0.653
    
  

4149 rows × 11 columns



In [202]:

    
# replaced NM_M column with integers based on Formation column

data.loc[data.Formation == 'A1 LM', 'NM_M'] = 31
data.loc[data.Formation == 'A1 SH', 'NM_M'] = 4
data.loc[data.Formation == 'B1 LM', 'NM_M'] = 33
data.loc[data.Formation == 'B1 SH', 'NM_M'] = 10
data.loc[data.Formation == 'B2 LM', 'NM_M'] = 43
data.loc[data.Formation == 'B2 SH', 'NM_M'] = 16
data.loc[data.Formation == 'B3 LM', 'NM_M'] = 37
data.loc[data.Formation == 'B3 SH', 'NM_M'] = 6
data.loc[data.Formation == 'B4 LM', 'NM_M'] = 35
data.loc[data.Formation == 'B4 SH', 'NM_M'] = 2
data.loc[data.Formation == 'B5 LM', 'NM_M'] = 41
data.loc[data.Formation == 'B5 SH', 'NM_M'] = 12
data.loc[data.Formation == 'C LM', 'NM_M'] = 50
data.loc[data.Formation == 'C SH', 'NM_M'] = 12



In [203]:

    
data.describe()









    Out[203]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      3232.000000
      4149.000000
      4149.000000
    
    
      mean
      4.503254
      2906.867438
      64.933985
      0.659566
      4.402484
      13.201066
      3.725014
      24.825982
      0.521852
    
    
      std
      2.474324
      133.300164
      30.302530
      0.252703
      5.274947
      7.132846
      0.896152
      17.101252
      0.286644
    
    
      min
      1.000000
      2573.500000
      10.149000
      -0.025949
      -21.832000
      0.550000
      0.200000
      2.000000
      0.000000
    
    
      25%
      2.000000
      2821.500000
      44.730000
      0.498000
      1.600000
      8.500000
      3.100000
      10.000000
      0.277000
    
    
      50%
      4.000000
      2932.500000
      64.990000
      0.639000
      4.300000
      12.020000
      3.551500
      31.000000
      0.528000
    
    
      75%
      6.000000
      3007.000000
      79.438000
      0.822000
      7.500000
      16.050000
      4.300000
      41.000000
      0.769000
    
    
      max
      9.000000
      3138.000000
      361.150000
      1.800000
      19.312000
      84.400000
      8.094000
      50.000000
      1.000000



In [204]:

    
data









    Out[204]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      4
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      4
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      4
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      4
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      4
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      50
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      50
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      50
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      50
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      50
      0.653
    
  

4149 rows × 11 columns



In [205]:

    
# data = data.drop('NM_M', 1)



In [206]:

    
data









    Out[206]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      4
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      4
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      4
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      4
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      4
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      50
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      50
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      50
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      50
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      50
      0.653
    
  

4149 rows × 11 columns



In [207]:

    
data









    Out[207]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      4
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      4
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      4
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      4
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      4
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      50
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      50
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      50
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      50
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      50
      0.653
    
  

4149 rows × 11 columns

getting well, depth, facies which are list of all the individual instances of each column (well, depth, facies)



In [208]:

    
X = data[feature_names].values
# XplusFacies = data[featurePlusFacies_names].values
y = data['Facies'].values



In [209]:

    
well = data['Well Name'].values
depth = data['Depth'].values
# facies = data['Facies'].values

getting list of formations



In [210]:

    
formation = data['NM_M'].values
np.unique(formation)









    Out[210]:





array([ 2,  4,  6, 10, 12, 16, 31, 33, 35, 37, 41, 43, 50])



In [211]:

    
data['Facies'].describe()









    Out[211]:





count    4149.000000
mean        4.503254
std         2.474324
min         1.000000
25%         2.000000
50%         4.000000
75%         6.000000
max         9.000000
Name: Facies, dtype: float64



In [212]:

    
data[feature_names].describe()









    Out[212]:






  
    
      
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      3232.000000
      4149.000000
      4149.000000
    
    
      mean
      64.933985
      0.659566
      4.402484
      13.201066
      3.725014
      24.825982
      0.521852
    
    
      std
      30.302530
      0.252703
      5.274947
      7.132846
      0.896152
      17.101252
      0.286644
    
    
      min
      10.149000
      -0.025949
      -21.832000
      0.550000
      0.200000
      2.000000
      0.000000
    
    
      25%
      44.730000
      0.498000
      1.600000
      8.500000
      3.100000
      10.000000
      0.277000
    
    
      50%
      64.990000
      0.639000
      4.300000
      12.020000
      3.551500
      31.000000
      0.528000
    
    
      75%
      79.438000
      0.822000
      7.500000
      16.050000
      4.300000
      41.000000
      0.769000
    
    
      max
      361.150000
      1.800000
      19.312000
      84.400000
      8.094000
      50.000000
      1.000000

quick investigation of these to make sure I understand them



In [213]:

    
well.shape









    Out[213]:





(4149,)



In [214]:

    
facies.shape









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-214-657a1cc54aaa> in <module>()
----> 1 facies.shape

NameError: name 'facies' is not defined



In [215]:

    
np.unique(facies)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-215-fa750c583390> in <module>()
----> 1 np.unique(facies)

NameError: name 'facies' is not defined



In [216]:

    
np.unique(well)









    Out[216]:





array(['ALEXANDER D', 'CHURCHMAN BIBLE', 'CROSS H CATTLE', 'KIMZEY A',
       'LUKE G U', 'NEWBY', 'NOLAN', 'Recruit F9', 'SHANKLE', 'SHRIMPLIN'], dtype=object)



In [217]:

    
X.size









    Out[217]:





29043



In [218]:

    
y.size









    Out[218]:





4149

Plotting descriptive visual statistics



In [219]:

    
# Define function for plotting feature statistics
def plot_feature_stats(X, y, feature_names, facies_colors, facies_names):
    
    # Remove NaN
    nan_idx = np.any(np.isnan(X), axis=1)
    X = X[np.logical_not(nan_idx), :]
    y = y[np.logical_not(nan_idx)]
    
    # Merge features and labels into a single DataFrame
    features = pd.DataFrame(X, columns=feature_names)
    labels = pd.DataFrame(y, columns=['Facies'])
    for f_idx, facies in enumerate(facies_names):
        labels[labels[:] == f_idx] = facies
    data = pd.concat((labels, features), axis=1)

    # Plot features statistics
    facies_color_map = {}
    for ind, label in enumerate(facies_names):
        facies_color_map[label] = facies_colors[ind]

    sns.pairplot(data, hue='Facies', palette=facies_color_map, hue_order=list(reversed(facies_names)))



In [220]:

    
# Feature distribution
plot_feature_stats(X, y, feature_names, facies_colors, facies_names)
mpl.rcParams.update(inline_rc)

The same as above but with hue as formation instead of facies (note: legend still says 'facies')

using formation_names & formation_colors



In [273]:

    
# Feature distribution
plot_feature_stats(X, y, feature_names, formation_colors, formation_names)
mpl.rcParams.update(inline_rc)

Bar Charts of Number of datapoints of each Facies by well



In [274]:

    
# Facies per well
for w_idx, w in enumerate(np.unique(well)):
    ax = plt.subplot(3, 4, w_idx+1)
    hist = np.histogram(y[well == w], bins=np.arange(len(facies_names)+1)+.5)
    plt.bar(np.arange(len(hist[0])), hist[0], color=facies_colors, align='center')
    ax.set_xticks(np.arange(len(hist[0])))
    ax.set_xticklabels(facies_names)
    ax.set_title(w)

Bar Charts of Number of datapoints of each Facies by well



In [275]:

    
# Facies per formation
for w_idx, w in enumerate(np.unique(formation)):
    ax = plt.subplot(4, 4, w_idx+1)
    hist = np.histogram(y[formation == w], bins=np.arange(len(facies_names)+1)+.5)
    plt.bar(np.arange(len(hist[0])), hist[0], color=facies_colors, align='center')
    ax.set_xticks(np.arange(len(hist[0])))
    ax.set_xticklabels(facies_names)
    ax.set_title(w)

Checks if every feature is in each well



In [276]:

    
# Features per well
for w_idx, w in enumerate(np.unique(well)):
    ax = plt.subplot(3, 4, w_idx+1)
    hist = np.logical_not(np.any(np.isnan(X[well == w, :]), axis=0))
    plt.bar(np.arange(len(hist)), hist, color=facies_colors, align='center')
    ax.set_xticks(np.arange(len(hist)))
    ax.set_xticklabels(feature_names)
    ax.set_yticks([0, 1])
    ax.set_yticklabels(['miss', 'hit'])
    ax.set_title(w)

PE is missing in several of the wells: Alexander D, Kinzey A, and Recruit F9



In [277]:

    
reg = RandomForestRegressor(max_features='sqrt', n_estimators=50)
DataImpAll = data[feature_names].copy()
DataImp = DataImpAll.dropna(axis = 0, inplace=False)
Ximp=DataImp.loc[:, DataImp.columns != 'PE']
Yimp=DataImp.loc[:, 'PE']
reg.fit(Ximp, Yimp)
X[np.array(DataImpAll.PE.isnull()),4] = reg.predict(DataImpAll.loc[DataImpAll.PE.isnull(),:].drop('PE',axis=1,inplace=False))

next three functions for Feature windows concatenation, Feature gradient computation, and Feature augmentation come from Paolo Bestagini's work.



In [278]:

    
# Feature windows concatenation function
def augment_features_window(X, N_neig):
    
    # Parameters
    N_row = X.shape[0]
    N_feat = X.shape[1]

    # Zero padding
    X = np.vstack((np.zeros((N_neig, N_feat)), X, (np.zeros((N_neig, N_feat)))))

    # Loop over windows
    X_aug = np.zeros((N_row, N_feat*(2*N_neig+1)))
    for r in np.arange(N_row)+N_neig:
        this_row = []
        for c in np.arange(-N_neig,N_neig+1):
            this_row = np.hstack((this_row, X[r+c]))
        X_aug[r-N_neig] = this_row

    return X_aug



In [279]:

    
# Feature gradient computation function
def augment_features_gradient(X, depth):
    
    # Compute features gradient
    d_diff = np.diff(depth).reshape((-1, 1))
    d_diff[d_diff==0] = 0.001
    X_diff = np.diff(X, axis=0)
    X_grad = X_diff / d_diff
        
    # Compensate for last missing value
    X_grad = np.concatenate((X_grad, np.zeros((1, X_grad.shape[1]))))
    
    return X_grad



In [280]:

    
# Feature augmentation function
def augment_features(X, well, depth, N_neig=1):
    
    # Augment features
    X_aug = np.zeros((X.shape[0], X.shape[1]*(N_neig*2+2)))
    for w in np.unique(well):
        w_idx = np.where(well == w)[0]
        X_aug_win = augment_features_window(X[w_idx, :], N_neig)
        X_aug_grad = augment_features_gradient(X[w_idx, :], depth[w_idx])
        X_aug[w_idx, :] = np.concatenate((X_aug_win, X_aug_grad), axis=1)
    
    # Find padded rows
    padded_rows = np.unique(np.where(X_aug[:, 0:7] == np.zeros((1, 7)))[0])
    
    return X_aug, padded_rows



In [281]:

    
X_aug, padded_rows = augment_features(X, well, depth)



In [344]:

    
# Initialize model selection methods
lpgo = LeavePGroupsOut(2)

# Generate splits
split_list = []
for train, val in lpgo.split(X, y, groups=data['Well Name']):
    hist_tr = np.histogram(y[train], bins=np.arange(len(facies_names)+1)+.5)
    hist_val = np.histogram(y[val], bins=np.arange(len(facies_names)+1)+.5)
    if np.all(hist_tr[0] != 0) & np.all(hist_val[0] != 0):
        split_list.append({'train':train, 'val':val})
            
# Print splits
for s, split in enumerate(split_list):
    print('Split %d' % s)
    print('    training:   %s' % (data['Well Name'][split['train']].unique()))
    print('    validation: %s' % (data['Well Name'][split['val']].unique()))









    



Split 0
    training:   ['SHRIMPLIN' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'NEWBY']
    validation: ['ALEXANDER D' 'CHURCHMAN BIBLE']
Split 1
    training:   ['SHRIMPLIN' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'NOLAN' 'Recruit F9' 'NEWBY'
 'CHURCHMAN BIBLE']
    validation: ['ALEXANDER D' 'CROSS H CATTLE']
Split 2
    training:   ['SHRIMPLIN' 'SHANKLE' 'LUKE G U' 'CROSS H CATTLE' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['ALEXANDER D' 'KIMZEY A']
Split 3
    training:   ['SHRIMPLIN' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['ALEXANDER D' 'NOLAN']
Split 4
    training:   ['SHRIMPLIN' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['ALEXANDER D' 'SHANKLE']
Split 5
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'NOLAN'
 'Recruit F9' 'NEWBY']
    validation: ['CROSS H CATTLE' 'CHURCHMAN BIBLE']
Split 6
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'NEWBY']
    validation: ['KIMZEY A' 'CHURCHMAN BIBLE']
Split 7
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'NEWBY']
    validation: ['LUKE G U' 'CHURCHMAN BIBLE']
Split 8
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE'
 'NOLAN' 'Recruit F9']
    validation: ['NEWBY' 'CHURCHMAN BIBLE']
Split 9
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE'
 'Recruit F9' 'NEWBY']
    validation: ['NOLAN' 'CHURCHMAN BIBLE']
Split 10
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE'
 'NOLAN' 'NEWBY']
    validation: ['Recruit F9' 'CHURCHMAN BIBLE']
Split 11
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'NEWBY']
    validation: ['SHANKLE' 'CHURCHMAN BIBLE']
Split 12
    training:   ['ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'NEWBY']
    validation: ['SHRIMPLIN' 'CHURCHMAN BIBLE']
Split 13
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['KIMZEY A' 'CROSS H CATTLE']
Split 14
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'NOLAN'
 'Recruit F9' 'CHURCHMAN BIBLE']
    validation: ['CROSS H CATTLE' 'NEWBY']
Split 15
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'NOLAN' 'NEWBY'
 'CHURCHMAN BIBLE']
    validation: ['CROSS H CATTLE' 'Recruit F9']
Split 16
    training:   ['ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'NOLAN' 'Recruit F9' 'NEWBY'
 'CHURCHMAN BIBLE']
    validation: ['SHRIMPLIN' 'CROSS H CATTLE']
Split 17
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'CROSS H CATTLE' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['LUKE G U' 'KIMZEY A']
Split 18
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'CHURCHMAN BIBLE']
    validation: ['KIMZEY A' 'NEWBY']
Split 19
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'CROSS H CATTLE'
 'Recruit F9' 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['KIMZEY A' 'NOLAN']
Split 20
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'CROSS H CATTLE' 'NOLAN'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['KIMZEY A' 'Recruit F9']
Split 21
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'LUKE G U' 'CROSS H CATTLE' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['SHANKLE' 'KIMZEY A']
Split 22
    training:   ['ALEXANDER D' 'SHANKLE' 'LUKE G U' 'CROSS H CATTLE' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['SHRIMPLIN' 'KIMZEY A']
Split 23
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE'
 'Recruit F9' 'CHURCHMAN BIBLE']
    validation: ['NOLAN' 'NEWBY']
Split 24
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN'
 'Recruit F9' 'CHURCHMAN BIBLE']
    validation: ['SHANKLE' 'NEWBY']
Split 25
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['NOLAN' 'Recruit F9']
Split 26
    training:   ['ALEXANDER D' 'SHANKLE' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE'
 'Recruit F9' 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['SHRIMPLIN' 'NOLAN']
Split 27
    training:   ['SHRIMPLIN' 'ALEXANDER D' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['SHANKLE' 'Recruit F9']
Split 28
    training:   ['ALEXANDER D' 'LUKE G U' 'KIMZEY A' 'CROSS H CATTLE' 'NOLAN' 'Recruit F9'
 'NEWBY' 'CHURCHMAN BIBLE']
    validation: ['SHRIMPLIN' 'SHANKLE']



In [345]:

    
# Parameters search grid (uncomment parameters for full grid search... may take a lot of time)
md_grid = [2,3]
# changed gamma_grid to 4 from 3 and then to 2 and back to 3
mcw_grid = [1]
gamma_grid = [0]
# changed gamma_grid to 0 from 0.3
ss_grid = [1]
# changed ss_grid to 0.7 from 1 and then back to 1 and back to 0.7
csb_grid = [0.8,0.9]
# 
alpha_grid =[0,0.2,0.3]
lr_grid = [0.05]
ne_grid = [200,300]
param_grid = []
for N in md_grid:
    for M in mcw_grid:
        for S in gamma_grid:
            for L in ss_grid:
                for K in csb_grid:
                    for P in alpha_grid:
                        for R in lr_grid:
                            for E in ne_grid:
                                param_grid.append({'maxdepth':N, 
                                                   'minchildweight':M, 
                                                   'gamma':S, 
                                                   'subsample':L,
                                                   'colsamplebytree':K,
                                                   'alpha':P,
                                                   'learningrate':R,
                                                   'n_estimators':E})



In [346]:

    
# Train and test a classifier
def train_and_test(X_tr, y_tr, X_v,well_v):
    
    #Feature normalization
    scaler = preprocessing.RobustScaler(quantile_range=(25.0, 75.0)).fit(X_tr)
    X_tr = scaler.transform(X_tr)
    X_v = scaler.transform(X_v)


    clf.fit(X_tr, y_tr)
    
    # Test classifier
    y_v_hat = clf.predict(X_v)
    
    # Clean isolated facies for each well
    for w in np.unique(well_v):
        y_v_hat[well_v==w] = medfilt(y_v_hat[well_v==w], kernel_size=5)
    
    return y_v_hat



In [347]:

    
# For each set of parameters
score_param = []
for param in param_grid:
    
    clf = OneVsOneClassifier(XGBClassifier(
            learning_rate = param['learningrate'],
            n_estimators=param['n_estimators'],
            max_depth=param['maxdepth'],
            min_child_weight=param['minchildweight'],
            gamma = param['gamma'],
            subsample=param['subsample'],
            colsample_bytree=param['colsamplebytree'],
            reg_alpha = param['alpha'],
            nthread = 1, 
            # changed nthread to 1 instead of 4 due to running on mac without NVIDA
            seed = seed,
        ) , n_jobs=-1)
    # For each data split
    score_split = []
    for split in split_list:
    
        # Remove padded rows
        split_train_no_pad = np.setdiff1d(split['train'], padded_rows)
        
        # Select training and validation data from current split
        X_tr = X_aug[split_train_no_pad, :]
        X_v = X_aug[split['val'], :]
        y_tr = y[split_train_no_pad]
        y_v = y[split['val']]
        
        # Select well labels for validation data
        well_v = well[split['val']]

        # Train and test
        y_v_hat = train_and_test(X_tr, y_tr, X_v, well_v)
        
        # Score
        score = f1_score(y_v, y_v_hat, average='micro')
        score_split.append(score)
        
    # Average score for this param
    score_param.append(np.mean(score_split))
    print('F1 score = %.3f %s' % (score_param[-1], param))
          
# Best set of parameters
best_idx = np.argmax(score_param)
param_best = param_grid[best_idx]
score_best = score_param[best_idx]
1.

print('\nBest F1 score = %.3f %s' % (score_best, param_best))









    



F1 score = 0.573 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.570 {'n_estimators': 300, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.572 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.572 {'n_estimators': 300, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.572 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.572 {'n_estimators': 300, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.573 {'n_estimators': 200, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.572 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.573 {'n_estimators': 200, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.574 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.571 {'n_estimators': 200, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.572 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.571 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.569 {'n_estimators': 300, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.570 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.569 {'n_estimators': 300, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.570 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.569 {'n_estimators': 300, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.571 {'n_estimators': 200, 'colsamplebytree': 0.9, 'maxdepth': 3, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.570 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 3, 'subsample': 1, 'alpha': 0, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.570 {'n_estimators': 200, 'colsamplebytree': 0.9, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.569 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.570 {'n_estimators': 200, 'colsamplebytree': 0.9, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}
F1 score = 0.566 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 3, 'subsample': 1, 'alpha': 0.3, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}

Best F1 score = 0.574 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}

Best Scores from different runs (each takes maybe 10 minutes)

Final

1.Best F1 score = 0.574 {'n_estimators': 300, 'colsamplebytree': 0.9, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}

2.Best F1 score = 0.572 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 2, 'subsample': 1, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}

3.F1 score = 0.566 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 0.7, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}

Best F1 score = 0.566 {'n_estimators': 200, 'colsamplebytree': 0.8, 'maxdepth': 3, 'subsample': 0.7, 'alpha': 0.2, 'minchildweight': 1, 'learningrate': 0.05, 'gamma': 0}



In [362]:

    
for param in param_grid:
    
    clf = OneVsOneClassifier(XGBClassifier(
            learning_rate = param['learningrate'],
            n_estimators=param['n_estimators'],
            max_depth=param['maxdepth'],
            min_child_weight=param['minchildweight'],
            gamma = param['gamma'],
            subsample=param['subsample'],
            colsample_bytree=param['colsamplebytree'],
            reg_alpha = param['alpha'],
            nthread = 1,
#             changed nthread to not be listed, it was 4
            seed = seed,
        ) , n_jobs=-1)



In [363]:

    
clf









    Out[363]:





OneVsOneClassifier(estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.9,
       gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=300, nthread=1,
       objective='binary:logistic', reg_alpha=0.3, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=True, subsample=1),
          n_jobs=-1)

Investigating Errors in Prediciton of Training Wells



In [364]:

    
X_aug.shape









    Out[364]:





(4149, 28)



In [365]:

    
X.shape









    Out[365]:





(4149, 7)



In [366]:

    
X_tr.shape









    Out[366]:





(3228, 28)



In [367]:

    
y.shape









    Out[367]:





(4149,)



In [368]:

    
# predicted_labels = clf.predict(X_test)
clf.fit(X_tr,y_tr)









    Out[368]:





OneVsOneClassifier(estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.9,
       gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=300, nthread=1,
       objective='binary:logistic', reg_alpha=0.3, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=True, subsample=1),
          n_jobs=-1)



In [369]:

    
predicted_labels = clf.predict(X_tr)
predicted_labels









    Out[369]:





array([3, 3, 3, ..., 5, 5, 5])

The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j] are equal to the number of observations predicted to have facies j, but are known to have facies i. To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py in this repo for the display_cm() function.



In [370]:

    
from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm

The rows of the confusion matrix correspond to the actual facies labels. The columns correspond to the labels assigned by the classifier.

The entries along the diagonal are the facies that have been correctly classified.

Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.

confusion matrix all facies, including adjacent



In [371]:

    
display_cm(conf, facies_names, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   175     1     2                                       178
     CSiS     1   697    31                             1         730
     FSiS          59   473                 1                     533
     SiSh           1         240           1           4         246
       MS                           175    27          12         214
       WS                       5     3   408          31     1   448
        D           1           1           1   110     4     2   119
       PS                 1     3     4    27     2   535     1   573
       BS           1                       1           1   169   172

Precision  0.99  0.92  0.93  0.96  0.96  0.88  0.98  0.91  0.98  0.93
   Recall  0.98  0.95  0.89  0.98  0.82  0.91  0.92  0.93  0.98  0.93
       F1  0.99  0.94  0.91  0.97  0.88  0.89  0.95  0.92  0.98  0.93

looks like MS (mudstone) and WS (wackestone) take the biggest hits on F1 score...

confusion matrix, excluding adjacent facies



In [372]:

    
display_adj_cm(conf, facies_names, adjacent_facies,
               display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   176           2                                       178
     CSiS         729                                   1         730
     FSiS               532                 1                     533
     SiSh           1         240           1           4         246
       MS                           202                12         214
       WS                       5         442                 1   448
        D           1           1               115           2   119
       PS                 1     3     4               565         573
       BS           1                       1               170   172

Precision  1.00  1.00  0.99  0.96  0.98  0.99  1.00  0.97  0.98  0.99
   Recall  0.99  1.00  1.00  0.98  0.94  0.99  0.97  0.99  0.99  0.99
       F1  0.99  1.00  1.00  0.97  0.96  0.99  0.98  0.98  0.99  0.99



In [373]:

    
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc



In [374]:

    
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))



In [375]:

    
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))









    



Facies classification accuracy = 0.928105
Adjacent facies classification accuracy = 0.986928

are these accuracy numbers above reversed?



In [376]:

    
# # Prepare training data
# X_trB = X
# y_trB = y

# # Augment features
# X_trB, padded_rows = augment_features(X_trB, well, depth)

# # Removed padded rows
# X_tr = np.delete(X_trB, padded_rows, axis=0)
# y_tr = np.delete(y_trB, padded_rows, axis=0)



In [377]:

    
def makeVizWell(wellName):
    ## create this to visualize the prediction at one of the training wells
    vis_well = data[data['Well Name'] == wellName]

    # Prepare test data 2
    well_tsB = vis_well['Well Name'].values
    depth_tsB = vis_well['Depth'].values
    X_tsB = vis_well[feature_names].values

    # Augment features
    X_tsB, padded_rows = augment_features(X_tsB, well_tsB, depth_tsB)

    # Predict test labels
    y_ts_hatB = train_and_test(X_trB, y_trB, X_tsB, well_tsB)

    # Save predicted labels
    vis_well['Prediction'] = y_ts_hatB



In [378]:

    
def compare_facies_plot(logs, compadre, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=7, figsize=(9, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[6])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im2, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-2):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    ax[6].set_xlabel(compadre)
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    ax[6].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)



In [379]:

    
# makeVizWell('SHANKLE')
# compare_facies_plot(vis_well, 'Prediction', facies_colors)



In [306]:

    
# makeVizWell('SHRIMPLIN')
# compare_facies_plot(viz_well, 'Prediction', facies_colors)



In [307]:

    
# makeVizWell('NEWBY')
# compare_facies_plot(makeVizWell('NEWBY'), 'Prediction', facies_colors)



In [308]:

    
# makeVizWell('LUKE G U')
# compare_facies_plot(vis_well, 'Prediction', facies_colors)



In [309]:

    
# data['Well Name'].unique()

Loading test data, running model on it, and exporting predictions



In [380]:

    
# Load data from file
test_data = pd.read_csv('validation_data_nofacies.csv')



In [381]:

    
test_data









    Out[381]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.300
      10.650
      3.591
      1
      1.000
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.500
      11.950
      3.341
      1
      0.978
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.400
      13.600
      3.064
      1
      0.956
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.500
      13.250
      2.977
      1
      0.933
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.700
      12.350
      3.020
      1
      0.911
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      825
      C SH
      CRAWFORD
      3158.5
      86.078
      0.554
      5.040
      16.150
      3.161
      1
      0.639
    
    
      826
      C SH
      CRAWFORD
      3159.0
      88.855
      0.539
      5.560
      16.750
      3.118
      1
      0.611
    
    
      827
      C SH
      CRAWFORD
      3159.5
      90.490
      0.530
      6.360
      16.780
      3.168
      1
      0.583
    
    
      828
      C SH
      CRAWFORD
      3160.0
      90.975
      0.522
      7.035
      16.995
      3.154
      1
      0.556
    
    
      829
      C SH
      CRAWFORD
      3160.5
      90.108
      0.513
      7.505
      17.595
      3.125
      1
      0.528
    
  

830 rows × 10 columns

adjusted test data, chaning MN_N column to integers representing Formations



In [382]:

    
# use formation inplace of NM_M column!
test_data.loc[test_data.Formation == 'A1 LM', 'NM_M'] = 31
test_data.loc[test_data.Formation == 'A1 SH', 'NM_M'] = 4
test_data.loc[test_data.Formation == 'B1 LM', 'NM_M'] = 33
test_data.loc[test_data.Formation == 'B1 SH', 'NM_M'] = 10
test_data.loc[test_data.Formation == 'B2 LM', 'NM_M'] = 35
test_data.loc[test_data.Formation == 'B2 SH', 'NM_M'] = 16
test_data.loc[test_data.Formation == 'B3 LM', 'NM_M'] = 37
test_data.loc[test_data.Formation == 'B3 SH', 'NM_M'] = 6
test_data.loc[test_data.Formation == 'B4 LM', 'NM_M'] = 45
test_data.loc[test_data.Formation == 'B4 SH', 'NM_M'] = 2
test_data.loc[test_data.Formation == 'B5 LM', 'NM_M'] = 41
test_data.loc[test_data.Formation == 'B5 SH', 'NM_M'] = 12
test_data.loc[test_data.Formation == 'C LM', 'NM_M'] = 43
test_data.loc[test_data.Formation == 'C SH', 'NM_M'] = 12



In [383]:

    
test_data









    Out[383]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.300
      10.650
      3.591
      4
      1.000
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.500
      11.950
      3.341
      4
      0.978
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.400
      13.600
      3.064
      4
      0.956
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.500
      13.250
      2.977
      4
      0.933
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.700
      12.350
      3.020
      4
      0.911
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      825
      C SH
      CRAWFORD
      3158.5
      86.078
      0.554
      5.040
      16.150
      3.161
      12
      0.639
    
    
      826
      C SH
      CRAWFORD
      3159.0
      88.855
      0.539
      5.560
      16.750
      3.118
      12
      0.611
    
    
      827
      C SH
      CRAWFORD
      3159.5
      90.490
      0.530
      6.360
      16.780
      3.168
      12
      0.583
    
    
      828
      C SH
      CRAWFORD
      3160.0
      90.975
      0.522
      7.035
      16.995
      3.154
      12
      0.556
    
    
      829
      C SH
      CRAWFORD
      3160.5
      90.108
      0.513
      7.505
      17.595
      3.125
      12
      0.528
    
  

830 rows × 10 columns



In [384]:

    
# Prepare training data
X_tr = X
y_tr = y

# Augment features
X_tr, padded_rows = augment_features(X_tr, well, depth)


# Removed padded rows
X_tr = np.delete(X_tr, padded_rows, axis=0)
y_tr = np.delete(y_tr, padded_rows, axis=0)



In [385]:

    
# Prepare test data 2
well_ts = test_data['Well Name'].values
depth_ts = test_data['Depth'].values
X_ts = test_data[feature_names].values



# Augment features
X_ts, padded_rows = augment_features(X_ts, well_ts, depth_ts)



In [386]:

    
X_ts.shape









    Out[386]:





(830, 28)



In [387]:

    
X_tr.shape









    Out[387]:





(4131, 28)



In [388]:

    
y_tr.shape









    Out[388]:





(4131,)



In [389]:

    
well_ts.shape









    Out[389]:





(830,)



In [390]:

    
# Predict test labels
y_ts_hat = train_and_test(X_tr, y_tr, X_ts, well_ts)



In [391]:

    
# Save predicted labels
test_data['Facies'] = y_ts_hat
test_data.to_csv('Prediction7_final.csv')



In [392]:

    
# print test data
test_data









    Out[392]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      Facies
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.300
      10.650
      3.591
      4
      1.000
      3
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.500
      11.950
      3.341
      4
      0.978
      3
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.400
      13.600
      3.064
      4
      0.956
      3
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.500
      13.250
      2.977
      4
      0.933
      3
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.700
      12.350
      3.020
      4
      0.911
      3
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      825
      C SH
      CRAWFORD
      3158.5
      86.078
      0.554
      5.040
      16.150
      3.161
      12
      0.639
      2
    
    
      826
      C SH
      CRAWFORD
      3159.0
      88.855
      0.539
      5.560
      16.750
      3.118
      12
      0.611
      2
    
    
      827
      C SH
      CRAWFORD
      3159.5
      90.490
      0.530
      6.360
      16.780
      3.168
      12
      0.583
      2
    
    
      828
      C SH
      CRAWFORD
      3160.0
      90.975
      0.522
      7.035
      16.995
      3.154
      12
      0.556
      2
    
    
      829
      C SH
      CRAWFORD
      3160.5
      90.108
      0.513
      7.505
      17.595
      3.125
      12
      0.528
      2
    
  

830 rows × 11 columns

Display test data in log form



In [393]:

    
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[5])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)



In [394]:

    
from mpl_toolkits.axes_grid1 import make_axes_locatable



In [395]:

    
# Plot predicted labels
make_facies_log_plot(
    test_data[test_data['Well Name'] == 'STUART'],
    facies_colors=facies_colors)

make_facies_log_plot(
    test_data[test_data['Well Name'] == 'CRAWFORD'],
    facies_colors=facies_colors)
mpl.rcParams.update(inline_rc)



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.450	0.664	9.900	11.915	4.600	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.260	0.661	14.200	12.565	4.100	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.050	0.658	14.800	13.050	3.600	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.100	0.655	13.900	13.115	3.500	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.580	0.647	13.500	13.300	3.400	1	0.915
...	...	...	...	...	...	...	...	...	...	...	...
4144	5	C LM	CHURCHMAN BIBLE	3120.5	46.719	0.947	1.828	7.254	3.617	2	0.685
4145	5	C LM	CHURCHMAN BIBLE	3121.0	44.563	0.953	2.241	8.013	3.344	2	0.677
4146	5	C LM	CHURCHMAN BIBLE	3121.5	49.719	0.964	2.925	8.013	3.190	2	0.669
4147	5	C LM	CHURCHMAN BIBLE	3122.0	51.469	0.965	3.083	7.708	3.152	2	0.661
4148	5	C LM	CHURCHMAN BIBLE	3122.5	50.031	0.970	2.609	6.668	3.295	2	0.653

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	4149.000000	4149.000000	4149.000000	4149.000000	4149.000000	4149.000000	3232.000000	4149.000000	4149.000000
mean	4.503254	2906.867438	64.933985	0.659566	4.402484	13.201066	3.725014	24.825982	0.521852
std	2.474324	133.300164	30.302530	0.252703	5.274947	7.132846	0.896152	17.101252	0.286644
min	1.000000	2573.500000	10.149000	-0.025949	-21.832000	0.550000	0.200000	2.000000	0.000000
25%	2.000000	2821.500000	44.730000	0.498000	1.600000	8.500000	3.100000	10.000000	0.277000
50%	4.000000	2932.500000	64.990000	0.639000	4.300000	12.020000	3.551500	31.000000	0.528000
75%	6.000000	3007.000000	79.438000	0.822000	7.500000	16.050000	4.300000	41.000000	0.769000
max	9.000000	3138.000000	361.150000	1.800000	19.312000	84.400000	8.094000	50.000000	1.000000

	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	A1 SH	STUART	2808.0	66.276	0.630	3.300	10.650	3.591	1	1.000
1	A1 SH	STUART	2808.5	77.252	0.585	6.500	11.950	3.341	1	0.978
2	A1 SH	STUART	2809.0	82.899	0.566	9.400	13.600	3.064	1	0.956
3	A1 SH	STUART	2809.5	80.671	0.593	9.500	13.250	2.977	1	0.933
4	A1 SH	STUART	2810.0	75.971	0.638	8.700	12.350	3.020	1	0.911
...	...	...	...	...	...	...	...	...	...	...
825	C SH	CRAWFORD	3158.5	86.078	0.554	5.040	16.150	3.161	1	0.639
826	C SH	CRAWFORD	3159.0	88.855	0.539	5.560	16.750	3.118	1	0.611
827	C SH	CRAWFORD	3159.5	90.490	0.530	6.360	16.780	3.168	1	0.583
828	C SH	CRAWFORD	3160.0	90.975	0.522	7.035	16.995	3.154	1	0.556
829	C SH	CRAWFORD	3160.5	90.108	0.513	7.505	17.595	3.125	1	0.528

SEG Machine Learning (Well Log Facies Prediction) Contest

Entry by Justin Gosses of team Pet_Stromatolite

In particular, I relied on the previous work by LA team, Paolo Bestagini, and especially Hou Math. You can find their work and more information here: https://github.com/seg/2016-ml-contest

and even more information at http://library.seg.org/doi/abs/10.1190/tle35100906.1

Early Ideas for feature engineering

In the end, ran out of time and concentrated most of my time on creating a feature based on formations.

getting well, depth, facies which are list of all the individual instances of each column (well, depth, facies)

getting list of formations

quick investigation of these to make sure I understand them

Plotting descriptive visual statistics

The same as above but with hue as formation instead of facies (note: legend still says 'facies')

Bar Charts of Number of datapoints of each Facies by well

Bar Charts of Number of datapoints of each Facies by well

Checks if every feature is in each well

PE is missing in several of the wells: Alexander D, Kinzey A, and Recruit F9

Best Scores from different runs (each takes maybe 10 minutes)

Final

Previous

Investigating Errors in Prediciton of Training Wells

confusion matrix all facies, including adjacent

confusion matrix, excluding adjacent facies

Loading test data, running model on it, and exporting predictions

adjusted test data, chaning MN_N column to integers representing Formations

Display test data in log form