Facies Classification using TPOT

George Crowther - https://www.linkedin.com/in/george-crowther-9669a931?trk=hp-identity-name

I've had a play with some of the data here and used something of a brute force approach, by creating a large number of additional features and then using the TPOT library to train a model and refine the model parameters. I will be interested to see whether this has over-fitted, as the selected Extra Trees Classifier can do that.

1. Data Loading and Initial Observations



In [1]:

    
# Initial imports for reading data and first observations
import pandas as pd
import bokeh.plotting as bk
import numpy as np

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from tpot import TPOTClassifier

bk.output_notebook()









    





    
        
        Loading BokehJS ...



In [89]:

    
# Input file paths
train_path = r'training_data.csv'
test_path = r'validation_data_nofacies.csv'

# Read training data to dataframe
train = pd.read_csv(train_path)

# TPOT library requires that the target class is renamed to 'class'
train.rename(columns={'Facies': 'class'}, inplace=True)



In [90]:

    
train.head()



In [91]:

    
train.describe()









    Out[91]:






  
    
      
      class
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
    
    
      mean
      4.422030
      2875.824567
      66.135769
      0.642719
      3.559642
      13.483213
      3.725014
      1.498453
      0.520287
    
    
      std
      2.504243
      131.006274
      30.854826
      0.241845
      5.228948
      7.698980
      0.896152
      0.500075
      0.286792
    
    
      min
      1.000000
      2573.500000
      13.250000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.010000
    
    
      25%
      2.000000
      2791.000000
      46.918750
      0.492750
      1.163750
      8.346750
      3.100000
      1.000000
      0.273000
    
    
      50%
      4.000000
      2893.500000
      65.721500
      0.624437
      3.500000
      12.150000
      3.551500
      1.000000
      0.526000
    
    
      75%
      6.000000
      2980.000000
      79.626250
      0.812735
      6.432500
      16.453750
      4.300000
      2.000000
      0.767250
    
    
      max
      9.000000
      3122.500000
      361.150000
      1.480000
      18.600000
      84.400000
      8.094000
      2.000000
      1.000000



In [92]:

    
formations = {}
for i, value in enumerate(train['Formation'].unique()):
    formations[value] = i
    train.loc[train['Formation'] == value, 'Formation'] = i

wells = {}
for i, value in enumerate(train['Well Name'].unique()):
    wells[value] = i
    train.loc[train['Well Name'] == value, 'Well Name'] = i



In [7]:

    
from bokeh.palettes import viridis

def log_plot(well_frame):
    plots = []
    iterator = 0
    
    #cmap = viridis(len(column.unique()))
    
    index = 0
    for i, column in well_frame.iteritems():
        if index == 0:
            plots.append(bk.figure(height = 800, width = 150))
        else:
            plots.append(bk.figure(height = 800, width = 75, y_range = plots[0].y_range))
        plots[-1].line(column, well_frame['Depth'])
        plots[-1].yaxis.visible = False
        plots[-1].title.text = i
        
    plots[0].yaxis.visible = True
    grid = bk.gridplot([plots])
    bk.show(grid)
            
for i, group in train.groupby('Well Name'):
    well_frame = group.sort_values('Depth')
    log_plot(well_frame)



In [93]:

    
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']

Feature construction and data clean-up.

1. Z-score normalisation of data.
2. Group each of the measurement parameters into quartiles. Most of the classification methods find data like this easier to work with.
3. Create a series of 'adjacent' parameters by looking for the above and below depth sample for each well. Create a series of features associated with the above and below parameters.



In [94]:

    
train_columns = train.columns[1:]
std_scaler = preprocessing.StandardScaler().fit(train[train_columns])

train_std = std_scaler.transform(train[train_columns])

train_std_frame = train
for i, column in enumerate(train_columns):
    train_std_frame.loc[:, column] = train_std[:, i]

train = train_std_frame
master_columns = train.columns[4:]

def in_range(row, vmin, vmax, variable):
    
    if vmin <= row[variable] < vmax:
        return 1
    else:
        return 0
    
for i, column in train[master_columns].iteritems():
    ds = np.linspace(0, 1.0, 5)
    quantiles = [column.quantile(n) for n in ds]
    
    for j in range(len(quantiles) - 1):
        train[i + '_{0}'.format(j)] = train.apply(lambda row: in_range(row, ds[j], ds[j + 1], i), axis = 1)

master_columns = train.columns[4:]

above = []
below = []
for i, group in train.groupby('Well Name'):
    
    df = group.sort_values('Depth')
    u = df.shift(-1).fillna(method = 'ffill')
    b = df.shift(1).fillna(method = 'bfill')
    
    above.append(u[master_columns])
    below.append(b[master_columns])
    
above_frame = pd.concat(above)
above_frame.columns = ['above_'+ column for column in above_frame.columns]
below_frame = pd.concat(below)
below_frame.columns = ['below_'+ column for column in below_frame.columns]

frame = pd.concat((train, above_frame, below_frame), axis = 1)

4. TPOT



In [96]:

    
train_vector = ['class']
train_columns = frame.columns[4:]

train_f, test_f = train_test_split(frame, test_size = 0.1, 
                                   random_state = 7)

TPOT uses a genetic algorithm to tune model parameters for the most effective fit. This can take quite a while to process if you want to re-run this part!



In [26]:

    
tpot = TPOTClassifier(verbosity = 2, generations = 5, max_eval_time_mins = 30)
tpot.fit(train_f[train_columns], train_f['class'])









    



C:\Users\george.crowther\AppData\Local\Continuum\Anaconda3_4\envs\optasense\lib\site-packages\sklearn\utils\extmath.py:410: RuntimeWarning: invalid value encountered in subtract
  out = np.log(np.sum(np.exp(arr - vmax), axis=0))
C:\Users\george.crowther\AppData\Local\Continuum\Anaconda3_4\envs\optasense\lib\site-packages\sklearn\ensemble\gradient_boosting.py:558: RuntimeWarning: invalid value encountered in multiply
  return np.sum(-1 * sample_weight * (Y * pred).sum(axis=1) +
Optimization Progress:   0%|▋                                                                                                                                    | 3/600 [00:10<30:58,  3.11s/pipeline]C:\Users\george.crowther\AppData\Local\Continuum\Anaconda3_4\envs\optasense\lib\site-packages\sklearn\utils\extmath.py:410: RuntimeWarning: invalid value encountered in subtract
  out = np.log(np.sum(np.exp(arr - vmax), axis=0))
Optimization Progress:  17%|█████████████████████▌                                                                                                           | 100/600 [1:07:13<30:03,  3.61s/pipeline]





    



Generation 1 - Current best internal CV score: 0.8408606836029527






    



Optimization Progress:  32%|████████████████████████████████████████▊                                                                                      | 193/600 [2:37:51<9:36:07, 84.93s/pipeline]





    



Generation 2 - Current best internal CV score: 0.8408606836029527






    



Optimization Progress:  48%|████████████████████████████████████████████████████████████▌                                                                  | 286/600 [3:42:05<2:49:53, 32.46s/pipeline]





    



Generation 3 - Current best internal CV score: 0.8416541124289602






    



Optimization Progress:  62%|█████████████████████████████████████████████████████████████████████████████▎                                               | 371/600 [5:32:08<36:10:06, 568.59s/pipeline]





    



Timeout during evaluation of pipeline #371. Skipping to the next pipeline.






    



Optimization Progress:  65%|██████████████████████████████████████████████████████████████████████████████████▊                                            | 391/600 [5:40:46<1:06:19, 19.04s/pipeline]





    



Generation 4 - Current best internal CV score: 0.8579135039637568






    



Optimization Progress:  82%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋                       | 489/600 [7:41:28<3:21:39, 109.01s/pipeline]





    



Generation 5 - Current best internal CV score: 0.8603160397473301






    









    



Best pipeline: ExtraTreesClassifier(VarianceThreshold(input_matrix, 0.37), 21, 0.70999999999999996)



In [61]:

    
tpot.score(test_f[train_columns], test_f['class'])









    Out[61]:





0.89883901153186463



In [28]:

    
tpot.export('contest_export.py')



In [62]:

    
result = tpot.predict(frame[train_columns])

from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm

conf = confusion_matrix(frame['class'], result)
display_cm(conf, facies_labels, hide_zeros=True, display_metrics = True)

def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

print(accuracy(conf))

adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))

print(accuracy_adjacent(conf, adjacent_facies))









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   259                                                   259
     CSiS     2   731     5                                       738
     FSiS          11   604                                       615
     SiSh                     179     2     2           1         184
       MS                           213     2           2         217
       WS                       1         453           8         462
        D                                        95     2     1    98
       PS                 1           2     9         486         498
       BS                                         2     3   156   161

Precision  0.99  0.99  0.99  0.99  0.98  0.97  0.98  0.97  0.99  0.98
   Recall  1.00  0.99  0.98  0.97  0.98  0.98  0.97  0.98  0.97  0.98
       F1  1.00  0.99  0.99  0.98  0.98  0.98  0.97  0.97  0.98  0.98
0.982673267327
0.996905940594

5.0 Workflow for Test Data

Run this to generate results from output model.



In [102]:

    
test_path = r'validation_data_nofacies.csv'

# Read training data to dataframe
test = pd.read_csv(test_path)

# TPOT library requires that the target class is renamed to 'class'
test.rename(columns={'Facies': 'class'}, inplace=True)

test_columns = test.columns

formations = {}
for i, value in enumerate(test['Formation'].unique()):
    formations[value] = i
    test.loc[test['Formation'] == value, 'Formation'] = i

wells = {}
for i, value in enumerate(test['Well Name'].unique()):
    wells[value] = i
    test.loc[test['Well Name'] == value, 'Well Name'] = i

std_scaler = preprocessing.StandardScaler().fit(test[test_columns])
test_std = std_scaler.transform(test[test_columns])

test_std_frame = test
for i, column in enumerate(test_columns):
    test_std_frame.loc[:, column] = test_std[:, i]

test = test_std_frame
master_columns = test.columns[3:]

def in_range(row, vmin, vmax, variable):
    
    if vmin <= row[variable] < vmax:
        return 1
    else:
        return 0
    
for i, column in test[master_columns].iteritems():
    ds = np.linspace(0, 1.0, 5)
    quantiles = [column.quantile(n) for n in ds]
    
    for j in range(len(quantiles) - 1):
        test[i + '_{0}'.format(j)] = test.apply(lambda row: in_range(row, ds[j], ds[j + 1], i), axis = 1)

master_columns = test.columns[3:]

above = []
below = []
for i, group in test.groupby('Well Name'):
    
    df = group.sort_values('Depth')
    u = df.shift(-1).fillna(method = 'ffill')
    b = df.shift(1).fillna(method = 'bfill')
    
    above.append(u[master_columns])
    below.append(b[master_columns])
    
above_frame = pd.concat(above)
above_frame.columns = ['above_'+ column for column in above_frame.columns]
below_frame = pd.concat(below)
below_frame.columns = ['below_'+ column for column in below_frame.columns]

frame = pd.concat((test, above_frame, below_frame), axis = 1)

test_columns = frame.columns[3:]

result = tpot.predict(frame[test_columns])



In [107]:

    
result









    Out[107]:





array([3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 8, 8, 8,
       8, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 5, 6, 6, 4, 4,
       6, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 6,
       6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 6,
       6, 6, 8, 8, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 4, 4, 5, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 6, 8, 8, 8, 2, 2,
       3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 9, 9, 8, 9, 9, 9, 9, 8, 6, 6, 6, 6, 8, 2, 2, 3, 3,
       3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 8, 3, 3, 3, 3, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 8, 8, 8, 6, 6, 6, 8, 6, 9, 9, 9,
       9, 9, 9, 6, 6, 6, 6, 6, 6, 6, 6, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2,
       3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 5, 5, 5, 5, 5, 5, 5, 5, 7, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 8, 8, 6, 4, 4,
       4, 4, 4, 4, 8, 4, 4, 4, 4, 6, 6, 4, 4, 4, 6, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4,
       4, 4, 6, 6, 8, 8, 6, 6, 6, 4, 4, 4, 4, 6, 6, 8, 8, 8, 8, 8, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 6, 6, 6, 6, 6, 6, 6, 8, 6, 8, 6, 6, 8,
       3, 3, 8, 8, 8, 8, 8, 8, 8, 5, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 6, 6, 6, 8, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 8, 8, 8, 8,
       8, 8, 8, 9, 9, 9, 9, 9, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 6, 8, 5,
       3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 8, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 3, 3, 2, 2, 2, 2, 2, 3, 3, 5, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6, 6,
       6, 6, 8, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 8, 8, 8, 8, 8, 8,
       8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5, 8,
       6, 6, 6, 6, 6, 6, 8, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3], dtype=int64)



In [113]:

    
output_frame = pd.read_csv(test_path)
output_frame['Facies'] = result
output_frame.to_csv('Well Facies Prediction - Test Data Set.csv')

	class	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.45	0.664	9.9	11.915	4.6	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.26	0.661	14.2	12.565	4.1	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.05	0.658	14.8	13.050	3.6	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.10	0.655	13.9	13.115	3.5	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.58	0.647	13.5	13.300	3.4	1	0.915

	class	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000
mean	4.422030	2875.824567	66.135769	0.642719	3.559642	13.483213	3.725014	1.498453	0.520287
std	2.504243	131.006274	30.854826	0.241845	5.228948	7.698980	0.896152	0.500075	0.286792
min	1.000000	2573.500000	13.250000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.010000
25%	2.000000	2791.000000	46.918750	0.492750	1.163750	8.346750	3.100000	1.000000	0.273000
50%	4.000000	2893.500000	65.721500	0.624437	3.500000	12.150000	3.551500	1.000000	0.526000
75%	6.000000	2980.000000	79.626250	0.812735	6.432500	16.453750	4.300000	2.000000	0.767250
max	9.000000	3122.500000	361.150000	1.480000	18.600000	84.400000	8.094000	2.000000	1.000000