Facies Classification using TPOT

George Crowther - https://www.linkedin.com/in/george-crowther-9669a931?trk=hp-identity-name

I've had a play with some of the data here and used something of a brute force approach, by creating a large number of additional features and then using the TPOT library to train a model and refine the model parameters. I will be interested to see whether this has over-fitted, as the selected Extra Trees Classifier can do that.

1. Data Loading and Initial Observations


In [1]:
# Initial imports for reading data and first observations
import pandas as pd
import bokeh.plotting as bk
import numpy as np

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from tpot import TPOTClassifier

bk.output_notebook()


Loading BokehJS ...

In [89]:
# Input file paths
train_path = r'training_data.csv'
test_path = r'validation_data_nofacies.csv'

# Read training data to dataframe
train = pd.read_csv(train_path)

# TPOT library requires that the target class is renamed to 'class'
train.rename(columns={'Facies': 'class'}, inplace=True)

In [90]:
train.head()


Out[90]:
class Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915

In [91]:
train.describe()


Out[91]:
class Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 3232.000000 3232.000000 3232.000000 3232.000000 3232.000000 3232.000000 3232.000000 3232.000000 3232.000000
mean 4.422030 2875.824567 66.135769 0.642719 3.559642 13.483213 3.725014 1.498453 0.520287
std 2.504243 131.006274 30.854826 0.241845 5.228948 7.698980 0.896152 0.500075 0.286792
min 1.000000 2573.500000 13.250000 -0.025949 -21.832000 0.550000 0.200000 1.000000 0.010000
25% 2.000000 2791.000000 46.918750 0.492750 1.163750 8.346750 3.100000 1.000000 0.273000
50% 4.000000 2893.500000 65.721500 0.624437 3.500000 12.150000 3.551500 1.000000 0.526000
75% 6.000000 2980.000000 79.626250 0.812735 6.432500 16.453750 4.300000 2.000000 0.767250
max 9.000000 3122.500000 361.150000 1.480000 18.600000 84.400000 8.094000 2.000000 1.000000

In [92]:
formations = {}
for i, value in enumerate(train['Formation'].unique()):
    formations[value] = i
    train.loc[train['Formation'] == value, 'Formation'] = i

wells = {}
for i, value in enumerate(train['Well Name'].unique()):
    wells[value] = i
    train.loc[train['Well Name'] == value, 'Well Name'] = i

In [7]:
from bokeh.palettes import viridis

def log_plot(well_frame):
    plots = []
    iterator = 0
    
    #cmap = viridis(len(column.unique()))
    
    index = 0
    for i, column in well_frame.iteritems():
        if index == 0:
            plots.append(bk.figure(height = 800, width = 150))
        else:
            plots.append(bk.figure(height = 800, width = 75, y_range = plots[0].y_range))
        plots[-1].line(column, well_frame['Depth'])
        plots[-1].yaxis.visible = False
        plots[-1].title.text = i
        
    plots[0].yaxis.visible = True
    grid = bk.gridplot([plots])
    bk.show(grid)
            
for i, group in train.groupby('Well Name'):
    well_frame = group.sort_values('Depth')
    log_plot(well_frame)



In [93]:
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']

Feature construction and data clean-up.

1. Z-score normalisation of data.
2. Group each of the measurement parameters into quartiles. Most of the classification methods find data like this easier to work with.
3. Create a series of 'adjacent' parameters by looking for the above and below depth sample for each well. Create a series of features associated with the above and below parameters.

In [94]:
train_columns = train.columns[1:]
std_scaler = preprocessing.StandardScaler().fit(train[train_columns])

train_std = std_scaler.transform(train[train_columns])

train_std_frame = train
for i, column in enumerate(train_columns):
    train_std_frame.loc[:, column] = train_std[:, i]

train = train_std_frame
master_columns = train.columns[4:]

def in_range(row, vmin, vmax, variable):
    
    if vmin <= row[variable] < vmax:
        return 1
    else:
        return 0
    
for i, column in train[master_columns].iteritems():
    ds = np.linspace(0, 1.0, 5)
    quantiles = [column.quantile(n) for n in ds]
    
    for j in range(len(quantiles) - 1):
        train[i + '_{0}'.format(j)] = train.apply(lambda row: in_range(row, ds[j], ds[j + 1], i), axis = 1)

master_columns = train.columns[4:]

above = []
below = []
for i, group in train.groupby('Well Name'):
    
    df = group.sort_values('Depth')
    u = df.shift(-1).fillna(method = 'ffill')
    b = df.shift(1).fillna(method = 'bfill')
    
    above.append(u[master_columns])
    below.append(b[master_columns])
    
above_frame = pd.concat(above)
above_frame.columns = ['above_'+ column for column in above_frame.columns]
below_frame = pd.concat(below)
below_frame.columns = ['below_'+ column for column in below_frame.columns]

frame = pd.concat((train, above_frame, below_frame), axis = 1)

4. TPOT


In [96]:
train_vector = ['class']
train_columns = frame.columns[4:]

train_f, test_f = train_test_split(frame, test_size = 0.1, 
                                   random_state = 7)

TPOT uses a genetic algorithm to tune model parameters for the most effective fit. This can take quite a while to process if you want to re-run this part!


In [26]:
tpot = TPOTClassifier(verbosity = 2, generations = 5, max_eval_time_mins = 30)
tpot.fit(train_f[train_columns], train_f['class'])


C:\Users\george.crowther\AppData\Local\Continuum\Anaconda3_4\envs\optasense\lib\site-packages\sklearn\utils\extmath.py:410: RuntimeWarning: invalid value encountered in subtract
  out = np.log(np.sum(np.exp(arr - vmax), axis=0))
C:\Users\george.crowther\AppData\Local\Continuum\Anaconda3_4\envs\optasense\lib\site-packages\sklearn\ensemble\gradient_boosting.py:558: RuntimeWarning: invalid value encountered in multiply
  return np.sum(-1 * sample_weight * (Y * pred).sum(axis=1) +
Optimization Progress:   0%|▋                                                                                                                                    | 3/600 [00:10<30:58,  3.11s/pipeline]C:\Users\george.crowther\AppData\Local\Continuum\Anaconda3_4\envs\optasense\lib\site-packages\sklearn\utils\extmath.py:410: RuntimeWarning: invalid value encountered in subtract
  out = np.log(np.sum(np.exp(arr - vmax), axis=0))
Optimization Progress:  17%|█████████████████████▌                                                                                                           | 100/600 [1:07:13<30:03,  3.61s/pipeline]
Generation 1 - Current best internal CV score: 0.8408606836029527
Optimization Progress:  32%|████████████████████████████████████████▊                                                                                      | 193/600 [2:37:51<9:36:07, 84.93s/pipeline]
Generation 2 - Current best internal CV score: 0.8408606836029527
Optimization Progress:  48%|████████████████████████████████████████████████████████████▌                                                                  | 286/600 [3:42:05<2:49:53, 32.46s/pipeline]
Generation 3 - Current best internal CV score: 0.8416541124289602
Optimization Progress:  62%|█████████████████████████████████████████████████████████████████████████████▎                                               | 371/600 [5:32:08<36:10:06, 568.59s/pipeline]
Timeout during evaluation of pipeline #371. Skipping to the next pipeline.
Optimization Progress:  65%|██████████████████████████████████████████████████████████████████████████████████▊                                            | 391/600 [5:40:46<1:06:19, 19.04s/pipeline]
Generation 4 - Current best internal CV score: 0.8579135039637568
Optimization Progress:  82%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋                       | 489/600 [7:41:28<3:21:39, 109.01s/pipeline]
Generation 5 - Current best internal CV score: 0.8603160397473301

Best pipeline: ExtraTreesClassifier(VarianceThreshold(input_matrix, 0.37), 21, 0.70999999999999996)

In [61]:
tpot.score(test_f[train_columns], test_f['class'])


Out[61]:
0.89883901153186463

In [28]:
tpot.export('contest_export.py')

In [62]:
result = tpot.predict(frame[train_columns])

from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm

conf = confusion_matrix(frame['class'], result)
display_cm(conf, facies_labels, hide_zeros=True, display_metrics = True)

def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

print(accuracy(conf))

adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))

print(accuracy_adjacent(conf, adjacent_facies))


     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   259                                                   259
     CSiS     2   731     5                                       738
     FSiS          11   604                                       615
     SiSh                     179     2     2           1         184
       MS                           213     2           2         217
       WS                       1         453           8         462
        D                                        95     2     1    98
       PS                 1           2     9         486         498
       BS                                         2     3   156   161

Precision  0.99  0.99  0.99  0.99  0.98  0.97  0.98  0.97  0.99  0.98
   Recall  1.00  0.99  0.98  0.97  0.98  0.98  0.97  0.98  0.97  0.98
       F1  1.00  0.99  0.99  0.98  0.98  0.98  0.97  0.97  0.98  0.98
0.982673267327
0.996905940594

5.0 Workflow for Test Data

Run this to generate results from output model.


In [102]:
test_path = r'validation_data_nofacies.csv'

# Read training data to dataframe
test = pd.read_csv(test_path)

# TPOT library requires that the target class is renamed to 'class'
test.rename(columns={'Facies': 'class'}, inplace=True)

test_columns = test.columns

formations = {}
for i, value in enumerate(test['Formation'].unique()):
    formations[value] = i
    test.loc[test['Formation'] == value, 'Formation'] = i

wells = {}
for i, value in enumerate(test['Well Name'].unique()):
    wells[value] = i
    test.loc[test['Well Name'] == value, 'Well Name'] = i

std_scaler = preprocessing.StandardScaler().fit(test[test_columns])
test_std = std_scaler.transform(test[test_columns])

test_std_frame = test
for i, column in enumerate(test_columns):
    test_std_frame.loc[:, column] = test_std[:, i]

test = test_std_frame
master_columns = test.columns[3:]

def in_range(row, vmin, vmax, variable):
    
    if vmin <= row[variable] < vmax:
        return 1
    else:
        return 0
    
for i, column in test[master_columns].iteritems():
    ds = np.linspace(0, 1.0, 5)
    quantiles = [column.quantile(n) for n in ds]
    
    for j in range(len(quantiles) - 1):
        test[i + '_{0}'.format(j)] = test.apply(lambda row: in_range(row, ds[j], ds[j + 1], i), axis = 1)

master_columns = test.columns[3:]

above = []
below = []
for i, group in test.groupby('Well Name'):
    
    df = group.sort_values('Depth')
    u = df.shift(-1).fillna(method = 'ffill')
    b = df.shift(1).fillna(method = 'bfill')
    
    above.append(u[master_columns])
    below.append(b[master_columns])
    
above_frame = pd.concat(above)
above_frame.columns = ['above_'+ column for column in above_frame.columns]
below_frame = pd.concat(below)
below_frame.columns = ['below_'+ column for column in below_frame.columns]

frame = pd.concat((test, above_frame, below_frame), axis = 1)

test_columns = frame.columns[3:]

result = tpot.predict(frame[test_columns])

In [107]:
result


Out[107]:
array([3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 8, 8, 8,
       8, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 5, 6, 6, 4, 4,
       6, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 6,
       6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 6,
       6, 6, 8, 8, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 4, 4, 5, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 6, 8, 8, 8, 2, 2,
       3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 9, 9, 8, 9, 9, 9, 9, 8, 6, 6, 6, 6, 8, 2, 2, 3, 3,
       3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 8, 3, 3, 3, 3, 8, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 8, 8, 8, 6, 6, 6, 8, 6, 9, 9, 9,
       9, 9, 9, 6, 6, 6, 6, 6, 6, 6, 6, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2,
       3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 8, 8, 8, 5, 5, 5, 5, 5, 5, 5, 5, 7, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 8, 8, 6, 4, 4,
       4, 4, 4, 4, 8, 4, 4, 4, 4, 6, 6, 4, 4, 4, 6, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4,
       4, 4, 6, 6, 8, 8, 6, 6, 6, 4, 4, 4, 4, 6, 6, 8, 8, 8, 8, 8, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 6, 6, 6, 6, 6, 6, 6, 8, 6, 8, 6, 6, 8,
       3, 3, 8, 8, 8, 8, 8, 8, 8, 5, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
       8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 6, 6, 6, 8, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 8, 8, 8, 8,
       8, 8, 8, 9, 9, 9, 9, 9, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 6, 8, 5,
       3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 8, 8, 8,
       8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 8, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 3, 3, 2, 2, 2, 2, 2, 3, 3, 5, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6, 6,
       6, 6, 8, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 8, 8, 8, 8, 8, 8,
       8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5, 8,
       6, 6, 6, 6, 6, 6, 8, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3], dtype=int64)

In [113]:
output_frame = pd.read_csv(test_path)
output_frame['Facies'] = result
output_frame.to_csv('Well Facies Prediction - Test Data Set.csv')