Facies classification using Machine Learning

LA Team Submission 5

Lukas Mosser, Alfredo De la Fuente

In this approach for solving the facies classfication problem ( https://github.com/seg/2016-ml-contest. ) we will explore the following statregies:

  • Features Exploration: based on Paolo Bestagini's work, we will consider imputation, normalization and augmentation routines for the initial features.
  • Model tuning:

Libraries

We will need to install the following libraries and packages.


In [1]:
%%sh
pip install pandas
pip install scikit-learn
pip install tpot


Requirement already satisfied (use --upgrade to upgrade): pandas in /home/alfredo/anaconda2/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /home/alfredo/anaconda2/lib/python2.7/site-packages (from pandas)
Requirement already satisfied (use --upgrade to upgrade): pytz>=2011k in /home/alfredo/anaconda2/lib/python2.7/site-packages (from pandas)
Requirement already satisfied (use --upgrade to upgrade): numpy>=1.7.0 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from pandas)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from python-dateutil->pandas)
Requirement already satisfied (use --upgrade to upgrade): scikit-learn in /home/alfredo/anaconda2/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): tpot in /home/alfredo/anaconda2/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): deap in /home/alfredo/anaconda2/lib/python2.7/site-packages (from tpot)
Requirement already satisfied (use --upgrade to upgrade): numpy in /home/alfredo/anaconda2/lib/python2.7/site-packages (from tpot)
Requirement already satisfied (use --upgrade to upgrade): update-checker in /home/alfredo/anaconda2/lib/python2.7/site-packages (from tpot)
Requirement already satisfied (use --upgrade to upgrade): scipy in /home/alfredo/anaconda2/lib/python2.7/site-packages (from tpot)
Requirement already satisfied (use --upgrade to upgrade): tqdm in /home/alfredo/anaconda2/lib/python2.7/site-packages (from tpot)
Requirement already satisfied (use --upgrade to upgrade): scikit-learn in /home/alfredo/anaconda2/lib/python2.7/site-packages (from tpot)
Requirement already satisfied (use --upgrade to upgrade): requests>=2.3.0 in /home/alfredo/anaconda2/lib/python2.7/site-packages (from update-checker->tpot)
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [2]:
from __future__ import print_function
import numpy as np
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold , StratifiedKFold
from classification_utilities import display_cm, display_adj_cm
from sklearn.metrics import confusion_matrix, f1_score
from sklearn import preprocessing
from sklearn.model_selection import LeavePGroupsOut
from sklearn.multiclass import OneVsOneClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy.signal import medfilt

Data Preprocessing


In [3]:
#Load Data
data = pd.read_csv('../facies_vectors.csv')

# Parameters
feature_names = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
facies_names = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']

# Store features and labels
X = data[feature_names].values 
y = data['Facies'].values 

# Store well labels and depths
well = data['Well Name'].values
depth = data['Depth'].values

# Fill 'PE' missing values with mean
imp = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
X = imp.transform(X)

We procceed to run Paolo Bestagini's routine to include a small window of values to acount for the spatial component in the log analysis, as well as the gradient information with respect to depth. This will be our prepared training dataset.


In [4]:
# Feature windows concatenation function
def augment_features_window(X, N_neig):
    
    # Parameters
    N_row = X.shape[0]
    N_feat = X.shape[1]

    # Zero padding
    X = np.vstack((np.zeros((N_neig, N_feat)), X, (np.zeros((N_neig, N_feat)))))

    # Loop over windows
    X_aug = np.zeros((N_row, N_feat*(2*N_neig+1)))
    for r in np.arange(N_row)+N_neig:
        this_row = []
        for c in np.arange(-N_neig,N_neig+1):
            this_row = np.hstack((this_row, X[r+c]))
        X_aug[r-N_neig] = this_row

    return X_aug


# Feature gradient computation function
def augment_features_gradient(X, depth):
    
    # Compute features gradient
    d_diff = np.diff(depth).reshape((-1, 1))
    d_diff[d_diff==0] = 0.001
    X_diff = np.diff(X, axis=0)
    X_grad = X_diff / d_diff
        
    # Compensate for last missing value
    X_grad = np.concatenate((X_grad, np.zeros((1, X_grad.shape[1]))))
    
    return X_grad


# Feature augmentation function
def augment_features(X, well, depth, N_neig=1):
    
    # Augment features
    X_aug = np.zeros((X.shape[0], X.shape[1]*(N_neig*2+2)))
    for w in np.unique(well):
        w_idx = np.where(well == w)[0]
        X_aug_win = augment_features_window(X[w_idx, :], N_neig)
        X_aug_grad = augment_features_gradient(X[w_idx, :], depth[w_idx])
        X_aug[w_idx, :] = np.concatenate((X_aug_win, X_aug_grad), axis=1)
    
    # Find padded rows
    padded_rows = np.unique(np.where(X_aug[:, 0:7] == np.zeros((1, 7)))[0])
    
    return X_aug, padded_rows

X_aug, padded_rows = augment_features(X, well, depth)

In [8]:
# Initialize model selection methods
lpgo = LeavePGroupsOut(2)

# Generate splits
split_list = []
for train, val in lpgo.split(X, y, groups=data['Well Name']):
    hist_tr = np.histogram(y[train], bins=np.arange(len(facies_names)+1)+.5)
    hist_val = np.histogram(y[val], bins=np.arange(len(facies_names)+1)+.5)
    if np.all(hist_tr[0] != 0) & np.all(hist_val[0] != 0):
        split_list.append({'train':train, 'val':val})
    
        
def preprocess():
    
    # Preprocess data to use in model
    X_train_aux = []
    X_test_aux = []
    y_train_aux = []
    y_test_aux = []
    
    # For each data split
    split = split_list[5]
        
    # Remove padded rows
    split_train_no_pad = np.setdiff1d(split['train'], padded_rows)

    # Select training and validation data from current split
    X_tr = X_aug[split_train_no_pad, :]
    X_v = X_aug[split['val'], :]
    y_tr = y[split_train_no_pad]
    y_v = y[split['val']]

    # Select well labels for validation data
    well_v = well[split['val']]

    # Feature normalization
    scaler = preprocessing.RobustScaler(quantile_range=(25.0, 75.0)).fit(X_tr)
    X_tr = scaler.transform(X_tr)
    X_v = scaler.transform(X_v)
        
    X_train_aux.append( X_tr )
    X_test_aux.append( X_v )
    y_train_aux.append( y_tr )
    y_test_aux.append (  y_v )
    
    X_train = np.concatenate( X_train_aux )
    X_test = np.concatenate ( X_test_aux )
    y_train = np.concatenate ( y_train_aux )
    y_test = np.concatenate ( y_test_aux )
    
    return X_train , X_test , y_train , y_test

Data Analysis

In this section we will run a Cross Validation routine


In [9]:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = preprocess()

tpot = TPOTClassifier(generations=5, population_size=20, 
                      verbosity=2,max_eval_time_mins=20,
                      max_time_mins=100,scoring='f1_micro',
                      random_state = 17)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('FinalPipeline.py')


Optimization Progress: 100%|██████████| 20/20 [06:32<00:00, 25.68s/pipeline]
Generation 1 - Current best internal CV score: 0.585100621865
Optimization Progress:  88%|████████▊ | 35/40 [09:44<01:00, 12.18s/pipeline]
Generation 2 - Current best internal CV score: 0.585100621865
Optimization Progress:  93%|█████████▎| 56/60 [15:07<01:02, 15.73s/pipeline]
Generation 3 - Current best internal CV score: 0.585100621865
Optimization Progress: 100%|██████████| 80/80 [22:05<00:00, 22.42s/pipeline]
Generation 4 - Current best internal CV score: 0.588208285608
Optimization Progress:  99%|█████████▉| 99/100 [30:45<00:34, 34.94s/pipeline]
Generation 5 - Current best internal CV score: 0.588208285608
Optimization Progress:  98%|█████████▊| 118/120 [37:15<00:50, 25.11s/pipeline]
Generation 6 - Current best internal CV score: 0.588208285608
Optimization Progress:  99%|█████████▉| 139/140 [44:52<00:20, 20.95s/pipeline]
Generation 7 - Current best internal CV score: 0.588208285608
Optimization Progress:  99%|█████████▉| 158/160 [55:25<00:54, 27.37s/pipeline]
Generation 8 - Current best internal CV score: 0.588208285608
Optimization Progress: 100%|██████████| 180/180 [1:04:39<00:00, 21.91s/pipeline]
Generation 9 - Current best internal CV score: 0.588208285608
Optimization Progress:  98%|█████████▊| 197/200 [1:10:20<00:46, 15.36s/pipeline]
Generation 10 - Current best internal CV score: 0.588208285608
Optimization Progress:  99%|█████████▉| 218/220 [1:18:09<00:38, 19.29s/pipeline]
Generation 11 - Current best internal CV score: 0.592520404373
Optimization Progress: 100%|██████████| 240/240 [1:25:40<00:00, 16.96s/pipeline]
Generation 12 - Current best internal CV score: 0.592520404373
Optimization Progress:  99%|█████████▉| 258/260 [1:30:56<00:31, 15.84s/pipeline]
Generation 13 - Current best internal CV score: 0.592520404373
Optimization Progress:  99%|█████████▉| 278/280 [1:38:04<00:35, 17.68s/pipeline]
Generation 14 - Current best internal CV score: 0.592520404373

GP closed prematurely - will use current best pipeline

Best pipeline: RandomForestClassifier(BernoulliNB(input_matrix, 60.0, 0.26000000000000001))
0.476243093923

In [10]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer

In [19]:
# Train and test a classifier
def train_and_test(X_tr, y_tr, X_v, well_v):
    
    # Feature normalization
    scaler = preprocessing.RobustScaler(quantile_range=(25.0, 75.0)).fit(X_tr)
    X_tr = scaler.transform(X_tr)
    X_v = scaler.transform(X_v)
    
    # Train classifier
    #clf = make_pipeline(make_union(VotingClassifier([("est", ExtraTreesClassifier(criterion="gini", max_features=1.0, n_estimators=500))]), FunctionTransformer(lambda X: X)), XGBClassifier(learning_rate=0.73, max_depth=10, min_child_weight=10, n_estimators=500, subsample=0.27))
    #clf =  make_pipeline( KNeighborsClassifier(n_neighbors=5, weights="distance") ) 
    #clf = make_pipeline(MaxAbsScaler(),make_union(VotingClassifier([("est", RandomForestClassifier(n_estimators=500))]), FunctionTransformer(lambda X: X)),ExtraTreesClassifier(criterion="entropy", max_features=0.0001, n_estimators=500))
    clf = make_pipeline( make_union(VotingClassifier([("est", BernoulliNB(alpha=60.0, binarize=0.26, fit_prior=True))]), FunctionTransformer(lambda X: X)),RandomForestClassifier(n_estimators=500))
    clf.fit(X_tr, y_tr)
    
    # Test classifier
    y_v_hat = clf.predict(X_v)
    
    # Clean isolated facies for each well
    for w in np.unique(well_v):
        y_v_hat[well_v==w] = medfilt(y_v_hat[well_v==w], kernel_size=5)
    
    return y_v_hat

Prediction


In [21]:
#Load testing data
test_data = pd.read_csv('../validation_data_nofacies.csv')

# Prepare training data
X_tr = X
y_tr = y

# Augment features
X_tr, padded_rows = augment_features(X_tr, well, depth)

# Removed padded rows
X_tr = np.delete(X_tr, padded_rows, axis=0)
y_tr = np.delete(y_tr, padded_rows, axis=0) 

# Prepare test data
well_ts = test_data['Well Name'].values
depth_ts = test_data['Depth'].values
X_ts = test_data[feature_names].values

# Augment features
X_ts, padded_rows = augment_features(X_ts, well_ts, depth_ts)

# Predict test labels
y_ts_hat = train_and_test(X_tr, y_tr, X_ts, well_ts)

# Save predicted labels
test_data['Facies'] = y_ts_hat
test_data.to_csv('Prediction_X_Final.csv')

In [ ]: