Facies classification using Random Forest

Contest entry by geoLEARN:

Martin Blouin, Lorenzo Perozzi and Antoine Caté

in collaboration with Erwan Gloaguen

Original contest notebook by Brendon Hall, Enthought

In this notebook we will train a machine learning algorithm to predict facies from well log data. The dataset comes from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

The dataset consists of log data from nine wells that have been labeled with a facies type based on observation of core. We will use this log data to train a Random Forest model to classify facies types.

Original approach

  • We focus on feature engineering instead of focusing on algorithm optimization.
  • The prediction approach has been split in two steps. 1) try to predict non-marine facies; 2) try to predict marine facies. Validation scoring has been computed after the prediction results of the two approaches have been stacked together.

N.B We are using exhaustively an interpreted variable (NM_M). We are well aware that this may not be representative of a true machine learning experience where you don't have access to interpreted information. But for now, since we have access to it, why not use it. Moreover, we could easily adapt our approach in a 2 steps prediction where we would first discriminate marine from non-marine sediments and then classify each facies.


In [1]:
%matplotlib inline
# to install watermark magic command: pip install ipyext
%load_ext watermark 
%watermark -v -p numpy,scipy,pandas,matplotlib,seaborn,sklearn


CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
scipy 0.17.1
pandas 0.18.1
matplotlib 1.5.1
seaborn 0.7.1
sklearn 0.18

Exploring the dataset

First, we will examine the data set we will use to train the classifier.


In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn import preprocessing

from pandas import set_option
set_option("display.max_rows",10)
pd.options.mode.chained_assignment = None

# turn off ipython warnings 
import warnings
warnings.filterwarnings('ignore')

filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data.describe()


Out[2]:
Facies Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 3232.000000 4149.000000 4149.000000
mean 4.503254 2906.867438 64.933985 0.659566 4.402484 13.201066 3.725014 1.518438 0.521852
std 2.474324 133.300164 30.302530 0.252703 5.274947 7.132846 0.896152 0.499720 0.286644
min 1.000000 2573.500000 10.149000 -0.025949 -21.832000 0.550000 0.200000 1.000000 0.000000
25% 2.000000 2821.500000 44.730000 0.498000 1.600000 8.500000 NaN 1.000000 0.277000
50% 4.000000 2932.500000 64.990000 0.639000 4.300000 12.020000 NaN 2.000000 0.528000
75% 6.000000 3007.000000 79.438000 0.822000 7.500000 16.050000 NaN 2.000000 0.769000
max 9.000000 3138.000000 361.150000 1.800000 19.312000 84.400000 8.094000 2.000000 1.000000

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector. Validation (test) data (830 examples from two wells) have the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:

  1. Nonmarine sandstone
  2. Nonmarine coarse siltstone
  3. Nonmarine fine siltstone
  4. Marine siltstone and shale
  5. Mudstone (limestone)
  6. Wackestone (limestone)
  7. Dolomite
  8. Packstone-grainstone (limestone)
  9. Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies Label Adjacent Facies
1 SS 2
2 CSiS 1,3
3 FSiS 2
4 SiSh 5
5 MS 4,6
6 WS 5,7
7 D 6,8
8 PS 6,7,9
9 BS 7,8

Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.


In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()


Out[3]:
[SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, KIMZEY A, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (10, object): [SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, ..., NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]

These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.

Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors dataframe.


In [4]:
# 1=sandstone  2=c_siltstone   3=f_siltstone 
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00',
       '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']

facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies_colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] -1]
    
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
training_data.describe()


Out[4]:
Facies Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 3232.000000 4149.000000 4149.000000
mean 4.503254 2906.867438 64.933985 0.659566 4.402484 13.201066 3.725014 1.518438 0.521852
std 2.474324 133.300164 30.302530 0.252703 5.274947 7.132846 0.896152 0.499720 0.286644
min 1.000000 2573.500000 10.149000 -0.025949 -21.832000 0.550000 0.200000 1.000000 0.000000
25% 2.000000 2821.500000 44.730000 0.498000 1.600000 8.500000 NaN 1.000000 0.277000
50% 4.000000 2932.500000 64.990000 0.639000 4.300000 12.020000 NaN 2.000000 0.528000
75% 6.000000 3007.000000 79.438000 0.822000 7.500000 16.050000 NaN 2.000000 0.769000
max 9.000000 3138.000000 361.150000 1.800000 19.312000 84.400000 8.094000 2.000000 1.000000

We build a function to create new variables representing the absolute distance of a deposit to an upper marine context (younger) and one for the distance to a deeper marine context (older)

Sedimentary facies tend to be deposited in sequences that reflect the evolution of the paleo-environment (variations in water depth, water temperature, biological activity, currents strenght, detrital input, ...). Each facies represents a specific depositional environment and is in contact with facies that represent a progressive transition to an other environment. The distance to a marine context introduces information on the depositional environment of samples neighbouring the predicted sample and thus improves the classification. This new variable illustrates our approach, which favorises the combination of machine learning algorithms with feature engineering adapted to the target variables.


In [5]:
def make_dist_mar_vars(wells_df):
 
    grouped = wells_df.groupby(['Well Name'])
    new_df = pd.DataFrame()
    
    for key in grouped.groups.keys():
    
        NM_M = grouped.get_group(key)['NM_M'].values

        #We create a temporary dataframe that we reset for every well
        temp_df = pd.DataFrame()
        temp_df['Depth'] = grouped.get_group(key)['Depth']
        temp_df['Well Name'] = [key for _ in range(len(NM_M))]

        #We initialize new variables
        dist_mar_up = np.zeros(len(NM_M))
        dist_mar_down = np.zeros(len(NM_M))

        # A variable counting the interval from the upper marine deposit an one for bottom lower deposit
        # We initialize them to -99999 since we do not know what's above the first log
        count = -99999
        
        #we build them in two seperate loops
        for i in range(len(NM_M)):

            if ((NM_M[i] == 1) & (count>-99999)):

                count+=0.5
                dist_mar_up[i] += count

            elif NM_M[i] == 2:

                count=0

            else:
                dist_mar_up[i] = count

        #********************************************#
        #we reset count
        count = -99999
        
        for i in range(len(NM_M)-1,-1,-1):
               
            if ((NM_M[i] == 1) & (count>-99999)):

                count+=0.5
                dist_mar_down[i] += count

            elif NM_M[i] == 2:            
                count=0

            else:
                dist_mar_down[i] = count
                
        #********************************************#       

        temp_df['dist_mar_up'] = dist_mar_up
        temp_df['dist_mar_down'] = dist_mar_down
    
        # We append each well variable to a larger dataframe
        # We use a dataframe to preserve the index
        new_df = new_df.append(temp_df) 
        
    new_df = new_df.sort_index()
    new_df = new_df.drop(['Well Name','Depth'],axis=1)
    
    #We don't use merge as it creates duplicates for curious reasons that we later have to drop
    return pd.concat([wells_df,new_df],axis=1)

training_data = make_dist_mar_vars(training_data)
training_data.head()


Out[5]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS FaciesLabels dist_mar_up dist_mar_down
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000 FSiS -99999.0 21.5
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979 FSiS -99999.0 21.0
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957 FSiS -99999.0 20.5
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936 FSiS -99999.0 20.0
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915 FSiS -99999.0 19.5

Encoding Formation labels

The nature of the 'Formation' variable is not entirely clear, and we do not know how it has been obtained. In this Notebook, we include it in the predictor variables. However, if 'Formation' appears to be in part derived from 'Facies', it should be removed from the predictor variables.

Formation labels are encoded as integer values.


In [6]:
LE = preprocessing.LabelEncoder()
training_data['Formation_category'] = LE.fit_transform(training_data.Formation)

Let's see how well NM_M variable discriminates facies


In [7]:
import seaborn as sns
training_data['NM_M_label'] = training_data['NM_M'].apply(lambda x: 'non_marine' if x == 1 else 'marine')

plt.figure(figsize=(8,6))

sns.countplot(x="FaciesLabels", hue="NM_M_label", 
              order=['SS', 'CSiS', 'FSiS', 'SiSh','MS','WS','D','PS','BS'], 
              data=training_data, palette = "RdBu");


Now, as the NM_M variable discriminate fairly well facies, we assume that facies 1-3 (SS to FSiS) are non marine and facies 4-9 (SiSh to BS) are marine.


In [8]:
training_data.head()


Out[8]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS FaciesLabels dist_mar_up dist_mar_down Formation_category NM_M_label
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000 FSiS -99999.0 21.5 1 non_marine
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979 FSiS -99999.0 21.0 1 non_marine
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957 FSiS -99999.0 20.5 1 non_marine
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936 FSiS -99999.0 20.0 1 non_marine
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915 FSiS -99999.0 19.5 1 non_marine

In [9]:
training_data = training_data.drop(['NM_M_label','FaciesLabels'], axis=1)
NM_training_data = training_data[training_data['NM_M'] == 1]
M_training_data = training_data[training_data['NM_M'] == 2]

Instead of dropping samples with the NaN value for the PE variable, we decided to replace NaN with an 'out of the range' value (i.e. -99999). This allow the classifier to take into account all the samples despite missing values for the PE variable.


In [10]:
NM_training_data.replace(to_replace=np.nan,value=-99999,inplace=True)
M_training_data.replace(to_replace=np.nan,value=-99999,inplace=True)

We drop unused variables for marine and non-marine dataset (we prepare the list here)

Non-marine facies are dominated by clastic sedimentary rocks, and marine facies by carbonate sedimentary rocks. The best predictive variables for each group differ. Thus, marine and non-marine rocks are classified separately here, and a different set of predictive variables is chosen for each group.


In [11]:
nm_drop_list = ['Formation', 'Well Name', 'Depth', 
                'Facies','DeltaPHI','PE','NM_M','RELPOS']

m_drop_list = ['Formation', 'Well Name', 'Depth', 
               'Facies','dist_mar_down','dist_mar_up','PE','NM_M','PHIND']

In [12]:
from sklearn import ensemble
from sklearn import metrics
clf = ensemble.RandomForestClassifier(n_estimators=500,n_jobs=-1)

names = list(np.unique(training_data['Well Name']))

nm_grouped = NM_training_data.groupby(['Well Name'])
m_grouped = M_training_data.groupby(['Well Name'])

new_df = pd.DataFrame()

scores = []

for name in names:

    temp_df = pd.DataFrame()

    # We need to isolate Recruit F9 since it has only marine facies
    if name == 'Recruit F9':

        #Build list of well names and remove blind well (and remove recruit F9 for NM)
        m_train_names = names.copy()
        m_train_names.remove(name)

        # Do it for marine sediments
        m_test = m_grouped.get_group(name)
        m_test_depth = m_test.Depth
        m_X_test = m_test.drop(m_drop_list, axis=1).values
        y_test = m_test['Facies'].values

        m_train = pd.DataFrame()

        for train_name in m_train_names:
            m_train = m_train.append(m_grouped.get_group(train_name))

        id_train_M = m_train.Facies >= 4
        m_train = m_train [id_train_M]
        m_X_train = m_train.drop(m_drop_list, axis=1).values
        m_y_train = m_train['Facies'].values

        #Then we do random forest classification and prediction
        clf.fit(m_X_train, m_y_train)
        y_pred = clf.predict(m_X_test)

    else:

        #Build list of well names and remove blind well (and remove recruit F9 for NM)
        m_train_names = names.copy()
        m_train_names.remove(name)   

        nm_train_names = m_train_names.copy()
        nm_train_names.remove('Recruit F9')


        # Do it for non-marine sediments
        nm_test = nm_grouped.get_group(name)
        nm_test_depth = nm_test.Depth
        nm_X_test = nm_test.drop(nm_drop_list, axis=1).values
        nm_y_test = nm_test['Facies'].values

        nm_train = pd.DataFrame()

        for train_name in nm_train_names:
            nm_train = nm_train.append(nm_grouped.get_group(train_name))

        id_train_NM = nm_train.Facies <= 3
        nm_train = nm_train [id_train_NM]
        nm_X_train = nm_train.drop(nm_drop_list, axis=1).values
        nm_y_train = nm_train['Facies'].values

        #Then we do random forest classification and prediction
        clf.fit(nm_X_train, nm_y_train)
        nm_y_pred = clf.predict(nm_X_test)

        print(clf.feature_importances_)

        #*********************************************************************#

        # Do it for marine sediments
        m_test = m_grouped.get_group(name)
        m_test_depth = m_test.Depth
        m_X_test = m_test.drop(m_drop_list, axis=1).values
        m_y_test = m_test['Facies'].values

        m_train = pd.DataFrame()

        for train_name in m_train_names:
            m_train = m_train.append(m_grouped.get_group(train_name))

        id_train_M = m_train.Facies >= 4
        m_train = m_train [id_train_M]
        m_X_train = m_train.drop(m_drop_list, axis=1).values
        m_y_train = m_train['Facies'].values

        #Then we do random forest classification and prediction
        clf.fit(m_X_train, m_y_train)
        m_y_pred = clf.predict(m_X_test)

        print(clf.feature_importances_)

        #================================================================#
        #                         combine results
        #================================================================#

        y_test = np.hstack((nm_y_test,m_y_test))
        y_pred = np.hstack((nm_y_pred,m_y_pred))

    #Scoring
    conf_mat = metrics.confusion_matrix(y_test,y_pred)
    print(conf_mat)
    try:
        score = metrics.f1_score(y_test, y_pred,average='weighted')
    except:
        #this exception is for Recruit F9
        score = metrics.f1_score(y_test, y_pred)
    scores.append(score)
    print('********')
    print('Blind well is {0}, F1 score : {1:.4%}\n'.format(name,score))

    if name == 'Recruit F9':
        depth = m_test_depth
    else:
        depth = np.hstack((nm_test_depth,m_test_depth))

    idx = np.argsort(depth)
    temp_df['Depth'] = depth[idx]
    temp_df['True Facies'] = y_test[idx]
    temp_df['Predicted Facies'] = y_pred[idx]

    temp_df['Well Name'] = [name for _ in range(len(depth))]
    new_df = new_df.append(temp_df)

print("="*30)
print('*********** RESULT ***********')
print("="*30)
print('\nAverage  F1-score is {:.4%}'.format(np.mean(scores)))


[ 0.14697974  0.19261253  0.18381727  0.17769512  0.18521185  0.1136835 ]
[ 0.24010757  0.23580513  0.19873271  0.22246258  0.10289201]
[[ 0  0  0  0  0  0  0  0  0]
 [ 0 96 21  0  0  0  0  0  0]
 [ 1 36 54  0  0  0  0  0  0]
 [ 0  0  0 34  0  2  2  6  0]
 [ 0  0  0  3  9  4  1  9  0]
 [ 0  0  0 14 11 12  6 26  0]
 [ 0  0  0  0  4  0 12  0  0]
 [ 0  0  0  6  6 14 17 48  7]
 [ 0  0  0  0  0  0  0  0  5]]
********
Blind well is ALEXANDER D, F1 score : 56.6869%

[ 0.15260149  0.19059186  0.19341511  0.17462172  0.1795732   0.10919661]
[ 0.24561755  0.23272532  0.19774223  0.22712629  0.09678861]
[[ 0  4  4  0  0  0  0  0  0]
 [ 0 34 21  0  0  1  0  0  0]
 [ 0  5 39  0  0  4  0  3  0]
 [ 0  0  0  4  0  8  0  1  0]
 [ 0  0  1  0  7 16  0  6  0]
 [ 0  0  0  9  0 54  0 24  0]
 [ 0  0  1  0  1  1  0 24  7]
 [ 0  1  2  0  2 19  0 50  1]
 [ 0  0  0  0  0  0  0  0 50]]
********
Blind well is CHURCHMAN BIBLE, F1 score : 55.1339%

[ 0.1576827   0.15157848  0.22979379  0.17501712  0.19002444  0.09590347]
[ 0.24185244  0.23012407  0.19302004  0.23073584  0.1042676 ]
[[65 76 17  0  0  0  0  0  0]
 [ 1 74 63  0  0  0  0  4  0]
 [ 0  4 37  0  0  1  0  4  1]
 [ 0  0  4  6  1 12  0  0  2]
 [ 4  0  3  0  5 13  0  3  0]
 [ 0  0  0  0  1 23  1  6  0]
 [ 0  0  0  0  0  0  0  2  0]
 [ 0  0  4  0  3 23  2 36  0]
 [ 0  0  0  0  0  0  0  0  0]]
********
Blind well is CROSS H CATTLE, F1 score : 50.2310%

[ 0.1590843   0.19182197  0.19182749  0.17704258  0.17705693  0.10316674]
[ 0.24529998  0.23953386  0.1913017   0.22366797  0.10019648]
[[ 0  4  5  0  0  0  0  0  0]
 [ 0 76  9  0  0  0  0  0  0]
 [ 1 36 37  0  0  0  0  0  0]
 [ 0  0  0 25  0 18  0  0  0]
 [ 0  0  0  4  2 21  6 20  0]
 [ 0  1  0  4  0 29  3 12  2]
 [ 0  0  0  0  2  2  9 14  0]
 [ 0  0  0  1  1 18  5 65  0]
 [ 0  0  0  0  0  0  0  0  7]]
********
Blind well is KIMZEY A, F1 score : 53.4579%

[ 0.14820498  0.19560181  0.1936073   0.17146413  0.18954427  0.10157751]
[ 0.24326298  0.22860323  0.19728321  0.22678613  0.10406445]
[[ 0  0  0  0  0  0  0  0  0]
 [ 6 91 20  0  0  0  0  0  0]
 [ 9 26 94  0  0  0  0  0  0]
 [ 0  0  0 29  3  3  0  0  0]
 [ 0  0  0  0  0  1  0  1  0]
 [ 0  0  0  6 16 42  0 20  0]
 [ 0  0  0  0  0  2  6  9  3]
 [ 0  0  0  2  2 15  0 55  0]
 [ 0  0  0  0  0  0  0  0  0]]
********
Blind well is LUKE G U, F1 score : 71.0244%

[ 0.15494882  0.18830023  0.19854817  0.17337321  0.17160476  0.1132248 ]
[ 0.24581734  0.22035168  0.19781314  0.23037869  0.10563915]
[[67 31  0  0  0  0  0  0]
 [30 50  0  0  0  0  0  0]
 [ 0  0 35  0 15  0  8  0]
 [ 0  0  2  3 10  3  6  4]
 [ 0  0  5 18 32  2 38  1]
 [ 0  0  0  0  5  5  6  0]
 [ 0  0  0  0 12  0 43  1]
 [ 0  0  2  0  1  0 17 11]]
********
Blind well is NEWBY, F1 score : 52.7507%

[ 0.15746267  0.19864105  0.1857417   0.18087319  0.17285021  0.10443117]
[ 0.23383888  0.23708749  0.19409552  0.23174869  0.10322942]
[[ 0  4  0  0  0  0  0  0  0]
 [ 6 98 13  0  0  0  0  1  0]
 [ 0 18 47  0  0  0  0  3  0]
 [ 1  0  0  6  3 16  0  2  0]
 [ 0  2  1  0 12 16  2 11  3]
 [ 0  1  0  2  2 12  0 13  0]
 [ 0  0  0  1  2  1  0  0  0]
 [ 0  2  3  5  5 28 18 53  2]
 [ 0  0  0  0  0  0  0  0  0]]
********
Blind well is NOLAN, F1 score : 56.9320%

[[ 0  0  0]
 [ 0  0  0]
 [ 1  6 73]]
********
Blind well is Recruit F9, F1 score : 95.4248%

[ 0.14437667  0.21553682  0.18959205  0.17136952  0.17839371  0.10073124]
[ 0.2431058   0.22548482  0.20024289  0.22799877  0.10316772]
[[59 28  2  0  0  0  0  0  0]
 [23 57  9  0  0  0  0  0  0]
 [ 1 51 65  0  0  0  0  0  0]
 [ 0  0  0  1  2  2  0  2  0]
 [ 0  0  0  6  2  8  2  1  0]
 [ 0  0  1  2  2 46  2 18  0]
 [ 0  0  0  0  0  0  5 10  2]
 [ 0  0  0  1  4  4  2 18 11]
 [ 0  0  0  0  0  0  0  0  0]]
********
Blind well is SHANKLE, F1 score : 57.9313%

[ 0.15387716  0.19274262  0.19747853  0.17742982  0.18009982  0.09837205]
[ 0.23738826  0.23143893  0.20185246  0.23110058  0.09821976]
[[102  16   0   0   0   0   0   0]
 [ 38  85   0   0   0   0   0   0]
 [  0   0  14   1   2   0   1   0]
 [  0   0   1  16  37   0   9   0]
 [  0   0   7   1  37   0  18   0]
 [  0   0   0   1   1   0   3   0]
 [  0   0   0   2   8   4  50   5]
 [  0   0   0   0   0   0   0  12]]
********
Blind well is SHRIMPLIN, F1 score : 65.9621%

==============================
*********** RESULT ***********
==============================

Average  F1-score is 61.5535%

Applying the classification model to new data


In [13]:
filename = '../validation_data_nofacies.csv'
test_data = pd.read_csv(filename)
test_data


Out[13]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 A1 SH STUART 2808.0 66.276 0.630 3.300 10.650 3.591 1 1.000
1 A1 SH STUART 2808.5 77.252 0.585 6.500 11.950 3.341 1 0.978
2 A1 SH STUART 2809.0 82.899 0.566 9.400 13.600 3.064 1 0.956
3 A1 SH STUART 2809.5 80.671 0.593 9.500 13.250 2.977 1 0.933
4 A1 SH STUART 2810.0 75.971 0.638 8.700 12.350 3.020 1 0.911
... ... ... ... ... ... ... ... ... ... ...
825 C SH CRAWFORD 3158.5 86.078 0.554 5.040 16.150 3.161 1 0.639
826 C SH CRAWFORD 3159.0 88.855 0.539 5.560 16.750 3.118 1 0.611
827 C SH CRAWFORD 3159.5 90.490 0.530 6.360 16.780 3.168 1 0.583
828 C SH CRAWFORD 3160.0 90.975 0.522 7.035 16.995 3.154 1 0.556
829 C SH CRAWFORD 3160.5 90.108 0.513 7.505 17.595 3.125 1 0.528

830 rows × 10 columns

Create new variables for the test dataset

  • we compute the absolute distance of a deposit to an upper marine context (younger) and one for the distance to a deeper marine context (older)
  • we encode the formation category

In [14]:
# absoulte distance
test_data = make_dist_mar_vars(test_data)
# formation category encoding
test_data['Formation_category'] = LE.fit_transform(test_data.Formation)

We train our classifier to the entire available dataset (10 wells), keeping the separation between marine and non-marine sediments


In [15]:
#  nm = non-marine and m = marine
nm_drop_list = ['Formation', 'Well Name', 'Depth','DeltaPHI','PE',
                'NM_M','RELPOS']
m_drop_list = ['Formation', 'Well Name', 'Depth',
               'dist_mar_down','dist_mar_up','PE','NM_M','PHIND']

Train and making prediction for the the well test


In [16]:
grouped = test_data.groupby('Well Name')
new_df = pd.DataFrame()
    
for name in grouped.groups.keys():
    
    temp_df = pd.DataFrame()
    temp_df['Depth'] = grouped.get_group(name)['Depth']

    nm_test = grouped.get_group(name)[test_data['NM_M'] == 1]
    m_test = grouped.get_group(name)[test_data['NM_M'] == 2]

    # Do it for non-marine sediments
    nm_test_depth = nm_test.Depth
    nm_X_test = nm_test.drop(nm_drop_list, axis=1).values

    id_train_NM = NM_training_data.Facies <= 3


    nm_train = NM_training_data[id_train_NM]

    nm_X_train = nm_train.drop(nm_drop_list, axis=1)
    nm_X_train = nm_X_train.drop('Facies', axis=1).values
    nm_y_train = nm_train['Facies'].values

    #Then we do random forest classification
    clf.fit(nm_X_train, nm_y_train)
    nm_y_pred = clf.predict(nm_X_test)

    #*********************************************************************#

    # Do it for marine sediments
    m_test_depth = m_test.Depth
    m_X_test = m_test.drop(m_drop_list, axis=1).values

    id_train_M = M_training_data.Facies >= 4
    m_train = M_training_data[id_train_M]

    m_X_train = m_train.drop('Facies', axis=1)
    m_X_train = m_X_train.drop(m_drop_list,axis=1).values
    m_y_train = m_train['Facies'].values

    #Then we do random forest classification
    clf.fit(m_X_train, m_y_train)
    m_y_pred = clf.predict(m_X_test)

    #================================================================#
    #                         combine results
    #================================================================#

    y_pred = np.hstack((nm_y_pred,m_y_pred))


    depth = np.hstack((nm_test_depth,m_test_depth))
    idx = np.argsort(depth)
    
    temp_df['Predicted Facies'] = y_pred[idx]
    temp_df['Well Name'] = [name for _ in range(len(depth))]
    
    new_df = new_df.append(temp_df)
new_df = new_df.sort_index()

In [17]:
new_df.to_pickle('Prediction_blind_wells.pkl')
new_df


Out[17]:
Depth Predicted Facies Well Name
0 2808.0 2 STUART
1 2808.5 3 STUART
2 2809.0 3 STUART
3 2809.5 3 STUART
4 2810.0 3 STUART
... ... ... ...
825 3158.5 3 CRAWFORD
826 3159.0 3 CRAWFORD
827 3159.5 2 CRAWFORD
828 3160.0 2 CRAWFORD
829 3160.5 2 CRAWFORD

830 rows × 3 columns


<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The code and ideas in this notebook,</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">geoLEARN,</span> are licensed under a Creative Commons Attribution 4.0 International License.