Two folds OneVsRest RF classification with RF imputation of missing values

Contest entry by :geoLEARN

Martin Blouin, Lorenzo Perozzi, Antoine Caté

in collaboration with Erwan Gloaguen

Original contest notebook by Brendon Hall, Enthought

In this Notebook, a two-folds classification strategy is tested. Missing PE values in test data are imputed using RF Regressor. A OneVsRest Classifer using Random Forest is used as the model for both steps. The first step consists in predicting facies using the feature engineering method we proposed in our previous submission. The second steps takes the probability of occurence of each facies and further feature engineering using rolling windows on each well is performed before implmenting a second random forest to get the final prediction. Cross-validation testing suggests sligthly lower scores than our previous submission, but with a more stable prediction (less difference in success rate between wells.


In [1]:
###### Importing all used packages
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
#import matplotlib as mpl
#import matplotlib.pyplot as plt
#import matplotlib.colors as colors
#from mpl_toolkits.axes_grid1 import make_axes_locatable
#import seaborn as sns

from pandas import set_option
# set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None

###### Import packages needed for the make_vars functions
import Feature_Engineering as FE

##### import stuff from scikit learn
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import KFold, cross_val_score,LeavePGroupsOut, LeaveOneGroupOut, cross_val_predict
from sklearn.metrics import confusion_matrix, make_scorer, f1_score, accuracy_score, recall_score, precision_score

filename = '../facies_vectors.csv'
df = pd.read_csv(filename)
df.head()


Out[1]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915

Missing values (PE) imputation


In [2]:
####### create X_train and y_train
X_train = df[['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'NM_M', 'RELPOS']][df.PE.notnull()]
y_train = df['PE'][df.PE.notnull()]
groups_train = df['Well Name'][df.PE.notnull()]

X_fit = df[['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'NM_M', 'RELPOS']][df.PE.isnull()]

In [3]:
Cl = RandomForestRegressor(n_estimators=100)
Cl.fit(X_train, y_train)
y_predict = Cl.predict(X_fit)
df['PE'][df.PE.isnull()] = y_predict
training_data = df

Feature engineering

As stated in our previous submission, we believe that feature engineering has a high potential for increasing classification success. A strategy for building new variables is explained below.

The dataset is distributed along a series of drillholes intersecting a stratigraphic sequence. Sedimentary facies tend to be deposited in sequences that reflect the evolution of the paleo-environment (variations in water depth, water temperature, biological activity, currents strenght, detrital input, ...). Each facies represents a specific depositional environment and is in contact with facies that represent a progressive transition to an other environment. Thus, there is a relationship between neighbouring samples, and the distribution of the data along drillholes can be as important as data values for predicting facies.

A series of new variables (features) are calculated and tested below to help represent the relationship of neighbouring samples and the overall texture of the data along drillholes. These variables are:

  • detail and approximation coeficients at various levels of two wavelet transforms (using two types of Daubechies wavelets);
  • measures of the local entropy with variable observation windows;
  • measures of the local gradient with variable observation windows;
  • rolling statistical calculations (i.e., mean, standard deviation, min and max) with variable observation windows;
  • ratios between marine and non-marine lithofacies with different observation windows;
  • distances from the nearest marine or non-marine occurence uphole and downhole.

Functions used to build these variables are located in the Feature Engineering python script.

All the data exploration work related to the conception and study of these variables is not presented here.


In [4]:
##### cD From wavelet db1
dwt_db1_cD_df = FE.make_dwt_vars_cD(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db1')

##### cA From wavelet db1
dwt_db1_cA_df = FE.make_dwt_vars_cA(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db1')

##### cD From wavelet db3
dwt_db3_cD_df = FE.make_dwt_vars_cD(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### cA From wavelet db3
dwt_db3_cA_df = FE.make_dwt_vars_cA(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### From entropy
entropy_df = FE.make_entropy_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                               l_foots=[2, 3, 4, 5, 7, 10])

###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                 dx_list=[2, 3, 4, 5, 6, 10, 20])

##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                   windows=[1, 2, 5, 10, 20])

##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3, 4, 5, 7, 10, 15, 20])

##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

###### From rolling NM/M ratio
rolling_marine_ratio_df = FE.make_rolling_marine_ratio_vars(wells_df=training_data, windows=[5, 10, 15, 20, 30, 50, 75, 100, 200])

###### From distance to NM and M, up and down
dist_M_up_df = FE.make_distance_to_M_up_vars(wells_df=training_data)
dist_M_down_df = FE.make_distance_to_M_down_vars(wells_df=training_data)
dist_NM_up_df = FE.make_distance_to_NM_up_vars(wells_df=training_data)
dist_NM_down_df = FE.make_distance_to_NM_down_vars(wells_df=training_data)

In [5]:
list_df_var = [dwt_db1_cD_df, dwt_db1_cA_df, dwt_db3_cD_df, dwt_db3_cA_df,
               entropy_df, gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df,
              rolling_marine_ratio_df, dist_M_up_df, dist_M_down_df, dist_NM_up_df, dist_NM_down_df]
combined_df = training_data
for var_df in list_df_var:
    temp_df = var_df
    combined_df = pd.concat([combined_df,temp_df],axis=1)
combined_df.replace(to_replace=np.nan, value='-1', inplace=True)

Creating the first-fold classification model


In [6]:
X = combined_df.iloc[:, 4:]
y = combined_df['Facies']
groups = combined_df['Well Name']
RF = RandomForestClassifier(n_estimators=100, max_features=0.1, min_samples_leaf=25,
                            min_samples_split=50, class_weight='balanced', random_state=42, n_jobs=-1)
Cl = OneVsRestClassifier(RF, n_jobs=-1)

Creating the dataset for the second-fold classification


In [7]:
###### Doing a first prediction of lithologies using predict_proba
cv=LeaveOneGroupOut().split(X, y, groups)
RF = RandomForestClassifier(n_estimators=100, max_features=0.1, min_samples_leaf=25,
                            min_samples_split=50, class_weight='balanced', random_state=42, n_jobs=-1)
Cl = OneVsRestClassifier(RF, n_jobs=-1)
proba = cross_val_predict(Cl, X, y, cv=cv, method='predict_proba')

In [8]:
###### creating new variables from predict proba
list_facies = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
proba = pd.DataFrame(proba, columns=list_facies)
proba = pd.concat([combined_df.iloc[:, :4], proba], axis=1)

In [9]:
###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=proba, logs=list_facies, dx_list=[2, 3, 4, 5, 6, 10, 20])

##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=proba, logs=list_facies, windows=[1, 2, 5, 10, 20])

##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=proba, logs=list_facies, windows=[3 , 4, 5, 7, 10, 15, 20])

##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=proba, logs=list_facies, windows=[3, 4, 5, 7, 10, 15, 20])

##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=proba, logs=list_facies, windows=[3 , 4, 5, 7, 10, 15, 20])

list_df_var = [gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df]
combined_df = proba
for var_df in list_df_var:
    combined_df = pd.concat([combined_df,var_df],axis=1)
combined_df.replace(to_replace=np.nan, value='-1', inplace=True)

In [10]:
X2 = combined_df.iloc[:, 4:]

Creating the second-fold classification model


In [11]:
Cl2 = RandomForestClassifier(n_estimators=100, max_features=0.7, min_samples_leaf=0.01,
                            min_samples_split=100, class_weight='balanced', n_jobs=-1)

Importing test data


In [12]:
filename = '../validation_data_nofacies.csv'
test_data = pd.read_csv(filename)
test_data.head(5)


Out[12]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 A1 SH STUART 2808.0 66.276 0.630 3.3 10.65 3.591 1 1.000
1 A1 SH STUART 2808.5 77.252 0.585 6.5 11.95 3.341 1 0.978
2 A1 SH STUART 2809.0 82.899 0.566 9.4 13.60 3.064 1 0.956
3 A1 SH STUART 2809.5 80.671 0.593 9.5 13.25 2.977 1 0.933
4 A1 SH STUART 2810.0 75.971 0.638 8.7 12.35 3.020 1 0.911

Implementing first fold of classification on test data


In [13]:
##### cD From wavelet db1
dwt_db1_cD_df = FE.make_dwt_vars_cD(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                    levels=[1, 2, 3, 4], wavelet='db1')

##### cA From wavelet db1
dwt_db1_cA_df = FE.make_dwt_vars_cA(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db1')

##### cD From wavelet db3
dwt_db3_cD_df = FE.make_dwt_vars_cD(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### cA From wavelet db3
dwt_db3_cA_df = FE.make_dwt_vars_cA(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### From entropy
entropy_df = FE.make_entropy_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                               l_foots=[2, 3, 4, 5, 7, 10])

###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                 dx_list=[2, 3, 4, 5, 6, 10, 20])

##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                   windows=[1, 2, 5, 10, 20])

##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3, 4, 5, 7, 10, 15, 20])

##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

###### From rolling NM/M ratio
rolling_marine_ratio_df = FE.make_rolling_marine_ratio_vars(wells_df=test_data, windows=[5, 10, 15, 20, 30, 50, 75, 100, 200])

###### From distance to NM and M, up and down
dist_M_up_df = FE.make_distance_to_M_up_vars(wells_df=test_data)
dist_M_down_df = FE.make_distance_to_M_down_vars(wells_df=test_data)
dist_NM_up_df = FE.make_distance_to_NM_up_vars(wells_df=test_data)
dist_NM_down_df = FE.make_distance_to_NM_down_vars(wells_df=test_data)

In [14]:
combined_test_df = test_data
list_df_var = [dwt_db1_cD_df, dwt_db1_cA_df, dwt_db3_cD_df, dwt_db3_cA_df,
               entropy_df, gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df,
              rolling_marine_ratio_df, dist_M_up_df, dist_M_down_df, dist_NM_up_df, dist_NM_down_df]
for var_df in list_df_var:
    temp_df = var_df
    combined_test_df = pd.concat([combined_test_df,temp_df],axis=1)
combined_test_df.replace(to_replace=np.nan, value='-1', inplace=True)

X_test = combined_test_df.iloc[:, 3:]

In [15]:
Cl.fit(X, y)
proba_test = Cl.predict_proba(X_test)

Implementing the second fold of classification on test data


In [16]:
###### creating new variables from predict proba
list_facies = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
proba_test = pd.DataFrame(proba_test, columns=list_facies)
proba_test = pd.concat([combined_test_df.iloc[:, :3], proba_test], axis=1)

In [17]:
###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=proba_test, logs=list_facies, dx_list=[2, 3, 4, 5, 6, 10, 20])

##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=proba_test, logs=list_facies, windows=[1, 2, 5, 10, 20])

##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=proba_test, logs=list_facies, windows=[3 , 4, 5, 7, 10, 15, 20])

##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=proba_test, logs=list_facies, windows=[3, 4, 5, 7, 10, 15, 20])

##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=proba_test, logs=list_facies, windows=[3 , 4, 5, 7, 10, 15, 20])

list_df_var = [gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df]
combined_test_df = proba_test
for var_df in list_df_var:
    combined_test_df = pd.concat([combined_test_df,var_df],axis=1)
combined_test_df.replace(to_replace=np.nan, value='-1', inplace=True)

In [19]:
X_test2 = combined_test_df.iloc[:, 3:]
test_pred_df = combined_test_df[['Well Name', 'Depth']]
for i in range(100):
    Cl2.fit(X2, y)
    y_test = Cl2.predict(X_test2)
    y_test = pd.DataFrame(y_test, columns=['Predicted Facies #' + str(i)])
    test_pred_df = pd.concat([test_pred_df, y_test], axis=1)
print (test_pred_df.shape)
test_pred_df.head()


(830, 102)
Out[19]:
Well Name Depth Predicted Facies #0 Predicted Facies #1 Predicted Facies #2 Predicted Facies #3 Predicted Facies #4 Predicted Facies #5 Predicted Facies #6 Predicted Facies #7 ... Predicted Facies #90 Predicted Facies #91 Predicted Facies #92 Predicted Facies #93 Predicted Facies #94 Predicted Facies #95 Predicted Facies #96 Predicted Facies #97 Predicted Facies #98 Predicted Facies #99
0 STUART 2808.0 2 1 1 1 1 2 1 1 ... 1 1 1 1 1 1 1 1 1 1
1 STUART 2808.5 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
2 STUART 2809.0 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
3 STUART 2809.5 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
4 STUART 2810.0 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2

5 rows × 102 columns


In [20]:
test_pred_df.to_pickle('Prediction_submission4_TwoSteps.pkl')

In [ ]: