in collaboration with Erwan Gloaguen
In this notebook we will train a machine learning algorithm to predict facies from well log data. The dataset comes from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset consists of log data from nine wells that have been labeled with a facies type based on observation of core. We will use this log data to train a Random Forest model to classify facies types.
In [4]:
###### Importing all used packages
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns
from pandas import set_option
# set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None
###### Import packages needed for the make_vars functions
import Feature_Engineering as FE
##### import stuff from scikit learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score,LeavePGroupsOut, LeaveOneGroupOut, cross_val_predict
from sklearn.metrics import confusion_matrix, make_scorer, f1_score, accuracy_score, recall_score, precision_score
filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data.head()
Out[4]:
In [5]:
training_data.describe()
Out[5]:
A complete description of the dataset is given in the Original contest notebook by Brendon Hall, Enthought. A total of four measured rock properties and two interpreted geological properties are given as raw predictor variables for the prediction of the "Facies" class.
As stated in our previous submission, we believe that feature engineering has a high potential for increasing classification success. A strategy for building new variables is explained below.
The dataset is distributed along a series of drillholes intersecting a stratigraphic sequence. Sedimentary facies tend to be deposited in sequences that reflect the evolution of the paleo-environment (variations in water depth, water temperature, biological activity, currents strenght, detrital input, ...). Each facies represents a specific depositional environment and is in contact with facies that represent a progressive transition to an other environment. Thus, there is a relationship between neighbouring samples, and the distribution of the data along drillholes can be as important as data values for predicting facies.
A series of new variables (features) are calculated and tested below to help represent the relationship of neighbouring samples and the overall texture of the data along drillholes. These variables are:
Functions used to build these variables are located in the Feature Engineering python script.
All the data exploration work related to the conception and study of these variables is not presented here.
In [6]:
##### cD From wavelet db1
dwt_db1_cD_df = FE.make_dwt_vars_cD(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db1')
##### cA From wavelet db1
dwt_db1_cA_df = FE.make_dwt_vars_cA(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db1')
##### cD From wavelet db3
dwt_db3_cD_df = FE.make_dwt_vars_cD(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db3')
##### cA From wavelet db3
dwt_db3_cA_df = FE.make_dwt_vars_cA(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db3')
##### From entropy
entropy_df = FE.make_entropy_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
l_foots=[2, 3, 4, 5, 7, 10])
###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
dx_list=[2, 3, 4, 5, 6, 10, 20])
##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[1, 2, 5, 10, 20])
##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[3 , 4, 5, 7, 10, 15, 20])
##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[3, 4, 5, 7, 10, 15, 20])
##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[3 , 4, 5, 7, 10, 15, 20])
###### From rolling NM/M ratio
rolling_marine_ratio_df = FE.make_rolling_marine_ratio_vars(wells_df=training_data, windows=[5, 10, 15, 20, 30, 50, 75, 100, 200])
###### From distance to NM and M, up and down
dist_M_up_df = FE.make_distance_to_M_up_vars(wells_df=training_data)
dist_M_down_df = FE.make_distance_to_M_down_vars(wells_df=training_data)
dist_NM_up_df = FE.make_distance_to_NM_up_vars(wells_df=training_data)
dist_NM_down_df = FE.make_distance_to_NM_down_vars(wells_df=training_data)
In [7]:
list_df_var = [dwt_db1_cD_df, dwt_db1_cA_df, dwt_db3_cD_df, dwt_db3_cA_df,
entropy_df, gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df,
rolling_marine_ratio_df, dist_M_up_df, dist_M_down_df, dist_NM_up_df, dist_NM_down_df]
combined_df = training_data
for var_df in list_df_var:
temp_df = var_df
combined_df = pd.concat([combined_df,temp_df],axis=1)
combined_df.replace(to_replace=np.nan, value='-1', inplace=True)
print (combined_df.shape)
combined_df.head(5)
Out[7]:
A Random Forest model is built here to test the effect of these new variables on the prediction power. Algorithm parameters have been tuned so as to take into account the non-stationarity of the training and testing sets using the LeaveOneGroupOut cross-validation strategy. The size of individual tree leafs and nodes has been increased to the maximum possible without significantly increasing the variance so as to reduce the bias of the prediction. Box plot for a series of scores obtained through cross validation are presented below.
In [8]:
X = combined_df.iloc[:, 4:]
y = combined_df['Facies']
groups = combined_df['Well Name']
In [25]:
scoring_param = ['accuracy', 'recall_weighted', 'precision_weighted','f1_weighted']
scores = []
Cl = RandomForestClassifier(n_estimators=100, max_features=0.1, min_samples_leaf=25,
min_samples_split=50, class_weight='balanced', random_state=42, n_jobs=-1)
lpgo = LeavePGroupsOut(n_groups=2)
for scoring in scoring_param:
cv=lpgo.split(X, y, groups)
validated = cross_val_score(Cl, X, y, scoring=scoring, cv=cv, n_jobs=-1)
scores.append(validated)
scores = np.array(scores)
scores = np.swapaxes(scores, 0, 1)
scores = pd.DataFrame(data=scores, columns=scoring_param)
In [26]:
sns.set_style('white')
fig,ax = plt.subplots(figsize=(8,6))
sns.boxplot(data=scores)
plt.xlabel('scoring parameters')
plt.ylabel('score')
plt.title('Classification scores for tuned parameters');
The individual contribution to the classification for each feature (i.e., feature importances) can be obtained from a Random Forest classifier. This gives a good idea of the classification power of individual features and helps understanding which type of feature engineering is the most promising.
Caution should be taken when interpreting feature importances, as highly correlated variables will tend to dilute their classification power between themselves and will rank lower than uncorelated variables.
In [11]:
####### Evaluation of feature importances
Cl = RandomForestClassifier(n_estimators=75, max_features=0.1, min_samples_leaf=25,
min_samples_split=50, class_weight='balanced', random_state=42,oob_score=True, n_jobs=-1)
Cl.fit(X, y)
print ('OOB estimate of accuracy for prospectivity classification using all features: %s' % str(Cl.oob_score_))
importances = Cl.feature_importances_
std = np.std([tree.feature_importances_ for tree in Cl.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
Vars = list(X.columns.values)
for f in range(X.shape[1]):
print("%d. feature %d %s (%f)" % (f + 1, indices[f], Vars[indices[f]], importances[indices[f]]))
In [12]:
sns.set_style('white')
fig,ax = plt.subplots(figsize=(15,5))
ax.bar(range(X.shape[1]), importances[indices],color="r", align="center")
plt.ylabel("Feature importance")
plt.xlabel('Ranked features')
plt.xticks([], indices)
plt.xlim([-1, X.shape[1]]);
Features derived from raw geological variables tend to have the highest classification power. Rolling min, max and mean tend to have better classification power than raw data. Wavelet approximation coeficients tend to have a similar to lower classification power than raw data. Features expressing local texture of the data (entropy, gradient, standard deviation and wavelet detail coeficients) have a low classification power but still participate in the prediction.
In [13]:
######## Confusion matrix from this tuning
cv=LeaveOneGroupOut().split(X, y, groups)
y_pred = cross_val_predict(Cl, X, y, cv=cv, n_jobs=-1)
conf_mat = confusion_matrix(y, y_pred)
list_facies = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
conf_mat = pd.DataFrame(conf_mat, columns=list_facies, index=list_facies)
conf_mat.head(10)
Out[13]:
In [15]:
filename = '../validation_data_nofacies.csv'
test_data = pd.read_csv(filename)
test_data.head(5)
Out[15]:
In [16]:
##### cD From wavelet db1
dwt_db1_cD_df = FE.make_dwt_vars_cD(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db1')
##### cA From wavelet db1
dwt_db1_cA_df = FE.make_dwt_vars_cA(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db1')
##### cD From wavelet db3
dwt_db3_cD_df = FE.make_dwt_vars_cD(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db3')
##### cA From wavelet db3
dwt_db3_cA_df = FE.make_dwt_vars_cA(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
levels=[1, 2, 3, 4], wavelet='db3')
##### From entropy
entropy_df = FE.make_entropy_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
l_foots=[2, 3, 4, 5, 7, 10])
###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
dx_list=[2, 3, 4, 5, 6, 10, 20])
##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[1, 2, 5, 10, 20])
##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[3 , 4, 5, 7, 10, 15, 20])
##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[3, 4, 5, 7, 10, 15, 20])
##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
windows=[3 , 4, 5, 7, 10, 15, 20])
###### From rolling NM/M ratio
rolling_marine_ratio_df = FE.make_rolling_marine_ratio_vars(wells_df=test_data, windows=[5, 10, 15, 20, 30, 50, 75, 100, 200])
###### From distance to NM and M, up and down
dist_M_up_df = FE.make_distance_to_M_up_vars(wells_df=test_data)
dist_M_down_df = FE.make_distance_to_M_down_vars(wells_df=test_data)
dist_NM_up_df = FE.make_distance_to_NM_up_vars(wells_df=test_data)
dist_NM_down_df = FE.make_distance_to_NM_down_vars(wells_df=test_data)
In [17]:
combined_test_df = test_data
list_df_var = [dwt_db1_cD_df, dwt_db1_cA_df, dwt_db3_cD_df, dwt_db3_cA_df,
entropy_df, gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df,
rolling_marine_ratio_df, dist_M_up_df, dist_M_down_df, dist_NM_up_df, dist_NM_down_df]
for var_df in list_df_var:
temp_df = var_df
combined_test_df = pd.concat([combined_test_df,temp_df],axis=1)
combined_test_df.replace(to_replace=np.nan, value='-99999', inplace=True)
X_test = combined_test_df.iloc[:, 3:]
print (combined_test_df.shape)
combined_test_df.head(5)
Out[17]:
In [18]:
Cl = RandomForestClassifier(n_estimators=100, max_features=0.1, min_samples_leaf=25,
min_samples_split=50, class_weight='balanced', random_state=42, n_jobs=-1)
Cl.fit(X, y)
y_test = Cl.predict(X_test)
y_test = pd.DataFrame(y_test, columns=['Predicted Facies'])
test_pred_df = pd.concat([combined_test_df[['Well Name', 'Depth']], y_test], axis=1)
test_pred_df.head()
Out[18]:
In [19]:
test_pred_df.to_pickle('Prediction_blind_wells_RF_c.pkl')
In [ ]: