Facies classification using Machine Learning
This is an "open science" contest designed to introduce people to machine learning with well logs and brainstorm different methods through collaboration with others, so this notebook is based heavily on the introductary notebook with my own modifications.
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.
=================================================================================================================
In [360]:
## import modules
from __future__ import division
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize']=(20.0,10.0)
inline_rc = dict(mpl.rcParams)
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import LeavePGroupsOut
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsOneClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from scipy.signal import medfilt
from pandas.tools.plotting import scatter_matrix
import matplotlib.colors as colors
import xgboost as xgb
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from classification_utilities import display_cm, display_adj_cm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import validation_curve
from sklearn.datasets import load_svmlight_files
from xgboost.sklearn import XGBClassifier
from scipy.sparse import vstack
seed = 123
np.random.seed(seed)
In [361]:
# Parameters
feature_names = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
### in the one below, formation is renamed NM_M and made into floats
# feature_names = ['NM_M','GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
featurePlusFacies_names = ['Facies', 'GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE', 'NM_M', 'RELPOS']
facies_names = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
formation_names = ['A1 LM', 'A1 SH', 'B1 LM', 'B1 SH', 'B2 LM', 'B2 SH', 'B3 LM','B3 SH', 'B4 LM', 'B4 SH', 'B5 LM', 'B5 SH', 'C LM', 'C SH']
## formation_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D','#FAA03F', '#FAA041','#DCAA33','#6AAC00', '#1BAA72']
formation_colors = ['#000000','#00FF00','#0000FF','#FF0000','#01FFFE','#FFA6FE','#FFDB66','#006401','#010067','#95003A','#007DB5','#FF00F6','#FFEEE8','#774D00']
In [362]:
### setting up options in pandas
from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None
In [364]:
# Load data for training ata from file
data = pd.read_csv('facies_vectors.csv')
In [431]:
data
Out[431]:
In [500]:
# replaced NM_M column with integers based on Formation column
data.loc[data.Formation == 'A1 LM', 'NM_M'] = 31
data.loc[data.Formation == 'A1 SH', 'NM_M'] = 4
data.loc[data.Formation == 'B1 LM', 'NM_M'] = 33
data.loc[data.Formation == 'B1 SH', 'NM_M'] = 10
data.loc[data.Formation == 'B2 LM', 'NM_M'] = 43
data.loc[data.Formation == 'B2 SH', 'NM_M'] = 16
data.loc[data.Formation == 'B3 LM', 'NM_M'] = 37
data.loc[data.Formation == 'B3 SH', 'NM_M'] = 6
data.loc[data.Formation == 'B4 LM', 'NM_M'] = 35
data.loc[data.Formation == 'B4 SH', 'NM_M'] = 2
data.loc[data.Formation == 'B5 LM', 'NM_M'] = 41
data.loc[data.Formation == 'B5 SH', 'NM_M'] = 12
data.loc[data.Formation == 'C LM', 'NM_M'] = 50
data.loc[data.Formation == 'C SH', 'NM_M'] = 12
In [501]:
data.describe()
Out[501]:
In [502]:
data
Out[502]:
In [503]:
# data = data.drop('NM_M', 1)
In [504]:
data
Out[504]:
In [505]:
data
Out[505]:
In [506]:
X = data[feature_names].values
# XplusFacies = data[featurePlusFacies_names].values
y = data['Facies'].values
In [507]:
well = data['Well Name'].values
depth = data['Depth'].values
# facies = data['Facies'].values
In [508]:
formation = data['NM_M'].values
np.unique(formation)
Out[508]:
In [509]:
data['Facies'].describe()
Out[509]:
In [510]:
data[feature_names].describe()
Out[510]:
In [511]:
well.shape
Out[511]:
In [512]:
facies.shape
Out[512]:
In [513]:
np.unique(facies)
Out[513]:
In [514]:
np.unique(well)
Out[514]:
In [515]:
X.size
Out[515]:
In [516]:
y.size
Out[516]:
In [517]:
# Define function for plotting feature statistics
def plot_feature_stats(X, y, feature_names, facies_colors, facies_names):
# Remove NaN
nan_idx = np.any(np.isnan(X), axis=1)
X = X[np.logical_not(nan_idx), :]
y = y[np.logical_not(nan_idx)]
# Merge features and labels into a single DataFrame
features = pd.DataFrame(X, columns=feature_names)
labels = pd.DataFrame(y, columns=['Facies'])
for f_idx, facies in enumerate(facies_names):
labels[labels[:] == f_idx] = facies
data = pd.concat((labels, features), axis=1)
# Plot features statistics
facies_color_map = {}
for ind, label in enumerate(facies_names):
facies_color_map[label] = facies_colors[ind]
sns.pairplot(data, hue='Facies', palette=facies_color_map, hue_order=list(reversed(facies_names)))
In [518]:
# Feature distribution
plot_feature_stats(X, y, feature_names, facies_colors, facies_names)
mpl.rcParams.update(inline_rc)
using formation_names & formation_colors
In [519]:
# Feature distribution
plot_feature_stats(X, y, feature_names, formation_colors, formation_names)
mpl.rcParams.update(inline_rc)
In [520]:
# Facies per well
for w_idx, w in enumerate(np.unique(well)):
ax = plt.subplot(3, 4, w_idx+1)
hist = np.histogram(y[well == w], bins=np.arange(len(facies_names)+1)+.5)
plt.bar(np.arange(len(hist[0])), hist[0], color=facies_colors, align='center')
ax.set_xticks(np.arange(len(hist[0])))
ax.set_xticklabels(facies_names)
ax.set_title(w)
In [521]:
# Facies per formation
for w_idx, w in enumerate(np.unique(formation)):
ax = plt.subplot(4, 4, w_idx+1)
hist = np.histogram(y[formation == w], bins=np.arange(len(facies_names)+1)+.5)
plt.bar(np.arange(len(hist[0])), hist[0], color=facies_colors, align='center')
ax.set_xticks(np.arange(len(hist[0])))
ax.set_xticklabels(facies_names)
ax.set_title(w)
In [522]:
# Features per well
for w_idx, w in enumerate(np.unique(well)):
ax = plt.subplot(3, 4, w_idx+1)
hist = np.logical_not(np.any(np.isnan(X[well == w, :]), axis=0))
plt.bar(np.arange(len(hist)), hist, color=facies_colors, align='center')
ax.set_xticks(np.arange(len(hist)))
ax.set_xticklabels(feature_names)
ax.set_yticks([0, 1])
ax.set_yticklabels(['miss', 'hit'])
ax.set_title(w)
In [523]:
reg = RandomForestRegressor(max_features='sqrt', n_estimators=50)
DataImpAll = data[feature_names].copy()
DataImp = DataImpAll.dropna(axis = 0, inplace=False)
Ximp=DataImp.loc[:, DataImp.columns != 'PE']
Yimp=DataImp.loc[:, 'PE']
reg.fit(Ximp, Yimp)
X[np.array(DataImpAll.PE.isnull()),4] = reg.predict(DataImpAll.loc[DataImpAll.PE.isnull(),:].drop('PE',axis=1,inplace=False))
next three functions for Feature windows concatenation, Feature gradient computation, and Feature augmentation come from Paolo Bestagini's work.
In [524]:
# Feature windows concatenation function
def augment_features_window(X, N_neig):
# Parameters
N_row = X.shape[0]
N_feat = X.shape[1]
# Zero padding
X = np.vstack((np.zeros((N_neig, N_feat)), X, (np.zeros((N_neig, N_feat)))))
# Loop over windows
X_aug = np.zeros((N_row, N_feat*(2*N_neig+1)))
for r in np.arange(N_row)+N_neig:
this_row = []
for c in np.arange(-N_neig,N_neig+1):
this_row = np.hstack((this_row, X[r+c]))
X_aug[r-N_neig] = this_row
return X_aug
In [525]:
# Feature gradient computation function
def augment_features_gradient(X, depth):
# Compute features gradient
d_diff = np.diff(depth).reshape((-1, 1))
d_diff[d_diff==0] = 0.001
X_diff = np.diff(X, axis=0)
X_grad = X_diff / d_diff
# Compensate for last missing value
X_grad = np.concatenate((X_grad, np.zeros((1, X_grad.shape[1]))))
return X_grad
In [526]:
# Feature augmentation function
def augment_features(X, well, depth, N_neig=1):
# Augment features
X_aug = np.zeros((X.shape[0], X.shape[1]*(N_neig*2+2)))
for w in np.unique(well):
w_idx = np.where(well == w)[0]
X_aug_win = augment_features_window(X[w_idx, :], N_neig)
X_aug_grad = augment_features_gradient(X[w_idx, :], depth[w_idx])
X_aug[w_idx, :] = np.concatenate((X_aug_win, X_aug_grad), axis=1)
# Find padded rows
padded_rows = np.unique(np.where(X_aug[:, 0:7] == np.zeros((1, 7)))[0])
return X_aug, padded_rows
In [527]:
X_aug, padded_rows = augment_features(X, well, depth)
In [528]:
# Initialize model selection methods
lpgo = LeavePGroupsOut(2)
# Generate splits
split_list = []
for train, val in lpgo.split(X, y, groups=data['Well Name']):
hist_tr = np.histogram(y[train], bins=np.arange(len(facies_names)+1)+.5)
hist_val = np.histogram(y[val], bins=np.arange(len(facies_names)+1)+.5)
if np.all(hist_tr[0] != 0) & np.all(hist_val[0] != 0):
split_list.append({'train':train, 'val':val})
# Print splits
for s, split in enumerate(split_list):
print('Split %d' % s)
print(' training: %s' % (data['Well Name'][split['train']].unique()))
print(' validation: %s' % (data['Well Name'][split['val']].unique()))
In [582]:
# Parameters search grid (uncomment parameters for full grid search... may take a lot of time)
md_grid = [2]
# changed gamma_grid to 4 from 3 and then to 2
mcw_grid = [1]
gamma_grid = [0]
# changed gamma_grid to 0 from 0.3
ss_grid = [1]
# changed ss_grid to 0.7 from 1 and then back to 1
csb_grid = [0.8]
#
alpha_grid =[0.2]
lr_grid = [0.05]
ne_grid = [200]
param_grid = []
for N in md_grid:
for M in mcw_grid:
for S in gamma_grid:
for L in ss_grid:
for K in csb_grid:
for P in alpha_grid:
for R in lr_grid:
for E in ne_grid:
param_grid.append({'maxdepth':N,
'minchildweight':M,
'gamma':S,
'subsample':L,
'colsamplebytree':K,
'alpha':P,
'learningrate':R,
'n_estimators':E})
In [578]:
# Train and test a classifier
def train_and_test(X_tr, y_tr, X_v,well_v):
#Feature normalization
scaler = preprocessing.RobustScaler(quantile_range=(25.0, 75.0)).fit(X_tr)
X_tr = scaler.transform(X_tr)
X_v = scaler.transform(X_v)
clf.fit(X_tr, y_tr)
# Test classifier
y_v_hat = clf.predict(X_v)
# Clean isolated facies for each well
for w in np.unique(well_v):
y_v_hat[well_v==w] = medfilt(y_v_hat[well_v==w], kernel_size=5)
return y_v_hat
In [579]:
# For each set of parameters
score_param = []
for param in param_grid:
clf = OneVsOneClassifier(XGBClassifier(
learning_rate = param['learningrate'],
n_estimators=param['n_estimators'],
max_depth=param['maxdepth'],
min_child_weight=param['minchildweight'],
gamma = param['gamma'],
subsample=param['subsample'],
colsample_bytree=param['colsamplebytree'],
reg_alpha = param['alpha'],
nthread = 1,
# changed nthread to 1 instead of 4 due to running on mac without NVIDA
seed = seed,
) , n_jobs=-1)
# For each data split
score_split = []
for split in split_list:
# Remove padded rows
split_train_no_pad = np.setdiff1d(split['train'], padded_rows)
# Select training and validation data from current split
X_tr = X_aug[split_train_no_pad, :]
X_v = X_aug[split['val'], :]
y_tr = y[split_train_no_pad]
y_v = y[split['val']]
# Select well labels for validation data
well_v = well[split['val']]
# Train and test
y_v_hat = train_and_test(X_tr, y_tr, X_v, well_v)
# Score
score = f1_score(y_v, y_v_hat, average='micro')
score_split.append(score)
# Average score for this param
score_param.append(np.mean(score_split))
print('F1 score = %.3f %s' % (score_param[-1], param))
# Best set of parameters
best_idx = np.argmax(score_param)
param_best = param_grid[best_idx]
score_best = score_param[best_idx]
print('\nBest F1 score = %.3f %s' % (score_best, param_best))
In [583]:
for param in param_grid:
clf = OneVsOneClassifier(XGBClassifier(
learning_rate = param['learningrate'],
n_estimators=param['n_estimators'],
max_depth=param['maxdepth'],
min_child_weight=param['minchildweight'],
gamma = param['gamma'],
subsample=param['subsample'],
colsample_bytree=param['colsamplebytree'],
reg_alpha = param['alpha'],
nthread = 1,
# changed nthread to not be listed, it was 4
seed = seed,
) , n_jobs=-1)
In [584]:
clf
Out[584]:
In [601]:
# Load data from file
test_data = pd.read_csv('validation_data_nofacies.csv')
In [602]:
test_data
Out[602]:
In [603]:
# use formation inplace of NM_M column!
test_data.loc[test_data.Formation == 'A1 LM', 'NM_M'] = 31
test_data.loc[test_data.Formation == 'A1 SH', 'NM_M'] = 4
test_data.loc[test_data.Formation == 'B1 LM', 'NM_M'] = 33
test_data.loc[test_data.Formation == 'B1 SH', 'NM_M'] = 10
test_data.loc[test_data.Formation == 'B2 LM', 'NM_M'] = 35
test_data.loc[test_data.Formation == 'B2 SH', 'NM_M'] = 16
test_data.loc[test_data.Formation == 'B3 LM', 'NM_M'] = 37
test_data.loc[test_data.Formation == 'B3 SH', 'NM_M'] = 6
test_data.loc[test_data.Formation == 'B4 LM', 'NM_M'] = 45
test_data.loc[test_data.Formation == 'B4 SH', 'NM_M'] = 2
test_data.loc[test_data.Formation == 'B5 LM', 'NM_M'] = 41
test_data.loc[test_data.Formation == 'B5 SH', 'NM_M'] = 12
test_data.loc[test_data.Formation == 'C LM', 'NM_M'] = 43
test_data.loc[test_data.Formation == 'C SH', 'NM_M'] = 12
In [604]:
test_data
Out[604]:
In [605]:
# Prepare training data
X_tr = X
y_tr = y
# Augment features
X_tr, padded_rows = augment_features(X_tr, well, depth)
# Removed padded rows
X_tr = np.delete(X_tr, padded_rows, axis=0)
y_tr = np.delete(y_tr, padded_rows, axis=0)
In [606]:
# Prepare test data 2
well_ts = test_data['Well Name'].values
depth_ts = test_data['Depth'].values
X_ts = test_data[feature_names].values
# Augment features
X_ts, padded_rows = augment_features(X_ts, well_ts, depth_ts)
In [607]:
X_ts.shape
Out[607]:
In [608]:
X_tr.shape
Out[608]:
In [609]:
y_tr.shape
Out[609]:
In [610]:
well_ts.shape
Out[610]:
In [611]:
# Predict test labels
y_ts_hat = train_and_test(X_tr, y_tr, X_ts, well_ts)
In [612]:
# Save predicted labels
test_data['Facies'] = y_ts_hat
test_data.to_csv('Prediction5_final.csv')
In [613]:
# print test data
test_data
Out[613]:
In [614]:
def make_facies_log_plot(logs, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')
ztop=logs.Depth.min(); zbot=logs.Depth.max()
cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
ax[0].plot(logs.GR, logs.Depth, '-g')
ax[1].plot(logs.ILD_log10, logs.Depth, '-')
ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
ax[4].plot(logs.PE, logs.Depth, '-', color='black')
im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
divider = make_axes_locatable(ax[5])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im, cax=cax)
cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS',
'SiSh', ' MS ', ' WS ', ' D ',
' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in range(len(ax)-1):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)
ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
ax[5].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
In [615]:
from mpl_toolkits.axes_grid1 import make_axes_locatable
In [616]:
# Plot predicted labels
make_facies_log_plot(
test_data[test_data['Well Name'] == 'STUART'],
facies_colors=facies_colors)
make_facies_log_plot(
test_data[test_data['Well Name'] == 'CRAWFORD'],
facies_colors=facies_colors)
mpl.rcParams.update(inline_rc)
In [ ]:
In [ ]:
In [ ]:
In [ ]: