in collaboration with Erwan Gloaguen
In this notebook we will train a machine learning algorithm to predict facies from well log data. The dataset comes from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset consists of log data from nine wells that have been labeled with a facies type based on observation of core. We will use this log data to train a Random Forest model to classify facies types.
N.B We are using exhaustively an interpreted variable (NM_M). We are well aware that this may not be representative of a true machine learning experience where you don't have access to interpreted information. But for now, since we have access to it, why not use it. Moreover, we could easily adapt our approach in a 2 steps prediction where we would first discriminate marine from non-marine sediments and then classify each facies.
In [1]:
%matplotlib inline
# to install watermark magic command: pip install ipyext
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,matplotlib,seaborn,sklearn
In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn import preprocessing
from pandas import set_option
set_option("display.max_rows",10)
pd.options.mode.chained_assignment = None
# turn off ipython warnings
import warnings
warnings.filterwarnings('ignore')
filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data.describe()
Out[2]:
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector. Validation (test) data (830 examples from two wells) have the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.
In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()
Out[3]:
These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.
Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors dataframe.
In [4]:
# 1=sandstone 2=c_siltstone 3=f_siltstone
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00',
'#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
facies_color_map[label] = facies_colors[ind]
def label_facies(row, labels):
return labels[ row['Facies'] -1]
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
training_data.describe()
Out[4]:
Sedimentary facies tend to be deposited in sequences that reflect the evolution of the paleo-environment (variations in water depth, water temperature, biological activity, currents strenght, detrital input, ...). Each facies represents a specific depositional environment and is in contact with facies that represent a progressive transition to an other environment. The distance to a marine context introduces information on the depositional environment of samples neighbouring the predicted sample and thus improves the classification. This new variable illustrates our approach, which favorises the combination of machine learning algorithms with feature engineering adapted to the target variables.
In [5]:
def make_dist_mar_vars(wells_df):
grouped = wells_df.groupby(['Well Name'])
new_df = pd.DataFrame()
for key in grouped.groups.keys():
NM_M = grouped.get_group(key)['NM_M'].values
#We create a temporary dataframe that we reset for every well
temp_df = pd.DataFrame()
temp_df['Depth'] = grouped.get_group(key)['Depth']
temp_df['Well Name'] = [key for _ in range(len(NM_M))]
#We initialize new variables
dist_mar_up = np.zeros(len(NM_M))
dist_mar_down = np.zeros(len(NM_M))
# A variable counting the interval from the upper marine deposit an one for bottom lower deposit
# We initialize them to -99999 since we do not know what's above the first log
count = -99999
#we build them in two seperate loops
for i in range(len(NM_M)):
if ((NM_M[i] == 1) & (count>-99999)):
count+=0.5
dist_mar_up[i] += count
elif NM_M[i] == 2:
count=0
else:
dist_mar_up[i] = count
#********************************************#
#we reset count
count = -99999
for i in range(len(NM_M)-1,-1,-1):
if ((NM_M[i] == 1) & (count>-99999)):
count+=0.5
dist_mar_down[i] += count
elif NM_M[i] == 2:
count=0
else:
dist_mar_down[i] = count
#********************************************#
temp_df['dist_mar_up'] = dist_mar_up
temp_df['dist_mar_down'] = dist_mar_down
# We append each well variable to a larger dataframe
# We use a dataframe to preserve the index
new_df = new_df.append(temp_df)
new_df = new_df.sort_index()
new_df = new_df.drop(['Well Name','Depth'],axis=1)
#We don't use merge as it creates duplicates for curious reasons that we later have to drop
return pd.concat([wells_df,new_df],axis=1)
training_data = make_dist_mar_vars(training_data)
training_data.head()
Out[5]:
The nature of the 'Formation' variable is not entirely clear, and we do not know how it has been obtained. In this Notebook, we include it in the predictor variables. However, if 'Formation' appears to be in part derived from 'Facies', it should be removed from the predictor variables.
Formation labels are encoded as integer values.
In [6]:
LE = preprocessing.LabelEncoder()
training_data['Formation_category'] = LE.fit_transform(training_data.Formation)
In [7]:
import seaborn as sns
training_data['NM_M_label'] = training_data['NM_M'].apply(lambda x: 'non_marine' if x == 1 else 'marine')
plt.figure(figsize=(8,6))
sns.countplot(x="FaciesLabels", hue="NM_M_label",
order=['SS', 'CSiS', 'FSiS', 'SiSh','MS','WS','D','PS','BS'],
data=training_data, palette = "RdBu");
Now, as the NM_M variable discriminate fairly well facies, we assume that facies 1-3 (SS to FSiS
) are non marine and facies 4-9 (SiSh to BS
) are marine.
In [8]:
training_data.head()
Out[8]:
In [9]:
training_data = training_data.drop(['NM_M_label','FaciesLabels'], axis=1)
NM_training_data = training_data[training_data['NM_M'] == 1]
M_training_data = training_data[training_data['NM_M'] == 2]
Instead of dropping samples with the NaN
value for the PE variable, we decided to replace NaN
with an 'out of the range' value (i.e. -99999). This allow the classifier to take into account all the samples despite missing values for the PE variable.
In [10]:
NM_training_data.replace(to_replace=np.nan,value=-99999,inplace=True)
M_training_data.replace(to_replace=np.nan,value=-99999,inplace=True)
Non-marine facies are dominated by clastic sedimentary rocks, and marine facies by carbonate sedimentary rocks. The best predictive variables for each group differ. Thus, marine and non-marine rocks are classified separately here, and a different set of predictive variables is chosen for each group.
In [11]:
nm_drop_list = ['Formation', 'Well Name', 'Depth',
'Facies','DeltaPHI','PE','NM_M','RELPOS']
m_drop_list = ['Formation', 'Well Name', 'Depth',
'Facies','dist_mar_down','dist_mar_up','PE','NM_M','PHIND']
In [12]:
from sklearn import ensemble
from sklearn import metrics
clf = ensemble.RandomForestClassifier(n_estimators=500,n_jobs=-1)
names = list(np.unique(training_data['Well Name']))
nm_grouped = NM_training_data.groupby(['Well Name'])
m_grouped = M_training_data.groupby(['Well Name'])
new_df = pd.DataFrame()
scores = []
for name in names:
temp_df = pd.DataFrame()
# We need to isolate Recruit F9 since it has only marine facies
if name == 'Recruit F9':
#Build list of well names and remove blind well (and remove recruit F9 for NM)
m_train_names = names.copy()
m_train_names.remove(name)
# Do it for marine sediments
m_test = m_grouped.get_group(name)
m_test_depth = m_test.Depth
m_X_test = m_test.drop(m_drop_list, axis=1).values
y_test = m_test['Facies'].values
m_train = pd.DataFrame()
for train_name in m_train_names:
m_train = m_train.append(m_grouped.get_group(train_name))
id_train_M = m_train.Facies >= 4
m_train = m_train [id_train_M]
m_X_train = m_train.drop(m_drop_list, axis=1).values
m_y_train = m_train['Facies'].values
#Then we do random forest classification and prediction
clf.fit(m_X_train, m_y_train)
y_pred = clf.predict(m_X_test)
else:
#Build list of well names and remove blind well (and remove recruit F9 for NM)
m_train_names = names.copy()
m_train_names.remove(name)
nm_train_names = m_train_names.copy()
nm_train_names.remove('Recruit F9')
# Do it for non-marine sediments
nm_test = nm_grouped.get_group(name)
nm_test_depth = nm_test.Depth
nm_X_test = nm_test.drop(nm_drop_list, axis=1).values
nm_y_test = nm_test['Facies'].values
nm_train = pd.DataFrame()
for train_name in nm_train_names:
nm_train = nm_train.append(nm_grouped.get_group(train_name))
id_train_NM = nm_train.Facies <= 3
nm_train = nm_train [id_train_NM]
nm_X_train = nm_train.drop(nm_drop_list, axis=1).values
nm_y_train = nm_train['Facies'].values
#Then we do random forest classification and prediction
clf.fit(nm_X_train, nm_y_train)
nm_y_pred = clf.predict(nm_X_test)
print(clf.feature_importances_)
#*********************************************************************#
# Do it for marine sediments
m_test = m_grouped.get_group(name)
m_test_depth = m_test.Depth
m_X_test = m_test.drop(m_drop_list, axis=1).values
m_y_test = m_test['Facies'].values
m_train = pd.DataFrame()
for train_name in m_train_names:
m_train = m_train.append(m_grouped.get_group(train_name))
id_train_M = m_train.Facies >= 4
m_train = m_train [id_train_M]
m_X_train = m_train.drop(m_drop_list, axis=1).values
m_y_train = m_train['Facies'].values
#Then we do random forest classification and prediction
clf.fit(m_X_train, m_y_train)
m_y_pred = clf.predict(m_X_test)
print(clf.feature_importances_)
#================================================================#
# combine results
#================================================================#
y_test = np.hstack((nm_y_test,m_y_test))
y_pred = np.hstack((nm_y_pred,m_y_pred))
#Scoring
conf_mat = metrics.confusion_matrix(y_test,y_pred)
print(conf_mat)
try:
score = metrics.f1_score(y_test, y_pred,average='weighted')
except:
#this exception is for Recruit F9
score = metrics.f1_score(y_test, y_pred)
scores.append(score)
print('********')
print('Blind well is {0}, F1 score : {1:.4%}\n'.format(name,score))
if name == 'Recruit F9':
depth = m_test_depth
else:
depth = np.hstack((nm_test_depth,m_test_depth))
idx = np.argsort(depth)
temp_df['Depth'] = depth[idx]
temp_df['True Facies'] = y_test[idx]
temp_df['Predicted Facies'] = y_pred[idx]
temp_df['Well Name'] = [name for _ in range(len(depth))]
new_df = new_df.append(temp_df)
print("="*30)
print('*********** RESULT ***********')
print("="*30)
print('\nAverage F1-score is {:.4%}'.format(np.mean(scores)))
In [13]:
filename = '../validation_data_nofacies.csv'
test_data = pd.read_csv(filename)
test_data
Out[13]:
In [14]:
# absoulte distance
test_data = make_dist_mar_vars(test_data)
# formation category encoding
test_data['Formation_category'] = LE.fit_transform(test_data.Formation)
In [15]:
# nm = non-marine and m = marine
nm_drop_list = ['Formation', 'Well Name', 'Depth','DeltaPHI','PE',
'NM_M','RELPOS']
m_drop_list = ['Formation', 'Well Name', 'Depth',
'dist_mar_down','dist_mar_up','PE','NM_M','PHIND']
In [16]:
grouped = test_data.groupby('Well Name')
new_df = pd.DataFrame()
for name in grouped.groups.keys():
temp_df = pd.DataFrame()
temp_df['Depth'] = grouped.get_group(name)['Depth']
nm_test = grouped.get_group(name)[test_data['NM_M'] == 1]
m_test = grouped.get_group(name)[test_data['NM_M'] == 2]
# Do it for non-marine sediments
nm_test_depth = nm_test.Depth
nm_X_test = nm_test.drop(nm_drop_list, axis=1).values
id_train_NM = NM_training_data.Facies <= 3
nm_train = NM_training_data[id_train_NM]
nm_X_train = nm_train.drop(nm_drop_list, axis=1)
nm_X_train = nm_X_train.drop('Facies', axis=1).values
nm_y_train = nm_train['Facies'].values
#Then we do random forest classification
clf.fit(nm_X_train, nm_y_train)
nm_y_pred = clf.predict(nm_X_test)
#*********************************************************************#
# Do it for marine sediments
m_test_depth = m_test.Depth
m_X_test = m_test.drop(m_drop_list, axis=1).values
id_train_M = M_training_data.Facies >= 4
m_train = M_training_data[id_train_M]
m_X_train = m_train.drop('Facies', axis=1)
m_X_train = m_X_train.drop(m_drop_list,axis=1).values
m_y_train = m_train['Facies'].values
#Then we do random forest classification
clf.fit(m_X_train, m_y_train)
m_y_pred = clf.predict(m_X_test)
#================================================================#
# combine results
#================================================================#
y_pred = np.hstack((nm_y_pred,m_y_pred))
depth = np.hstack((nm_test_depth,m_test_depth))
idx = np.argsort(depth)
temp_df['Predicted Facies'] = y_pred[idx]
temp_df['Well Name'] = [name for _ in range(len(depth))]
new_df = new_df.append(temp_df)
new_df = new_df.sort_index()
In [17]:
new_df.to_pickle('Prediction_blind_wells.pkl')
new_df
Out[17]:
<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The code and ideas in this notebook,</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">geoLEARN,</span> are licensed under a Creative Commons Attribution 4.0 International License.