This notebook demonstrates how to train a machine learning algorithm to predict facies from well log data. The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a majority-vote classifier comprised of support vector machine, random forest and XGBoosted tree. We will use the classifiers implementation in scikit-learn and XGBoost
First, we will examine the data set we will use to train the classifier. The training data is contained in the file training_data.csv
. The dataset consists of 5 wireline log measurements, two indicator variables and a facies label at half foot intervals. In machine learning terminology, each log measurement is a feature vector that maps a set of 'features' (the log measurements) to a class (the facies type). We will use the pandas library to load the data into a dataframe, which provides a convenient data structure to work with well log data.
In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from pandas import set_option
set_option("display.max_rows", 20)
pd.options.mode.chained_assignment = None
filename = '../training_data.csv'
training_data = pd.read_csv(filename)
training_data
Out[2]:
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.
In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()
Out[3]:
In [4]:
training_data.columns[4:]
Out[4]:
In [5]:
training_data.describe()
Out[5]:
This is a quick view of the statistical distribution of the input variables. Looking at the count
values, there are 3232 feature vectors in the training set.
In [6]:
# 1=sandstone 2=c_siltstone 3=f_siltstone
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00',
'#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
facies_color_map[label] = facies_colors[ind]
def label_facies(row, labels):
return labels[ row['Facies'] -1]
def make_facies_log_plot(logs, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')
ztop=logs.Depth.min(); zbot=logs.Depth.max()
cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
ax[0].plot(logs.GR, logs.Depth, '-g')
ax[1].plot(logs.ILD_log10, logs.Depth, '-')
ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
ax[4].plot(logs.PE, logs.Depth, '-', color='black')
im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
divider = make_axes_locatable(ax[5])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im, cax=cax)
cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS',
'SiSh', ' MS ', ' WS ', ' D ',
' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in range(len(ax)-1):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)
ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
ax[5].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
def compare_facies_plot(logs, compadre, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')
ztop=logs.Depth.min(); zbot=logs.Depth.max()
cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)
f, ax = plt.subplots(nrows=1, ncols=7, figsize=(9, 12))
ax[0].plot(logs.GR, logs.Depth, '-g')
ax[1].plot(logs.ILD_log10, logs.Depth, '-')
ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
ax[4].plot(logs.PE, logs.Depth, '-', color='black')
im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
divider = make_axes_locatable(ax[6])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im2, cax=cax)
cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS',
'SiSh', ' MS ', ' WS ', ' D ',
' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in range(len(ax)-2):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)
ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[6].set_xlabel(compadre)
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
ax[5].set_xticklabels([])
ax[6].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
In [6]:
make_facies_log_plot(
training_data[training_data['Well Name'] == 'SHRIMPLIN'],
facies_colors)
Next: make labels and features vectors
In [7]:
feat_labels =training_data.columns[4:]
In [8]:
def add_del(df,feat_names):
""""""
for fn in feat_names:
df["del_"+fn] = np.gradient(df[fn] )
return df
training_data.columns[4:]
training_data = add_del(training_data,[fn for fn in training_data.columns[4:] if fn != 'NM_M'])
training_data = add_del(training_data,[fn for fn in training_data.columns[4:] if fn != 'NM_M'])
training_data.columns[4:]
feat_labels =training_data.columns[4:]
In [9]:
feat_labels
Out[9]:
import and initilize a few classifires that we play with
In [10]:
y = training_data['Facies'].values
X = training_data.drop(['Formation', 'Well Name', 'Depth','Facies'], axis=1).values
feat_labels =training_data.columns[4:]
label_encoded_y = np.unique(y)
In [12]:
## import and initilize a few classifires that we play with
In [13]:
def randomize(dataset, labels):
permutation = np.random.permutation(labels.shape[0])
shuffled_dataset = dataset[permutation,:]
shuffled_labels = labels[permutation]
return shuffled_dataset, shuffled_labels
X, y = randomize(X, y)
In [14]:
from sklearn import __version__ as sklearn_version
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 1)
forest = RandomForestClassifier( n_estimators = 10000, random_state=50, n_jobs=-1)
bag = BaggingClassifier(base_estimator = tree, n_estimators=400,random_state=0)
knn = KNeighborsClassifier(n_neighbors = 5, p=2, metric = 'minkowski')
In [15]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import svm
SVC_classifier = svm.SVC(kernel = 'rbf', random_state=0, gamma=0.01)
pipe_svm = Pipeline([('scl',StandardScaler()),('clf',SVC_classifier)])
In [16]:
import xgboost
from xgboost import XGBClassifier
Split the training data into training and test sets. Let's use 10% of the data for the test set.
In [17]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,y, test_size=0.1, random_state=42)
In [18]:
# the scores
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer
pre_scorer= make_scorer(score_func=f1_score, greater_is_better=True,average = 'micro')
#the scorer is f1_scrore with micro averaging, i.e.
from sklearn.cross_validation import StratifiedKFold
kfold = StratifiedKFold(y_train, n_folds=10, shuffle=True, random_state=7)
# stratified kfold cross-validation keeps the classes balanced in each fold.
In [19]:
from sklearn.model_selection import GridSearchCV
In [20]:
forest.fit(X, y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f+1 , 30,
feat_labels[indices[f]],
importances[indices[f]]))
Looks like not all features are nearly important, e.g. may consider dropping those with relative importance below 2.3%
In [21]:
if sklearn_version < '0.18':
X_selected = forest.transform(X, threshold=0.023)
else:
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=0.023, prefit=True)
X_selected = sfm.transform(X)
In [22]:
X_selected = X
In [23]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X_selected,y, test_size=0.1, random_state=42)
In [26]:
#tuned forest
forest = RandomForestClassifier(
n_estimators = 303,
min_samples_leaf = 1,
oob_score = True,
random_state=50, n_jobs=-1)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
print ("F1-score: %.4g" % f1_score(y_test,y_pred, average='micro'))
In [27]:
#tuned xgb
xg = XGBClassifier(learning_rate =0.005,
n_estimators=1888,
max_depth=10,
min_child_weight=1,
gamma=0.2,
subsample=0.9,
colsample_bytree=0.7,
reg_alpha=0,
nthread=-1,
objective='multi:softprob',
scale_pos_weight=1,
seed=43)
xg.fit(X_train, y_train,eval_metric=pre_scorer)
y_pred = xg.predict(X_test)
print ("F1-score: %.4g" % f1_score(y_test,y_pred, average='micro'))
In [28]:
print('Predicted F1-Score: {}'.format(xg.score(X_test,y_test)))
In [29]:
#tuned SVM
pipe_svm = Pipeline([('scl',StandardScaler()),('clf',svm.SVC(kernel = 'rbf', random_state=0, gamma=0.1, C=100))])
In [30]:
from sklearn.ensemble import VotingClassifier
mv = VotingClassifier(estimators=[('forest',forest),('XGBoost',xg),('svn',pipe_svm)])
In [31]:
clf_labels = [ 'Random Forest' ,'XGBoost','SVN', 'Majority-Vote']
print('10-fold cross validation:\n')
for clf, label in zip([forest,xg,pipe_svm,mv], clf_labels):
scores = cross_val_score(estimator=clf,
X=X_train,
y=y_train,
cv=kfold,
scoring=pre_scorer)
print("F1-score: %0.2f (+/- %0.2f) [%s]"
% (scores.mean(), scores.std(), label))
In [32]:
mv.fit(X_train,y_train)
print('Predicted F1-Score: {}'.format(mv.score(X_test,y_test)))
In [33]:
y_pred=mv.predict(X_test)
print('Predicted F1-Score: {}'.format(f1_score(y_true=y_test,y_pred=y_pred, average='micro')))
Some more detailes metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.
The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j]
are equal to the number of observations predicted to have facies j
, but are known to have facies i
.
To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py
in this repo for the display_cm()
function.
In [34]:
predicted_labels=mv.predict(X_test)
In [35]:
from sklearn.metrics import confusion_matrix
In [36]:
from classification_utilities import display_cm, display_adj_cm
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS'];
In [37]:
conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
The entries along the diagonal are the facies that have been correctly classified. Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.
In [38]:
def accuracy(conf):
total_correct = 0.
nb_classes = conf.shape[0]
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
acc = total_correct/sum(sum(conf))
return acc
As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i
, adjacent_facies[i]
is an array of the adjacent facies labels.
In [39]:
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])
def accuracy_adjacent(conf, adjacent_facies):
nb_classes = conf.shape[0]
total_correct = 0.
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
for j in adjacent_facies[i]:
total_correct += conf[i][j]
return total_correct / sum(sum(conf))
In [40]:
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))
In [41]:
display_adj_cm(conf, facies_labels, adjacent_facies,
display_metrics=True, hide_zeros=True)
Considering adjacent facies, the F1
scores for all facies types are above 0.9, except when classifying SiSh
or marine siltstone and shale. The classifier often misclassifies this facies (recall of 0.87), most often as wackestone.
In [42]:
mv.fit(X,y)
print('Predicted F1-Score: {}'.format(mv.score(X,y)))
Now that we have a trained facies classification model we can use it to identify facies in wells that do not have core data. In this case, we will apply the classifier to two wells, but we could use it on any number of wells for which we have the same set of well logs for input.
This dataset is similar to the training data except it does not have facies labels. It is loaded into a dataframe called test_data
.
In [43]:
well_data = pd.read_csv('../validation_data_nofacies.csv')
well_data['Well Name'] = well_data['Well Name'].astype('category')
well_data.columns[4:]
well_data= add_del(well_data,[fn for fn in well_data.columns[3:] if fn != 'NM_M'])
well_data = add_del(well_data,[fn for fn in well_data.columns[3:] if fn != 'NM_M'])
well_data.columns[3:]
feat_labels =well_data.columns[3:]
X_unknown = well_data.drop(['Formation', 'Well Name', 'Depth'], axis=1).values
In [44]:
well_data.columns[3:]
#training_data.columns[4:]
Out[44]:
Finally we predict facies labels for the unknown data, and store the results in a Facies
column of the test_data
dataframe.
In [45]:
#predict facies of unclassified data
y_unknown = mv.predict(X_unknown)
well_data['Facies'] = y_unknown
well_data
Out[45]:
In [46]:
well_data['Well Name'].unique()
Out[46]:
We can use the well log plot to view the classification results along with the well logs.
In [47]:
make_facies_log_plot(
well_data[well_data['Well Name'] == 'STUART'],
facies_colors=facies_colors)
make_facies_log_plot(
well_data[well_data['Well Name'] == 'CRAWFORD'],
facies_colors=facies_colors)
Finally we can write out a csv file with the well data along with the facies classification results.
In [49]:
well_data.to_csv('well_data_with_facies.csv')