<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The code and ideas in this notebook,</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Matteo Niccoli and Mark Dahl,</span> are licensed under a Creative Commons Attribution 4.0 International License.
In this notebook we will train a machine learning algorithm to predict facies from well log data. The dataset comes from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset consists of log data from nine wells that have been labeled with a facies type based on observation of core. We will use this log data to train a support vector machine to classify facies types.
After a quick exploration of the dataset, we will:
In [1]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn import preprocessing
from sklearn.metrics import f1_score, accuracy_score, make_scorer
from sklearn.model_selection import LeaveOneGroupOut, validation_curve
import pandas as pd
from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None
In [2]:
filename = 'facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data
Out[2]:
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.
In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()
Out[3]:
These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.
Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors
dataframe.
In [4]:
# 1=sandstone 2=c_siltstone 3=f_siltstone # 4=marine_silt_shale
#5=mudstone 6=wackestone 7=dolomite 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041', '#DC7633','#A569BD',
'#000000', '#000080', '#2E86C1', '#AED6F1', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
facies_color_map[label] = facies_colors[ind]
def label_facies(row, labels):
return labels[ row['Facies'] -1]
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
training_data.describe()
Out[4]:
This is a quick view of the statistical distribution of the input variables. Looking at the count
values, most values have 4149 valid values except for PE
, which has 3232. We will drop the feature vectors that don't have a valid PE
entry.
In [5]:
PE_mask = training_data['PE'].notnull().values
training_data = training_data[PE_mask]
In [6]:
training_data.describe()
Out[6]:
Now we extract just the feature variables we need to perform the classification. The predictor variables are the five log values and two geologic constraining variables, and we are also using depth. We also get a vector of the facies labels that correspond to each feature vector.
In [7]:
y = training_data['Facies'].values
print y[25:40]
print np.shape(y)
In [8]:
X = training_data.drop(['Formation', 'Well Name','Facies','FaciesLabels'], axis=1)
print np.shape(X)
X.describe(percentiles=[.05, .25, .50, .75, .95])
Out[8]:
In [9]:
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
In [10]:
Fscorer = make_scorer(f1_score, average = 'micro')
# Ascorer = make_scorer(accuracy_score)
One of the key steps in machine learning is to estimate a model's performance on data that it has not seen before. Scikit-learn provides a simple utility utility (train_test_split) to partition the data into a training and a test set, but the disadvantage with that is that we ignore a portion of our dataset during training. An additional disadvantage of simple spit, inherent to log data, is that there's a depth dependence.
A possible strategy to avoid this is cross-validation. With k-fold cross-validation we randomly split the data into k-folds without replacement, where k-1 folds are used for training and one fold for testing. The process is repeated k times, and the performance is obtained by taking the average of the k individual performances.
Stratified k-fold is an improvement over standard k-fold in that the class proportions are preserved in each fold to ensure that each fold is representative of the class proportions in the data.
Another important aspect of machine learning is the search for the optimal model parameters (i.e. those that will yield the best performance). This tuning is done using grid search.
The above short summary is based on Sebastian Raschka's Python Machine Learning book.
In [11]:
from sklearn.model_selection import GridSearchCV
First things first, how many samples do we have for each leave-one-well-out split? Since this will be our validation method, it is good to select a number of k-fold splits that will use for training the same number of samples (on average) we'd have with validation.
In [12]:
wells = training_data["Well Name"].values
logo = LeaveOneGroupOut()
n_samples =[]
for train, test in logo.split(X, y, groups=wells):
well_name = wells[test[0]]
print well_name, 'out: ', np.shape(train)[0], 'training samples - ', np.shape(test)[0], 'test samples'
n_samples.append(np.shape(train)[0])
Let's get the average number of training samples (with the exclusion of Recruit F9). This is going to be the desired size for the folds.
In [13]:
n_samples = np.delete(n_samples,5,0)
ave_n = (n_samples.sum()/len(n_samples))
From that and the total number of samples we can estimate the appropriate number of folds for cross validation, knowing that:
size_folds = n_samples/n_splits
gives:
n_splits = n_samples/size_folds
In [14]:
(len(y)/((len(y)-ave_n)))
Out[14]:
In [15]:
from sklearn import svm
SVC_classifier = svm.SVC(kernel = 'rbf', cache_size = 2400, random_state=1)
We've found in our previous notebook that the rbf kernel performs better than the linear kernel. We will now run grid search with a reasonable range for C (based on our experience) and a large range of gamma values. The intention is to ensure we do not settle on a local minimum for gamma, and then to revisit the value of C with validation curves.
In [16]:
parm_grid={'C': [0.1, 0.05, 0.1, 0.5, 1, 5, 10, 15, 20, 25, 50],
'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(SVC_classifier,
param_grid=parm_grid,
scoring = Fscorer,
n_jobs = 7, cv = 7)
grid_search.fit(X, y)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
grid_search.best_estimator_
Out[16]:
Their use is very nicely illustrated in this tutorial:
http://www.astroml.org/sklearn_tutorial/practical.html#learning-curves
https://github.com/jakevdp/sklearn_pycon2015/blob/master/notebooks/05-Validation.ipynb
In [17]:
def rms_error(model, X, y):
y_pred = model.predict(X)
return np.sqrt(np.mean((y - y_pred) ** 2))
In [18]:
from sklearn import svm
SVC_classifier_LOWO_VC = svm.SVC(cache_size=2400, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [19]:
parm_range1 = np.logspace(-2, 6, 9)
train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC, X, y, "C", parm_range1,
cv =logo.split(X, y, groups=wells),
scoring = rms_error, n_jobs = 9)
In [20]:
train_scores_mean1 = np.mean(train_scores1, axis=1)
#train_scores_std1= np.std(train_scores1, axis=1)
test_scores_mean1 = np.mean(test_scores1, axis=1)
#test_scores_std1 = np.std(test_scores1, axis=1)
print test_scores_mean1
print np.amin(test_scores_mean1) # optimal (minimum) average RMS test error
print np.logspace(-2, 6, 9)[test_scores_mean1.argmin(axis=0)] # optimal C parameter value
In [21]:
plt.figure(figsize=(10,10))
plt.title("Validation Curve with SVM")
plt.xlabel("C")
plt.ylabel("RMS")
#plt.ylim(0.2, 1)
lw = 2
plt.semilogx(parm_range1, train_scores_mean1, label="Training error", color="darkorange", lw=lw)
plt.semilogx(parm_range1, test_scores_mean1, label="Cross-validation error", color="navy", lw=lw)
plt.legend(loc="best")
#plt.gca().invert_xaxis()
plt.show()
In [22]:
from sklearn import svm
SVC_classifier_conf = svm.SVC(C = 100, cache_size=2400, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [23]:
SVC_classifier_conf.fit(X,y)
svc_pred = SVC_classifier_conf.predict(X)
from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm
conf = confusion_matrix(svc_pred, y)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [24]:
f1_SVC = []
for train, test in logo.split(X, y, groups=wells):
well_name = wells[test[0]]
SVC_classifier_conf.fit(X[train], y[train])
pred = SVC_classifier_conf.predict(X[test])
sc = f1_score(y[test], pred, labels = np.arange(10), average = 'micro')
print("{:>20s} {:.3f}".format(well_name, sc))
f1_SVC.append(sc)
print "-Average leave-one-well-out F1 Score: %6f" % (sum(f1_SVC)/(1.0*(len(f1_SVC))))
In [25]:
blind = pd.read_csv('validation_data_nofacies.csv')
X_blind = np.array(blind.drop(['Formation', 'Well Name'], axis=1))
X_blind = scaler.transform(X_blind)
y_pred = SVC_classifier_conf.fit(X, y).predict(X_blind)
blind['Facies'] = y_pred
In [26]:
def make_facies_log_plot(logs, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')
ztop=logs.Depth.min(); zbot=logs.Depth.max()
cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
ax[0].plot(logs.GR, logs.Depth, '-g')
ax[1].plot(logs.ILD_log10, logs.Depth, '-')
ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
ax[4].plot(logs.PE, logs.Depth, '-', color='black')
im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
divider = make_axes_locatable(ax[5])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im, cax=cax)
cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS',
'SiSh', ' MS ', ' WS ', ' D ',
' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in range(len(ax)-1):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)
ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
ax[5].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
In [27]:
make_facies_log_plot(blind[blind['Well Name'] == 'STUART'], facies_colors)
make_facies_log_plot(blind[blind['Well Name'] == 'CRAWFORD'], facies_colors)
In [28]:
np.save('ypred.npy', y_pred)
This is a nice display to finish up with, as it gives us a visual idea of the predicted faces where we have facies from the core observations. The plot we will use a function from the original notebook. Let's look at the well with the lowest F1 from the previous code block, CROSS H CATTLE, and the one with the highest F1 (excluding Recruit F9), which is SHRIMPLIN.
In [29]:
def compare_facies_plot(logs, compadre, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')
ztop=logs.Depth.min(); zbot=logs.Depth.max()
cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)
f, ax = plt.subplots(nrows=1, ncols=7, figsize=(9, 12))
ax[0].plot(logs.GR, logs.Depth, '-g')
ax[1].plot(logs.ILD_log10, logs.Depth, '-')
ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
ax[4].plot(logs.PE, logs.Depth, '-', color='black')
im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
divider = make_axes_locatable(ax[6])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im2, cax=cax)
cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS',
'SiSh', ' MS ', ' WS ', ' D ',
' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in range(len(ax)-2):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)
ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[6].set_xlabel(compadre)
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
ax[5].set_xticklabels([])
ax[6].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
In [30]:
SVC_classifier_conf.fit(X,y)
pred = SVC_classifier_conf.predict(X)
X = training_data
X['Prediction'] = pred
In [31]:
compare_facies_plot(X[X['Well Name'] == 'CROSS H CATTLE'], 'Prediction', facies_colors)
compare_facies_plot(X[X['Well Name'] == 'SHRIMPLIN'], 'Prediction', facies_colors)