<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The code and ideas in this notebook,</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Matteo Niccoli and Mark Dahl,</span> are licensed under a Creative Commons Attribution 4.0 International License.
In this notebook we will train a machine learning algorithm to predict facies from well log data. The dataset comes from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset consists of log data from nine wells that have been labeled with a facies type based on observation of core. We will use this log data to train a support vector machine to classify facies types.
After a quick exploration of the dataset, we will:
In [1]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import f1_score, accuracy_score, make_scorer
from sklearn.model_selection import LeaveOneGroupOut, validation_curve
import pandas as pd
from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None
In [2]:
filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data
Out[2]:
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.
In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()
Out[3]:
These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.
Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors
dataframe.
In [4]:
# 1=sandstone 2=c_siltstone 3=f_siltstone # 4=marine_silt_shale
#5=mudstone 6=wackestone 7=dolomite 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041', '#DC7633','#A569BD',
'#000000', '#000080', '#2E86C1', '#AED6F1', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
facies_color_map[label] = facies_colors[ind]
def label_facies(row, labels):
return labels[ row['Facies'] -1]
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)
training_data.describe()
Out[4]:
This is a quick view of the statistical distribution of the input variables. Looking at the count
values, most values have 4149 valid values except for PE
, which has 3232. We will drop the feature vectors that don't have a valid PE
entry.
In [5]:
PE_mask = training_data['PE'].notnull().values
training_data = training_data[PE_mask]
In [6]:
training_data.describe()
Out[6]:
Now we extract just the feature variables we need to perform the classification. The predictor variables are the five log values and two geologic constraining variables, and we are also using depth. We also get a vector of the facies labels that correspond to each feature vector.
In [7]:
y = training_data['Facies'].values
print y[25:40]
print np.shape(y)
In [8]:
X = training_data.drop(['Formation', 'Well Name','Facies','FaciesLabels'], axis=1)
print np.shape(X)
X.describe(percentiles=[.05, .25, .50, .75, .95])
Out[8]:
One of the key steps in machine learning is to estimate a model's performance on data that it has not seen before. Scikit-learn provides a simple utility utility (train_test_split) to partition the data into a training and a test set, but the disadvantage with that is that we ignore a portion of our dataset during training. An additional disadvantage of simple spit, inherent to log data, is that there's a depth dependence.
A possible strategy to avoid this is cross-validation. With k-fold cross-validation we randomly split the data into k-folds without replacement, where k-1 folds are used for training and one fold for testing. The process is repeated k times, and the performance is obtained by taking the average of the k individual performances.
Stratified k-fold is an improvement over standard k-fold in that the class proportions are preserved in each fold to ensure that each fold is representative of the class proportions in the data.
Another important aspect of machine learning is the search for the optimal model parameters (i.e. those that will yield the best performance). This tuning is done using grid search.
The above short summary is based on Sebastian Raschka's Python Machine Learning book.
In [9]:
from sklearn.model_selection import GridSearchCV
Below we will perform grid search with stratified K-fold: http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html#sphx-glr-auto-examples-model-selection-grid-search-digits-py.
This will give us reasonable values for the more critical (for performance) classifier's parameters.
In [10]:
Fscorer = make_scorer(f1_score, average = 'micro')
Ascorer = make_scorer(accuracy_score)
In [11]:
from sklearn import svm
SVC_classifier = svm.SVC(cache_size = 800, random_state=1)
In [12]:
training_data = pd.read_csv('../training_data.csv')
X = training_data.drop(['Formation', 'Well Name', 'Facies'], axis=1).values
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Facies'].values
In [13]:
parm_grid={'kernel': ['linear', 'rbf'],
'C': [0.5, 1, 5, 10, 15],
'gamma':[0.0001, 0.001, 0.01, 0.1, 1, 10]}
grid_search = GridSearchCV(SVC_classifier,
param_grid=parm_grid,
scoring = Fscorer,
cv=10) # Stratified K-fold with n_splits=10
# For integer inputs, if the estimator is a
# classifier and y is either binary or multiclass,
# as in our case, StratifiedKFold is used
grid_search.fit(X, y)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
grid_search.best_estimator_
Out[13]:
The idea from this point forward is to use the parameters, as tuned above, but to create a brand new classifier for the learning curves exercise. This classifier will therefore be well tuned but would not have seen the training data.
We will look at learning curves of training and (cross-validated) testing error versus number of samples, hoping to gain some insight into whether:
The plots are adapted from: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
In [14]:
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(0.1, 1., 5)):
"""
Generate a simple plot of the test and training learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring = Fscorer)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training F1")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation F1")
plt.legend(loc="best")
return plt
First things first, how many samples do we have for each leave-one-well-out split?
In [15]:
training_data = pd.read_csv('../training_data.csv')
X = training_data.drop(['Formation', 'Well Name', 'Facies'], axis=1).values
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Facies'].values
wells = training_data["Well Name"].values
logo = LeaveOneGroupOut()
for train, test in logo.split(X, y, groups=wells):
well_name = wells[test[0]]
print well_name, 'out: ', np.shape(train)[0], 'training samples - ', np.shape(test)[0], 'test samples'
On average, we'll have about 2830 samples for training curves and 400 for testing curves.
In [16]:
from sklearn import svm
SVC_classifier_learn = svm.SVC(C=5, cache_size=800, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [17]:
training_data = pd.read_csv('../training_data.csv')
X = training_data.drop(['Formation', 'Well Name', 'Facies'], axis=1).values
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Facies'].values
In [18]:
title = "Learning Curves (SVC)"
# Learning curves with 50 iterations to get smoother mean test and train
# score curves; each time we hold 15% of the data randomly as a validation set.
# This is equivalent to leaving about 1 well out, on average (3232 minus ~2800 samples)
cv = ShuffleSplit(n_splits=50, test_size=0.15, random_state=1)
plot_learning_curve(SVC_classifier_learn, title, X, y, cv=cv, ylim=(0.45, 0.75), n_jobs=4)
plt.show()
Since we cannot address the overfitting by increasing the number of samples (without sacrificing the leave-one-well-out strategy), we can increase regularization a bit by slightly decreasing the parameter C.
In [19]:
SVC_classifier_learn_2 = svm.SVC(C=2, cache_size=800, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [20]:
title = "Learning Curves (SVC)"
# Learning curves with 50 iterations to get smoother mean test and train
# score curves; each time we hold 15% of the data randomly as a validation set.
# This is equivalent to leaving about 1 well out, on average (3232 minus ~2800 samples)
cv = ShuffleSplit(n_splits=50, test_size=0.15, random_state=1)
plot_learning_curve(SVC_classifier_learn_2, title, X, y, cv=cv, ylim=(0.45, 0.75), n_jobs=4)
plt.show()
In [21]:
from sklearn import svm
SVC_classifier_conf = svm.SVC(C=2, cache_size=800, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [22]:
svc_pred = SVC_classifier_conf.fit(X,y)
svc_pred = SVC_classifier_conf.predict(X)
from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm
conf = confusion_matrix(svc_pred, y)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
In [23]:
from sklearn import svm
SVC_classifier_LOWO = svm.SVC(C=2, cache_size=800, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [24]:
training_data = pd.read_csv('../training_data.csv')
X = training_data.drop(['Formation', 'Well Name', 'Facies'], axis=1).values
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Facies'].values
wells = training_data["Well Name"].values
logo = LeaveOneGroupOut()
f1_SVC = []
for train, test in logo.split(X, y, groups=wells):
well_name = wells[test[0]]
SVC_classifier_LOWO.fit(X[train], y[train])
pred = SVC_classifier_LOWO.predict(X[test])
sc = f1_score(y[test], pred, labels = np.arange(10), average = 'micro')
print("{:>20s} {:.3f}".format(well_name, sc))
f1_SVC.append(sc)
In [25]:
print "-Average leave-one-well-out F1 Score: %6f" % (sum(f1_SVC)/(1.0*(len(f1_SVC))))
In [26]:
from sklearn import svm
SVC_classifier_LOWO_C5 = svm.SVC(C=5, cache_size=800, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=1, shrinking=True,
tol=0.001, verbose=False)
In [27]:
training_data = pd.read_csv('../training_data.csv')
X = training_data.drop(['Formation', 'Well Name', 'Facies'], axis=1).values
scaler = preprocessing.StandardScaler().fit(X)
X = scaler.transform(X)
y = training_data['Facies'].values
wells = training_data["Well Name"].values
logo = LeaveOneGroupOut()
f1_SVC = []
for train, test in logo.split(X, y, groups=wells):
well_name = wells[test[0]]
SVC_classifier_LOWO_C5.fit(X[train], y[train])
pred = SVC_classifier_LOWO_C5.predict(X[test])
sc = f1_score(y[test], pred, labels = np.arange(10), average = 'micro')
print("{:>20s} {:.3f}".format(well_name, sc))
f1_SVC.append(sc)
In [28]:
print "-Average leave-one-well-out F1 Score: %6f" % (sum(f1_SVC)/(1.0*(len(f1_SVC))))
... we would have achieved a slightly higher cross-validated score, as seen above. Would that have conduced necessarily to a higher validation score with the blind wells? This is a very interesting conundrum.... it is a matter of choosing between accepting to leave with a degre of variance, or reducing it but at the cost of increasing bias (lower performance). Perhaps there is an argument for tolerating the variance in order to gain performance, if both trainign ad test scores go up, but the best way to make that decision is to test the classifier with different tools.
We will try two more approaches: