Facies classification using Machine Learning

Original contest notebook by Brendon Hall, Enthought

Let's train some data. I'm Jesper and I have no clue what I'm doing but I'm having fun. Most texts are still from the original notebook from Brandon Hall. As well as the basis for the code.


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable

from pandas import set_option
set_option("display.max_rows", 10)
pd.set_option('display.width', 1000)
pd.options.mode.chained_assignment = None

def distance(latlon1,latlon2):
    lat1 = np.deg2rad(latlon1[0])
    lon1 = np.deg2rad(latlon1[1])
    lat2 = np.deg2rad(latlon2[0])
    lon2 = np.deg2rad(latlon2[1])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    return (6371 * c)

facies_labels = ['SS','CSiS','FSiS','SiSh','MS','WS','D','PS','BS']
facies = pd.DataFrame({'ShortName': pd.Series(facies_labels,index=facies_labels),'Facies' : pd.Series(['Nonmarine sandstone', 'Nonmarine coarse siltstone', 'Nonmarine fine siltstone ', 'Marine siltstone and shale', 'Mudstone (limestone)',  'Wackestone (limestone)', 'Dolomite', 'Packstone-grainstone (limestone)','Phylloid-algal bafflestone (limestone)'],index=facies_labels), 'Neighbours' : pd.Series(['CSiS',['SS','FSiS'],'CSiS','MS',['SiSh','WS'],['MS','D'],['WS','PS'],['WS','D','BS'],['D','PS']],index=facies_labels), 'Colors' : pd.Series(['#F4D03F', '#F5B041','#DC7633','#6E2C00',
       '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D'],index=facies_labels)})

filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data


Out[1]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.450 0.664 9.900 11.915 4.600 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.260 0.661 14.200 12.565 4.100 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.050 0.658 14.800 13.050 3.600 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.100 0.655 13.900 13.115 3.500 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.580 0.647 13.500 13.300 3.400 1 0.915
... ... ... ... ... ... ... ... ... ... ... ...
4144 5 C LM CHURCHMAN BIBLE 3120.5 46.719 0.947 1.828 7.254 3.617 2 0.685
4145 5 C LM CHURCHMAN BIBLE 3121.0 44.563 0.953 2.241 8.013 3.344 2 0.677
4146 5 C LM CHURCHMAN BIBLE 3121.5 49.719 0.964 2.925 8.013 3.190 2 0.669
4147 5 C LM CHURCHMAN BIBLE 3122.0 51.469 0.965 3.083 7.708 3.152 2 0.661
4148 5 C LM CHURCHMAN BIBLE 3122.5 50.031 0.970 2.609 6.668 3.295 2 0.653

4149 rows × 11 columns

Verbatim from Brendon Hall:

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

The nine discrete facies (classes of rocks) are:


In [2]:
facies[['Facies']]


Out[2]:
Facies
SS Nonmarine sandstone
CSiS Nonmarine coarse siltstone
FSiS Nonmarine fine siltstone
SiSh Marine siltstone and shale
MS Mudstone (limestone)
WS Wackestone (limestone)
D Dolomite
PS Packstone-grainstone (limestone)
BS Phylloid-algal bafflestone (limestone)

These facies gradually blend into one another. We can see a marine non-marine neighbourhood already.


In [3]:
pd.DataFrame({'Neighbours' : [facies.ShortName[x] for x in facies.Neighbours]},index=facies_labels)


Out[3]:
Neighbours
SS CSiS
CSiS SS SS FSiS FSiS Name: ShortName, dty...
FSiS CSiS
SiSh MS
MS SiSh SiSh WS WS Name: ShortName, dty...
WS MS MS D D Name: ShortName, dtype: object
D WS WS PS PS Name: ShortName, dtype: object
PS WS WS D D BS BS Name: ShortName, dt...
BS D D PS PS Name: ShortName, dtype: object

The 'Well Name' and 'Formation' columns can be turned into a categorical data type.


In [4]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()


Out[4]:
[SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, KIMZEY A, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (10, object): [SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, ..., NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]

Shrimplin:

  • Longitude: -100.98329
  • Latitude: 37.98126

ALEXANDER D

  • Longitude: -95.3534525
  • Latitude: 37.6607787

SHANKLE

  • Longitude: -101.3891894
  • Latitude: 38.0633727

LUKE G U

  • Longitude: -101.6073725
  • Latitude: 37.4537739

KIMZEY A

  • Longitude: -101.39697
  • Latitude: 37.12289

CROSS H CATTLE

  • Longitude: -101.6464517
  • Latitude: 37.9105826

NOLAN

  • Longitude: -101.0451641
  • Latitude: 37.7866294

Recruit F9

NEWBY

  • Longitude: -101.5847635
  • Latitude: 37.5406739

CHURCHMAN BIBLE

  • Longitude: -101.1060761
  • Latitude: 37.3497658

STUART

  • Longitude: -101.1391063
  • Latitude: 37.4857262

CRAWFORD

  • Longitude: -101.1494994
  • Latitude: 37.1893654

In [5]:
latlong = pd.DataFrame({"SHRIMPLIN": [37.98126,-100.98329], "ALEXANDER D": [37.6607787,-95.3534525], "SHANKLE": [38.0633727,-101.3891894], "LUKE G U": [37.4537739,-101.6073725], "KIMZEY A": [37.12289,-101.39697], "CROSS H CATTLE": [37.9105826,-101.6464517], "NOLAN": [37.7866294,-101.0451641], "NEWBY": [37.5406739,-101.5847635], "CHURCHMAN BIBLE": [37.3497658,-101.1060761], "STUART": [37.4857262,-101.1391063], "CRAWFORD": [37.1893654,-101.1494994], "Recruit F9": [37.4,-101]})
#distance([37.660779,-95.353453],[37.349766,-101.106076])
dist_mat= pd.DataFrame(np.zeros((latlong.shape[1],latlong.shape[1])))
#dist_mat = np.zeros(len(latlong.index),len(latlong.index))
for i,x in enumerate(latlong):
    for j,y in enumerate(latlong):
        if i > j:
            dist_mat[i][j] = (distance(latlong[x].values,latlong[y].values))
            dist_mat[j][i] = dist_mat[i][j]
dist_mat


Out[5]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0.000000 508.539085 514.414809 553.609438 537.155038 551.655595 549.025730 500.715074 498.685691 531.653931 495.709564 510.181434
1 508.539085 0.000000 18.244918 78.442557 36.049841 45.766211 47.291283 48.872768 10.911501 83.166276 71.045797 15.396954
2 514.414809 18.244918 0.000000 91.381670 23.143376 50.037144 54.823941 67.047680 26.897113 99.451087 89.264274 32.966625
3 553.609438 78.442557 91.381670 0.000000 90.308868 50.911093 41.488243 54.564443 80.385857 28.230578 58.679434 64.994310
4 537.155038 36.049841 23.143376 90.308868 0.000000 41.232963 49.333475 80.071330 46.729386 104.579152 102.175650 46.345830
... ... ... ... ... ... ... ... ... ... ... ... ...
7 500.715074 48.872768 67.047680 54.564443 80.071330 61.820025 54.808216 0.000000 43.174992 43.098540 22.312772 34.466311
8 498.685691 10.911501 26.897113 80.385857 46.729386 53.965104 53.924917 43.174992 0.000000 81.317150 64.649885 15.546242
9 531.653931 83.166276 99.451087 28.230578 104.579152 70.445691 60.608168 43.098540 81.317150 0.000000 36.708814 67.888026
10 495.709564 71.045797 89.264274 58.679434 102.175650 80.333850 72.081453 22.312772 64.649885 36.708814 0.000000 56.779037
11 510.181434 15.396954 32.966625 64.994310 46.345830 41.478111 39.779513 34.466311 15.546242 67.888026 56.779037 0.000000

12 rows × 12 columns


In [6]:
training_data.describe()


C:\Users\Jesper\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[6]:
Facies Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 3232.000000 4149.000000 4149.000000
mean 4.503254 2906.867438 64.933985 0.659566 4.402484 13.201066 3.725014 1.518438 0.521852
std 2.474324 133.300164 30.302530 0.252703 5.274947 7.132846 0.896152 0.499720 0.286644
min 1.000000 2573.500000 10.149000 -0.025949 -21.832000 0.550000 0.200000 1.000000 0.000000
25% 2.000000 2821.500000 44.730000 0.498000 1.600000 8.500000 NaN 1.000000 0.277000
50% 4.000000 2932.500000 64.990000 0.639000 4.300000 12.020000 NaN 2.000000 0.528000
75% 6.000000 3007.000000 79.438000 0.822000 7.500000 16.050000 NaN 2.000000 0.769000
max 9.000000 3138.000000 361.150000 1.800000 19.312000 84.400000 8.094000 2.000000 1.000000

The new data set contains about 1000 data points or 33% more than the original one. Rejoice!


In [7]:
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies.Colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] ]
    
training_data['Facies'] = training_data['Facies']-1
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)

rows = len(training_data.index)

processed_data = training_data.copy()

Let's take a look at the data from individual wells in a more familiar log plot form. We will create plots for the five well log variables, as well as a log for facies labels. The plots are based on the those described in Alessandro Amato del Monte's excellent tutorial.


In [109]:
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies.Colors[0:len(facies.Colors)], 'indexed')
    cmap_coarse = colors.ListedColormap(['#c0c0c0','#000000'])
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    clustercoarse=np.repeat(np.expand_dims(logs['NM_M'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=7, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    
    
    im1 = ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=0,vmax=8)
    im2 = ax[6].imshow(clustercoarse, interpolation='none', aspect='auto',
                    cmap=cmap_coarse,vmin=0,vmax=1)
    
    divider = make_axes_locatable(ax[6])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im1, cax=cax)
    cbar.set_label((18*' ').join(facies.index))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-2):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    ax[6].set_xlabel('NM_M')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([]); ax[6].set_yticklabels([])
    ax[5].set_xticklabels([]); ax[6].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)
    plt.show()

Here's the SHRIMPLIN well for your vieling pleasure.


In [9]:
make_facies_log_plot(
    processed_data[processed_data['Well Name'] == 'SHRIMPLIN'],
    facies.Colors)


In addition to individual wells, we can look at how the various facies are represented by the entire training set. Let's plot a histogram of the number of training examples for each facies class.


In [10]:
#count the number of unique entries for each facies, sort them by
#facies number (instead of by number of entries)
facies_counts = processed_data['Facies'].value_counts().sort_index()
tmp = processed_data.query('NM_M==1')
facies_counts0 = tmp['Facies'].value_counts().sort_index()

tmp = processed_data.query('NM_M==2')
facies_counts1 = tmp['Facies'].value_counts().sort_index()

#use facies labels to index each count
facies_counts.index = facies_labels

facies_counts.plot(kind='bar',color=facies.Colors, 
                   title='Distribution of Training Data by Facies')
pd.DataFrame(facies_counts).T


Out[10]:
SS CSiS FSiS SiSh MS WS D PS BS
Facies 268 940 780 271 296 582 141 686 185

This shows the distribution of examples by facies for the examples in the training set. Dolomite (facies 7) has the fewest with 81 examples. Depending on the performance of the classifier we are going to train, we may consider getting more examples of these facies.

Crossplots are a familiar tool in the geosciences to visualize how two properties vary with rock type. This dataset contains 5 log variables, and scatter matrix can help to quickly visualize the variation between the all the variables in the dataset. We can employ the very useful Seaborn library to quickly create a nice looking scatter matrix. Each pane in the plot shows the relationship between two of the variables on the x and y axis, with each point is colored according to its facies. The same colormap is used to represent the 9 facies.


In [11]:
#save plot display settings to change back to when done plotting with seaborn
inline_rc = dict(mpl.rcParams)

import seaborn as sns
sns.set()
plotdata=processed_data.drop(['Well Name','Formation','Depth','RELPOS'],axis=1).dropna()

sns.pairplot(plotdata.drop(['Facies','NM_M'],axis=1),
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))


Out[11]:
<seaborn.axisgrid.PairGrid at 0x1756fcef7f0>

At this point I woult like to take a look at the separated main classes.


In [12]:
facies_counts0.plot(kind='bar',color=facies.Colors, 
                   title='Distribution of Training Data by Facies')
pd.DataFrame(facies_counts0).T


Out[12]:
0 1 2 3 4 5 6 7
Facies 268 934 764 5 11 3 1 12

In [13]:
facies_counts1.plot(kind='bar',color=facies.Colors, 
                   title='Distribution of Training Data by Facies')
pd.DataFrame(facies_counts1).T


Out[13]:
1 2 3 4 5 6 7 8
Facies 6 16 266 285 579 140 674 185

Preprocessing


In [81]:
from sklearn import preprocessing
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from scipy import signal


enc = preprocessing.OneHotEncoder()
processed_data['NM_M'] = enc.fit(processed_data[['NM_M']]).transform(processed_data[['NM_M']]).toarray()

feature_vectors = processed_data.drop(['Formation', 'Well Name','Depth','Facies','FaciesLabels'], axis=1)
training_data= training_data[training_data['PE'].notnull().values]
correct_facies_labels  = processed_data['Facies'].values
correct_facies_labels_comp = training_data['Facies'].values

feature_vectors_comp = training_data.drop(['Formation', 'Well Name','Depth','Facies','FaciesLabels'], axis=1)

#combined_features = FeatureUnion([("original_data", orig_feature_vectors), ("univ_select", selection)])

preping = Pipeline([("imputer", preprocessing.Imputer(missing_values='NaN',strategy="mean",axis=0)),
                    ("scaling", preprocessing.StandardScaler()),
                    ("trees", RandomForestClassifier())])

preping.set_params(trees__random_state=42).fit(orig_feature_vectors,correct_facies_labels)

score = cross_val_score(preping, orig_feature_vectors, correct_facies_labels).mean()
print("Score after preprocessing: {0}".format(score))


Score after preprocessing: 0.500591066256444

In [36]:
processed_data.describe()


C:\Users\Jesper\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[36]:
Facies Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 3232.000000 4149.000000 4149.000000
mean 3.503254 2906.867438 64.933985 0.659566 4.402484 13.201066 3.725014 0.518438 0.521852
std 2.474324 133.300164 30.302530 0.252703 5.274947 7.132846 0.896152 0.499720 0.286644
min 0.000000 2573.500000 10.149000 -0.025949 -21.832000 0.550000 0.200000 0.000000 0.000000
25% 1.000000 2821.500000 44.730000 0.498000 1.600000 8.500000 NaN 0.000000 0.277000
50% 3.000000 2932.500000 64.990000 0.639000 4.300000 12.020000 NaN 1.000000 0.528000
75% 5.000000 3007.000000 79.438000 0.822000 7.500000 16.050000 NaN 1.000000 0.769000
max 8.000000 3138.000000 361.150000 1.800000 19.312000 84.400000 8.094000 1.000000 1.000000

Feature Engineering


In [37]:
#scaler = preprocessing.StandardScaler().fit(feature_vectors)
#scaled_features = scaler.transform(feature_vectors)

scaler_comp = preprocessing.StandardScaler().fit(feature_vectors_comp)
scaled_features_comp = scaler_comp.transform(feature_vectors_comp)

Scikit includes a preprocessing module that can 'standardize' the data (giving each variable zero mean and unit variance, also called whitening). Many machine learning algorithms assume features will be standard normally distributed data (ie: Gaussian with zero mean and unit variance). The factors used to standardize the training set must be applied to any subsequent feature set that will be input to the classifier. The StandardScalar class can be fit to the training set, and later used to standardize any training data.

Scikit also includes a handy function to randomly split the training data into training and test sets. The test set contains a small subset of feature vectors that are not used to train the network. Because we know the true facies labels for these examples, we can compare the results of the classifier to the actual facies and determine the accuracy of the model. Let's use 20% of the data for the test set.

Training the SVM classifier

Now we use the cleaned and conditioned training set to create a facies classifier.


In [38]:
from sklearn.ensemble import RandomForestClassifier

clf  = preping
clf_comp = RandomForestClassifier()

Now we can train the classifier using the training set we created above.


In [39]:
clf_comp.fit(scaled_features_comp,correct_facies_labels_comp)


Out[39]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Now that the model has been trained on our data, we can use it to predict the facies of the feature vectors in the test set. Because we know the true facies labels of the vectors in the test set, we can use the results to evaluate the accuracy of the classifier.


In [51]:
predicted_labels = clf.predict(feature_vectors)
predicted_labels_comp = clf_comp.predict(scaled_features_comp)


---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-51-23c7984e62d3> in <module>()
----> 1 predicted_labels = clf.predict(feature_vectors)
      2 predicted_labels_comp = clf_comp.predict(scaled_features_comp)

C:\Users\Jesper\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in predict(self, X)
    532             The predicted classes.
    533         """
--> 534         proba = self.predict_proba(X)
    535 
    536         if self.n_outputs_ == 1:

C:\Users\Jesper\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in predict_proba(self, X)
    571         """
    572         # Check data
--> 573         X = self._validate_X_predict(X)
    574 
    575         # Assign chunk of trees to jobs

C:\Users\Jesper\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in _validate_X_predict(self, X)
    350         """Validate X whenever one tries to predict, apply, predict_proba"""
    351         if self.estimators_ is None or len(self.estimators_) == 0:
--> 352             raise NotFittedError("Estimator not fitted, "
    353                                  "call `fit` before exploiting the model.")
    354 

NotFittedError: Estimator not fitted, call `fit` before exploiting the model.

We need some metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.

The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j] are equal to the number of observations predicted to have facies j, but are known to have facies i.

To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py in this repo for the display_cm() function.


In [52]:
from sklearn.metrics import confusion_matrix,f1_score, accuracy_score, make_scorer
from classification_utilities import display_cm, display_adj_cm

In [53]:
conf = confusion_matrix(correct_facies_labels, predicted_labels)
display_cm(conf, facies_labels, hide_zeros=True)


     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   267           1                                       268
     CSiS     1   933     6                                       940
     FSiS     1    13   765     1                                 780
     SiSh           1         269     1                           271
       MS           1           1   287     3           4         296
       WS                 1               579           2         582
        D                                   1   137           3   141
       PS           1     1     2     1     6         675         686
       BS                                               1   184   185

In [54]:
conf_comp = confusion_matrix(correct_facies_labels_comp, predicted_labels_comp)
display_cm(conf_comp, facies_labels, hide_zeros=True)


     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   255     3     1                                       259
     CSiS         734     4                                       738
     FSiS          10   605                                       615
     SiSh                     180     1                 3         184
       MS                 1     1   212     2           1         217
       WS                 1     1         459           1         462
        D                                        95           3    98
       PS                 1                 5         492         498
       BS                                               2   159   161

The rows of the confusion matrix correspond to the actual facies labels. The columns correspond to the labels assigned by the classifier. For example, consider the first row. For the feature vectors in the test set that actually have label SS, 23 were correctly indentified as SS, 21 were classified as CSiS and 2 were classified as FSiS.

The entries along the diagonal are the facies that have been correctly classified. Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.


In [45]:
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i, adjacent_facies[i] is an array of the adjacent facies labels.


In [46]:
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies,offset):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j-offset]
    return total_correct / sum(sum(conf))

In [47]:
print('Processed Full')
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies,0))


Processed Full
Facies classification accuracy = 0.987226
Adjacent facies classification accuracy = 0.995662

In [48]:
print('Non-processed')
print('Facies classification accuracy = %f' % accuracy(conf_comp))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf_comp, adjacent_facies,0))


Non-processed
Facies classification accuracy = 0.987314
Adjacent facies classification accuracy = 0.996287

Model parameter selection

The classifier so far has been built with the default parameters. However, we may be able to get improved classification results with optimal parameter choices.

We will consider two parameters. The parameter C is a regularization factor, and tells the classifier how much we want to avoid misclassifying training examples. A large value of C will try to correctly classify more examples from the training set, but if C is too large it may 'overfit' the data and fail to generalize when classifying new data. If C is too small then the model will not be good at fitting outliers and will have a large error on the training set.

The SVM learning algorithm uses a kernel function to compute the distance between feature vectors. Many kernel functions exist, but in this case we are using the radial basis function rbf kernel (the default). The gamma parameter describes the size of the radial basis functions, which is how far away two vectors in the feature space need to be to be considered close.

We will train a series of classifiers with different values for C and gamma. Two nested loops are used to train a classifier for every possible combination of values in the ranges specified. The classification accuracy is recorded for each combination of parameter values. The results are shown in a series of plots, so the parameter values that give the best classification accuracy on the test set can be selected.

This process is also known as 'cross validation'. Often a separate 'cross validation' dataset will be created in addition to the training and test sets to do model selection. For this tutorial we will just use the test set to choose model parameters.


In [58]:
from sklearn.model_selection import RandomizedSearchCV,LeaveOneGroupOut
from scipy.stats import randint as sp_randint

Fscorer = make_scorer(f1_score, average = 'micro')
Ascorer = make_scorer(accuracy_score)

clf = Pipeline([("imputer", preprocessing.Imputer(missing_values='NaN',strategy="mean",axis=0)),
                    ("scaling", preprocessing.StandardScaler()),
                    ("trees", RandomForestClassifier())])
clf_comp = RandomForestClassifier()

n_iter_search = 30


param_dist =  {"trees__max_depth": [np.sqrt(feature_vectors.shape[1]),np.log2(feature_vectors.shape[1]),None],
               "trees__n_estimators": sp_randint(30,200),
               "trees__max_features": sp_randint(3,feature_vectors.shape[1]), 
               "trees__min_samples_split": sp_randint(10,100), 
               "trees__min_samples_leaf": sp_randint(2,30)}

param_dist_comp =  {"max_depth": [np.sqrt(scaled_features_comp.shape[1]),np.log2(scaled_features_comp.shape[1]),None],
               "n_estimators": sp_randint(30,200),
               "max_features": sp_randint(3,scaled_features_comp.shape[1]), 
               "min_samples_split": sp_randint(10,100), 
               "min_samples_leaf": sp_randint(2,30)}

In [60]:
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,scoring = Fscorer,n_iter=n_iter_search)
random_search.fit(feature_vectors,correct_facies_labels)
print('Best score for combo: {}'.format(random_search.best_score_))
print('Best parameters for combo: {}'.format(random_search.best_params_))
clf = random_search.best_estimator_
random_search.best_estimator_


Best score for combo: 0.5538684020245842
Best parameters for combo: {'trees__max_depth': None, 'trees__max_features': 3, 'trees__min_samples_leaf': 27, 'trees__min_samples_split': 90, 'trees__n_estimators': 149}
Out[60]:
Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('scaling', StandardScaler(copy=True, with_mean=True, with_std=True)), ('trees', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=3, m...mators=149, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])

In [61]:
random_search_comp = RandomizedSearchCV(clf_comp, param_distributions=param_dist_comp,scoring = Fscorer,n_iter=n_iter_search)
random_search_comp.fit(scaled_features_comp,correct_facies_labels_comp)
print('Best score for unprocessed: {}'.format(random_search_comp.best_score_))
print('Best parameters for unprocessed: {}'.format(random_search_comp.best_params_))
clf_comp = random_search_comp.best_estimator_
random_search_comp.best_estimator_


Best score for unprocessed: 0.5411509900990099
Best parameters for unprocessed: {'min_samples_split': 94, 'min_samples_leaf': 2, 'max_depth': None, 'n_estimators': 136, 'max_features': 3}
Out[61]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=3, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=94, min_weight_fraction_leaf=0.0,
            n_estimators=136, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Okay, take it up a notch.


In [62]:
clf.fit(feature_vectors, correct_facies_labels)


Out[62]:
Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('scaling', StandardScaler(copy=True, with_mean=True, with_std=True)), ('trees', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=3, m...mators=149, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])

In [63]:
clf_comp.fit(scaled_features_comp, correct_facies_labels_comp)


Out[63]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=3, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=94, min_weight_fraction_leaf=0.0,
            n_estimators=136, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [65]:
cv_conf = confusion_matrix(correct_facies_labels, clf.predict(feature_vectors))
cv_conf_comp = confusion_matrix(correct_facies_labels_comp, clf_comp.predict(scaled_features_comp))

In [66]:
print('Preprocessed')
print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf, adjacent_facies,0))


Preprocessed
Optimized facies classification accuracy = 0.66
Optimized adjacent facies classification accuracy = 0.93

In [67]:
print('Unprocessed')
print('Optimized facies classification accuracy = %.2f' % accuracy(cv_conf_comp))
print('Optimized adjacent facies classification accuracy = %.2f' % accuracy_adjacent(cv_conf_comp, adjacent_facies,0))


Unprocessed
Optimized facies classification accuracy = 0.68
Optimized adjacent facies classification accuracy = 0.93

Precision and recall are metrics that give more insight into how the classifier performs for individual facies. Precision is the probability that given a classification result for a sample, the sample actually belongs to that class. Recall is the probability that a sample will be correctly classified for a given class.

Precision and recall can be computed easily using the confusion matrix. The code to do so has been added to the display_confusion_matrix() function:


In [68]:
print('Pre-processed')
display_cm(cv_conf, facies_labels, 
           display_metrics=True, hide_zeros=True)


Pre-processed
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   139   118    11                                       268
     CSiS    29   724   181     2           1     1     2         940
     FSiS     5   197   562     3           1          12         780
     SiSh     1           4   191          60     4    11         271
       MS           6     5    34    47   133    13    58         296
       WS           2     1    32     3   397     9   136     2   582
        D                 1     8          16    63    50     3   141
       PS           4     8    18     1   119    13   518     5   686
       BS                       4          17     6    58   100   185

Precision  0.80  0.69  0.73  0.65  0.92  0.53  0.58  0.61  0.91  0.69
   Recall  0.52  0.77  0.72  0.70  0.16  0.68  0.45  0.76  0.54  0.66
       F1  0.63  0.73  0.72  0.68  0.27  0.60  0.50  0.68  0.68  0.65

In [69]:
print('Un-processed')
display_cm(cv_conf_comp, facies_labels, 
           display_metrics=True, hide_zeros=True)


Un-processed
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   150    98    11                                       259
     CSiS    34   556   142     3           1     1     1         738
     FSiS     5   125   470     4           2           9         615
     SiSh     1           4   132          39           8         184
       MS           7     4    27    27   107     6    38     1   217
       WS     1           1    27     3   351     3    75     1   462
        D                 1    10          15    43    28     1    98
       PS     1     2     9    11     1    92     1   366    15   498
       BS                       4          11          41   105   161

Precision  0.78  0.71  0.73  0.61  0.87  0.57  0.80  0.65  0.85  0.70
   Recall  0.58  0.75  0.76  0.72  0.12  0.76  0.44  0.73  0.65  0.68
       F1  0.67  0.73  0.75  0.66  0.22  0.65  0.57  0.69  0.74  0.67

To interpret these results, consider facies SS. In our test set, if a sample was labeled SS the probability the sample was correct is 0.8 (precision). If we know a sample has facies SS, then the probability it will be correctly labeled by the classifier is 0.78 (recall). It is desirable to have high values for both precision and recall, but often when an algorithm is tuned to increase one, the other decreases. The F1 score combines both to give a single measure of relevancy of the classifier results.

These results can help guide intuition for how to improve the classifier results. For example, for a sample with facies MS or mudstone, it is only classified correctly 57% of the time (recall). Perhaps this could be improved by introducing more training samples. Sample quality could also play a role. Facies BS or bafflestone has the best F1 score and relatively few training examples. But this data was handpicked from other wells to provide training examples to identify this facies.

We can also consider the classification metrics when we consider misclassifying an adjacent facies as correct:


In [70]:
display_adj_cm(cv_conf, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)


     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   257          11                                       268
     CSiS         934           2           1     1     2         940
     FSiS     5         759     3           1          12         780
     SiSh     1           4   191          60     4    11         271
       MS           6     5         214          13    58         296
       WS           2     1    32         545                 2   582
        D                 1     8               129           3   141
       PS           4     8    18     1               655         686
       BS                       4          17               164   185

Precision  0.98  0.99  0.96  0.74  1.00  0.87  0.88  0.89  0.97  0.93
   Recall  0.96  0.99  0.97  0.70  0.72  0.94  0.91  0.95  0.89  0.93
       F1  0.97  0.99  0.97  0.72  0.84  0.90  0.90  0.92  0.93  0.93

In [71]:
display_adj_cm(cv_conf_comp, facies_labels, adjacent_facies, 
           display_metrics=True, hide_zeros=True)


     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   248          11                                       259
     CSiS         732           3           1     1     1         738
     FSiS     5         595     4           2           9         615
     SiSh     1           4   132          39           8         184
       MS           7     4         161           6    38     1   217
       WS     1           1    27         432                 1   462
        D                 1    10                86           1    98
       PS     1     2     9    11     1               474         498
       BS                       4          11               146   161

Precision  0.97  0.99  0.95  0.69  0.99  0.89  0.92  0.89  0.98  0.93
   Recall  0.96  0.99  0.97  0.72  0.74  0.94  0.88  0.95  0.91  0.93
       F1  0.96  0.99  0.96  0.70  0.85  0.91  0.90  0.92  0.94  0.93

Considering adjacent facies, the F1 scores for all facies types are above 0.9, except when classifying SiSh or marine siltstone and shale. The classifier often misclassifies this facies (recall of 0.66), most often as wackestone.

These results are comparable to those reported in Dubois et al. (2007).


In [77]:
wells = processed_data["Well Name"].values
wells_comp = training_data["Well Name"].values
logo = LeaveOneGroupOut()
logo_comp = LeaveOneGroupOut()

In [95]:
f1_SVC = []
pred = {}

X = feature_vectors.as_matrix()

for train, test in logo.split(feature_vectors, correct_facies_labels, groups=wells):
    well_name = wells[test[0]]
    clf.fit(X[train], correct_facies_labels[train])
    pred[well_name] = clf.predict(X[test])
    sc = f1_score(correct_facies_labels[test], pred[well_name], labels = np.arange(10), average = 'micro')
    print("{:>20s}  {:.3f}".format(well_name, sc))
    f1_SVC.append(sc)

print("-Average leave-one-well-out F1 Score: {0}".format(sum(f1_SVC)/(1.0*(len(f1_SVC)))))


         ALEXANDER D  0.599
     CHURCHMAN BIBLE  0.545
      CROSS H CATTLE  0.315
            KIMZEY A  0.531
            LUKE G U  0.644
               NEWBY  0.508
               NOLAN  0.516
          Recruit F9  0.400
             SHANKLE  0.479
           SHRIMPLIN  0.597
-Average leave-one-well-out F1 Score: 0.513230638767481

In [96]:
f1_SVC_comp = []
pred_comp = {}
for train, test in logo_comp.split(scaled_features_comp, correct_facies_labels_comp, groups=wells_comp):
    well_name = wells_comp[test[0]]
    clf_comp.fit(scaled_features_comp[train], correct_facies_labels_comp[train])
    pred_comp[well_name] = clf_comp.predict(scaled_features_comp[test])
    sc = f1_score(correct_facies_labels_comp[test], pred_comp[well_name], labels = np.arange(10), average = 'micro')
    print("{:>20s}  {:.3f}".format(well_name, sc))
    f1_SVC_comp.append(sc)

print("-Average leave-one-well-out F1 Score: {0}".format(sum(f1_SVC_comp)/(1.0*(len(f1_SVC_comp)))))


     CHURCHMAN BIBLE  0.542
      CROSS H CATTLE  0.325
            LUKE G U  0.627
               NEWBY  0.499
               NOLAN  0.501
          Recruit F9  0.471
             SHANKLE  0.517
           SHRIMPLIN  0.571
-Average leave-one-well-out F1 Score: 0.5066085936986217

Applying the classification model to new data

Now that we have a trained facies classification model we can use it to identify facies in wells that do not have core data. In this case, we will apply the classifier to two wells, but we could use it on any number of wells for which we have the same set of well logs for input.

This dataset is similar to the training data except it does not have facies labels. It is loaded into a dataframe called test_data.


In [100]:
well_data = pd.read_csv('../validation_data_nofacies.csv')
well_data['Well Name'] = well_data['Well Name'].astype('category')
well_data['NM_M'] = enc.fit(well_data[['NM_M']]).transform(well_data[['NM_M']]).toarray()
well_features = well_data.drop(['Formation', 'Well Name','Depth'], axis=1)

The data needs to be scaled using the same constants we used for the training data.


In [103]:
well_features.describe()


Out[103]:
GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 830.00000 830.000000 830.000000 830.000000 830.000000 830.000000 830.000000
mean 57.61173 0.666312 2.851964 11.655277 3.654178 0.321687 0.535807
std 27.52774 0.288367 3.442074 5.190236 0.649793 0.467405 0.283062
min 12.03600 -0.468000 -8.900000 1.855000 2.113000 0.000000 0.013000
25% 36.77325 0.541000 0.411250 7.700000 3.171500 0.000000 0.300000
50% 58.34450 0.675000 2.397500 10.950000 3.515500 0.000000 0.547500
75% 73.05150 0.850750 4.600000 14.793750 4.191500 1.000000 0.778000
max 220.41300 1.507000 16.500000 31.335000 6.321000 1.000000 1.000000

Finally we predict facies labels for the unknown data, and store the results in a Facies column of the test_data dataframe.


In [104]:
#predict facies of unclassified data
y_unknown = clf.predict(well_features)
well_data['Facies'] = y_unknown
well_data


Out[104]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS Facies
0 A1 SH STUART 2808.0 66.276 0.630 3.300 10.650 3.591 1.0 1.000 2
1 A1 SH STUART 2808.5 77.252 0.585 6.500 11.950 3.341 1.0 0.978 2
2 A1 SH STUART 2809.0 82.899 0.566 9.400 13.600 3.064 1.0 0.956 1
3 A1 SH STUART 2809.5 80.671 0.593 9.500 13.250 2.977 1.0 0.933 1
4 A1 SH STUART 2810.0 75.971 0.638 8.700 12.350 3.020 1.0 0.911 1
... ... ... ... ... ... ... ... ... ... ... ...
825 C SH CRAWFORD 3158.5 86.078 0.554 5.040 16.150 3.161 1.0 0.639 1
826 C SH CRAWFORD 3159.0 88.855 0.539 5.560 16.750 3.118 1.0 0.611 1
827 C SH CRAWFORD 3159.5 90.490 0.530 6.360 16.780 3.168 1.0 0.583 1
828 C SH CRAWFORD 3160.0 90.975 0.522 7.035 16.995 3.154 1.0 0.556 1
829 C SH CRAWFORD 3160.5 90.108 0.513 7.505 17.595 3.125 1.0 0.528 2

830 rows × 11 columns


In [105]:
well_data['Well Name'].unique()


Out[105]:
[STUART, CRAWFORD]
Categories (2, object): [STUART, CRAWFORD]

We can use the well log plot to view the classification results along with the well logs.


In [110]:
make_facies_log_plot(
    well_data[well_data['Well Name'] == 'STUART'],
    facies_colors=facies.Colors)



In [111]:
make_facies_log_plot(
    well_data[well_data['Well Name'] == 'CRAWFORD'],
    facies_colors=facies.Colors)


Finally we can write out a csv file with the well data along with the facies classification results.


In [112]:
well_data['Facies'] = y_unknown+1
well_data.to_csv('well_data_with_facies.csv')

References

Amato del Monte, A., 2015. Seismic Petrophysics: Part 1, The Leading Edge, 34 (4). doi:10.1190/tle34040440.1

Bohling, G. C., and M. K. Dubois, 2003. An Integrated Application of Neural Network and Markov Chain Techniques to Prediction of Lithofacies from Well Logs, KGS Open-File Report 2003-50, 6 pp. pdf

Dubois, M. K., G. C. Bohling, and S. Chakrabarti, 2007, Comparison of four approaches to a rock facies classification problem, Computers & Geosciences, 33 (5), 599-617 pp. doi:10.1016/j.cageo.2006.08.011