Facies classification using KNearestNeighbors


Dan Hallau

A lot of sophisticated models have been submitted for the contest so far (deep neural nets, random forests, etc.) so I thought I'd try submitting a simpler model to see how it stacks up. In that spirit here's a KNearestNeighbors classifier.

I spend a few cells back-calculating some more standard logging curves (RHOB, NPHI, etc) then create a log-based lithology model from a Umaa-Rhomaa plot. After training, I finish it up with a LeaveOneGroupOut test.


In [1]:
import pandas as pd
import numpy as np

from sklearn import neighbors
from sklearn import preprocessing
from sklearn.model_selection import LeaveOneGroupOut

import inversion

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load training data


In [2]:
df = pd.read_csv('../facies_vectors.csv')
df.dropna(inplace=True)

Build features

In the real world it would be unusual to have neutron-density cross-plot porosity (i.e. PHIND) without the corresponding raw input curves, namely bulk density and neutron porosity, as we have in this contest dataset. So as part of the feature engineering process, I back-calculate estimates of those raw curves from the provided DeltaPHI and PHIND curves. One issue with this approach though is that cross-plot porosity differs between vendors, toolstrings, and software packages, and it is not known exactly how the PHIND in this dataset was computed. So I make the assumption here that PHIND ≈ sum of squares porosity, which is usually an adequate approximation of neutron-density crossplot porosity. That equation looks like this:

$$PHIND ≈ \sqrt{\frac{NPHI^2 + DPHI^2}{2}}$$

and it is assumed here that DeltaPHI is:

$$DeltaPHI = NPHI - DPHI$$

The functions below use the relationships from the above equations (...two equations, two unknowns...) to estimate NPHI and DPHI (and consequently RHOB).

Once we have RHOB, we can use it combined with PE to estimate apparent grain density (RHOMAA) and apparent photoelectric capture cross-section (UMAA), which are useful in lithology estimations from well logs.


In [3]:
def estimate_dphi(df):
    return ((4*(df['PHIND']**2) - (df['DeltaPHI']**2))**0.5 - df['DeltaPHI']) / 2

def estimate_rhob(df):
    return (2.71 - (df['DPHI_EST']/100) * 1.71)

def estimate_nphi(df):
    return df['DPHI_EST'] + df['DeltaPHI']

def compute_rhomaa(df):
    return (df['RHOB_EST'] - (df['PHIND'] / 100)) / (1 - df['PHIND'] / 100)
            
def compute_umaa(df):
    return ((df['PE'] * df['RHOB_EST']) - (df['PHIND']/100 * 0.398)) / (1 - df['PHIND'] / 100)

Because solving the sum of squares equation involved the quadratic formula, in some cases imaginary numbers result due to porosities being negative, which is what the warning below is about.


In [4]:
df['DPHI_EST'] = df.apply(lambda x: estimate_dphi(x), axis=1).astype(float)
df['RHOB_EST'] = df.apply(lambda x: estimate_rhob(x), axis=1)
df['NPHI_EST'] = df.apply(lambda x: estimate_nphi(x), axis=1)
df['RHOMAA_EST'] = df.apply(lambda x: compute_rhomaa(x), axis=1)
df['UMAA_EST'] = df.apply(lambda x: compute_umaa(x), axis=1)


/home/delta/anaconda3/lib/python3.5/site-packages/pandas/core/common.py:1920: ComplexWarning: Casting complex values to real discards the imaginary part
  return arr.astype(dtype)

Just for fun, below is a basic Umaa-Rhomaa plot to view relative abundances of quartz, calcite, dolomite, and clay. The red triangle represents a ternary solution for QTZ, CAL, and DOL, while the green triangle represents a solution for QTZ, CAL, and CLAY (illite).


In [5]:
df[df.GR < 125].plot(kind='scatter', x='UMAA_EST', y='RHOMAA_EST', c='GR', figsize=(8,6))
plt.ylim(3.1, 2.2)
plt.xlim(0.0, 17.0)
plt.plot([4.8, 9.0, 13.8, 4.8], [2.65, 2.87, 2.71, 2.65], c='r')
plt.plot([4.8, 11.9, 13.8, 4.8], [2.65, 3.06, 2.71, 2.65], c='g')
plt.scatter([4.8], [2.65], s=50, c='r')
plt.scatter([9.0], [2.87], s=50, c='r')
plt.scatter([13.8], [2.71], s=50, c='r')
plt.scatter([11.9], [3.06], s=50, c='g')
plt.text(2.8, 2.65, 'Quartz', backgroundcolor='w')
plt.text(14.4, 2.71, 'Calcite', backgroundcolor='w')
plt.text(9.6, 2.87, 'Dolomite', backgroundcolor='w')
plt.text(12.5, 3.06, 'Illite', backgroundcolor='w')
plt.text(7.0, 2.55, "gas effect", ha="center", va="center", rotation=-55,
            size=8, bbox=dict(boxstyle="larrow,pad=0.3", fc="pink", ec="red", lw=2))
plt.text(15.0, 2.78, "heavies?", ha="center", va="center", rotation=0,
            size=8, bbox=dict(boxstyle="rarrow,pad=0.3", fc="yellow", ec="orange", lw=2))


Out[5]:
<matplotlib.text.Text at 0x7f3ccf2ea668>

Here I use matrix inversion to "solve" the ternary plot for each lithologic component. Essentially each datapoint is a mix of the three components defined by the ternary diagram, with abundances of each defined by the relative distances from each endpoint. I use a GR cutoff of 40 API to determine when to use either the QTZ-CAL-DOL or QTZ-CAL-CLAY ternary solutions. In other words, it is assumed that below 40 API, there is 0% clay, and above 40 API there is 0% dolomite, and also that these four lithologic components are the only components in these rocks. Admittedly it's not a great assumption, especially since the ternary plot indicates other stuff is going on. For example the high Umaa datapoints near the Calcite endpoint may indicate some heavy minerals (e.g., pyrite) or even barite-weighted mud. The "pull" of datapoints to the northwest quadrant probably reflects some gas effect, so my lithologies in those gassy zones will be skewed.


In [6]:
# QTZ-CAL-CLAY
ur = inversion.UmaaRhomaa()
ur.set_dol_uma(11.9)
ur.set_dol_rhoma(3.06)
# QTZ-CAL-DOL
ur2 = inversion.UmaaRhomaa()

df['UR_QTZ'] = np.nan
df['UR_CLY'] = np.nan
df['UR_CAL'] = np.nan
df['UR_DOL'] = np.nan

df.ix[df.GR >= 40, 'UR_QTZ'] = df.ix[df.GR >= 40].apply(lambda x: ur.get_qtz(x.UMAA_EST, x.RHOMAA_EST), axis=1)
df.ix[df.GR >= 40, 'UR_CLY'] = df.ix[df.GR >= 40].apply(lambda x: ur.get_dol(x.UMAA_EST, x.RHOMAA_EST), axis=1) 
df.ix[df.GR >= 40, 'UR_CAL'] = df.ix[df.GR >= 40].apply(lambda x: ur.get_cal(x.UMAA_EST, x.RHOMAA_EST), axis=1)
df.ix[df.GR >= 40, 'UR_DOL'] = 0

df.ix[df.GR < 40, 'UR_QTZ'] = df.ix[df.GR < 40].apply(lambda x: ur2.get_qtz(x.UMAA_EST, x.RHOMAA_EST), axis=1)
df.ix[df.GR < 40, 'UR_DOL'] = df[df.GR < 40].apply(lambda x: ur2.get_dol(x.UMAA_EST, x.RHOMAA_EST), axis=1) 
df.ix[df.GR < 40, 'UR_CAL'] = df[df.GR < 40].apply(lambda x: ur2.get_cal(x.UMAA_EST, x.RHOMAA_EST), axis=1)
df.ix[df.GR < 40, 'UR_CLY'] = 0

Fit KNearestNeighbors model and apply LeaveOneGroupOut test

There is some bad log data in this dataset which I'd guess is due to rugose hole. PHIND gets as high at 80%, which is certainly spurious. For the purposes of this exercise, I'll just remove records where neutron-density crossplot porosity is super high.

Also the CROSS H CATTLE well consistently returns anomalously low F1 scores, so I'm going to omit it from the training set.


In [7]:
df1 = df.dropna()
df1 = df1[(df1.PHIND <= 40) & (df1['Well Name'] != 'CROSS H CATTLE')]

facies = df1['Facies'].values
wells = df1['Well Name'].values

drop_list = ['Formation', 'Well Name', 'Facies', 'Depth', 'DPHI_EST',  'NPHI_EST', 'DeltaPHI',
             'RHOMAA_EST', 'UMAA_EST', 'UR_QTZ', 'UR_DOL'] 

fv = df1.drop(drop_list, axis=1).values

clf = neighbors.KNeighborsClassifier(n_neighbors=56, weights='distance') 

X = preprocessing.StandardScaler().fit(fv).transform(fv)
y = facies

logo = LeaveOneGroupOut()
f1knn = []

for train, test in logo.split(X, y, groups=wells):
    well_name = wells[test[0]]
    clf.fit(X[train], y[train])
    predicted_labels = clf.predict(X[test])
    score = clf.fit(X[train], y[train]).score(X[test], y[test])
    print("{:>20s}  {:.3f}".format(well_name, score))
    f1knn.append(score)
    
print("-Average leave-one-well-out F1 Score: %6f" % (np.mean(f1knn)))
f1knn.pop(4)
print("-Average leave-one-well-out F1 Score, no Recruit F1: %6f" % (np.mean(f1knn)))


     CHURCHMAN BIBLE  0.544
            LUKE G U  0.584
               NEWBY  0.501
               NOLAN  0.499
          Recruit F9  0.868
             SHANKLE  0.510
           SHRIMPLIN  0.575
-Average leave-one-well-out F1 Score: 0.582957
-Average leave-one-well-out F1 Score, no Recruit F1: 0.535509

Apply model to validation dataset

Load validation data (vd), build features, and use the classfier from above to predict facies.


In [8]:
vd = pd.read_csv('../validation_data_nofacies.csv', index_col=0)
vd.dropna(inplace=True)

vd['DPHI_EST'] = vd.apply(lambda x: estimate_dphi(x), axis=1).astype(float)
vd['RHOB_EST'] = vd.apply(lambda x: estimate_rhob(x), axis=1)
vd['NPHI_EST'] = vd.apply(lambda x: estimate_nphi(x), axis=1)
vd['RHOMAA_EST'] = vd.apply(lambda x: compute_rhomaa(x), axis=1)
vd['UMAA_EST'] = vd.apply(lambda x: compute_umaa(x), axis=1)

vd['UR_QTZ'] = np.nan
vd['UR_CLY'] = np.nan
vd['UR_CAL'] = np.nan
vd['UR_DOL'] = np.nan

vd.ix[vd.GR >= 40, 'UR_QTZ'] = vd.ix[vd.GR >= 40].apply(lambda x: ur.get_qtz(x.UMAA_EST, x.RHOMAA_EST), axis=1)
vd.ix[vd.GR >= 40, 'UR_CLY'] = vd.ix[vd.GR >= 40].apply(lambda x: ur.get_dol(x.UMAA_EST, x.RHOMAA_EST), axis=1) 
vd.ix[vd.GR >= 40, 'UR_CAL'] = vd.ix[vd.GR >= 40].apply(lambda x: ur.get_cal(x.UMAA_EST, x.RHOMAA_EST), axis=1)
vd.ix[vd.GR >= 40, 'UR_DOL'] = 0

vd.ix[vd.GR < 40, 'UR_QTZ'] = vd.ix[vd.GR < 40].apply(lambda x: ur2.get_qtz(x.UMAA_EST, x.RHOMAA_EST), axis=1)
vd.ix[vd.GR < 40, 'UR_DOL'] = vd[vd.GR < 40].apply(lambda x: ur2.get_dol(x.UMAA_EST, x.RHOMAA_EST), axis=1) 
vd.ix[vd.GR < 40, 'UR_CAL'] = vd[vd.GR < 40].apply(lambda x: ur2.get_cal(x.UMAA_EST, x.RHOMAA_EST), axis=1)
vd.ix[vd.GR < 40, 'UR_CLY'] = 0

vd1 = vd.dropna()
vd1 = vd1[(vd1.PHIND <= 40)]

drop_list1 = ['Well Name', 'Depth', 'DPHI_EST',  'NPHI_EST', 'DeltaPHI',
             'RHOMAA_EST', 'UMAA_EST', 'UR_QTZ', 'UR_DOL'] 

fv1 = vd1.drop(drop_list1, axis=1).values

X1 = preprocessing.StandardScaler().fit(fv1).transform(fv1)

vd_predicted_facies = clf.predict(X1)