In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
import itertools
import pandas as pd
from sklearn import decomposition, preprocessing
from skimage.feature import greycomatrix, greycoprops
from skimage import exposure
<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The code and ideas to engineer new features used in this notebook, </span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Matteo Niccoli and Mark Dahl, with contributions by Daniel Kittridge,</span> are licensed under a Creative Commons Attribution 4.0 International License.
In [3]:
# import data and filling missing PE values with average
filename = 'nofacies_data.csv'
training_data = pd.read_csv(filename)
training_data['PE'].fillna((training_data['PE'].mean()), inplace=True)
print np.shape(training_data)
training_data['PE'].fillna((training_data['PE'].mean()), inplace=True)
print np.shape(training_data)
In [5]:
pd.set_option('display.float_format', lambda x: '%.4f' % x)
training_data.describe()
Out[5]:
To keep feature importance on a level playing field, we will rescale each WL log before calculating moments. We will use sklearn.preprocessing.StandardScaler.
In [6]:
# standardize features to go into moments calculation
feature_vectors = training_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)
In [7]:
scaled_vectors_df = pd.DataFrame(scaled_features, columns=list(feature_vectors))
scaled_feat_df = pd.concat((training_data[['Depth', 'Well Name', 'Formation']], scaled_vectors_df),1)
scaled_feat_df.head()
Out[7]:
In [8]:
scaled_feat_df.shape
Out[8]:
The rate of change of a function of series of values is commonly used as a booster for machine learning classifiers. We will calculate the first and second derivatives for each WL log curve in each well.
In [9]:
# calculate all 1st and 2nd derivative for all logs, for all wells
derivative_df = pd.DataFrame() # final dataframe
grouped = training_data['Well Name'].unique()
for well in grouped: # for each well
new_df = pd.DataFrame() # make a new temporary dataframe
for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']:
d1 = np.array(np.gradient(scaled_feat_df[log][scaled_feat_df['Well Name'] == well]))
d2 = np.array(np.gradient(np.gradient(scaled_feat_df[log][scaled_feat_df['Well Name'] == well])))
# write to temporary dataframe
new_df[str(log) + '_d1'] = d1
new_df[str(log) + '_d2'] = d2
# append all rows of temporary dataframe to final dataframe
derivative_df = pd.concat([derivative_df, new_df])
In [10]:
derivative_df.describe()
Out[10]:
Facies are interpreted groupings of rocks and commonly composed of a several rock elements, each demonstrating different properties. Therefore, we should expect to see a distribution of WL log responses for each facies. A corollary of this is that attempting to directly solve for a facies with WL log responses at any given depth will be tenuous. Facies require a context; a context provided by the surrounding rock. Likewise, if we are to effectively solve for facies from WL logs, we should provide a context to for each response at a given depth. We can accomplish this with rolling windows.
A rolling window provides a local neighbourhood of values about a central point, which can be stepped through an array of values. The neighbourhood sample size, which is the depth thickness/sampling rate, of the neighbourhood evaluated should relate directly to the thickness of a facies. Because facies are observed with different thicknesses, we will build neighbourhoods to include the thickest observed facies. To keep the number of rolling windows reasonable, we will use a geometric function where the half window length is doubled for each subsequent value.
In [11]:
# import file
filename = 'facies_vectors.csv'
data = pd.read_csv(filename)
# eliminate nulls
PE_mask = data['PE'].notnull().values
data = data[PE_mask]
# get facies
y = data['Facies'].values
# get longest facies
max_len = max(len(list(s)) for (c,s) in itertools.groupby(y))
max_len
Out[11]:
In [12]:
# function to create a geometric series of window sizes
# using powers of 2 up to one just above a reference geological size (longest facies)
def geom_windows(max_sz):
"""returns a list of square window sizes using powers of two"""
return list(int(2**(n+1)+1) for n in np.arange(np.ceil(np.log2(max_sz))))
In [13]:
# window sizes
sizes = geom_windows(max_len)
sizes
Out[13]:
The simplest and most fundamental way to numerically describe the shape of a distribution of values is using moments. The first moment, mean $\mu$, characterizes the central tendency of the distribution. The second moment, variance $\sigma^2$, characterizes the spread of the values about the central tendency. The third moment, skewness $\gamma_1$, characterizes the symmetry (or lack thereof) about the central tendency.
We will calculate the first three moments (with one small modification) for each rolling window size at every depth. The small modification is that instead of variance $\sigma^2$, we are calculating standard deviation $\sigma$ because the results of variance $\sigma^2$ produce values with units of the mean squared $\mu^2$. As a result, feature importance of variance is artificially high due to the dimension of the variance values. Standard deviation $\sigma$ has the same dimension as mean $\mu$.
With respect to facies prediction, now, in addition to the raw WL log inputs, we will describe at multiple scales the shapes of the distributions of WL log responses associated with each facies.
In [14]:
# Efficient rolling statistics with NumPy
# http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
In [15]:
# function to calculate moments using a rolling window
def rollin_moments(arr, w, moment ='mean'):
"""- pad input array by (w-1)/2 samples at the top and bottom
- apply rolling window function
- calculate moment: mean (default), var, or skew"""
mom = []
arr = np.pad(arr, ((w-1)/2, (w-1)/2), 'edge')
if moment == 'std':
return np.array(np.std(rolling_window(arr, w), 1))
elif moment == 'skew':
return np.array(sp.stats.skew(rolling_window(arr, w), 1))
else:
return np.array(np.mean(rolling_window(arr, w), 1))
In [16]:
moments = ['mean', 'std', 'skew']
In [17]:
# calculate all moments for all logs, for all wells
moments_df = pd.DataFrame() # final dataframe
grouped = training_data['Well Name'].unique()
for well in grouped: # for each well
new_df = pd.DataFrame() # make a new temporary dataframe
for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']:
for mo in moments: # for each moment
# calculate the rolling moments with each window size
# and also the mean of moments (all window sizes)
results = np.array([rollin_moments(scaled_feat_df[log][scaled_feat_df['Well Name'] == well],
size, moment = mo) for size in sizes])
mean_result = np.mean(results, axis=0)
# write to temporary dataframe
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[0])] = results[0]
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[1])] = results[1]
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[2])] = results[2]
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[3])] = results[3]
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[4])] = results[4]
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[5])] = results[5]
new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[6])] = results[6]
new_df[str(log)+ '_' + str(mo)+'_wsize=ave'] = mean_result
# append all rows of temporary dataframe to final dataframe
moments_df = pd.concat([moments_df, new_df])
In [18]:
moments_df.describe()
Out[18]:
Statistical moments can be said to characterize the composition of a neighbourhood of values. However, we can easily describe two neighbourhoods with identical composition that are distinctly different. For example N1 = [00001111] and N2 = [01010101] have exactly the same mean $\mu$, variance $\sigma^2$, and skewness $\gamma_1$, but, in terms of rocks, might represent different facies. Therefore, in addition to describing the shape of a distribution of values for a facies, we need something to evaluate the ordering of those values. That something is a grey-level coocurrence matrix (GLCM).
A GLCM is a second order statistical method that numerically describes ordering of elements by evaluating the probability of values to be neighbours. Think of the GLCM as a histogram that preserves the ordering of values. For more about the GLCM, see Mryka Hall-Beyer's tutorial and read skimage.feature.greycomatrix documentation. Just as we calculated moments to describe the shape of a histogram, we need to represent the arrangement of values in a GLCM with a single value. Properties that capture different characteristics of a GLCM including contrast, dissimilarity, homogeneity, ASM, energy, and correlation can be calculated with skimage.feature.greycoprops. To keep resulting dimensions equivalent to the moments previously calculated, we will use the properties dissimilarity, energy, and correlation.
In [19]:
# function to calculate glcm and greycoprops using a rolling window
def gprops_calc(arr, w, lv, sym = True, prop='dissimilarity'):
"""- make w copies of the input array, roll it up one row at a time
- calculate glcm on a square window of size w
- calculate greycoprops from glcm: dissimilarity (default), energy, or correlation
- repeat until back at row one
N.B. the input array is padded by (w-1)/2 samples at the top and bottom"""
diss = []
itr = len(arr)
arr = np.pad(arr, ((w-1)/2, (w-1)/2), 'edge')
s = np.array([arr,]*w,dtype=np.uint8).transpose()
for _ in np.arange(itr):
if sym == True:
glcm = greycomatrix(s[:w,:], [1], [np.pi/2], levels = lv, symmetric = True, normed = True)
else:
glcm = greycomatrix(s[:w,:], [1], [np.pi/2], levels = lv, symmetric = False, normed = True)
if prop == 'correlation':
ds = greycoprops(glcm, 'correlation')
elif prop == 'energy':
ds = greycoprops(glcm, 'energy')
else:
ds = greycoprops(glcm, 'dissimilarity')
diss.append(ds)
s = np.roll(s[:, :], -w)
return np.ndarray.flatten(np.array(diss))
In [20]:
methods = ['dissimilarity','energy', 'correlation']
Similar to the step preceeding moments calculation, we will rescale the raw WL logs for GLCM property calculation so each resulting property is unaffected by the magnitude of the raw WL log values. skimage.feature.greycomatrix requires uint8 values, so we need an alternative to sklearn.preprocessing.StandardScaler. Unlike with calculating moments, preserving the shape of the histogram is not important to the integrity of a GLCM property. We will use histogram equalization, which flattens a histogram (puts an equal number of values in each bin). To maximize the effectiveness of a GLCM, it is commonly wise to reduce the bit depth from 8 to avoid processing expense and noise caused by empty matrix entries. After some trial and error, we found the 64 bins works nicely. Note that 64 bins results in a 64x64 matrix at every depth for every rolling window size.
In [23]:
# functions to equalize histogram of features to go into GLCM calculation
def eqlz(arr, bins):
return (bins-1) * exposure.equalize_hist(arr)
def eqlz_along_axis(arr, bins):
return np.apply_along_axis(eqlz, 0, arr, bins)
In [24]:
# equalize features
feature_vectors_glcm = training_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
eq_vectors_glcm = eqlz_along_axis(feature_vectors_glcm, 64)
In [25]:
eq_vectors_glcm_df = pd.DataFrame(eq_vectors_glcm, columns=list(feature_vectors_glcm))
eq_vectors_glcm_df = np.round(eq_vectors_glcm_df).astype(int)
eq_glcm_df = pd.concat((training_data[['Depth', 'Well Name', 'Formation']], eq_vectors_glcm_df),1)
eq_glcm_df.head()
Out[25]:
One last consideration for the GLCM is its symmetry. Symmetry in a GLCM refers to a bi-directional evaluation of the reference-neighbour pair. In plain English, if you were to construct a GLCM by hand, you would move through an array in one direction and then in the opposite direction. It is often desirable to do this because this removes the asymmety caused at the edge of a neighbourhood. See Mryka Hall-Beyer's tutorial for a full explanation of this. However, since sedimentary rocks (provided that they are structurally undisturbed) are laid down from bottom to top, we thought in addition to the symmetric GLCM, it would be useful to evaluate the asymmetric GLCM where we look at the neighbour above.
First let's calculate symmetric GLCM properties:
In [26]:
glcm_sym_df = pd.DataFrame() # final dataframe
grouped = training_data['Well Name'].unique()
for well in grouped: # for each well
new_dfg = pd.DataFrame() # make a new temporary dataframe
for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']: # for each log
for me in methods: # for each property
# calculate rolling GLCM properties with each window size
# and also the mean of moments (all window sizes)
lg = eq_glcm_df[log][eq_glcm_df['Well Name'] == well]
results = np.array([gprops_calc(lg.astype(int), wd, lv = 64, sym = True, prop = me) for wd in sizes])
mean_result = np.mean(results, axis=0)
# write to temporary dataframe
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[0])] = results[0]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[1])] = results[1]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[2])] = results[2]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[3])] = results[3]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[4])] = results[4]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[5])] = results[5]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[6])] = results[6]
new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=ave'] = mean_result
# append all rows of temporary dataframe to final dataframe
glcm_sym_df = pd.concat([glcm_sym_df , new_dfg])
In [29]:
glcm_sym_df.describe()
Out[29]:
And now let's calculate asymmetric GLCM properties using only the upward neighbour:
In [28]:
glcm_asym_df = pd.DataFrame() # final dataframe
grouped1 = training_data['Well Name'].unique()
for well1 in grouped1: # for each well
new_dfg1 = pd.DataFrame() # make a new temporary dataframe
for log1 in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']: # for each log
for me in methods: # for each property
# calculate rolling GLCM properties with each window size
# and also the mean of moments (all window sizes)
lg1 = eq_glcm_df[log][eq_glcm_df['Well Name'] == well1]
results1 = np.array([gprops_calc(lg1.astype(int), wd, lv = 64, sym = False, prop = me) for wd in sizes])
mean_result1 = np.mean(results1, axis=0)
# write to temporary dataframe
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[0])] = results1[0]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[1])] = results1[1]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[2])] = results1[2]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[3])] = results1[3]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[4])] = results1[4]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[5])] = results1[5]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[6])] = results1[6]
new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=ave'] = mean_result1
# append all rows of temporary dataframe to final dataframe
glcm_asym_df = pd.concat([glcm_asym_df , new_dfg1])
In [30]:
glcm_asym_df.describe()
Out[30]:
In [31]:
arr_final = (np.concatenate((training_data.values, derivative_df.values, moments_df, glcm_sym_df, glcm_asym_df), axis=1))
print np.shape(arr_final)
cols1 = list(training_data) + list(derivative_df) + list(moments_df) + list(glcm_sym_df) + list(glcm_asym_df)
arr_final_df = pd.DataFrame(arr_final, columns=cols1)
arr_final_df.describe()
#arr_final_df.dtypes
Out[31]:
In [32]:
lll2 = list(training_data)[3:] + list(derivative_df) + list(moments_df) + list(glcm_sym_df) + list(glcm_asym_df)
for l2 in lll2:
arr_final_df[l2] = arr_final_df[l2].astype('float64')
arr_final_df['Formation'] = arr_final_df['Formation'].astype('category')
arr_final_df['Well Name'] = arr_final_df['Well Name'].astype('category')
arr_final_df['NM_M'] = arr_final_df['NM_M'].astype('int64')
In [33]:
arr_final_df.describe()
Out[33]:
In [35]:
# just a quick test
arr_final_df['PE_GLCM_correlation_asym_wsize=33'] == arr_final_df['PE_GLCM_correlation_wsize=33']
Out[35]:
Run PCA, and look at the significance of the components.
The explained variance shows how much information (variance) can be attributed to each of the principal components, and its cumulative sum can be used to determine the number of components to select:
In [37]:
pca = decomposition.PCA()
scld = arr_final_df.drop(['Well Name', 'Formation'],axis=1)
scaler = preprocessing.StandardScaler().fit(scld)
scld = scaler.transform(scld)
pca.fit(scld)
np.set_printoptions(suppress=True) # so output is not in scientific notation
print np.cumsum(pca.explained_variance_ratio_)[:170]
In [38]:
fig = plt.figure(figsize=(14,8))
plt.plot(np.arange(1, len(np.cumsum(pca.explained_variance_ratio_))+1, 1)[:170],
np.cumsum(pca.explained_variance_ratio_)[:170])
plt.show
Out[38]:
It looks like from the plot above that it would take more than 100 PCs for the cumulative explained variance ratio reache 0.99. We will use another technique to reduce the number of features to go into the classification.
In [39]:
arr_final_df.to_csv('engineered_features_validation_set2.csv', sep=',', index=False)
In [ ]: