In [6]:

    
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
import itertools
import pandas as pd

from sklearn import decomposition, preprocessing
from skimage.feature import greycomatrix, greycoprops
from skimage import exposure

Feature engineering

<span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">The code and ideas to engineer new features used in this notebook, </span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Matteo Niccoli and Mark Dahl, with contributions by Daniel Kittridge,</span> are licensed under a Creative Commons Attribution 4.0 International License.

1 - Clean up and rescale data



In [7]:

    
# import data and filling missing PE values with average

filename = 'facies_vectors.csv'
training_data = pd.read_csv(filename)

training_data['PE'].fillna((training_data['PE'].mean()), inplace=True)
print  np.shape(training_data)
training_data['PE'].fillna((training_data['PE'].mean()), inplace=True)
print  np.shape(training_data)









    



(4149, 11)
(4149, 11)



In [8]:

    
pd.set_option('display.float_format', lambda x: '%.4f' % x)
training_data.describe()

To keep feature importance on a level playing field, we will rescale each WL log before calculating moments. We will use sklearn.preprocessing.StandardScaler.



In [12]:

    
# standardize features to go into moments calculation
feature_vectors = training_data.drop(['Formation', 'Well Name', 'Depth','Facies'], axis=1)
scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)



In [13]:

    
scaled_vectors_df = pd.DataFrame(scaled_features, columns=list(feature_vectors))
scaled_feat_df = pd.concat((training_data[['Depth', 'Well Name', 'Formation', 'Facies']], scaled_vectors_df),1)
scaled_feat_df.head()



In [14]:

    
scaled_feat_df.shape









    Out[14]:





(4149, 11)

2 - Calculate derivatives

The rate of change of a function of series of values is commonly used as a booster for machine learning classifiers. We will calculate the first and second derivatives for each WL log curve in each well.



In [15]:

    
# calculate all 1st and 2nd derivative for all logs, for all wells

derivative_df = pd.DataFrame()             # final dataframe

grouped = training_data['Well Name'].unique()

for well in grouped:                  # for each well     
    new_df = pd.DataFrame()           # make a new temporary dataframe 
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']:
        d1 = np.array(np.gradient(scaled_feat_df[log][scaled_feat_df['Well Name'] == well]))
        d2 = np.array(np.gradient(np.gradient(scaled_feat_df[log][scaled_feat_df['Well Name'] == well])))
                                      # write to temporary dataframe 
        new_df[str(log) + '_d1'] = d1
        new_df[str(log) + '_d2'] = d2
                                      # append all rows of temporary dataframe to final dataframe          

    derivative_df = pd.concat([derivative_df, new_df])



In [16]:

    
derivative_df.describe()









    Out[16]:






  
    
      
      GR_d1
      GR_d2
      ILD_log10_d1
      ILD_log10_d2
      DeltaPHI_d1
      DeltaPHI_d2
      PHIND_d1
      PHIND_d2
      PE_d1
      PE_d2
    
  
  
    
      count
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
    
    
      mean
      -0.0002
      0.0019
      0.0014
      -0.0000
      0.0001
      -0.0010
      -0.0015
      0.0000
      0.0010
      0.0008
    
    
      std
      0.3225
      0.2047
      0.1854
      0.1034
      0.3797
      0.2813
      0.3619
      0.2470
      0.2932
      0.2017
    
    
      min
      -4.0549
      -2.9966
      -1.2289
      -1.4060
      -3.2421
      -2.1899
      -3.2775
      -3.1040
      -1.6135
      -1.5822
    
    
      25%
      -0.0949
      -0.0548
      -0.0792
      -0.0257
      -0.1456
      -0.1017
      -0.1017
      -0.0587
      -0.0702
      -0.0531
    
    
      50%
      0.0100
      0.0060
      -0.0020
      0.0020
      0.0000
      0.0047
      0.0013
      0.0063
      0.0000
      0.0000
    
    
      75%
      0.1035
      0.0658
      0.0792
      0.0307
      0.1517
      0.0995
      0.1017
      0.0740
      0.0632
      0.0632
    
    
      max
      3.8355
      1.8359
      1.9808
      1.1200
      2.5691
      2.2041
      4.9075
      2.1961
      2.7819
      1.8968

3 - Create a list of geometrically-expanding windows for rolling features

Facies are interpreted groupings of rocks and commonly composed of a several rock elements, each demonstrating different properties. Therefore, we should expect to see a distribution of WL log responses for each facies. A corollary of this is that attempting to directly solve for a facies with WL log responses at any given depth will be tenuous. Facies require a context; a context provided by the surrounding rock. Likewise, if we are to effectively solve for facies from WL logs, we should provide a context to for each response at a given depth. We can accomplish this with rolling windows.

A rolling window provides a local neighbourhood of values about a central point, which can be stepped through an array of values. The neighbourhood sample size, which is the depth thickness/sampling rate, of the neighbourhood evaluated should relate directly to the thickness of a facies. Because facies are observed with different thicknesses, we will build neighbourhoods to include the thickest observed facies. To keep the number of rolling windows reasonable, we will use a geometric function where the half window length is doubled for each subsequent value.



In [17]:

    
# import file
filename = 'facies_vectors.csv'
data = pd.read_csv(filename)

# eliminate nulls
PE_mask = data['PE'].notnull().values
data = data[PE_mask]

# get facies
y = data['Facies'].values

# get longest facies 
max_len = max(len(list(s)) for (c,s) in itertools.groupby(y))
max_len









    Out[17]:





68



In [18]:

    
# function to create a geometric series of window sizes 
# using powers of 2 up to one just above a reference geological size (longest facies)

def geom_windows(max_sz): 
    """returns a list of square window sizes using powers of two"""
    return list(int(2**(n+1)+1) for n in np.arange(np.ceil(np.log2(max_sz))))



In [19]:

    
# window sizes
sizes = geom_windows(max_len)
sizes









    Out[19]:





[3, 5, 9, 17, 33, 65, 129]

4 - Moments feature generation

The simplest and most fundamental way to numerically describe the shape of a distribution of values is using moments. The first moment, mean $\mu$, characterizes the central tendency of the distribution. The second moment, variance $\sigma^2$, characterizes the spread of the values about the central tendency. The third moment, skewness $\gamma_1$, characterizes the symmetry (or lack thereof) about the central tendency.

We will calculate the first three moments (with one small modification) for each rolling window size at every depth. The small modification is that instead of variance $\sigma^2$, we are calculating standard deviation $\sigma$ because the results of variance $\sigma^2$ produce values with units of the mean squared $\mu^2$. As a result, feature importance of variance is artificially high due to the dimension of the variance values. Standard deviation $\sigma$ has the same dimension as mean $\mu$.

With respect to facies prediction, now, in addition to the raw WL log inputs, we will describe at multiple scales the shapes of the distributions of WL log responses associated with each facies.



In [20]:

    
# Efficient rolling statistics with NumPy
# http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)



In [21]:

    
# function to calculate moments using a rolling window

def rollin_moments(arr, w, moment ='mean'):
    """- pad input array by (w-1)/2 samples at the top and bottom
       - apply rolling window function
       - calculate moment: mean (default), var, or skew"""
    mom = []
    arr = np.pad(arr, ((w-1)/2, (w-1)/2), 'edge')  
    if moment == 'std':
        return np.array(np.std(rolling_window(arr, w), 1))
    elif moment == 'skew':
        return np.array(sp.stats.skew(rolling_window(arr, w), 1))
    else:
        return np.array(np.mean(rolling_window(arr, w), 1))



In [22]:

    
moments = ['mean', 'std', 'skew']



In [23]:

    
# calculate all moments for all logs, for all wells

moments_df = pd.DataFrame()             # final dataframe

grouped = training_data['Well Name'].unique()

for well in grouped:                  # for each well     
    new_df = pd.DataFrame()           # make a new temporary dataframe 
   
    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND' ,'PE']:
        for mo in moments:            # for each moment
                                      # calculate the rolling moments with each window size
                                      # and also the mean of moments (all window sizes)
            results = np.array([rollin_moments(scaled_feat_df[log][scaled_feat_df['Well Name'] == well],
                                               size, moment = mo) for size in sizes])
            mean_result = np.mean(results, axis=0)
                                      # write to temporary dataframe 
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[0])] = results[0]
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[1])] = results[1]
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[2])] = results[2]
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[3])] = results[3]
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[4])] = results[4]
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[5])] = results[5]
            new_df[str(log)+ '_' + str(mo)+'_wsize=' +str(sizes[6])] = results[6]
            new_df[str(log)+ '_' + str(mo)+'_wsize=ave'] = mean_result
                                      # append all rows of temporary dataframe to final dataframe          

    moments_df = pd.concat([moments_df, new_df])



In [24]:

    
moments_df.describe()









    Out[24]:






  
    
      
      GR_mean_wsize=3
      GR_mean_wsize=5
      GR_mean_wsize=9
      GR_mean_wsize=17
      GR_mean_wsize=33
      GR_mean_wsize=65
      GR_mean_wsize=129
      GR_mean_wsize=ave
      GR_std_wsize=3
      GR_std_wsize=5
      ...
      PE_std_wsize=129
      PE_std_wsize=ave
      PE_skew_wsize=3
      PE_skew_wsize=5
      PE_skew_wsize=9
      PE_skew_wsize=17
      PE_skew_wsize=33
      PE_skew_wsize=65
      PE_skew_wsize=129
      PE_skew_wsize=ave
    
  
  
    
      count
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      ...
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
    
    
      mean
      -0.0000
      0.0003
      0.0007
      0.0012
      0.0020
      0.0058
      0.0235
      0.0048
      0.1579
      0.2542
      ...
      0.7163
      0.4516
      0.0210
      0.0378
      0.0458
      0.0658
      0.1263
      0.1739
      0.2244
      0.0993
    
    
      std
      0.9616
      0.9142
      0.8249
      0.6920
      0.5254
      0.3908
      0.3570
      0.5765
      0.2251
      0.3180
      ...
      0.4618
      0.3330
      0.4521
      0.5773
      0.6270
      0.6641
      0.6752
      0.6219
      0.5662
      0.3510
    
    
      min
      -1.7746
      -1.7264
      -1.6490
      -1.5003
      -1.2895
      -1.0583
      -1.0015
      -1.2453
      0.0002
      0.0094
      ...
      0.0000
      0.0000
      -1.0000
      -1.5000
      -2.4749
      -3.3231
      -4.1331
      -2.4207
      -2.8087
      -1.4898
    
    
      25%
      -0.6317
      -0.6244
      -0.5697
      -0.4635
      -0.3208
      -0.2104
      -0.1733
      -0.3910
      0.0476
      0.0850
      ...
      0.3675
      0.1999
      -0.2883
      -0.3079
      -0.2344
      -0.2307
      -0.1231
      -0.0225
      0.0000
      -0.0218
    
    
      50%
      0.0036
      0.0100
      0.0280
      0.0129
      -0.0182
      0.0059
      0.0222
      0.0082
      0.0929
      0.1568
      ...
      0.7977
      0.4429
      0.0000
      0.0000
      0.0000
      0.0000
      0.0000
      0.0462
      0.2203
      0.0205
    
    
      75%
      0.4656
      0.4494
      0.4192
      0.3744
      0.3114
      0.2160
      0.2156
      0.3134
      0.1784
      0.2972
      ...
      1.0570
      0.6720
      0.3818
      0.3632
      0.3688
      0.4000
      0.4877
      0.5521
      0.5927
      0.3244
    
    
      max
      8.7912
      6.9026
      5.3798
      3.6272
      3.1635
      2.7984
      2.6547
      3.6996
      3.3147
      3.9119
      ...
      1.7849
      1.6715
      1.0000
      1.5000
      2.4749
      3.1863
      3.1959
      2.4672
      2.3229
      1.2788
    
  

8 rows × 120 columns

5 - GLCM feature generation

Statistical moments can be said to characterize the composition of a neighbourhood of values. However, we can easily describe two neighbourhoods with identical composition that are distinctly different. For example N1 = [00001111] and N2 = [01010101] have exactly the same mean $\mu$, variance $\sigma^2$, and skewness $\gamma_1$, but, in terms of rocks, might represent different facies. Therefore, in addition to describing the shape of a distribution of values for a facies, we need something to evaluate the ordering of those values. That something is a grey-level coocurrence matrix (GLCM).

A GLCM is a second order statistical method that numerically describes ordering of elements by evaluating the probability of values to be neighbours. Think of the GLCM as a histogram that preserves the ordering of values. For more about the GLCM, see Mryka Hall-Beyer's tutorial and read skimage.feature.greycomatrix documentation. Just as we calculated moments to describe the shape of a histogram, we need to represent the arrangement of values in a GLCM with a single value. Properties that capture different characteristics of a GLCM including contrast, dissimilarity, homogeneity, ASM, energy, and correlation can be calculated with skimage.feature.greycoprops. To keep resulting dimensions equivalent to the moments previously calculated, we will use the properties dissimilarity, energy, and correlation.



In [25]:

    
# function to calculate glcm and greycoprops using a rolling window

def gprops_calc(arr, w, lv, sym = True, prop='dissimilarity'):
    """- make w copies of the input array, roll it up one row at a time
       - calculate glcm on a square window of size w
       - calculate greycoprops from glcm: dissimilarity (default), energy, or correlation
       - repeat until back at row one
       N.B. the input array is padded by (w-1)/2 samples at the top and bottom"""
    diss = []
    itr = len(arr)
    arr = np.pad(arr, ((w-1)/2, (w-1)/2), 'edge')
    s = np.array([arr,]*w,dtype=np.uint8).transpose()
    for _ in np.arange(itr):
        if sym == True:
            glcm = greycomatrix(s[:w,:], [1], [np.pi/2], levels = lv, symmetric = True, normed = True)
        else:
            glcm = greycomatrix(s[:w,:], [1], [np.pi/2], levels = lv, symmetric = False, normed = True)
        if prop == 'correlation':
            ds = greycoprops(glcm, 'correlation')
        elif prop == 'energy':
            ds = greycoprops(glcm, 'energy')
        else:
            ds = greycoprops(glcm, 'dissimilarity')
        diss.append(ds)
        s = np.roll(s[:, :], -w)
    return np.ndarray.flatten(np.array(diss))



In [26]:

    
methods = ['dissimilarity','energy', 'correlation']

Similar to the step preceeding moments calculation, we will rescale the raw WL logs for GLCM property calculation so each resulting property is unaffected by the magnitude of the raw WL log values. skimage.feature.greycomatrix requires uint8 values, so we need an alternative to sklearn.preprocessing.StandardScaler. Unlike with calculating moments, preserving the shape of the histogram is not important to the integrity of a GLCM property. We will use histogram equalization, which flattens a histogram (puts an equal number of values in each bin). To maximize the effectiveness of a GLCM, it is commonly wise to reduce the bit depth from 8 to avoid processing expense and noise caused by empty matrix entries. After some trial and error, we found the 64 bins works nicely. Note that 64 bins results in a 64x64 matrix at every depth for every rolling window size.



In [27]:

    
# functions to equalize histogram of features to go into GLCM calculation
def eqlz(arr, bins):
    return (bins-1) * exposure.equalize_hist(arr)
def eqlz_along_axis(arr, bins):
    return np.apply_along_axis(eqlz, 0, arr, bins)



In [28]:

    
# equalize features
feature_vectors_glcm = training_data.drop(['Formation', 'Well Name', 'Depth','Facies'], axis=1)
eq_vectors_glcm = eqlz_along_axis(feature_vectors_glcm, 64)



In [29]:

    
eq_vectors_glcm_df = pd.DataFrame(eq_vectors_glcm, columns=list(feature_vectors_glcm))
eq_vectors_glcm_df = np.round(eq_vectors_glcm_df).astype(int)
eq_glcm_df = pd.concat((training_data[['Depth', 'Well Name', 'Formation', 'Facies']], eq_vectors_glcm_df),1)
eq_glcm_df.head()

One last consideration for the GLCM is its symmetry. Symmetry in a GLCM refers to a bi-directional evaluation of the reference-neighbour pair. In plain English, if you were to construct a GLCM by hand, you would move through an array in one direction and then in the opposite direction. It is often desirable to do this because this removes the asymmety caused at the edge of a neighbourhood. See Mryka Hall-Beyer's tutorial for a full explanation of this. However, since sedimentary rocks (provided that they are structurally undisturbed) are laid down from bottom to top, we thought in addition to the symmetric GLCM, it would be useful to evaluate the asymmetric GLCM where we look at the neighbour above.

First let's calculate symmetric GLCM properties:



In [30]:

    
glcm_sym_df = pd.DataFrame()        # final dataframe
grouped = training_data['Well Name'].unique()

for well in grouped:                   # for each well   
    new_dfg = pd.DataFrame()           # make a new temporary dataframe 

    for log in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']:   # for each log
        for me in methods:            # for each property
                                      # calculate rolling GLCM properties with each window size
                                      # and also the mean of moments (all window sizes)
            lg = eq_glcm_df[log][eq_glcm_df['Well Name'] == well]
            results = np.array([gprops_calc(lg.astype(int), wd, lv = 64, sym = True, prop = me) for wd in sizes])
            mean_result = np.mean(results, axis=0)
                                      # write to temporary dataframe 
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[0])] = results[0]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[1])] = results[1]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[2])] = results[2]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[3])] = results[3]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[4])] = results[4]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[5])] = results[5]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=' +str(sizes[6])] = results[6]
            new_dfg[str(log)+ '_GLCM_' + str(me)+'_wsize=ave'] = mean_result
                                      # append all rows of temporary dataframe to final dataframe 

    glcm_sym_df  = pd.concat([glcm_sym_df , new_dfg])



In [31]:

    
glcm_sym_df.describe()









    Out[31]:






  
    
      
      GR_GLCM_dissimilarity_wsize=3
      GR_GLCM_dissimilarity_wsize=5
      GR_GLCM_dissimilarity_wsize=9
      GR_GLCM_dissimilarity_wsize=17
      GR_GLCM_dissimilarity_wsize=33
      GR_GLCM_dissimilarity_wsize=65
      GR_GLCM_dissimilarity_wsize=129
      GR_GLCM_dissimilarity_wsize=ave
      GR_GLCM_energy_wsize=3
      GR_GLCM_energy_wsize=5
      ...
      PE_GLCM_energy_wsize=129
      PE_GLCM_energy_wsize=ave
      PE_GLCM_correlation_wsize=3
      PE_GLCM_correlation_wsize=5
      PE_GLCM_correlation_wsize=9
      PE_GLCM_correlation_wsize=17
      PE_GLCM_correlation_wsize=33
      PE_GLCM_correlation_wsize=65
      PE_GLCM_correlation_wsize=129
      PE_GLCM_correlation_wsize=ave
    
  
  
    
      count
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      ...
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
    
    
      mean
      3.6655
      3.6587
      3.6478
      3.6262
      3.5938
      3.5214
      3.3838
      3.5853
      0.5572
      0.4111
      ...
      0.3571
      0.4703
      0.0791
      0.4102
      0.6635
      0.8093
      0.8790
      0.9136
      0.9294
      0.6692
    
    
      std
      3.8264
      2.9219
      2.1691
      1.6652
      1.3437
      1.1368
      0.9819
      1.6104
      0.1114
      0.0921
      ...
      0.3527
      0.2879
      0.6409
      0.4530
      0.2956
      0.1837
      0.1229
      0.0819
      0.0553
      0.2077
    
    
      min
      0.0000
      0.0000
      0.1250
      0.4375
      0.5000
      0.9375
      1.1641
      0.7578
      0.5000
      0.3536
      ...
      0.0806
      0.2290
      -1.0000
      -1.0000
      -0.5924
      -0.1399
      -0.0086
      0.1382
      0.5059
      0.0809
    
    
      25%
      1.0000
      1.5000
      1.8750
      2.3750
      2.5938
      2.6875
      2.5703
      2.3382
      0.5000
      0.3536
      ...
      0.1174
      0.2760
      -0.3333
      0.0338
      0.4949
      0.7214
      0.8346
      0.8855
      0.9009
      0.5314
    
    
      50%
      2.5000
      2.7500
      3.2500
      3.5625
      3.5312
      3.4062
      3.3750
      3.2790
      0.5000
      0.3953
      ...
      0.1508
      0.3255
      -0.1111
      0.4667
      0.7241
      0.8514
      0.9014
      0.9221
      0.9300
      0.6392
    
    
      75%
      5.0000
      5.0000
      5.1250
      4.6250
      4.4375
      4.2969
      4.0625
      4.4799
      0.6124
      0.4330
      ...
      0.4579
      0.4985
      1.0000
      0.6800
      0.8770
      0.9434
      0.9635
      0.9660
      0.9623
      0.7444
    
    
      max
      30.5000
      20.0000
      13.5000
      9.9375
      8.5000
      7.6250
      6.2188
      10.6150
      1.0000
      1.0000
      ...
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
    
  

8 rows × 120 columns

And now let's calculate asymmetric GLCM properties using only the upward neighbour:



In [32]:

    
glcm_asym_df = pd.DataFrame()        # final dataframe
grouped1 = training_data['Well Name'].unique()

for well1 in grouped1:                   # for each well   
    new_dfg1 = pd.DataFrame()           # make a new temporary dataframe 

    for log1 in ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']:   # for each log
        for me in methods:            # for each property
                                      # calculate rolling GLCM properties with each window size
                                      # and also the mean of moments (all window sizes)
            lg1 = eq_glcm_df[log][eq_glcm_df['Well Name'] == well1]
            results1 = np.array([gprops_calc(lg1.astype(int), wd, lv = 64, sym = False, prop = me) for wd in sizes])
            mean_result1 = np.mean(results1, axis=0)
            
                                      # write to temporary dataframe 
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[0])] = results1[0]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[1])] = results1[1]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[2])] = results1[2]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[3])] = results1[3]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[4])] = results1[4]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[5])] = results1[5]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=' +str(sizes[6])] = results1[6]
            new_dfg1[str(log1)+ '_GLCM_' + str(me)+'_asym_wsize=ave'] = mean_result1
                                      # append all rows of temporary dataframe to final dataframe 

    glcm_asym_df  = pd.concat([glcm_asym_df , new_dfg1])



In [33]:

    
glcm_asym_df.describe()









    Out[33]:






  
    
      
      GR_GLCM_dissimilarity_asym_wsize=3
      GR_GLCM_dissimilarity_asym_wsize=5
      GR_GLCM_dissimilarity_asym_wsize=9
      GR_GLCM_dissimilarity_asym_wsize=17
      GR_GLCM_dissimilarity_asym_wsize=33
      GR_GLCM_dissimilarity_asym_wsize=65
      GR_GLCM_dissimilarity_asym_wsize=129
      GR_GLCM_dissimilarity_asym_wsize=ave
      GR_GLCM_energy_asym_wsize=3
      GR_GLCM_energy_asym_wsize=5
      ...
      PE_GLCM_energy_asym_wsize=129
      PE_GLCM_energy_asym_wsize=ave
      PE_GLCM_correlation_asym_wsize=3
      PE_GLCM_correlation_asym_wsize=5
      PE_GLCM_correlation_asym_wsize=9
      PE_GLCM_correlation_asym_wsize=17
      PE_GLCM_correlation_asym_wsize=33
      PE_GLCM_correlation_asym_wsize=65
      PE_GLCM_correlation_asym_wsize=129
      PE_GLCM_correlation_asym_wsize=ave
    
  
  
    
      count
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      ...
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
    
    
      mean
      2.9636
      2.9576
      2.9468
      2.9312
      2.9064
      2.8641
      2.7776
      2.9068
      0.7939
      0.6432
      ...
      0.3627
      0.5089
      0.7599
      0.6033
      0.7199
      0.8247
      0.8825
      0.9141
      0.9295
      0.8048
    
    
      std
      4.5490
      3.6087
      2.8980
      2.4526
      2.1418
      1.9531
      1.8213
      2.3654
      0.1338
      0.2072
      ...
      0.3491
      0.2649
      0.6501
      0.4803
      0.2880
      0.1768
      0.1206
      0.0816
      0.0552
      0.1904
    
    
      min
      0.0000
      0.0000
      0.0000
      0.0000
      0.0000
      0.0000
      0.0000
      0.0000
      0.7071
      0.5000
      ...
      0.0963
      0.3162
      -1.0000
      -1.0000
      -0.5855
      -0.3742
      0.0065
      0.1725
      0.5352
      0.0577
    
    
      25%
      0.0000
      0.2500
      0.5000
      0.6250
      0.8750
      1.2031
      1.5000
      1.0592
      0.7071
      0.5000
      ...
      0.1269
      0.3357
      1.0000
      0.3333
      0.5636
      0.7441
      0.8393
      0.8861
      0.9009
      0.7163
    
    
      50%
      1.5000
      1.5000
      2.0000
      2.7500
      3.0000
      3.0781
      3.0234
      2.6440
      0.7071
      0.5000
      ...
      0.1559
      0.3661
      1.0000
      0.8121
      0.8121
      0.8654
      0.9041
      0.9224
      0.9300
      0.8567
    
    
      75%
      3.5000
      4.2500
      4.8750
      4.6250
      4.6250
      4.4219
      4.2266
      4.3438
      1.0000
      0.7906
      ...
      0.4595
      0.5179
      1.0000
      1.0000
      0.9527
      0.9575
      0.9667
      0.9670
      0.9623
      0.9467
    
    
      max
      32.0000
      22.7500
      15.0000
      10.8750
      9.0938
      7.0781
      6.2891
      12.8013
      1.0000
      1.0000
      ...
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
    
  

8 rows × 120 columns

6 - Concatenate results with input into a single numpy array, then make it into final dataframe



In [34]:

    
arr_final = (np.concatenate((training_data.values, derivative_df.values, moments_df, glcm_sym_df, glcm_asym_df), axis=1))
print np.shape(arr_final)
cols1 = list(training_data) + list(derivative_df) + list(moments_df) + list(glcm_sym_df) + list(glcm_asym_df)
arr_final_df = pd.DataFrame(arr_final, columns=cols1)
arr_final_df.describe()
#arr_final_df.dtypes









    



(4149, 381)






    Out[34]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      ...
      PE_GLCM_energy_asym_wsize=129
      PE_GLCM_energy_asym_wsize=ave
      PE_GLCM_correlation_asym_wsize=3
      PE_GLCM_correlation_asym_wsize=5
      PE_GLCM_correlation_asym_wsize=9
      PE_GLCM_correlation_asym_wsize=17
      PE_GLCM_correlation_asym_wsize=33
      PE_GLCM_correlation_asym_wsize=65
      PE_GLCM_correlation_asym_wsize=129
      PE_GLCM_correlation_asym_wsize=ave
    
  
  
    
      count
      4149
      4149
      4149
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149
      ...
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
    
    
      unique
      9
      14
      10
      1130.0000
      3561.0000
      1387.0000
      1407.0000
      2616.0000
      996.0000
      2
      ...
      783.0000
      3055.0000
      2.0000
      1694.0000
      2960.0000
      3170.0000
      3193.0000
      3207.0000
      3215.0000
      3245.0000
    
    
      top
      2
      C LM
      CROSS H CATTLE
      2944.0000
      36.5600
      0.5300
      3.6000
      13.0500
      3.7250
      2
      ...
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
    
    
      freq
      940
      662
      501
      8.0000
      6.0000
      27.0000
      35.0000
      14.0000
      917.0000
      2151
      ...
      905.0000
      905.0000
      3651.0000
      1118.0000
      920.0000
      905.0000
      905.0000
      905.0000
      905.0000
      905.0000
    
  

4 rows × 381 columns



In [35]:

    
lll2 = list(training_data)[3:] + list(derivative_df) + list(moments_df) + list(glcm_sym_df) + list(glcm_asym_df)
for l2 in lll2:
    arr_final_df[l2] = arr_final_df[l2].astype('float64')
    
arr_final_df['Facies'] = arr_final_df['Facies'].astype('int64')
arr_final_df['Formation'] = arr_final_df['Formation'].astype('category')
arr_final_df['Well Name'] = arr_final_df['Well Name'].astype('category')
arr_final_df['NM_M'] = arr_final_df['NM_M'].astype('int64')



In [36]:

    
arr_final_df.describe()









    Out[36]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      GR_d1
      ...
      PE_GLCM_energy_asym_wsize=129
      PE_GLCM_energy_asym_wsize=ave
      PE_GLCM_correlation_asym_wsize=3
      PE_GLCM_correlation_asym_wsize=5
      PE_GLCM_correlation_asym_wsize=9
      PE_GLCM_correlation_asym_wsize=17
      PE_GLCM_correlation_asym_wsize=33
      PE_GLCM_correlation_asym_wsize=65
      PE_GLCM_correlation_asym_wsize=129
      PE_GLCM_correlation_asym_wsize=ave
    
  
  
    
      count
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      ...
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
      4149.0000
    
    
      mean
      4.5033
      2906.8674
      64.9340
      0.6596
      4.4025
      13.2011
      3.7250
      1.5184
      0.5219
      -0.0002
      ...
      0.3627
      0.5089
      0.7599
      0.6033
      0.7199
      0.8247
      0.8825
      0.9141
      0.9295
      0.8048
    
    
      std
      2.4743
      133.3002
      30.3025
      0.2527
      5.2749
      7.1328
      0.7909
      0.4997
      0.2866
      0.3225
      ...
      0.3491
      0.2649
      0.6501
      0.4803
      0.2880
      0.1768
      0.1206
      0.0816
      0.0552
      0.1904
    
    
      min
      1.0000
      2573.5000
      10.1490
      -0.0259
      -21.8320
      0.5500
      0.2000
      1.0000
      0.0000
      -4.0549
      ...
      0.0963
      0.3162
      -1.0000
      -1.0000
      -0.5855
      -0.3742
      0.0065
      0.1725
      0.5352
      0.0577
    
    
      25%
      2.0000
      2821.5000
      44.7300
      0.4980
      1.6000
      8.5000
      3.2000
      1.0000
      0.2770
      -0.0949
      ...
      0.1269
      0.3357
      1.0000
      0.3333
      0.5636
      0.7441
      0.8393
      0.8861
      0.9009
      0.7163
    
    
      50%
      4.0000
      2932.5000
      64.9900
      0.6390
      4.3000
      12.0200
      3.7250
      2.0000
      0.5280
      0.0100
      ...
      0.1559
      0.3661
      1.0000
      0.8121
      0.8121
      0.8654
      0.9041
      0.9224
      0.9300
      0.8567
    
    
      75%
      6.0000
      3007.0000
      79.4380
      0.8220
      7.5000
      16.0500
      4.0000
      2.0000
      0.7690
      0.1035
      ...
      0.4595
      0.5179
      1.0000
      1.0000
      0.9527
      0.9575
      0.9667
      0.9670
      0.9623
      0.9467
    
    
      max
      9.0000
      3138.0000
      361.1500
      1.8000
      19.3120
      84.4000
      8.0940
      2.0000
      1.0000
      3.8355
      ...
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
      1.0000
    
  

8 rows × 379 columns



In [37]:

    
# just a quick test
arr_final_df['PE_GLCM_correlation_asym_wsize=33'] == arr_final_df['PE_GLCM_correlation_wsize=33']









    Out[37]:





0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
4119    False
4120    False
4121    False
4122    False
4123    False
4124    False
4125    False
4126    False
4127    False
4128    False
4129    False
4130    False
4131    False
4132    False
4133    False
4134    False
4135    False
4136    False
4137    False
4138    False
4139    False
4140    False
4141    False
4142    False
4143     True
4144    False
4145    False
4146    False
4147    False
4148    False
dtype: bool

7 - PCA dimensionality analysis

Run PCA, and look at the significance of the components.

The explained variance shows how much information (variance) can be attributed to each of the principal components, and its cumulative sum can be used to determine the number of components to select:



In [38]:

    
pca = decomposition.PCA()
scld = arr_final_df.drop(['Well Name', 'Formation', 'Facies'],axis=1)
scaler = preprocessing.StandardScaler().fit(scld)
scld = scaler.transform(scld)
pca.fit(scld)

np.set_printoptions(suppress=True) # so output is not in scientific notation
print np.cumsum(pca.explained_variance_ratio_)[:170]









    



[ 0.26891999  0.347724    0.40239399  0.44299017  0.47978752  0.51283051
  0.53856363  0.56246759  0.58310323  0.60189869  0.61865306  0.63421904
  0.64869259  0.66144743  0.67399898  0.68489675  0.69521286  0.70513935
  0.71473596  0.72406989  0.73324915  0.74126384  0.74908675  0.75664708
  0.76389835  0.77066015  0.77738794  0.78368575  0.78989718  0.79602748
  0.80183893  0.80757984  0.81283002  0.81795652  0.82286045  0.82756341
  0.8321701   0.83670903  0.84099655  0.84510988  0.84919064  0.85306341
  0.85686058  0.86052807  0.86406564  0.86733779  0.8705615   0.87366228
  0.87668848  0.87961783  0.88239448  0.88515589  0.88781564  0.89040241
  0.89295104  0.89545783  0.89790335  0.90028186  0.90261332  0.90487944
  0.90707728  0.90919977  0.91128496  0.91336043  0.91536466  0.91728106
  0.919189    0.92107664  0.92293236  0.92472061  0.92648888  0.92822126
  0.92993215  0.93158708  0.93320072  0.93474563  0.93627746  0.93779927
  0.93929135  0.94075223  0.9421688   0.9435784   0.9449775   0.94634678
  0.94767174  0.94897573  0.95023048  0.95145889  0.95267125  0.95387656
  0.9550268   0.95615941  0.95728665  0.95840377  0.95949028  0.96054794
  0.9615944   0.96262168  0.96359514  0.96455927  0.96550185  0.96642221
  0.96733346  0.96823413  0.96910909  0.96996699  0.97079757  0.97160719
  0.97241521  0.97320608  0.97398491  0.97475905  0.97551931  0.97625819
  0.97698425  0.97766534  0.97834181  0.97899576  0.97963812  0.98027397
  0.9808737   0.98146388  0.9820479   0.98258937  0.98312789  0.9836588
  0.98416979  0.98465825  0.98513543  0.98561182  0.98605491  0.9864904
  0.98691261  0.98731916  0.98770533  0.98809116  0.98846767  0.98883263
  0.98918792  0.98953821  0.98987773  0.99021174  0.99053762  0.99085333
  0.99116327  0.99147024  0.99176123  0.99204102  0.99231616  0.99258884
  0.99284956  0.99310743  0.99335337  0.99359548  0.99382435  0.9940462
  0.99426671  0.99448082  0.99469096  0.99489596  0.99509559  0.99528956
  0.99547565  0.99565682  0.99583202  0.9960024   0.99616928  0.99633191
  0.99648657  0.9966376 ]



In [39]:

    
fig = plt.figure(figsize=(14,8))
plt.plot(np.arange(1, len(np.cumsum(pca.explained_variance_ratio_))+1, 1)[:170], 
         np.cumsum(pca.explained_variance_ratio_)[:170])

plt.show









    Out[39]:





<function matplotlib.pyplot.show>

It looks like from the plot above that it would take more than 100 PCs for the cumulative explained variance ratio reache 0.99. We will use another technique to reduce the number of features to go into the classification.



In [40]:

    
arr_final_df.to_csv('engineered_features.csv', sep=',',  index=False)



In [ ]:

	Depth	Well Name	Formation	Facies	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	2793.0000	SHRIMPLIN	A1 SH	3	0.4131	0.0175	1.0423	-0.1803	1.1064	-1.0376	1.6683
1	2793.5000	SHRIMPLIN	A1 SH	3	0.4398	0.0057	1.8576	-0.0892	0.4742	-1.0376	1.5950
2	2794.0000	SHRIMPLIN	A1 SH	3	0.4659	-0.0062	1.9714	-0.0212	-0.1581	-1.0376	1.5183
3	2794.5000	SHRIMPLIN	A1 SH	3	0.6986	-0.0181	1.8007	-0.0121	-0.2845	-1.0376	1.4450
4	2795.0000	SHRIMPLIN	A1 SH	3	0.3184	-0.0497	1.7249	0.0139	-0.4110	-1.0376	1.3717

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000
mean	4.5033	2906.8674	64.9340	0.6596	4.4025	13.2011	3.7250	1.5184	0.5219
std	2.4743	133.3002	30.3025	0.2527	5.2749	7.1328	0.7909	0.4997	0.2866
min	1.0000	2573.5000	10.1490	-0.0259	-21.8320	0.5500	0.2000	1.0000	0.0000
25%	2.0000	2821.5000	44.7300	0.4980	1.6000	8.5000	3.2000	1.0000	0.2770
50%	4.0000	2932.5000	64.9900	0.6390	4.3000	12.0200	3.7250	2.0000	0.5280
75%	6.0000	3007.0000	79.4380	0.8220	7.5000	16.0500	4.0000	2.0000	0.7690
max	9.0000	3138.0000	361.1500	1.8000	19.3120	84.4000	8.0940	2.0000	1.0000

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	...	PE_GLCM_energy_asym_wsize=129	PE_GLCM_energy_asym_wsize=ave	PE_GLCM_correlation_asym_wsize=3	PE_GLCM_correlation_asym_wsize=5	PE_GLCM_correlation_asym_wsize=9	PE_GLCM_correlation_asym_wsize=17	PE_GLCM_correlation_asym_wsize=33	PE_GLCM_correlation_asym_wsize=65	PE_GLCM_correlation_asym_wsize=129	PE_GLCM_correlation_asym_wsize=ave
count	4149	4149	4149	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149	...	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000	4149.0000
unique	9	14	10	1130.0000	3561.0000	1387.0000	1407.0000	2616.0000	996.0000	2	...	783.0000	3055.0000	2.0000	1694.0000	2960.0000	3170.0000	3193.0000	3207.0000	3215.0000	3245.0000
top	2	C LM	CROSS H CATTLE	2944.0000	36.5600	0.5300	3.6000	13.0500	3.7250	2	...	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000	1.0000
freq	940	662	501	8.0000	6.0000	27.0000	35.0000	14.0000	917.0000	2151	...	905.0000	905.0000	3651.0000	1118.0000	920.0000	905.0000	905.0000	905.0000	905.0000	905.0000