1.0 - Facies Classification Using RandomForestClassifier.

Chris Esprey - https://www.linkedin.com/in/christopher-esprey-beng-8aab1078?trk=nav_responsive_tab_profile

I have generated two main feature types, namely:

  • The absolute difference between each feature for all feature rows.
  • The difference between each sample and the mean and standard deviation of each facies.

I then threw this at a RandomForestClassifier.

Possible future improvements:

  • Perform Univariate feature selection to hone in on the best features
  • Try out other classifiers e.g. gradient boost, SVM etc.
  • Use an ensemble of algorithms for classification

In [1]:
%matplotlib notebook
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from classification_utilities import display_cm, display_adj_cm


C:\Users\chris.esprey\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [2]:
filename = 'training_data.csv'
training_data = pd.read_csv(filename)

2.0 - Feature Generation


In [3]:
## Create a difference vector for each feature e.g. x1-x2, x1-x3... x2-x3...

# order features in depth.

feature_vectors = training_data.drop(['Formation', 'Well Name','Facies'], axis=1)
feature_vectors = feature_vectors[['Depth','GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]

def difference_vector(feature_vectors):
    length = len(feature_vectors['Depth'])
    df_temp = np.zeros((25, length))
                          
    for i in range(0,int(len(feature_vectors['Depth']))):
                       
        vector_i = feature_vectors.iloc[i,:]
        vector_i = vector_i[['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
        for j, value_j in enumerate(vector_i):
            for k, value_k in enumerate(vector_i): 
                differ_j_k = value_j - value_k          
                df_temp[5*j+k, i] = np.abs(differ_j_k)
                
    return df_temp

def diff_vec2frame(feature_vectors, df_temp):
    
    heads = feature_vectors.columns[1::]
    for i in range(0,5):
        string_i = heads[i]
        for j in range(0,5):
            string_j = heads[j]
            col_head = 'diff'+string_i+string_j
            
            df = pd.Series(df_temp[5*i+j, :])
            feature_vectors[col_head] = df
    return feature_vectors
            
df_diff = difference_vector(feature_vectors)    
feature_vectors = diff_vec2frame(feature_vectors, df_diff)

# Drop duplicated columns and column of zeros
feature_vectors = feature_vectors.T.drop_duplicates().T   
feature_vectors.drop('diffGRGR', axis = 1, inplace = True)

In [4]:
# Add Facies column back into features vector

feature_vectors['Facies'] = training_data['Facies']

# # group by facies, take statistics of each facies e.g. mean, std. Take sample difference of each row with 

def facies_stats(feature_vectors):
    facies_labels = np.sort(feature_vectors['Facies'].unique())
    frame_mean = pd.DataFrame()
    frame_std = pd.DataFrame()
    for i, value in enumerate(facies_labels):
        facies_subframe = feature_vectors[feature_vectors['Facies']==value]
        subframe_mean = facies_subframe.mean()
        subframe_std = facies_subframe.std()
        
        frame_mean[str(value)] = subframe_mean
        frame_std[str(value)] = subframe_std
    
    return frame_mean.T, frame_std.T

def feature_stat_diff(feature_vectors, frame_mean, frame_std):
    
    feature_vec_origin = feature_vectors[['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]
    
    for i, column in enumerate(feature_vec_origin):
        
        feature_column = feature_vec_origin[column]
        stat_column_mean = frame_mean[column]
        stat_column_std = frame_std[column]
        
        for j in range(0,9):
            
            stat_column_mean_facie = stat_column_mean[j]
            stat_column_std_facie = stat_column_std[j]
            
            feature_vectors[column + '_mean_diff_facies' + str(j)] = feature_column-stat_column_mean_facie
            feature_vectors[column + '_std_diff_facies' + str(j)] = feature_column-stat_column_std_facie
    return feature_vectors
             
frame_mean, frame_std = facies_stats(feature_vectors)  
feature_vectors = feature_stat_diff(feature_vectors, frame_mean, frame_std)

3.0 - Generate plots of each feature


In [5]:
# A = feature_vectors.sort_values(by='Facies')
# A.reset_index(drop=True).plot(subplots=True, style='b', figsize = [12, 400])

4.0 - Train model using RandomForestClassifier


In [6]:
df = feature_vectors
predictors = feature_vectors.columns
predictors = list(predictors.drop('Facies'))
correct_facies_labels = df['Facies'].values
# Scale features
df = df[predictors]

scaler = preprocessing.StandardScaler().fit(df)
scaled_features = scaler.transform(df)

# Train test split:

X_train, X_test, y_train, y_test = train_test_split(scaled_features,  correct_facies_labels, test_size=0.2, random_state=0)
alg = RandomForestClassifier(random_state=1, n_estimators=200, min_samples_split=8, min_samples_leaf=3, max_features= None)
alg.fit(X_train, y_train)

predicted_random_forest = alg.predict(X_test)

In [7]:
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                 'WS', 'D','PS', 'BS']
result = predicted_random_forest
conf = confusion_matrix(y_test, result)
display_cm(conf, facies_labels, hide_zeros=True, display_metrics = True)

def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

print(accuracy(conf))

adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))

print(accuracy_adjacent(conf, adjacent_facies))


     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    31     7     3                 1                      42
     CSiS     6   125    21                 2           1         155
     FSiS     1    24    88           1     2     1     2         119
     SiSh           5     2    19     3     4           3          36
       MS           4     2          20     4           3          33
       WS           5           3     5    76     1    14         104
        D           1                       1     9     4     1    16
       PS           5     1           2    16     2    76     4   106
       BS                             1     2     2     5    26    36

Precision  0.82  0.71  0.75  0.86  0.62  0.70  0.60  0.70  0.84  0.73
   Recall  0.74  0.81  0.74  0.53  0.61  0.73  0.56  0.72  0.72  0.73
       F1  0.78  0.76  0.75  0.66  0.62  0.72  0.58  0.71  0.78  0.73
0.726429675425
0.910355486862

5.0 - Predict on test data


In [8]:
# read in Test data

filename = 'validation_data_nofacies.csv'
test_data = pd.read_csv(filename)

In [10]:
# Reproduce feature generation

feature_vectors_test = test_data.drop(['Formation', 'Well Name'], axis=1)
feature_vectors_test = feature_vectors_test[['Depth','GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']]

df_diff_test = difference_vector(feature_vectors_test)    
feature_vectors_test = diff_vec2frame(feature_vectors_test, df_diff_test)

# Drop duplicated columns and column of zeros

feature_vectors_test = feature_vectors_test.T.drop_duplicates().T   
feature_vectors_test.drop('diffGRGR', axis = 1, inplace = True)

# Create statistical feature differences using preivously caluclated mean and std values from train data.

feature_vectors_test = feature_stat_diff(feature_vectors_test, frame_mean, frame_std)
feature_vectors_test = feature_vectors_test[predictors]
scaler = preprocessing.StandardScaler().fit(feature_vectors_test)
scaled_features = scaler.transform(feature_vectors_test)

predicted_random_forest = alg.predict(scaled_features)

In [14]:
predicted_random_forest
test_data['Facies'] = predicted_random_forest
test_data.to_csv('test_data_prediction_CE.csv')

In [ ]: