Facies classification using Random Forest

Contest entry by <a href=\"https://geolern.github.io/index.html#\">geoLEARN</a>:

<a href=\"https://github.com/mablou\">Martin Blouin</a>, <a href=\"https://github.com/lperozzi\">Lorenzo Perozzi</a> and <a href=\"https://github.com/Antoine-Cate\">Antoine Caté</a>

in collaboration with <a href=\"http://ete.inrs.ca/erwan-gloaguen\">Erwan Gloaguen</a>

Original contest notebook by Brendon Hall, Enthought

In this notebook we will train a machine learning algorithm to predict facies from well log data. The dataset comes from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

The dataset consists of log data from nine wells that have been labeled with a facies type based on observation of core. We will use this log data to train a Random Forest model to classify facies types.

Exploring the dataset

First, we import and examine the dataset used to train the classifier.



In [4]:

    
###### Importing all used packages
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
import seaborn as sns

from pandas import set_option
# set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None

###### Import packages needed for the make_vars functions
import Feature_Engineering as FE

##### import stuff from scikit learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score,LeavePGroupsOut, LeaveOneGroupOut, cross_val_predict
from sklearn.metrics import confusion_matrix, make_scorer, f1_score, accuracy_score, recall_score, precision_score

filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data.head()



In [5]:

    
training_data.describe()









    Out[5]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      4149.000000
      3232.000000
      4149.000000
      4149.000000
    
    
      mean
      4.503254
      2906.867438
      64.933985
      0.659566
      4.402484
      13.201066
      3.725014
      1.518438
      0.521852
    
    
      std
      2.474324
      133.300164
      30.302530
      0.252703
      5.274947
      7.132846
      0.896152
      0.499720
      0.286644
    
    
      min
      1.000000
      2573.500000
      10.149000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.000000
    
    
      25%
      2.000000
      2821.500000
      44.730000
      0.498000
      1.600000
      8.500000
      3.100000
      1.000000
      0.277000
    
    
      50%
      4.000000
      2932.500000
      64.990000
      0.639000
      4.300000
      12.020000
      3.551500
      2.000000
      0.528000
    
    
      75%
      6.000000
      3007.000000
      79.438000
      0.822000
      7.500000
      16.050000
      4.300000
      2.000000
      0.769000
    
    
      max
      9.000000
      3138.000000
      361.150000
      1.800000
      19.312000
      84.400000
      8.094000
      2.000000
      1.000000

A complete description of the dataset is given in the Original contest notebook by Brendon Hall, Enthought. A total of four measured rock properties and two interpreted geological properties are given as raw predictor variables for the prediction of the "Facies" class.

Feature engineering

As stated in our previous submission, we believe that feature engineering has a high potential for increasing classification success. A strategy for building new variables is explained below.

The dataset is distributed along a series of drillholes intersecting a stratigraphic sequence. Sedimentary facies tend to be deposited in sequences that reflect the evolution of the paleo-environment (variations in water depth, water temperature, biological activity, currents strenght, detrital input, ...). Each facies represents a specific depositional environment and is in contact with facies that represent a progressive transition to an other environment. Thus, there is a relationship between neighbouring samples, and the distribution of the data along drillholes can be as important as data values for predicting facies.

A series of new variables (features) are calculated and tested below to help represent the relationship of neighbouring samples and the overall texture of the data along drillholes. These variables are:

detail and approximation coeficients at various levels of two wavelet transforms (using two types of Daubechies wavelets);
measures of the local entropy with variable observation windows;
measures of the local gradient with variable observation windows;
rolling statistical calculations (i.e., mean, standard deviation, min and max) with variable observation windows;
ratios between marine and non-marine lithofacies with different observation windows;
distances from the nearest marine or non-marine occurence uphole and downhole.

Functions used to build these variables are located in the Feature Engineering python script.

All the data exploration work related to the conception and study of these variables is not presented here.



In [6]:

    
##### cD From wavelet db1
dwt_db1_cD_df = FE.make_dwt_vars_cD(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db1')

##### cA From wavelet db1
dwt_db1_cA_df = FE.make_dwt_vars_cA(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db1')

##### cD From wavelet db3
dwt_db3_cD_df = FE.make_dwt_vars_cD(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### cA From wavelet db3
dwt_db3_cA_df = FE.make_dwt_vars_cA(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### From entropy
entropy_df = FE.make_entropy_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                               l_foots=[2, 3, 4, 5, 7, 10])

###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                 dx_list=[2, 3, 4, 5, 6, 10, 20])

##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                   windows=[1, 2, 5, 10, 20])

##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3, 4, 5, 7, 10, 15, 20])

##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=training_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

###### From rolling NM/M ratio
rolling_marine_ratio_df = FE.make_rolling_marine_ratio_vars(wells_df=training_data, windows=[5, 10, 15, 20, 30, 50, 75, 100, 200])

###### From distance to NM and M, up and down
dist_M_up_df = FE.make_distance_to_M_up_vars(wells_df=training_data)
dist_M_down_df = FE.make_distance_to_M_down_vars(wells_df=training_data)
dist_NM_up_df = FE.make_distance_to_NM_up_vars(wells_df=training_data)
dist_NM_down_df = FE.make_distance_to_NM_down_vars(wells_df=training_data)



In [7]:

    
list_df_var = [dwt_db1_cD_df, dwt_db1_cA_df, dwt_db3_cD_df, dwt_db3_cA_df,
               entropy_df, gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df,
              rolling_marine_ratio_df, dist_M_up_df, dist_M_down_df, dist_NM_up_df, dist_NM_down_df]
combined_df = training_data
for var_df in list_df_var:
    temp_df = var_df
    combined_df = pd.concat([combined_df,temp_df],axis=1)
combined_df.replace(to_replace=np.nan, value='-1', inplace=True)
print (combined_df.shape)
combined_df.head(5)









    



(4149, 299)






    Out[7]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      ...
      Marine_ratio_20_centered
      Marine_ratio_30_centered
      Marine_ratio_50_centered
      Marine_ratio_75_centered
      Marine_ratio_100_centered
      Marine_ratio_200_centered
      dist_M_up
      dist_M_down
      dist_NM_up
      dist_NM_down
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.45
      0.664
      9.9
      11.915
      4.6
      1
      ...
      1.0
      1.0
      1.0
      1.0
      1.140000
      1.510000
      -1.0
      21.5
      0.0
      0.0
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.26
      0.661
      14.2
      12.565
      4.1
      1
      ...
      1.0
      1.0
      1.0
      1.0
      1.156863
      1.504950
      -1.0
      21.0
      0.0
      0.0
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.05
      0.658
      14.8
      13.050
      3.6
      1
      ...
      1.0
      1.0
      1.0
      1.0
      1.173077
      1.500000
      -1.0
      20.5
      0.0
      0.0
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.10
      0.655
      13.9
      13.115
      3.5
      1
      ...
      1.0
      1.0
      1.0
      1.0
      1.188679
      1.495146
      -1.0
      20.0
      0.0
      0.0
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.58
      0.647
      13.5
      13.300
      3.4
      1
      ...
      1.0
      1.0
      1.0
      1.0
      1.203704
      1.490385
      -1.0
      19.5
      0.0
      0.0
    
  

5 rows × 299 columns

Building a prediction model from these variables

A Random Forest model is built here to test the effect of these new variables on the prediction power. Algorithm parameters have been tuned so as to take into account the non-stationarity of the training and testing sets using the LeaveOneGroupOut cross-validation strategy. The size of individual tree leafs and nodes has been increased to the maximum possible without significantly increasing the variance so as to reduce the bias of the prediction. Box plot for a series of scores obtained through cross validation are presented below.

Create predictor and target arrays



In [8]:

    
X = combined_df.iloc[:, 4:]
y = combined_df['Facies']
groups = combined_df['Well Name']

Estimation of validation scores from this tuning



In [25]:

    
scoring_param = ['accuracy', 'recall_weighted', 'precision_weighted','f1_weighted']
scores = []

Cl = RandomForestClassifier(n_estimators=100, max_features=0.1, min_samples_leaf=25,
                            min_samples_split=50, class_weight='balanced', random_state=42, n_jobs=-1)

lpgo = LeavePGroupsOut(n_groups=2)

for scoring in scoring_param:
    
    cv=lpgo.split(X, y, groups)
    validated = cross_val_score(Cl, X, y, scoring=scoring, cv=cv, n_jobs=-1)
    scores.append(validated)
    
scores = np.array(scores)
scores = np.swapaxes(scores, 0, 1)
scores = pd.DataFrame(data=scores, columns=scoring_param)



In [26]:

    
sns.set_style('white')
fig,ax = plt.subplots(figsize=(8,6))
sns.boxplot(data=scores)
plt.xlabel('scoring parameters')
plt.ylabel('score')
plt.title('Classification scores for tuned parameters');

Evaluating feature importances

The individual contribution to the classification for each feature (i.e., feature importances) can be obtained from a Random Forest classifier. This gives a good idea of the classification power of individual features and helps understanding which type of feature engineering is the most promising.

Caution should be taken when interpreting feature importances, as highly correlated variables will tend to dilute their classification power between themselves and will rank lower than uncorelated variables.



In [11]:

    
####### Evaluation of feature importances
Cl = RandomForestClassifier(n_estimators=75, max_features=0.1, min_samples_leaf=25,
                            min_samples_split=50, class_weight='balanced', random_state=42,oob_score=True, n_jobs=-1)
Cl.fit(X, y)
print ('OOB estimate of accuracy for prospectivity classification using all features: %s' % str(Cl.oob_score_))

importances = Cl.feature_importances_ 
std = np.std([tree.feature_importances_ for tree in Cl.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
Vars = list(X.columns.values)
for f in range(X.shape[1]):
    print("%d. feature %d %s (%f)" % (f + 1, indices[f], Vars[indices[f]], importances[indices[f]]))









    



OOB estimate of accuracy for prospectivity classification using all features: 0.723065798988
Feature ranking:
1. feature 282 Marine_ratio_5_centered (0.031530)
2. feature 293 dist_NM_up (0.030115)
3. feature 288 Marine_ratio_75_centered (0.029691)
4. feature 292 dist_M_down (0.027875)
5. feature 284 Marine_ratio_15_centered (0.027651)
6. feature 285 Marine_ratio_20_centered (0.026680)
7. feature 290 Marine_ratio_200_centered (0.026423)
8. feature 286 Marine_ratio_30_centered (0.022103)
9. feature 5 NM_M (0.021090)
10. feature 289 Marine_ratio_100_centered (0.018909)
11. feature 287 Marine_ratio_50_centered (0.017506)
12. feature 283 Marine_ratio_10_centered (0.016424)
13. feature 171 PE_moving_av_20ft (0.014572)
14. feature 249 GR_moving_min_5ft (0.012678)
15. feature 291 dist_M_up (0.011630)
16. feature 250 GR_moving_min_7ft (0.011605)
17. feature 81 PE_cA_level_3 (0.010347)
18. feature 254 ILD_log10_moving_min_3ft (0.009445)
19. feature 272 PE_moving_min_10ft (0.009424)
20. feature 294 dist_NM_down (0.008873)
21. feature 219 ILD_log10_moving_max_3ft (0.008808)
22. feature 241 PHIND_moving_max_4ft (0.008783)
23. feature 224 ILD_log10_moving_max_15ft (0.008776)
24. feature 248 GR_moving_min_4ft (0.008697)
25. feature 0 GR (0.008360)
26. feature 223 ILD_log10_moving_max_10ft (0.008254)
27. feature 271 PE_moving_min_7ft (0.008216)
28. feature 170 PE_moving_av_10ft (0.007934)
29. feature 251 GR_moving_min_10ft (0.007851)
30. feature 277 PHIND_moving_min_5ft (0.007727)
31. feature 275 PHIND_moving_min_3ft (0.007675)
32. feature 257 ILD_log10_moving_min_7ft (0.007360)
33. feature 240 PHIND_moving_max_3ft (0.007212)
34. feature 247 GR_moving_min_3ft (0.007172)
35. feature 159 ILD_log10_moving_av_5ft (0.007125)
36. feature 221 ILD_log10_moving_max_5ft (0.007063)
37. feature 174 PHIND_moving_av_5ft (0.006821)
38. feature 37 DeltaPHI_cA_level_3 (0.006339)
39. feature 160 ILD_log10_moving_av_10ft (0.006312)
40. feature 255 ILD_log10_moving_min_4ft (0.006159)
41. feature 259 ILD_log10_moving_min_15ft (0.005927)
42. feature 269 PE_moving_min_4ft (0.005768)
43. feature 31 ILD_log10_cA_level_1 (0.005768)
44. feature 239 PE_moving_max_20ft (0.005693)
45. feature 152 GR_moving_av_1ft (0.005620)
46. feature 1 ILD_log10 (0.005587)
47. feature 3 PHIND (0.005504)
48. feature 6 RELPOS (0.005443)
49. feature 153 GR_moving_av_2ft (0.005402)
50. feature 34 ILD_log10_cA_level_4 (0.005058)
51. feature 238 PE_moving_max_15ft (0.005040)
52. feature 157 ILD_log10_moving_av_1ft (0.004991)
53. feature 43 PHIND_cA_level_1 (0.004968)
54. feature 246 PHIND_moving_max_20ft (0.004958)
55. feature 276 PHIND_moving_min_4ft (0.004852)
56. feature 230 DeltaPHI_moving_max_10ft (0.004707)
57. feature 173 PHIND_moving_av_2ft (0.004687)
58. feature 39 PE_cA_level_1 (0.004664)
59. feature 32 ILD_log10_cA_level_2 (0.004654)
60. feature 242 PHIND_moving_max_5ft (0.004548)
61. feature 172 PHIND_moving_av_1ft (0.004467)
62. feature 41 PE_cA_level_3 (0.004452)
63. feature 158 ILD_log10_moving_av_2ft (0.004395)
64. feature 256 ILD_log10_moving_min_5ft (0.004373)
65. feature 264 DeltaPHI_moving_min_7ft (0.004358)
66. feature 222 ILD_log10_moving_max_7ft (0.004249)
67. feature 165 DeltaPHI_moving_av_10ft (0.004245)
68. feature 270 PE_moving_min_5ft (0.004242)
69. feature 161 ILD_log10_moving_av_20ft (0.004107)
70. feature 38 DeltaPHI_cA_level_4 (0.004102)
71. feature 155 GR_moving_av_10ft (0.004062)
72. feature 236 PE_moving_max_7ft (0.003970)
73. feature 204 PE_moving_std_20ft (0.003910)
74. feature 79 PE_cA_level_1 (0.003905)
75. feature 83 PHIND_cA_level_1 (0.003900)
76. feature 278 PHIND_moving_min_7ft (0.003852)
77. feature 33 ILD_log10_cA_level_3 (0.003812)
78. feature 220 ILD_log10_moving_max_4ft (0.003765)
79. feature 169 PE_moving_av_5ft (0.003751)
80. feature 86 PHIND_cA_level_4 (0.003729)
81. feature 231 DeltaPHI_moving_max_15ft (0.003673)
82. feature 176 PHIND_moving_av_20ft (0.003595)
83. feature 243 PHIND_moving_max_7ft (0.003566)
84. feature 70 GR_cA_level_4 (0.003537)
85. feature 263 DeltaPHI_moving_min_5ft (0.003492)
86. feature 212 GR_moving_max_3ft (0.003473)
87. feature 229 DeltaPHI_moving_max_7ft (0.003400)
88. feature 260 ILD_log10_moving_min_20ft (0.003384)
89. feature 168 PE_moving_av_2ft (0.003371)
90. feature 22 PE_cD_level_4 (0.003364)
91. feature 245 PHIND_moving_max_15ft (0.003225)
92. feature 274 PE_moving_min_20ft (0.003152)
93. feature 82 PE_cA_level_4 (0.003132)
94. feature 87 GR_entropy_foot2 (0.003090)
95. feature 27 GR_cA_level_1 (0.003087)
96. feature 10 GR_cD_level_4 (0.003044)
97. feature 54 ILD_log10_cD_level_4 (0.003015)
98. feature 40 PE_cA_level_2 (0.002986)
99. feature 62 PE_cD_level_4 (0.002894)
100. feature 2 DeltaPHI (0.002862)
101. feature 98 ILD_log10_entropy_foot10 (0.002792)
102. feature 78 DeltaPHI_cA_level_4 (0.002736)
103. feature 266 DeltaPHI_moving_min_15ft (0.002735)
104. feature 45 PHIND_cA_level_3 (0.002653)
105. feature 261 DeltaPHI_moving_min_3ft (0.002625)
106. feature 166 DeltaPHI_moving_av_20ft (0.002619)
107. feature 46 PHIND_cA_level_4 (0.002609)
108. feature 154 GR_moving_av_5ft (0.002592)
109. feature 77 DeltaPHI_cA_level_3 (0.002551)
110. feature 232 DeltaPHI_moving_max_20ft (0.002545)
111. feature 252 GR_moving_min_15ft (0.002541)
112. feature 268 PE_moving_min_3ft (0.002493)
113. feature 9 GR_cD_level_3 (0.002458)
114. feature 175 PHIND_moving_av_10ft (0.002419)
115. feature 227 DeltaPHI_moving_max_4ft (0.002415)
116. feature 30 GR_cA_level_4 (0.002377)
117. feature 211 PHIND_moving_std_20ft (0.002376)
118. feature 244 PHIND_moving_max_10ft (0.002362)
119. feature 267 DeltaPHI_moving_min_20ft (0.002321)
120. feature 163 DeltaPHI_moving_av_2ft (0.002276)
121. feature 253 GR_moving_min_20ft (0.002248)
122. feature 210 PHIND_moving_std_15ft (0.002224)
123. feature 228 DeltaPHI_moving_max_5ft (0.002209)
124. feature 92 GR_entropy_foot10 (0.002205)
125. feature 279 PHIND_moving_min_10ft (0.002189)
126. feature 68 GR_cA_level_2 (0.002179)
127. feature 225 ILD_log10_moving_max_20ft (0.002164)
128. feature 35 DeltaPHI_cA_level_1 (0.002157)
129. feature 196 DeltaPHI_moving_std_15ft (0.002150)
130. feature 235 PE_moving_max_5ft (0.002136)
131. feature 237 PE_moving_max_10ft (0.002133)
132. feature 233 PE_moving_max_3ft (0.002054)
133. feature 17 DeltaPHI_cD_level_3 (0.002046)
134. feature 44 PHIND_cA_level_2 (0.002030)
135. feature 197 DeltaPHI_moving_std_20ft (0.002011)
136. feature 74 ILD_log10_cA_level_4 (0.002005)
137. feature 69 GR_cA_level_3 (0.001974)
138. feature 84 PHIND_cA_level_2 (0.001971)
139. feature 234 PE_moving_max_4ft (0.001941)
140. feature 36 DeltaPHI_cA_level_2 (0.001940)
141. feature 265 DeltaPHI_moving_min_10ft (0.001915)
142. feature 25 PHIND_cD_level_3 (0.001901)
143. feature 280 PHIND_moving_min_15ft (0.001895)
144. feature 156 GR_moving_av_20ft (0.001888)
145. feature 42 PE_cA_level_4 (0.001869)
146. feature 226 DeltaPHI_moving_max_3ft (0.001856)
147. feature 113 PHIND_entropy_foot4 (0.001846)
148. feature 262 DeltaPHI_moving_min_4ft (0.001806)
149. feature 58 DeltaPHI_cD_level_4 (0.001804)
150. feature 66 PHIND_cD_level_4 (0.001798)
151. feature 203 PE_moving_std_15ft (0.001797)
152. feature 162 DeltaPHI_moving_av_1ft (0.001776)
153. feature 164 DeltaPHI_moving_av_5ft (0.001759)
154. feature 88 GR_entropy_foot3 (0.001689)
155. feature 14 ILD_log10_cD_level_4 (0.001689)
156. feature 18 DeltaPHI_cD_level_4 (0.001684)
157. feature 90 GR_entropy_foot5 (0.001682)
158. feature 89 GR_entropy_foot4 (0.001646)
159. feature 215 GR_moving_max_7ft (0.001638)
160. feature 217 GR_moving_max_15ft (0.001597)
161. feature 73 ILD_log10_cA_level_3 (0.001554)
162. feature 28 GR_cA_level_2 (0.001541)
163. feature 216 GR_moving_max_10ft (0.001515)
164. feature 218 GR_moving_max_20ft (0.001498)
165. feature 273 PE_moving_min_15ft (0.001492)
166. feature 190 ILD_log10_moving_std_20ft (0.001424)
167. feature 189 ILD_log10_moving_std_15ft (0.001422)
168. feature 50 GR_cD_level_4 (0.001421)
169. feature 258 ILD_log10_moving_min_10ft (0.001419)
170. feature 94 ILD_log10_entropy_foot3 (0.001407)
171. feature 182 GR_moving_std_15ft (0.001398)
172. feature 202 PE_moving_std_10ft (0.001389)
173. feature 97 ILD_log10_entropy_foot7 (0.001343)
174. feature 67 GR_cA_level_1 (0.001277)
175. feature 65 PHIND_cD_level_3 (0.001261)
176. feature 183 GR_moving_std_20ft (0.001236)
177. feature 61 PE_cD_level_3 (0.001223)
178. feature 96 ILD_log10_entropy_foot5 (0.001221)
179. feature 12 ILD_log10_cD_level_2 (0.001213)
180. feature 29 GR_cA_level_3 (0.001206)
181. feature 209 PHIND_moving_std_10ft (0.001194)
182. feature 115 PHIND_entropy_foot7 (0.001187)
183. feature 213 GR_moving_max_4ft (0.001178)
184. feature 57 DeltaPHI_cD_level_3 (0.001171)
185. feature 95 ILD_log10_entropy_foot4 (0.001152)
186. feature 281 PHIND_moving_min_20ft (0.001129)
187. feature 181 GR_moving_std_10ft (0.001116)
188. feature 8 GR_cD_level_2 (0.001071)
189. feature 80 PE_cA_level_2 (0.001048)
190. feature 26 PHIND_cD_level_4 (0.001042)
191. feature 111 PHIND_entropy_foot2 (0.001025)
192. feature 179 GR_moving_std_5ft (0.001022)
193. feature 75 DeltaPHI_cA_level_1 (0.001017)
194. feature 214 GR_moving_max_5ft (0.001001)
195. feature 180 GR_moving_std_7ft (0.000997)
196. feature 85 PHIND_cA_level_3 (0.000971)
197. feature 114 PHIND_entropy_foot5 (0.000971)
198. feature 201 PE_moving_std_7ft (0.000905)
199. feature 72 ILD_log10_cA_level_2 (0.000866)
200. feature 13 ILD_log10_cD_level_3 (0.000861)
201. feature 186 ILD_log10_moving_std_5ft (0.000842)
202. feature 194 DeltaPHI_moving_std_7ft (0.000805)
203. feature 21 PE_cD_level_3 (0.000796)
204. feature 93 ILD_log10_entropy_foot2 (0.000760)
205. feature 60 PE_cD_level_2 (0.000742)
206. feature 64 PHIND_cD_level_2 (0.000741)
207. feature 200 PE_moving_std_5ft (0.000726)
208. feature 139 PEgradient_dx3 (0.000719)
209. feature 187 ILD_log10_moving_std_7ft (0.000715)
210. feature 7 GR_cD_level_1 (0.000714)
211. feature 24 PHIND_cD_level_2 (0.000688)
212. feature 53 ILD_log10_cD_level_3 (0.000687)
213. feature 206 PHIND_moving_std_4ft (0.000684)
214. feature 112 PHIND_entropy_foot3 (0.000655)
215. feature 4 PE (0.000629)
216. feature 76 DeltaPHI_cA_level_2 (0.000592)
217. feature 71 ILD_log10_cA_level_1 (0.000589)
218. feature 55 DeltaPHI_cD_level_1 (0.000575)
219. feature 100 DeltaPHI_entropy_foot3 (0.000571)
220. feature 208 PHIND_moving_std_7ft (0.000568)
221. feature 207 PHIND_moving_std_5ft (0.000560)
222. feature 104 DeltaPHI_entropy_foot10 (0.000551)
223. feature 195 DeltaPHI_moving_std_10ft (0.000547)
224. feature 205 PHIND_moving_std_3ft (0.000535)
225. feature 116 PHIND_entropy_foot10 (0.000525)
226. feature 188 ILD_log10_moving_std_10ft (0.000519)
227. feature 101 DeltaPHI_entropy_foot4 (0.000516)
228. feature 192 DeltaPHI_moving_std_4ft (0.000513)
229. feature 49 GR_cD_level_3 (0.000511)
230. feature 128 ILD_log10gradient_dx6 (0.000503)
231. feature 140 PEgradient_dx4 (0.000470)
232. feature 56 DeltaPHI_cD_level_2 (0.000447)
233. feature 19 PE_cD_level_1 (0.000443)
234. feature 119 GRgradient_dx4 (0.000442)
235. feature 122 GRgradient_dx10 (0.000438)
236. feature 191 DeltaPHI_moving_std_3ft (0.000429)
237. feature 198 PE_moving_std_3ft (0.000415)
238. feature 138 PEgradient_dx2 (0.000411)
239. feature 129 ILD_log10gradient_dx10 (0.000406)
240. feature 193 DeltaPHI_moving_std_5ft (0.000394)
241. feature 145 PHINDgradient_dx2 (0.000389)
242. feature 125 ILD_log10gradient_dx3 (0.000387)
243. feature 52 ILD_log10_cD_level_2 (0.000387)
244. feature 151 PHINDgradient_dx20 (0.000368)
245. feature 127 ILD_log10gradient_dx5 (0.000368)
246. feature 121 GRgradient_dx6 (0.000347)
247. feature 132 DeltaPHIgradient_dx3 (0.000344)
248. feature 99 DeltaPHI_entropy_foot2 (0.000341)
249. feature 51 ILD_log10_cD_level_1 (0.000336)
250. feature 63 PHIND_cD_level_1 (0.000321)
251. feature 146 PHINDgradient_dx3 (0.000315)
252. feature 147 PHINDgradient_dx4 (0.000314)
253. feature 141 PEgradient_dx5 (0.000312)
254. feature 136 DeltaPHIgradient_dx10 (0.000311)
255. feature 103 DeltaPHI_entropy_foot7 (0.000310)
256. feature 167 PE_moving_av_1ft (0.000301)
257. feature 126 ILD_log10gradient_dx4 (0.000293)
258. feature 149 PHINDgradient_dx6 (0.000293)
259. feature 48 GR_cD_level_2 (0.000289)
260. feature 102 DeltaPHI_entropy_foot5 (0.000280)
261. feature 118 GRgradient_dx3 (0.000279)
262. feature 15 DeltaPHI_cD_level_1 (0.000274)
263. feature 11 ILD_log10_cD_level_1 (0.000258)
264. feature 148 PHINDgradient_dx5 (0.000247)
265. feature 123 GRgradient_dx20 (0.000244)
266. feature 144 PEgradient_dx20 (0.000236)
267. feature 150 PHINDgradient_dx10 (0.000231)
268. feature 124 ILD_log10gradient_dx2 (0.000230)
269. feature 108 PE_entropy_foot5 (0.000223)
270. feature 142 PEgradient_dx6 (0.000223)
271. feature 185 ILD_log10_moving_std_4ft (0.000197)
272. feature 16 DeltaPHI_cD_level_2 (0.000194)
273. feature 135 DeltaPHIgradient_dx6 (0.000187)
274. feature 178 GR_moving_std_4ft (0.000178)
275. feature 20 PE_cD_level_2 (0.000170)
276. feature 59 PE_cD_level_1 (0.000169)
277. feature 130 ILD_log10gradient_dx20 (0.000166)
278. feature 23 PHIND_cD_level_1 (0.000166)
279. feature 137 DeltaPHIgradient_dx20 (0.000162)
280. feature 134 DeltaPHIgradient_dx5 (0.000160)
281. feature 133 DeltaPHIgradient_dx4 (0.000151)
282. feature 143 PEgradient_dx10 (0.000150)
283. feature 120 GRgradient_dx5 (0.000132)
284. feature 110 PE_entropy_foot10 (0.000127)
285. feature 106 PE_entropy_foot3 (0.000118)
286. feature 47 GR_cD_level_1 (0.000105)
287. feature 109 PE_entropy_foot7 (0.000104)
288. feature 184 ILD_log10_moving_std_3ft (0.000100)
289. feature 107 PE_entropy_foot4 (0.000093)
290. feature 177 GR_moving_std_3ft (0.000077)
291. feature 131 DeltaPHIgradient_dx2 (0.000072)
292. feature 199 PE_moving_std_4ft (0.000071)
293. feature 117 GRgradient_dx2 (0.000065)
294. feature 91 GR_entropy_foot7 (0.000041)
295. feature 105 PE_entropy_foot2 (0.000000)

Plot the feature importances of the forest



In [12]:

    
sns.set_style('white')
fig,ax = plt.subplots(figsize=(15,5))
ax.bar(range(X.shape[1]), importances[indices],color="r", align="center")
plt.ylabel("Feature importance")
plt.xlabel('Ranked features')
plt.xticks([], indices)
plt.xlim([-1, X.shape[1]]);

Features derived from raw geological variables tend to have the highest classification power. Rolling min, max and mean tend to have better classification power than raw data. Wavelet approximation coeficients tend to have a similar to lower classification power than raw data. Features expressing local texture of the data (entropy, gradient, standard deviation and wavelet detail coeficients) have a low classification power but still participate in the prediction.

Confusion matrix

The confusion matrix from the validation test is presented below.



In [13]:

    
######## Confusion matrix from this tuning
cv=LeaveOneGroupOut().split(X, y, groups)
y_pred = cross_val_predict(Cl, X, y, cv=cv, n_jobs=-1)
conf_mat = confusion_matrix(y, y_pred)
list_facies = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D', 'PS', 'BS']
conf_mat = pd.DataFrame(conf_mat, columns=list_facies, index=list_facies)
conf_mat.head(10)

Applying the classification model to test data



In [15]:

    
filename = '../validation_data_nofacies.csv'
test_data = pd.read_csv(filename)
test_data.head(5)



In [16]:

    
##### cD From wavelet db1
dwt_db1_cD_df = FE.make_dwt_vars_cD(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                    levels=[1, 2, 3, 4], wavelet='db1')

##### cA From wavelet db1
dwt_db1_cA_df = FE.make_dwt_vars_cA(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db1')

##### cD From wavelet db3
dwt_db3_cD_df = FE.make_dwt_vars_cD(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### cA From wavelet db3
dwt_db3_cA_df = FE.make_dwt_vars_cA(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                       levels=[1, 2, 3, 4], wavelet='db3')

##### From entropy
entropy_df = FE.make_entropy_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                               l_foots=[2, 3, 4, 5, 7, 10])

###### From gradient
gradient_df = FE.make_gradient_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                 dx_list=[2, 3, 4, 5, 6, 10, 20])

##### From rolling average
moving_av_df = FE.make_moving_av_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                   windows=[1, 2, 5, 10, 20])

##### From rolling standard deviation
moving_std_df = FE.make_moving_std_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

##### From rolling max
moving_max_df = FE.make_moving_max_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3, 4, 5, 7, 10, 15, 20])

##### From rolling min
moving_min_df = FE.make_moving_min_vars(wells_df=test_data, logs=['GR', 'ILD_log10', 'DeltaPHI', 'PE', 'PHIND'],
                                     windows=[3 , 4, 5, 7, 10, 15, 20])

###### From rolling NM/M ratio
rolling_marine_ratio_df = FE.make_rolling_marine_ratio_vars(wells_df=test_data, windows=[5, 10, 15, 20, 30, 50, 75, 100, 200])

###### From distance to NM and M, up and down
dist_M_up_df = FE.make_distance_to_M_up_vars(wells_df=test_data)
dist_M_down_df = FE.make_distance_to_M_down_vars(wells_df=test_data)
dist_NM_up_df = FE.make_distance_to_NM_up_vars(wells_df=test_data)
dist_NM_down_df = FE.make_distance_to_NM_down_vars(wells_df=test_data)



In [17]:

    
combined_test_df = test_data
list_df_var = [dwt_db1_cD_df, dwt_db1_cA_df, dwt_db3_cD_df, dwt_db3_cA_df,
               entropy_df, gradient_df, moving_av_df, moving_std_df, moving_max_df, moving_min_df,
              rolling_marine_ratio_df, dist_M_up_df, dist_M_down_df, dist_NM_up_df, dist_NM_down_df]
for var_df in list_df_var:
    temp_df = var_df
    combined_test_df = pd.concat([combined_test_df,temp_df],axis=1)
combined_test_df.replace(to_replace=np.nan, value='-99999', inplace=True)

X_test = combined_test_df.iloc[:, 3:]

print (combined_test_df.shape)
combined_test_df.head(5)









    



(830, 298)






    Out[17]:






  
    
      
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      ...
      Marine_ratio_20_centered
      Marine_ratio_30_centered
      Marine_ratio_50_centered
      Marine_ratio_75_centered
      Marine_ratio_100_centered
      Marine_ratio_200_centered
      dist_M_up
      dist_M_down
      dist_NM_up
      dist_NM_down
    
  
  
    
      0
      A1 SH
      STUART
      2808.0
      66.276
      0.630
      3.3
      10.65
      3.591
      1
      1.000
      ...
      1.0
      1.0
      1.0
      1.0
      1.140000
      1.570000
      -1.0
      21.5
      0.0
      0.0
    
    
      1
      A1 SH
      STUART
      2808.5
      77.252
      0.585
      6.5
      11.95
      3.341
      1
      0.978
      ...
      1.0
      1.0
      1.0
      1.0
      1.156863
      1.574257
      -1.0
      21.0
      0.0
      0.0
    
    
      2
      A1 SH
      STUART
      2809.0
      82.899
      0.566
      9.4
      13.60
      3.064
      1
      0.956
      ...
      1.0
      1.0
      1.0
      1.0
      1.173077
      1.578431
      -1.0
      20.5
      0.0
      0.0
    
    
      3
      A1 SH
      STUART
      2809.5
      80.671
      0.593
      9.5
      13.25
      2.977
      1
      0.933
      ...
      1.0
      1.0
      1.0
      1.0
      1.188679
      1.582524
      -1.0
      20.0
      0.0
      0.0
    
    
      4
      A1 SH
      STUART
      2810.0
      75.971
      0.638
      8.7
      12.35
      3.020
      1
      0.911
      ...
      1.0
      1.0
      1.0
      1.0
      1.203704
      1.586538
      -1.0
      19.5
      0.0
      0.0
    
  

5 rows × 298 columns



In [18]:

    
Cl = RandomForestClassifier(n_estimators=100, max_features=0.1, min_samples_leaf=25,
                            min_samples_split=50, class_weight='balanced', random_state=42, n_jobs=-1)
Cl.fit(X, y)
y_test = Cl.predict(X_test)
y_test = pd.DataFrame(y_test, columns=['Predicted Facies'])
test_pred_df = pd.concat([combined_test_df[['Well Name', 'Depth']], y_test], axis=1)
test_pred_df.head()









    Out[18]:






  
    
      
      Well Name
      Depth
      Predicted Facies
    
  
  
    
      0
      STUART
      2808.0
      3
    
    
      1
      STUART
      2808.5
      3
    
    
      2
      STUART
      2809.0
      3
    
    
      3
      STUART
      2809.5
      3
    
    
      4
      STUART
      2810.0
      3

Exporting results



In [19]:

    
test_pred_df.to_pickle('Prediction_blind_wells_RF_c.pkl')



In [ ]:

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.45	0.664	9.9	11.915	4.6	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.26	0.661	14.2	12.565	4.1	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.05	0.658	14.8	13.050	3.6	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.10	0.655	13.9	13.115	3.5	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.58	0.647	13.5	13.300	3.4	1	0.915

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	4149.000000	4149.000000	4149.000000	4149.000000	4149.000000	4149.000000	3232.000000	4149.000000	4149.000000
mean	4.503254	2906.867438	64.933985	0.659566	4.402484	13.201066	3.725014	1.518438	0.521852
std	2.474324	133.300164	30.302530	0.252703	5.274947	7.132846	0.896152	0.499720	0.286644
min	1.000000	2573.500000	10.149000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.000000
25%	2.000000	2821.500000	44.730000	0.498000	1.600000	8.500000	3.100000	1.000000	0.277000
50%	4.000000	2932.500000	64.990000	0.639000	4.300000	12.020000	3.551500	2.000000	0.528000
75%	6.000000	3007.000000	79.438000	0.822000	7.500000	16.050000	4.300000	2.000000	0.769000
max	9.000000	3138.000000	361.150000	1.800000	19.312000	84.400000	8.094000	2.000000	1.000000

	SS	CSiS	FSiS	SiSh	MS	WS	D	PS	BS
SS	170	78	20	0	0	0	0	0	0
CSiS	77	604	254	3	0	2	0	0	0
FSiS	25	255	484	8	0	2	1	5	0
SiSh	0	2	3	177	25	38	18	8	0
MS	0	3	9	53	30	105	26	50	20
WS	0	0	4	71	86	225	35	150	11
D	0	0	1	16	3	9	77	35	0
PS	0	4	11	35	15	159	73	341	48
BS	0	2	0	0	1	13	15	54	100

	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	A1 SH	STUART	2808.0	66.276	0.630	3.3	10.65	3.591	1	1.000
1	A1 SH	STUART	2808.5	77.252	0.585	6.5	11.95	3.341	1	0.978
2	A1 SH	STUART	2809.0	82.899	0.566	9.4	13.60	3.064	1	0.956
3	A1 SH	STUART	2809.5	80.671	0.593	9.5	13.25	2.977	1	0.933
4	A1 SH	STUART	2810.0	75.971	0.638	8.7	12.35	3.020	1	0.911