In this notebook, we mainly utilize extreme gradient boost to improve the prediction model originially proposed in TLE 2016 November machine learning tuotrial. Extreme gradient boost can be viewed as an enhanced version of gradient boost by using a more regularized model formalization to control over-fitting, and XGB usually performs better. Applications of XGB can be found in many Kaggle competitions. Some recommended tutorrials can be found

Our work will be orginized in the follwing order:

•Background

•Exploratory Data Analysis

•Data Prepration and Model Selection

•Final Results

Background

The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a classifier to predict facies types.

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are: •Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE. •Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:

1.Nonmarine sandstone

2.Nonmarine coarse siltstone

3.Nonmarine fine siltstone

4.Marine siltstone and shale

5.Mudstone (limestone)

6.Wackestone (limestone)

7.Dolomite

8.Packstone-grainstone (limestone)

9.Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies/ Label/ Adjacent Facies

1 SS 2

2 CSiS 1,3

3 FSiS 2

4 SiSh 5

5 MS 4,6

6 WS 5,7

7 D 6,8

8 PS 6,7,9

9 BS 7,8

The first thing we notice for this data is that it seems that neighboring facies are not symmetric, for example, the adjacent facies for 9 could be 7, yet the adjacent facies for 7 couldn't be 9. We already contacted the authors regarding this.

Exprolatory Data Analysis

After the background intorduction, we start to import the pandas library for some basic data analysis and manipulation. The matplotblib and seaborn are imported for data vislization.


In [1]:
%matplotlib inline
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import matplotlib.colors as colors

In [4]:
filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data


Out[4]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.450 0.664 9.900 11.915 4.600 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.260 0.661 14.200 12.565 4.100 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.050 0.658 14.800 13.050 3.600 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.100 0.655 13.900 13.115 3.500 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.580 0.647 13.500 13.300 3.400 1 0.915
5 3 A1 SH SHRIMPLIN 2795.5 73.970 0.636 14.000 13.385 3.600 1 0.894
6 3 A1 SH SHRIMPLIN 2796.0 73.720 0.630 15.600 13.930 3.700 1 0.872
7 3 A1 SH SHRIMPLIN 2796.5 75.650 0.625 16.500 13.920 3.500 1 0.830
8 3 A1 SH SHRIMPLIN 2797.0 73.790 0.624 16.200 13.980 3.400 1 0.809
9 3 A1 SH SHRIMPLIN 2797.5 76.890 0.615 16.900 14.220 3.500 1 0.787
10 3 A1 SH SHRIMPLIN 2798.0 76.110 0.600 14.800 13.375 3.600 1 0.766
11 3 A1 SH SHRIMPLIN 2798.5 74.950 0.583 13.300 12.690 3.700 1 0.745
12 3 A1 SH SHRIMPLIN 2799.0 71.870 0.561 11.300 12.475 3.500 1 0.723
13 3 A1 SH SHRIMPLIN 2799.5 83.420 0.537 13.300 14.930 3.400 1 0.702
14 2 A1 SH SHRIMPLIN 2800.0 90.100 0.519 14.300 16.555 3.200 1 0.681
15 2 A1 SH SHRIMPLIN 2800.5 78.150 0.467 11.800 15.960 3.100 1 0.638
16 2 A1 SH SHRIMPLIN 2801.0 69.300 0.438 9.500 15.120 3.100 1 0.617
17 2 A1 SH SHRIMPLIN 2801.5 63.540 0.418 8.800 15.190 3.000 1 0.596
18 2 A1 SH SHRIMPLIN 2802.0 63.870 0.401 7.200 15.390 2.900 1 0.574
19 2 A1 SH SHRIMPLIN 2802.5 58.320 0.386 6.600 14.885 2.800 1 0.553
20 2 A1 SH SHRIMPLIN 2803.0 56.610 0.369 5.500 14.800 3.000 1 0.532
21 2 A1 SH SHRIMPLIN 2803.5 55.970 0.352 6.100 14.460 3.000 1 0.511
22 2 A1 SH SHRIMPLIN 2804.0 63.670 0.344 6.000 14.745 3.000 1 0.489
23 2 A1 SH SHRIMPLIN 2804.5 66.200 0.342 6.800 15.135 3.000 1 0.468
24 2 A1 SH SHRIMPLIN 2805.0 61.270 0.346 6.100 15.480 3.000 1 0.447
25 3 A1 SH SHRIMPLIN 2805.5 69.480 0.354 5.800 14.675 3.000 1 0.404
26 3 A1 SH SHRIMPLIN 2806.0 76.370 0.354 5.200 13.635 3.000 1 0.383
27 2 A1 SH SHRIMPLIN 2806.5 82.200 0.348 7.400 15.055 3.000 1 0.362
28 2 A1 SH SHRIMPLIN 2807.0 90.250 0.346 11.500 20.230 3.100 1 0.340
29 2 A1 SH SHRIMPLIN 2807.5 94.380 0.358 14.200 24.015 3.000 1 0.319
... ... ... ... ... ... ... ... ... ... ... ...
4119 8 C LM CHURCHMAN BIBLE 3108.0 30.734 0.991 1.552 5.382 4.738 2 0.887
4120 6 C LM CHURCHMAN BIBLE 3108.5 32.219 1.013 1.342 5.055 4.637 2 0.879
4121 6 C LM CHURCHMAN BIBLE 3109.0 37.688 1.040 0.681 4.739 4.539 2 0.871
4122 6 C LM CHURCHMAN BIBLE 3109.5 35.844 1.044 0.960 3.533 4.832 2 0.863
4123 6 C LM CHURCHMAN BIBLE 3110.0 42.156 1.051 1.448 3.337 4.797 2 0.855
4124 6 C LM CHURCHMAN BIBLE 3110.5 42.094 1.057 2.736 4.051 4.500 2 0.847
4125 5 C LM CHURCHMAN BIBLE 3111.0 49.719 1.060 3.092 5.893 3.830 2 0.839
4126 5 C LM CHURCHMAN BIBLE 3111.5 46.219 1.062 3.018 6.503 3.434 2 0.831
4127 6 C LM CHURCHMAN BIBLE 3112.0 42.313 1.050 2.245 5.958 3.318 2 0.823
4128 6 C LM CHURCHMAN BIBLE 3112.5 36.031 1.028 1.193 5.936 3.393 2 0.815
4129 6 C LM CHURCHMAN BIBLE 3113.0 32.594 1.014 0.662 5.978 3.422 2 0.806
4130 6 C LM CHURCHMAN BIBLE 3113.5 37.094 1.005 0.377 6.605 3.697 2 0.798
4131 5 C LM CHURCHMAN BIBLE 3114.0 40.031 1.027 0.615 6.270 4.035 2 0.790
4132 5 C LM CHURCHMAN BIBLE 3114.5 42.500 1.057 0.672 5.871 4.422 2 0.782
4133 6 C LM CHURCHMAN BIBLE 3115.0 39.719 1.087 0.648 4.479 4.203 2 0.774
4134 6 C LM CHURCHMAN BIBLE 3115.5 38.844 1.109 1.025 2.686 3.908 2 0.766
4135 6 C LM CHURCHMAN BIBLE 3116.0 41.719 1.107 0.659 2.320 3.943 2 0.758
4136 5 C LM CHURCHMAN BIBLE 3116.5 44.750 1.085 1.165 2.937 4.020 2 0.750
4137 5 C LM CHURCHMAN BIBLE 3117.0 46.469 1.070 1.872 5.013 4.156 2 0.742
4138 5 C LM CHURCHMAN BIBLE 3117.5 51.000 1.061 3.760 6.445 3.828 2 0.734
4139 5 C LM CHURCHMAN BIBLE 3118.0 55.563 1.052 4.296 7.325 3.805 2 0.726
4140 5 C LM CHURCHMAN BIBLE 3118.5 58.313 1.034 3.863 7.465 3.584 2 0.718
4141 5 C LM CHURCHMAN BIBLE 3119.0 55.344 1.003 2.225 7.541 3.645 2 0.710
4142 5 C LM CHURCHMAN BIBLE 3119.5 53.313 0.972 1.640 7.295 3.629 2 0.702
4143 5 C LM CHURCHMAN BIBLE 3120.0 49.594 0.954 1.494 7.149 3.727 2 0.694
4144 5 C LM CHURCHMAN BIBLE 3120.5 46.719 0.947 1.828 7.254 3.617 2 0.685
4145 5 C LM CHURCHMAN BIBLE 3121.0 44.563 0.953 2.241 8.013 3.344 2 0.677
4146 5 C LM CHURCHMAN BIBLE 3121.5 49.719 0.964 2.925 8.013 3.190 2 0.669
4147 5 C LM CHURCHMAN BIBLE 3122.0 51.469 0.965 3.083 7.708 3.152 2 0.661
4148 5 C LM CHURCHMAN BIBLE 3122.5 50.031 0.970 2.609 6.668 3.295 2 0.653

4149 rows × 11 columns


In [11]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 11 columns):
Facies       4149 non-null int64
Formation    4149 non-null category
Well Name    4149 non-null category
Depth        4149 non-null float64
GR           4149 non-null float64
ILD_log10    4149 non-null float64
DeltaPHI     4149 non-null float64
PHIND        4149 non-null float64
PE           3232 non-null float64
NM_M         4149 non-null int64
RELPOS       4149 non-null float64
dtypes: category(2), float64(7), int64(2)
memory usage: 300.1 KB

In [5]:
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00','#1B4F72',
                 '#2E86C1', '#AED6F1', '#A569BD', '#196F3D']

facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS','WS', 'D','PS', 'BS']

facies_counts = training_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors,title='Distribution of Training Data by Facies')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x115e3ccc0>

In [6]:
sns.heatmap(training_data.corr(), vmax=1.0, square=True)


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x115e3a4e0>

In [7]:
training_data.describe()


/Users/littleni/anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[7]:
Facies Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 3232.000000 4149.000000 4149.000000
mean 4.503254 2906.867438 64.933985 0.659566 4.402484 13.201066 3.725014 1.518438 0.521852
std 2.474324 133.300164 30.302530 0.252703 5.274947 7.132846 0.896152 0.499720 0.286644
min 1.000000 2573.500000 10.149000 -0.025949 -21.832000 0.550000 0.200000 1.000000 0.000000
25% 2.000000 2821.500000 44.730000 0.498000 1.600000 8.500000 NaN 1.000000 0.277000
50% 4.000000 2932.500000 64.990000 0.639000 4.300000 12.020000 NaN 2.000000 0.528000
75% 6.000000 3007.000000 79.438000 0.822000 7.500000 16.050000 NaN 2.000000 0.769000
max 9.000000 3138.000000 361.150000 1.800000 19.312000 84.400000 8.094000 2.000000 1.000000

Data Preparation and Model Selection

Now we are ready to test the XGB approach, along the way confusion matrix and f1_score are imported as metric for classification, as well as GridSearchCV, which is an excellent tool for parameter optimization.


In [17]:
import xgboost as xgb
import numpy as np
from sklearn.metrics import confusion_matrix, f1_score
from classification_utilities import display_cm, display_adj_cm
from sklearn.model_selection import GridSearchCV

In [12]:
X_train = training_data.drop(['Facies', 'Well Name','Formation','Depth'], axis = 1 ) 
Y_train = training_data['Facies' ] - 1
dtrain = xgb.DMatrix(X_train, Y_train)

The accuracy function and accuracy_adjacent function are defined in teh following to quatify the prediction correctness.


In [13]:
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))

Initial model


In [10]:
# Proposed Initial Model
xgb1 = xgb.XGBClassifier( learning_rate =0.1, n_estimators=200, max_depth=5,
                          min_child_weight=1, gamma=0, subsample=0.6,
                          colsample_bytree=0.6, reg_alpha=0, reg_lambda=1, objective='multi:softmax',
                          nthread=4, scale_pos_weight=1, seed=100)


#Fit the algorithm on the data
xgb1.fit(X_train, Y_train,eval_metric='merror')

#Predict training set:
predictions = xgb1.predict(X_train)
        
#Print model report

# Confusion Matrix
conf = confusion_matrix(Y_train, predictions)

# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))

print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)

# Print Feature Importance
feat_imp = pd.Series(xgb1.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')


Model Report
-Accuracy: 0.970354
-Adjacent Accuracy: 0.993492

Confusion Matrix
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   259     5     4                                       268
     CSiS     1   919    20                                       940
     FSiS     1    34   745                                       780
     SiSh                     268           2           1         271
       MS           1           1   285     5           4         296
       WS                 1     1     4   566          10         582
        D                                   1   137           3   141
       PS                 1     4     2    15         664         686
       BS                       1           1               183   185

Precision  0.99  0.96  0.97  0.97  0.98  0.96  1.00  0.98  0.98  0.97
   Recall  0.97  0.98  0.96  0.99  0.96  0.97  0.97  0.97  0.99  0.97
       F1  0.98  0.97  0.96  0.98  0.97  0.97  0.99  0.97  0.99  0.97
Out[10]:
<matplotlib.text.Text at 0xbbc4a90>

In [11]:
# Cross Validation parameters
cv_folds = 10
rounds = 100

xgb_param_1 = xgb1.get_xgb_params()
xgb_param_1['num_class'] = 9

# Perform cross-validation
cvresult1 = xgb.cv(xgb_param_1, dtrain, num_boost_round=xgb_param_1['n_estimators'], 
                  stratified = True, nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)

print ("\nCross Validation Training Report Summary")
print (cvresult1.head())
print (cvresult1.tail())


Cross Validation Training Report Summary
   test-merror-mean  test-merror-std  train-merror-mean  train-merror-std
0          0.463624         0.034581           0.419595          0.019798
1          0.433773         0.028935           0.372004          0.014199
2          0.408699         0.026354           0.350609          0.007946
3          0.404589         0.026290           0.339788          0.007658
4          0.398107         0.024423           0.331486          0.007193
     test-merror-mean  test-merror-std  train-merror-mean  train-merror-std
195          0.292358         0.021023           0.023353          0.000796
196          0.290915         0.021367           0.022790          0.000619
197          0.291154         0.020785           0.022522          0.000776
198          0.291633         0.021096           0.022281          0.000906
199          0.290673         0.019750           0.021612          0.001124

The typical range for learning rate is around 0.01~0.2, so we vary ther learning rate a bit and at the same time, scan over the number of boosted trees to fit. This will take a little bit of time to finish.


In [12]:
print("Parameter optimization")
grid_search1 = GridSearchCV(xgb1,{'learning_rate':[0.05,0.01,0.1,0.2] , 'n_estimators':[200,400,600,800]},
                                   scoring='accuracy' , n_jobs = 4)
grid_search1.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search1.grid_scores_, grid_search1.best_params_, grid_search1.best_score_


Parameter optimization
Best Set of Parameters
C:\Users\chenzhan\AppData\Local\Continuum\Anaconda64\lib\site-packages\sklearn\model_selection\_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[12]:
([mean: 0.54616, std: 0.03023, params: {'n_estimators': 200, 'learning_rate': 0.05},
  mean: 0.53893, std: 0.02403, params: {'n_estimators': 400, 'learning_rate': 0.05},
  mean: 0.53651, std: 0.02372, params: {'n_estimators': 600, 'learning_rate': 0.05},
  mean: 0.53169, std: 0.02483, params: {'n_estimators': 800, 'learning_rate': 0.05},
  mean: 0.55363, std: 0.02880, params: {'n_estimators': 200, 'learning_rate': 0.01},
  mean: 0.55604, std: 0.02784, params: {'n_estimators': 400, 'learning_rate': 0.01},
  mean: 0.55411, std: 0.02605, params: {'n_estimators': 600, 'learning_rate': 0.01},
  mean: 0.54832, std: 0.02556, params: {'n_estimators': 800, 'learning_rate': 0.01},
  mean: 0.53989, std: 0.02591, params: {'n_estimators': 200, 'learning_rate': 0.1},
  mean: 0.53507, std: 0.02213, params: {'n_estimators': 400, 'learning_rate': 0.1},
  mean: 0.52711, std: 0.02248, params: {'n_estimators': 600, 'learning_rate': 0.1},
  mean: 0.52663, std: 0.02164, params: {'n_estimators': 800, 'learning_rate': 0.1},
  mean: 0.52398, std: 0.02532, params: {'n_estimators': 200, 'learning_rate': 0.2},
  mean: 0.52615, std: 0.02738, params: {'n_estimators': 400, 'learning_rate': 0.2},
  mean: 0.52061, std: 0.02497, params: {'n_estimators': 600, 'learning_rate': 0.2},
  mean: 0.51747, std: 0.02464, params: {'n_estimators': 800, 'learning_rate': 0.2}],
 {'learning_rate': 0.01, 'n_estimators': 400},
 0.5560375994215474)

It seems that we need to adjust the learning rate and make it smaller, which could help to reduce overfitting in my opinion. The number of boosted trees to fit also requires to be updated.


In [13]:
# Proposed Model with optimized learning rate and number of boosted trees to fit
xgb2 = xgb.XGBClassifier( learning_rate =0.01, n_estimators=400, max_depth=5,
                          min_child_weight=1, gamma=0, subsample=0.6,
                          colsample_bytree=0.6, reg_alpha=0, reg_lambda=1, objective='multi:softmax',
                          nthread=4, scale_pos_weight=1, seed=100)

#Fit the algorithm on the data
xgb2.fit(X_train, Y_train,eval_metric='merror')

#Predict training set:
predictions = xgb2.predict(X_train)
        
#Print model report

# Confusion Matrix
conf = confusion_matrix(Y_train, predictions )

# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))

# Confusion Matrix
print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)

# Print Feature Importance
feat_imp = pd.Series(xgb2.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')


Model Report
-Accuracy: 0.779706
-Adjacent Accuracy: 0.952519

Confusion Matrix
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   166    89    13                                       268
     CSiS    18   803   116                 2           1         940
     FSiS         139   635                 1           5         780
     SiSh                 6   224     1    27     2    11         271
       MS           6     5    17   151    75     2    40         296
       WS     1           2    30    14   432     7    93     3   582
        D                 1     3           5   106    25     1   141
       PS           1     6    16     4    82     9   566     2   686
       BS                                   8     2    23   152   185

Precision  0.90  0.77  0.81  0.77  0.89  0.68  0.83  0.74  0.96  0.79
   Recall  0.62  0.85  0.81  0.83  0.51  0.74  0.75  0.83  0.82  0.78
       F1  0.73  0.81  0.81  0.80  0.65  0.71  0.79  0.78  0.89  0.78
Out[13]:
<matplotlib.text.Text at 0xbc80eb8>

In [14]:
# Cross Validation parameters
cv_folds = 10
rounds = 100

xgb_param_2 = xgb2.get_xgb_params()
xgb_param_2['num_class'] = 9

# Perform cross-validation
cvresult2 = xgb.cv(xgb_param_2, dtrain, num_boost_round=xgb_param_2['n_estimators'], 
                  stratified = True, nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)

print ("\nCross Validation Training Report Summary")
print (cvresult2.head())
print (cvresult2.tail())


Cross Validation Training Report Summary
   test-merror-mean  test-merror-std  train-merror-mean  train-merror-std
0          0.463624         0.034581           0.419595          0.019798
1          0.435210         0.031384           0.375298          0.014082
2          0.420986         0.024074           0.356848          0.010152
3          0.416908         0.024509           0.351465          0.008039
4          0.403438         0.015630           0.345599          0.005708
     test-merror-mean  test-merror-std  train-merror-mean  train-merror-std
395          0.336699         0.025564           0.214028          0.002945
396          0.337423         0.025555           0.213947          0.002970
397          0.336940         0.025623           0.213760          0.002869
398          0.336223         0.026504           0.213385          0.002796
399          0.335978         0.025790           0.213305          0.002611

In [15]:
print("Parameter optimization")
grid_search2 = GridSearchCV(xgb2,{'reg_alpha':[0, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10], 'reg_lambda':[0, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10] },
                                   scoring='accuracy' , n_jobs = 4)
grid_search2.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search2.grid_scores_, grid_search2.best_params_, grid_search2.best_score_


Parameter optimization
Best Set of Parameters
C:\Users\chenzhan\AppData\Local\Continuum\Anaconda64\lib\site-packages\sklearn\model_selection\_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[15]:
([mean: 0.55363, std: 0.02560, params: {'reg_alpha': 0, 'reg_lambda': 0},
  mean: 0.55483, std: 0.02838, params: {'reg_alpha': 0, 'reg_lambda': 0.05},
  mean: 0.55483, std: 0.02776, params: {'reg_alpha': 0, 'reg_lambda': 0.1},
  mean: 0.55459, std: 0.02749, params: {'reg_alpha': 0, 'reg_lambda': 0.2},
  mean: 0.55483, std: 0.02620, params: {'reg_alpha': 0, 'reg_lambda': 0.5},
  mean: 0.55604, std: 0.02784, params: {'reg_alpha': 0, 'reg_lambda': 1},
  mean: 0.55459, std: 0.02897, params: {'reg_alpha': 0, 'reg_lambda': 2},
  mean: 0.55098, std: 0.02991, params: {'reg_alpha': 0, 'reg_lambda': 5},
  mean: 0.55242, std: 0.03191, params: {'reg_alpha': 0, 'reg_lambda': 10},
  mean: 0.55411, std: 0.02701, params: {'reg_alpha': 0.05, 'reg_lambda': 0},
  mean: 0.55459, std: 0.02749, params: {'reg_alpha': 0.05, 'reg_lambda': 0.05},
  mean: 0.55459, std: 0.02784, params: {'reg_alpha': 0.05, 'reg_lambda': 0.1},
  mean: 0.55580, std: 0.02595, params: {'reg_alpha': 0.05, 'reg_lambda': 0.2},
  mean: 0.55604, std: 0.02640, params: {'reg_alpha': 0.05, 'reg_lambda': 0.5},
  mean: 0.55507, std: 0.02600, params: {'reg_alpha': 0.05, 'reg_lambda': 1},
  mean: 0.55315, std: 0.02898, params: {'reg_alpha': 0.05, 'reg_lambda': 2},
  mean: 0.55146, std: 0.02914, params: {'reg_alpha': 0.05, 'reg_lambda': 5},
  mean: 0.55194, std: 0.03225, params: {'reg_alpha': 0.05, 'reg_lambda': 10},
  mean: 0.55363, std: 0.02756, params: {'reg_alpha': 0.1, 'reg_lambda': 0},
  mean: 0.55435, std: 0.02644, params: {'reg_alpha': 0.1, 'reg_lambda': 0.05},
  mean: 0.55435, std: 0.02750, params: {'reg_alpha': 0.1, 'reg_lambda': 0.1},
  mean: 0.55628, std: 0.02721, params: {'reg_alpha': 0.1, 'reg_lambda': 0.2},
  mean: 0.55652, std: 0.02552, params: {'reg_alpha': 0.1, 'reg_lambda': 0.5},
  mean: 0.55652, std: 0.02734, params: {'reg_alpha': 0.1, 'reg_lambda': 1},
  mean: 0.55435, std: 0.02857, params: {'reg_alpha': 0.1, 'reg_lambda': 2},
  mean: 0.55170, std: 0.02891, params: {'reg_alpha': 0.1, 'reg_lambda': 5},
  mean: 0.55194, std: 0.03269, params: {'reg_alpha': 0.1, 'reg_lambda': 10},
  mean: 0.55483, std: 0.02519, params: {'reg_alpha': 0.2, 'reg_lambda': 0},
  mean: 0.55411, std: 0.02519, params: {'reg_alpha': 0.2, 'reg_lambda': 0.05},
  mean: 0.55411, std: 0.02480, params: {'reg_alpha': 0.2, 'reg_lambda': 0.1},
  mean: 0.55580, std: 0.02591, params: {'reg_alpha': 0.2, 'reg_lambda': 0.2},
  mean: 0.55435, std: 0.02634, params: {'reg_alpha': 0.2, 'reg_lambda': 0.5},
  mean: 0.55194, std: 0.02746, params: {'reg_alpha': 0.2, 'reg_lambda': 1},
  mean: 0.55411, std: 0.02770, params: {'reg_alpha': 0.2, 'reg_lambda': 2},
  mean: 0.55266, std: 0.03008, params: {'reg_alpha': 0.2, 'reg_lambda': 5},
  mean: 0.55194, std: 0.03360, params: {'reg_alpha': 0.2, 'reg_lambda': 10},
  mean: 0.55459, std: 0.02602, params: {'reg_alpha': 0.5, 'reg_lambda': 0},
  mean: 0.55507, std: 0.02602, params: {'reg_alpha': 0.5, 'reg_lambda': 0.05},
  mean: 0.55652, std: 0.02633, params: {'reg_alpha': 0.5, 'reg_lambda': 0.1},
  mean: 0.55507, std: 0.02602, params: {'reg_alpha': 0.5, 'reg_lambda': 0.2},
  mean: 0.55290, std: 0.02814, params: {'reg_alpha': 0.5, 'reg_lambda': 0.5},
  mean: 0.55242, std: 0.02823, params: {'reg_alpha': 0.5, 'reg_lambda': 1},
  mean: 0.55146, std: 0.02872, params: {'reg_alpha': 0.5, 'reg_lambda': 2},
  mean: 0.55242, std: 0.03230, params: {'reg_alpha': 0.5, 'reg_lambda': 5},
  mean: 0.55098, std: 0.03272, params: {'reg_alpha': 0.5, 'reg_lambda': 10},
  mean: 0.55387, std: 0.02924, params: {'reg_alpha': 1, 'reg_lambda': 0},
  mean: 0.55266, std: 0.02893, params: {'reg_alpha': 1, 'reg_lambda': 0.05},
  mean: 0.55266, std: 0.02893, params: {'reg_alpha': 1, 'reg_lambda': 0.1},
  mean: 0.55363, std: 0.03067, params: {'reg_alpha': 1, 'reg_lambda': 0.2},
  mean: 0.55339, std: 0.03040, params: {'reg_alpha': 1, 'reg_lambda': 0.5},
  mean: 0.55387, std: 0.02969, params: {'reg_alpha': 1, 'reg_lambda': 1},
  mean: 0.55170, std: 0.02859, params: {'reg_alpha': 1, 'reg_lambda': 2},
  mean: 0.55290, std: 0.03314, params: {'reg_alpha': 1, 'reg_lambda': 5},
  mean: 0.54929, std: 0.03083, params: {'reg_alpha': 1, 'reg_lambda': 10},
  mean: 0.55025, std: 0.03203, params: {'reg_alpha': 2, 'reg_lambda': 0},
  mean: 0.55146, std: 0.03014, params: {'reg_alpha': 2, 'reg_lambda': 0.05},
  mean: 0.55290, std: 0.03070, params: {'reg_alpha': 2, 'reg_lambda': 0.1},
  mean: 0.55218, std: 0.03158, params: {'reg_alpha': 2, 'reg_lambda': 0.2},
  mean: 0.55194, std: 0.03249, params: {'reg_alpha': 2, 'reg_lambda': 0.5},
  mean: 0.55290, std: 0.03226, params: {'reg_alpha': 2, 'reg_lambda': 1},
  mean: 0.55266, std: 0.03405, params: {'reg_alpha': 2, 'reg_lambda': 2},
  mean: 0.55242, std: 0.03318, params: {'reg_alpha': 2, 'reg_lambda': 5},
  mean: 0.54881, std: 0.02862, params: {'reg_alpha': 2, 'reg_lambda': 10},
  mean: 0.55146, std: 0.03166, params: {'reg_alpha': 5, 'reg_lambda': 0},
  mean: 0.55122, std: 0.03076, params: {'reg_alpha': 5, 'reg_lambda': 0.05},
  mean: 0.55242, std: 0.03043, params: {'reg_alpha': 5, 'reg_lambda': 0.1},
  mean: 0.55074, std: 0.03090, params: {'reg_alpha': 5, 'reg_lambda': 0.2},
  mean: 0.55194, std: 0.03018, params: {'reg_alpha': 5, 'reg_lambda': 0.5},
  mean: 0.55194, std: 0.03115, params: {'reg_alpha': 5, 'reg_lambda': 1},
  mean: 0.55122, std: 0.02885, params: {'reg_alpha': 5, 'reg_lambda': 2},
  mean: 0.55387, std: 0.02835, params: {'reg_alpha': 5, 'reg_lambda': 5},
  mean: 0.55459, std: 0.02933, params: {'reg_alpha': 5, 'reg_lambda': 10},
  mean: 0.55459, std: 0.02804, params: {'reg_alpha': 10, 'reg_lambda': 0},
  mean: 0.55483, std: 0.02781, params: {'reg_alpha': 10, 'reg_lambda': 0.05},
  mean: 0.55435, std: 0.02801, params: {'reg_alpha': 10, 'reg_lambda': 0.1},
  mean: 0.55411, std: 0.02824, params: {'reg_alpha': 10, 'reg_lambda': 0.2},
  mean: 0.55411, std: 0.02795, params: {'reg_alpha': 10, 'reg_lambda': 0.5},
  mean: 0.55411, std: 0.02852, params: {'reg_alpha': 10, 'reg_lambda': 1},
  mean: 0.55411, std: 0.02999, params: {'reg_alpha': 10, 'reg_lambda': 2},
  mean: 0.55435, std: 0.02819, params: {'reg_alpha': 10, 'reg_lambda': 5},
  mean: 0.55387, std: 0.02639, params: {'reg_alpha': 10, 'reg_lambda': 10}],
 {'reg_alpha': 0.1, 'reg_lambda': 0.5},
 0.55651964328753911)

In [16]:
# Proposed Model with optimized regularization 
xgb3 = xgb.XGBClassifier( learning_rate =0.01, n_estimators=400, max_depth=5,
                          min_child_weight=1, gamma=0, subsample=0.6,
                          colsample_bytree=0.6, reg_alpha=0.1, reg_lambda=0.5, objective='multi:softmax',
                          nthread=4, scale_pos_weight=1, seed=100)

#Fit the algorithm on the data
xgb3.fit(X_train, Y_train,eval_metric='merror')

#Predict training set:
predictions = xgb3.predict(X_train)
        
#Print model report

# Confusion Matrix
conf = confusion_matrix(Y_train, predictions )

# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))

# Confusion Matrix
print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)

# Print Feature Importance
feat_imp = pd.Series(xgb3.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')


Model Report
-Accuracy: 0.784285
-Adjacent Accuracy: 0.953242

Confusion Matrix
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   167    89    12                                       268
     CSiS    17   808   112                 2           1         940
     FSiS         139   636                 2           3         780
     SiSh                 6   225     1    26     2    11         271
       MS           6     5    17   152    74     2    40         296
       WS     1           2    29    12   440     7    88     3   582
        D                 1     3           5   106    25     1   141
       PS           1     6    16     5    82     9   566     1   686
       BS                                   8     2    21   154   185

Precision  0.90  0.77  0.82  0.78  0.89  0.69  0.83  0.75  0.97  0.79
   Recall  0.62  0.86  0.82  0.83  0.51  0.76  0.75  0.83  0.83  0.78
       F1  0.74  0.81  0.82  0.80  0.65  0.72  0.79  0.79  0.90  0.78
Out[16]:
<matplotlib.text.Text at 0xbccacf8>

In [17]:
print("Parameter optimization")
grid_search3 = GridSearchCV(xgb3,{'max_depth':[2, 5, 8], 'gamma':[0, 1], 'subsample':[0.4, 0.6, 0.8],'colsample_bytree':[0.4, 0.6, 0.8] },
                                   scoring='accuracy' , n_jobs = 4)
grid_search3.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search3.grid_scores_, grid_search3.best_params_, grid_search3.best_score_


Parameter optimization
Best Set of Parameters
C:\Users\chenzhan\AppData\Local\Continuum\Anaconda64\lib\site-packages\sklearn\model_selection\_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[17]:
([mean: 0.55146, std: 0.02078, params: {'max_depth': 2, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.4},
  mean: 0.55170, std: 0.01781, params: {'colsample_bytree': 0.4, 'max_depth': 2, 'subsample': 0.6, 'gamma': 0},
  mean: 0.54977, std: 0.01751, params: {'colsample_bytree': 0.4, 'max_depth': 2, 'subsample': 0.8, 'gamma': 0},
  mean: 0.53531, std: 0.01736, params: {'max_depth': 5, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.4},
  mean: 0.54182, std: 0.01044, params: {'subsample': 0.6, 'max_depth': 5, 'gamma': 0, 'colsample_bytree': 0.4},
  mean: 0.53917, std: 0.01401, params: {'colsample_bytree': 0.4, 'max_depth': 5, 'subsample': 0.8, 'gamma': 0},
  mean: 0.52953, std: 0.01619, params: {'max_depth': 8, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.4},
  mean: 0.52832, std: 0.01334, params: {'max_depth': 8, 'subsample': 0.6, 'gamma': 0, 'colsample_bytree': 0.4},
  mean: 0.53025, std: 0.01603, params: {'colsample_bytree': 0.4, 'max_depth': 8, 'subsample': 0.8, 'gamma': 0},
  mean: 0.55098, std: 0.02163, params: {'max_depth': 2, 'subsample': 0.4, 'gamma': 1, 'colsample_bytree': 0.4},
  mean: 0.55122, std: 0.01714, params: {'colsample_bytree': 0.4, 'max_depth': 2, 'subsample': 0.6, 'gamma': 1},
  mean: 0.54905, std: 0.01806, params: {'subsample': 0.8, 'max_depth': 2, 'gamma': 1, 'colsample_bytree': 0.4},
  mean: 0.53651, std: 0.01349, params: {'max_depth': 5, 'subsample': 0.4, 'gamma': 1, 'colsample_bytree': 0.4},
  mean: 0.54037, std: 0.01404, params: {'colsample_bytree': 0.4, 'max_depth': 5, 'subsample': 0.6, 'gamma': 1},
  mean: 0.53844, std: 0.01161, params: {'subsample': 0.8, 'max_depth': 5, 'gamma': 1, 'colsample_bytree': 0.4},
  mean: 0.53218, std: 0.01724, params: {'max_depth': 8, 'subsample': 0.4, 'gamma': 1, 'colsample_bytree': 0.4},
  mean: 0.53410, std: 0.01634, params: {'colsample_bytree': 0.4, 'max_depth': 8, 'subsample': 0.6, 'gamma': 1},
  mean: 0.53266, std: 0.00957, params: {'max_depth': 8, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.4},
  mean: 0.55170, std: 0.02279, params: {'max_depth': 2, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.55146, std: 0.01988, params: {'max_depth': 2, 'subsample': 0.6, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.54832, std: 0.01790, params: {'max_depth': 2, 'subsample': 0.8, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.55266, std: 0.03049, params: {'max_depth': 5, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.55652, std: 0.02552, params: {'max_depth': 5, 'subsample': 0.6, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.54905, std: 0.02224, params: {'max_depth': 5, 'subsample': 0.8, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.54375, std: 0.02369, params: {'subsample': 0.4, 'max_depth': 8, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.54664, std: 0.02236, params: {'max_depth': 8, 'subsample': 0.6, 'gamma': 0, 'colsample_bytree': 0.6},
  mean: 0.54640, std: 0.02479, params: {'colsample_bytree': 0.6, 'max_depth': 8, 'subsample': 0.8, 'gamma': 0},
  mean: 0.55074, std: 0.02137, params: {'subsample': 0.4, 'max_depth': 2, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.55194, std: 0.02152, params: {'subsample': 0.6, 'max_depth': 2, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.54881, std: 0.01910, params: {'max_depth': 2, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.54857, std: 0.03170, params: {'colsample_bytree': 0.6, 'max_depth': 5, 'subsample': 0.4, 'gamma': 1},
  mean: 0.55483, std: 0.02627, params: {'subsample': 0.6, 'max_depth': 5, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.55194, std: 0.02645, params: {'max_depth': 5, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.54881, std: 0.02548, params: {'colsample_bytree': 0.6, 'max_depth': 8, 'subsample': 0.4, 'gamma': 1},
  mean: 0.54760, std: 0.02325, params: {'subsample': 0.6, 'max_depth': 8, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.54977, std: 0.01933, params: {'max_depth': 8, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.6},
  mean: 0.55363, std: 0.01914, params: {'max_depth': 2, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.55074, std: 0.01928, params: {'colsample_bytree': 0.8, 'max_depth': 2, 'subsample': 0.6, 'gamma': 0},
  mean: 0.55122, std: 0.01444, params: {'max_depth': 2, 'subsample': 0.8, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.54712, std: 0.02818, params: {'max_depth': 5, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.55049, std: 0.03193, params: {'max_depth': 5, 'subsample': 0.6, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.54640, std: 0.02575, params: {'max_depth': 5, 'subsample': 0.8, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.54447, std: 0.03035, params: {'max_depth': 8, 'subsample': 0.4, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.54784, std: 0.03084, params: {'colsample_bytree': 0.8, 'max_depth': 8, 'subsample': 0.6, 'gamma': 0},
  mean: 0.54543, std: 0.02831, params: {'max_depth': 8, 'subsample': 0.8, 'gamma': 0, 'colsample_bytree': 0.8},
  mean: 0.55218, std: 0.01772, params: {'max_depth': 2, 'subsample': 0.4, 'gamma': 1, 'colsample_bytree': 0.8},
  mean: 0.55001, std: 0.01803, params: {'max_depth': 2, 'subsample': 0.6, 'gamma': 1, 'colsample_bytree': 0.8},
  mean: 0.55146, std: 0.01476, params: {'max_depth': 2, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.8},
  mean: 0.54736, std: 0.02696, params: {'colsample_bytree': 0.8, 'max_depth': 5, 'subsample': 0.4, 'gamma': 1},
  mean: 0.55170, std: 0.03230, params: {'subsample': 0.6, 'max_depth': 5, 'gamma': 1, 'colsample_bytree': 0.8},
  mean: 0.54423, std: 0.02674, params: {'max_depth': 5, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.8},
  mean: 0.54616, std: 0.02970, params: {'colsample_bytree': 0.8, 'max_depth': 8, 'subsample': 0.4, 'gamma': 1},
  mean: 0.54832, std: 0.02628, params: {'subsample': 0.6, 'max_depth': 8, 'gamma': 1, 'colsample_bytree': 0.8},
  mean: 0.54688, std: 0.03003, params: {'max_depth': 8, 'subsample': 0.8, 'gamma': 1, 'colsample_bytree': 0.8}],
 {'colsample_bytree': 0.6, 'gamma': 0, 'max_depth': 5, 'subsample': 0.6},
 0.55651964328753911)

In [18]:
# Load data 
filename = '../facies_vectors.csv'
data = pd.read_csv(filename)

# Change to category data type
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')

# Leave one well out for cross validation 
well_names = data['Well Name'].unique()
f1=[]
for i in range(len(well_names)):
    
    # Split data for training and testing
    X_train = data.drop(['Facies', 'Formation','Depth'], axis = 1 ) 
    Y_train = data['Facies' ] - 1
    
    train_X = X_train[X_train['Well Name'] != well_names[i] ]
    train_Y = Y_train[X_train['Well Name'] != well_names[i] ]
    test_X  = X_train[X_train['Well Name'] == well_names[i] ]
    test_Y  = Y_train[X_train['Well Name'] == well_names[i] ]

    train_X = train_X.drop(['Well Name'], axis = 1 ) 
    test_X = test_X.drop(['Well Name'], axis = 1 )

    # Final recommended model based on the extensive parameters search
    model_final = xgb.XGBClassifier( learning_rate =0.01, n_estimators=400, max_depth=5,
                                   min_child_weight=1, gamma=0, subsample=0.6, reg_alpha=0.1, reg_lambda=0.5,
                                   colsample_bytree=0.6, objective='multi:softmax',
                                   nthread=4, scale_pos_weight=1, seed=100)

    # Train the model based on training data
    model_final.fit( train_X , train_Y , eval_metric = 'merror' )


    # Predict on the test set
    predictions = model_final.predict(test_X)

    # Print report
    print ("\n------------------------------------------------------")
    print ("Validation on the leaving out well " + well_names[i])
    conf = confusion_matrix( test_Y, predictions, labels = np.arange(9) )
    print ("\nModel Report")
    print ("-Accuracy: %.6f" % ( accuracy(conf) ))
    print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
    print ("-F1 Score: %.6f" % ( f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ) ))
    f1.append(f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ))
    facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                     'WS', 'D','PS', 'BS']
    print ("\nConfusion Matrix Results")
    from classification_utilities import display_cm, display_adj_cm
    display_cm(conf, facies_labels,display_metrics=True, hide_zeros=True)
    
print ("\n------------------------------------------------------")
print ("Final Results")
print ("-Average F1 Score: %6f" % (sum(f1)/(1.0*len(f1))))


------------------------------------------------------
Validation on the leaving out well SHRIMPLIN

Model Report
-Accuracy: 0.607219
-Adjacent Accuracy: 0.959660
-F1 Score: 0.587285

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS     7    94    17                                       118
     FSiS          52    71                                       123
     SiSh                      13           2           3          18
       MS                       6     6    43           8          63
       WS                       2     2    38     1    18     2    63
        D                                   1     1     3           5
       PS                             2    10     1    52     4    69
       BS                                               1    11    12

Precision  0.00  0.64  0.81  0.62  0.60  0.40  0.33  0.61  0.65  0.64
   Recall  0.00  0.80  0.58  0.72  0.10  0.60  0.20  0.75  0.92  0.61
       F1  0.00  0.71  0.67  0.67  0.16  0.48  0.25  0.68  0.76  0.59
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well ALEXANDER D

Model Report
-Accuracy: 0.626609
-Adjacent Accuracy: 0.916309
-F1 Score: 0.589447

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS          85    32                                       117
     FSiS          17    74                                        91
     SiSh                      39                 3     2          44
       MS                       3    11     2          10          26
       WS                      12    19     1     4    33          69
        D                             1           8     7          16
       PS                       7     3     5     8    73     2    98
       BS                                   1     3           1     5

Precision  0.00  0.83  0.70  0.64  0.32  0.11  0.31  0.58  0.33  0.58
   Recall  0.00  0.73  0.81  0.89  0.42  0.01  0.50  0.74  0.20  0.63
       F1  0.00  0.78  0.75  0.74  0.37  0.03  0.38  0.65  0.25  0.59
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well SHANKLE

Model Report
-Accuracy: 0.487751
-Adjacent Accuracy: 0.966592
-F1 Score: 0.467278

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS     7    81     1                                        89
     CSiS     8    69    12                                        89
     FSiS          55    61                             1         117
     SiSh                       4           1           2           7
       MS                      14           3     1     1          19
       WS                       7     5    46          13          71
        D                             1     2     9     5          17
       PS                                  16     1    23          40
       BS                                                           0

Precision  0.47  0.34  0.82  0.16  0.00  0.68  0.82  0.51  0.00  0.56
   Recall  0.08  0.78  0.52  0.57  0.00  0.65  0.53  0.57  0.00  0.49
       F1  0.13  0.47  0.64  0.25  0.00  0.66  0.64  0.54  0.00  0.47
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well LUKE G U

Model Report
-Accuracy: 0.637744
-Adjacent Accuracy: 0.928416
-F1 Score: 0.659138

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS    11    88    18                                       117
     FSiS     7    43    75                             4         129
     SiSh                      31     1     3                      35
       MS                             1                 1           2
       WS                       8    17    42          16     1    84
        D                                   1     7    11     1    20
       PS                 2     3     3    15     1    50          74
       BS                                                           0

Precision  0.00  0.67  0.79  0.74  0.05  0.69  0.88  0.61  0.00  0.71
   Recall  0.00  0.75  0.58  0.89  0.50  0.50  0.35  0.68  0.00  0.64
       F1  0.00  0.71  0.67  0.81  0.08  0.58  0.50  0.64  0.00  0.66
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well KIMZEY A

Model Report
-Accuracy: 0.530752
-Adjacent Accuracy: 0.895216
-F1 Score: 0.495145

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS           5     4                                         9
     CSiS          75    10                                        85
     FSiS          40    34                                        74
     SiSh                      27          16                      43
       MS                       7     2    29          15          53
       WS                 1     3          28     1    18          51
        D                       3           5     5    14          27
       PS                       1     1    24     4    60          90
       BS                                   2           3     2     7

Precision  0.00  0.62  0.69  0.66  0.67  0.27  0.50  0.55  1.00  0.57
   Recall  0.00  0.88  0.46  0.63  0.04  0.55  0.19  0.67  0.29  0.53
       F1  0.00  0.73  0.55  0.64  0.07  0.36  0.27  0.60  0.44  0.50
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
------------------------------------------------------
Validation on the leaving out well CROSS H CATTLE

Model Report
-Accuracy: 0.361277
-Adjacent Accuracy: 0.878244
-F1 Score: 0.339461

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    31   112    15                                       158
     CSiS     2    58    81                             1         142
     FSiS           5    39                 1           2          47
     SiSh                 4     1     2    17           1          25
       MS           4     3                17     1     3          28
       WS                                  24     1     6          31
        D                                         1     1           2
       PS                 4     4     1    27     5    27          68
       BS                                                           0

Precision  0.94  0.32  0.27  0.20  0.00  0.28  0.12  0.66  0.00  0.53
   Recall  0.20  0.41  0.83  0.04  0.00  0.77  0.50  0.40  0.00  0.36
       F1  0.32  0.36  0.40  0.07  0.00  0.41  0.20  0.50  0.00  0.34
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well NOLAN

Model Report
-Accuracy: 0.520482
-Adjacent Accuracy: 0.872289
-F1 Score: 0.541316

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS           4                                               4
     CSiS    15    85    17                 1                     118
     FSiS     3    23    39                 1           2          68
     SiSh           1           7     3    11     1     5          28
       MS           1     2           1    32     1     6     4    47
       WS     1                 1          11          12     5    30
        D                       1                 2     1           4
       PS                 5     1          14     2    71    23   116
       BS                                                           0

Precision  0.00  0.75  0.62  0.70  0.25  0.16  0.33  0.73  0.00  0.61
   Recall  0.00  0.72  0.57  0.25  0.02  0.37  0.50  0.61  0.00  0.52
       F1  0.00  0.73  0.60  0.37  0.04  0.22  0.40  0.67  0.00  0.54
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well Recruit F9

Model Report
-Accuracy: 0.637500
-Adjacent Accuracy: 0.925000
-F1 Score: 0.778626

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS                                                           0
     FSiS                                                           0
     SiSh                                                           0
       MS                                                           0
       WS                                                           0
        D                                                           0
       PS                                                           0
       BS                       1           5     4    19    51    80

Precision  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  1.00  1.00
   Recall  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.64  0.64
       F1  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.78  0.78
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well NEWBY

Model Report
-Accuracy: 0.494600
-Adjacent Accuracy: 0.892009
-F1 Score: 0.486147

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS    12    62    23                             1          98
     FSiS     1    36    43                                        80
     SiSh           1          34     4    14     2     3          58
       MS                 3     2           8     4    11          28
       WS                       4    12    40     1    39          96
        D                                   3     4     9          16
       PS                             1    10          45          56
       BS                                   5          25     1    31

Precision  0.00  0.63  0.62  0.85  0.00  0.50  0.36  0.34  1.00  0.57
   Recall  0.00  0.63  0.54  0.59  0.00  0.42  0.25  0.80  0.03  0.49
       F1  0.00  0.63  0.58  0.69  0.00  0.45  0.30  0.48  0.06  0.49
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well CHURCHMAN BIBLE

Model Report
-Accuracy: 0.574257
-Adjacent Accuracy: 0.878713
-F1 Score: 0.548002

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS           7     1                                         8
     CSiS     3    31    21                 1                      56
     FSiS           8    39     2           1           1          51
     SiSh                 1     7           5                      13
       MS                 1     4     3    18           4          30
       WS                      13          65           9          87
        D                 2     7           7     2    16          34
       PS                 3     1     3    24          43     1    75
       BS                       2           1           5    42    50

Precision  0.00  0.67  0.57  0.19  0.50  0.53  1.00  0.55  0.98  0.63
   Recall  0.00  0.55  0.76  0.54  0.10  0.75  0.06  0.57  0.84  0.57
       F1  0.00  0.61  0.66  0.29  0.17  0.62  0.11  0.56  0.90  0.55

------------------------------------------------------
Final Results
-Average F1 Score: 0.549185

In [19]:
# Load test data
test_data = pd.read_csv('../validation_data_nofacies.csv')
test_data['Well Name'] = test_data['Well Name'].astype('category')
X_test = test_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
# Predict facies of unclassified data
Y_predicted = model_final.predict(X_test)
test_data['Facies'] = Y_predicted + 1
# Store the prediction
test_data.to_csv('Prediction1.csv')

In [20]:
test_data


Out[20]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS Facies
0 A1 SH STUART 2808.0 66.276 0.630 3.300 10.650 3.591 1 1.000 2
1 A1 SH STUART 2808.5 77.252 0.585 6.500 11.950 3.341 1 0.978 3
2 A1 SH STUART 2809.0 82.899 0.566 9.400 13.600 3.064 1 0.956 2
3 A1 SH STUART 2809.5 80.671 0.593 9.500 13.250 2.977 1 0.933 2
4 A1 SH STUART 2810.0 75.971 0.638 8.700 12.350 3.020 1 0.911 2
5 A1 SH STUART 2810.5 73.955 0.667 6.900 12.250 3.086 1 0.889 2
6 A1 SH STUART 2811.0 77.962 0.674 6.500 12.450 3.092 1 0.867 2
7 A1 SH STUART 2811.5 83.894 0.667 6.300 12.650 3.123 1 0.844 2
8 A1 SH STUART 2812.0 84.424 0.653 6.700 13.050 3.121 1 0.822 2
9 A1 SH STUART 2812.5 83.160 0.642 7.300 12.950 3.127 1 0.800 2
10 A1 SH STUART 2813.0 79.063 0.651 7.300 12.050 3.147 1 0.778 2
11 A1 SH STUART 2813.5 69.002 0.677 6.200 10.800 3.096 1 0.756 2
12 A1 SH STUART 2814.0 63.983 0.690 4.400 9.700 3.103 1 0.733 2
13 A1 SH STUART 2814.5 61.797 0.675 3.500 9.150 3.101 1 0.711 2
14 A1 SH STUART 2815.0 61.372 0.646 2.800 9.300 3.065 1 0.689 2
15 A1 SH STUART 2815.5 63.535 0.621 2.800 9.800 2.982 1 0.667 2
16 A1 SH STUART 2816.0 65.126 0.600 3.300 10.550 2.914 1 0.644 2
17 A1 SH STUART 2816.5 75.930 0.576 3.400 11.900 2.845 1 0.600 2
18 A1 SH STUART 2817.0 85.077 0.584 4.400 12.900 2.854 1 0.578 2
19 A1 SH STUART 2817.5 89.459 0.598 6.600 13.500 2.986 1 0.556 2
20 A1 SH STUART 2818.0 88.619 0.610 7.200 14.800 2.988 1 0.533 2
21 A1 SH STUART 2818.5 81.593 0.636 6.400 13.900 2.998 1 0.511 2
22 A1 SH STUART 2819.0 66.595 0.702 2.800 11.400 2.988 1 0.489 2
23 A1 SH STUART 2819.5 55.081 0.789 2.700 8.150 3.028 1 0.467 1
24 A1 SH STUART 2820.0 48.112 0.840 1.000 7.500 3.073 1 0.444 2
25 A1 SH STUART 2820.5 43.730 0.846 0.400 7.100 3.146 1 0.422 1
26 A1 SH STUART 2821.0 44.097 0.840 0.700 6.650 3.205 1 0.400 1
27 A1 SH STUART 2821.5 46.839 0.842 0.800 6.600 3.254 1 0.378 1
28 A1 SH STUART 2822.0 50.348 0.843 1.100 6.750 3.230 1 0.356 1
29 A1 SH STUART 2822.5 57.129 0.822 2.200 7.300 3.237 1 0.333 1
... ... ... ... ... ... ... ... ... ... ... ...
800 B5 LM CRAWFORD 3146.0 167.803 -0.219 4.270 23.370 3.810 2 0.190 8
801 B5 LM CRAWFORD 3146.5 151.183 -0.057 0.925 17.125 4.153 2 0.172 8
802 B5 LM CRAWFORD 3147.0 123.264 0.067 0.285 14.215 4.404 2 0.155 8
803 B5 LM CRAWFORD 3147.5 108.569 0.234 0.705 12.225 4.499 2 0.138 8
804 B5 LM CRAWFORD 3148.0 101.072 0.427 1.150 10.760 4.392 2 0.121 8
805 B5 LM CRAWFORD 3148.5 91.748 0.625 1.135 9.605 4.254 2 0.103 8
806 B5 LM CRAWFORD 3149.0 83.794 0.749 2.075 7.845 4.023 2 0.086 6
807 B5 LM CRAWFORD 3149.5 83.794 0.749 2.075 7.845 4.023 2 0.086 6
808 B5 LM CRAWFORD 3150.0 79.722 0.771 2.890 6.640 4.040 2 0.069 6
809 B5 LM CRAWFORD 3150.5 76.334 0.800 2.960 6.290 3.997 2 0.052 8
810 B5 LM CRAWFORD 3151.0 73.631 0.800 2.680 6.690 3.828 2 0.034 8
811 B5 LM CRAWFORD 3151.5 76.865 0.772 2.420 8.600 3.535 2 0.017 8
812 C SH CRAWFORD 3152.0 79.924 0.752 2.620 11.510 3.148 1 1.000 2
813 C SH CRAWFORD 3152.5 82.199 0.728 3.725 14.555 2.964 1 0.972 3
814 C SH CRAWFORD 3153.0 79.953 0.700 5.610 16.930 2.793 1 0.944 3
815 C SH CRAWFORD 3153.5 75.881 0.673 6.300 17.570 2.969 1 0.917 3
816 C SH CRAWFORD 3154.0 67.470 0.652 4.775 15.795 3.282 1 0.889 2
817 C SH CRAWFORD 3154.5 58.832 0.640 4.315 13.575 3.642 1 0.861 2
818 C SH CRAWFORD 3155.0 57.946 0.631 3.595 11.305 3.893 1 0.833 2
819 C SH CRAWFORD 3155.5 65.755 0.625 3.465 10.355 3.911 1 0.806 2
820 C SH CRAWFORD 3156.0 69.445 0.617 3.390 11.540 3.820 1 0.778 2
821 C SH CRAWFORD 3156.5 73.389 0.608 3.625 12.775 3.620 1 0.750 2
822 C SH CRAWFORD 3157.0 77.115 0.605 4.140 13.420 3.467 1 0.722 2
823 C SH CRAWFORD 3157.5 79.840 0.596 4.875 13.825 3.360 1 0.694 2
824 C SH CRAWFORD 3158.0 82.616 0.577 5.235 14.845 3.207 1 0.667 2
825 C SH CRAWFORD 3158.5 86.078 0.554 5.040 16.150 3.161 1 0.639 2
826 C SH CRAWFORD 3159.0 88.855 0.539 5.560 16.750 3.118 1 0.611 2
827 C SH CRAWFORD 3159.5 90.490 0.530 6.360 16.780 3.168 1 0.583 3
828 C SH CRAWFORD 3160.0 90.975 0.522 7.035 16.995 3.154 1 0.556 2
829 C SH CRAWFORD 3160.5 90.108 0.513 7.505 17.595 3.125 1 0.528 3

830 rows × 11 columns

Future work, make more customerized objective function. Also, we could use RandomizedSearchCV instead of GridSearchCV to avoild potential local minimal trap and further improve the test results.