In this notebook, we mainly utilize extreme gradient boost to improve the prediction model originially proposed in TLE 2016 November machine learning tuotrial. Extreme gradient boost can be viewed as an enhanced version of gradient boost by using a more regularized model formalization to control over-fitting, and XGB usually performs better. Applications of XGB can be found in many Kaggle competitions. Some recommended tutorrials can be found

Our work will be orginized in the follwing order:

•Background

•Exploratory Data Analysis

•Data Prepration and Model Selection

•Final Results

Background

The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).

The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a classifier to predict facies types.

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are: •Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE. •Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:

1.Nonmarine sandstone

2.Nonmarine coarse siltstone

3.Nonmarine fine siltstone

4.Marine siltstone and shale

5.Mudstone (limestone)

6.Wackestone (limestone)

7.Dolomite

8.Packstone-grainstone (limestone)

9.Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies/ Label/ Adjacent Facies

1 SS 2

2 CSiS 1,3

3 FSiS 2

4 SiSh 5

5 MS 4,6

6 WS 5,7

7 D 6,8

8 PS 6,7,9

9 BS 7,8

Exprolatory Data Analysis

After the background intorduction, we start to import the pandas library for some basic data analysis and manipulation. The matplotblib and seaborn are imported for data vislization.


In [1]:
%matplotlib inline
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import matplotlib.colors as colors

import xgboost as xgb
import numpy as np
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, roc_auc_score
from classification_utilities import display_cm, display_adj_cm
from sklearn.model_selection import GridSearchCV


from sklearn.model_selection import validation_curve
from sklearn.datasets import load_svmlight_files
from sklearn.model_selection import StratifiedKFold, cross_val_score, LeavePGroupsOut
from sklearn.datasets import make_classification
from xgboost.sklearn import XGBClassifier
from scipy.sparse import vstack

#use a fixed seed for reproducibility
seed = 123
np.random.seed(seed)

In [2]:
filename = './facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data.head(10)


Out[2]:
Facies Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
0 3 A1 SH SHRIMPLIN 2793.0 77.45 0.664 9.9 11.915 4.6 1 1.000
1 3 A1 SH SHRIMPLIN 2793.5 78.26 0.661 14.2 12.565 4.1 1 0.979
2 3 A1 SH SHRIMPLIN 2794.0 79.05 0.658 14.8 13.050 3.6 1 0.957
3 3 A1 SH SHRIMPLIN 2794.5 86.10 0.655 13.9 13.115 3.5 1 0.936
4 3 A1 SH SHRIMPLIN 2795.0 74.58 0.647 13.5 13.300 3.4 1 0.915
5 3 A1 SH SHRIMPLIN 2795.5 73.97 0.636 14.0 13.385 3.6 1 0.894
6 3 A1 SH SHRIMPLIN 2796.0 73.72 0.630 15.6 13.930 3.7 1 0.872
7 3 A1 SH SHRIMPLIN 2796.5 75.65 0.625 16.5 13.920 3.5 1 0.830
8 3 A1 SH SHRIMPLIN 2797.0 73.79 0.624 16.2 13.980 3.4 1 0.809
9 3 A1 SH SHRIMPLIN 2797.5 76.89 0.615 16.9 14.220 3.5 1 0.787

Set columns 'Well Name' and 'Formation' to be category


In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4149 entries, 0 to 4148
Data columns (total 11 columns):
Facies       4149 non-null int64
Formation    4149 non-null category
Well Name    4149 non-null category
Depth        4149 non-null float64
GR           4149 non-null float64
ILD_log10    4149 non-null float64
DeltaPHI     4149 non-null float64
PHIND        4149 non-null float64
PE           3232 non-null float64
NM_M         4149 non-null int64
RELPOS       4149 non-null float64
dtypes: category(2), float64(7), int64(2)
memory usage: 300.1 KB

In [4]:
training_data.describe()


/Users/littleni/anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[4]:
Facies Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS
count 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 4149.000000 3232.000000 4149.000000 4149.000000
mean 4.503254 2906.867438 64.933985 0.659566 4.402484 13.201066 3.725014 1.518438 0.521852
std 2.474324 133.300164 30.302530 0.252703 5.274947 7.132846 0.896152 0.499720 0.286644
min 1.000000 2573.500000 10.149000 -0.025949 -21.832000 0.550000 0.200000 1.000000 0.000000
25% 2.000000 2821.500000 44.730000 0.498000 1.600000 8.500000 NaN 1.000000 0.277000
50% 4.000000 2932.500000 64.990000 0.639000 4.300000 12.020000 NaN 2.000000 0.528000
75% 6.000000 3007.000000 79.438000 0.822000 7.500000 16.050000 NaN 2.000000 0.769000
max 9.000000 3138.000000 361.150000 1.800000 19.312000 84.400000 8.094000 2.000000 1.000000

Check distribution of classes in whole dataset


In [5]:
plt.figure(figsize=(5,5))
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00','#1B4F72',
                 '#2E86C1', '#AED6F1', '#A569BD', '#196F3D']

facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS','WS', 'D','PS', 'BS']

facies_counts = training_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors,title='Distribution of Training Data by Facies')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x10492f438>

Check distribution of classes in each well


In [6]:
wells = training_data['Well Name'].unique()

In [7]:
plt.figure(figsize=(15,9))
for index, w in enumerate(wells):
    ax = plt.subplot(2,5,index+1)

    facies_counts = pd.Series(np.zeros(9), index=range(1,10))
    facies_counts = facies_counts.add(training_data[training_data['Well Name']==w]['Facies'].value_counts().sort_index())
    #facies_counts.replace(np.nan,0)
    facies_counts.index = facies_labels
    facies_counts.plot(kind='bar',color=facies_colors,title=w)
    ax.set_ylim(0,160)


We can see that classes are very imbalanced in each well


In [8]:
plt.figure(figsize=(5,5))
sns.heatmap(training_data.corr(), vmax=1.0, square=True)


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1160a4160>

Data Preparation and Model Selection

Now we are ready to test the XGB approach, and will use confusion matrix and f1_score, which were imported, as metric for classification, as well as GridSearchCV, which is an excellent tool for parameter optimization.


In [9]:
X_train = training_data.drop(['Facies', 'Well Name','Formation','Depth'], axis = 1 ) 
Y_train = training_data['Facies' ] - 1
dtrain = xgb.DMatrix(X_train, Y_train)

In [39]:
features = ['GR','ILD_log10','DeltaPHI','PHIND','PE','NM_M','RELPOS']

The accuracy function and accuracy_adjacent function are defined in the following to quatify the prediction correctness.


In [10]:
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))

Before processing further, we define a functin which will help us create XGBoost models and perform cross-validation.


In [11]:
skf = StratifiedKFold(n_splits=5)

In [13]:
cv = skf.split(X_train, Y_train)

In [24]:
def modelfit(alg, Xtrain, Ytrain, useTrainCV=True, cv_fold=skf):
        
    #Fit the algorithm on the data
    alg.fit(Xtrain, Ytrain,eval_metric='merror')
        
    #Predict training set:
    dtrain_prediction = alg.predict(Xtrain)
    #dtrain_predprob = alg.predict_proba(Xtrain)[:,1]
        
    #Pring model report
    print ("\nModel Report")
    print ("Accuracy : %.4g" % accuracy_score(Ytrain,dtrain_prediction))
    print ("F1 score (Train) : %f" % f1_score(Ytrain,dtrain_prediction,average='micro'))
    #Perform cross-validation:
    if useTrainCV:
        cv_score = cross_val_score(alg, Xtrain, Ytrain, cv=cv_fold, scoring='f1_micro')
        print ("CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % 
        (np.mean(cv_score), np.std(cv_score), np.min(cv_score), np.max(cv_score)))
    
    #Pring Feature Importance
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar',title='Feature Importances')
    plt.ylabel('Feature Importance Score')

General Approach for Parameter Tuning

We are going to preform the steps as follows:

1.Choose a relatively high learning rate, e.g., 0.1. Usually somewhere between 0.05 and 0.3 should work for different problems.

2.Determine the optimum number of tress for this learning rate.XGBoost has a very usefull function called as "cv" which performs cross-validation at each boosting iteration and thus returns the optimum number of tress required.

3.Tune tree-based parameters(max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees.

4.Tune regularization parameters(lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.

5.Lower the learning rate and decide the optimal parameters.

Step 1:Fix learning rate and number of estimators for tuning tree-based parameters

In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:

1.max_depth = 5

2.min_child_weight = 1

3.gamma = 0

4.subsample, colsample_bytree = 0.8 : This is a commonly used used start value.

5.scale_pos_weight = 1

Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.


In [15]:
xgb1= XGBClassifier(
    learning_rate=0.05,
    objective = 'multi:softmax',
    nthread = 4, 
    seed = seed
)

In [16]:
xgb1


Out[16]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=4,
       objective='multi:softmax', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=True, subsample=1)

In [25]:
modelfit(xgb1, X_train, Y_train)


Model Report
Accuracy : 0.6756
F1 score (Train) : 0.675584
CV Score : Mean - 0.5253533 | Std - 0.05590592 | Min - 0.4323671 | Max - 0.5795181

Step 2: Tune max_depth and min_child_weight


In [26]:
param_test1={
    'n_estimators':range(20, 100, 10)
}

gs1 = GridSearchCV(xgb1,param_grid=param_test1, 
                   scoring='accuracy', n_jobs=4,iid=False, cv=skf)
gs1.fit(X_train, Y_train)
gs1.grid_scores_, gs1.best_params_,gs1.best_score_


/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[26]:
([mean: 0.52342, std: 0.05811, params: {'n_estimators': 20},
  mean: 0.52221, std: 0.05941, params: {'n_estimators': 30},
  mean: 0.52509, std: 0.06253, params: {'n_estimators': 40},
  mean: 0.52583, std: 0.05667, params: {'n_estimators': 50},
  mean: 0.52703, std: 0.05816, params: {'n_estimators': 60},
  mean: 0.52511, std: 0.05431, params: {'n_estimators': 70},
  mean: 0.52558, std: 0.05486, params: {'n_estimators': 80},
  mean: 0.52535, std: 0.05566, params: {'n_estimators': 90}],
 {'n_estimators': 60},
 0.52702933462869905)

In [27]:
gs1.best_estimator_


Out[27]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=60, nthread=4,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=True, subsample=1)

In [29]:
param_test2={
    'max_depth':range(5,16,2),
    'min_child_weight':range(1,15,2)
}

gs2 = GridSearchCV(gs1.best_estimator_,param_grid=param_test2, 
                   scoring='accuracy', n_jobs=4,iid=False, cv=skf)
gs2.fit(X_train, Y_train)
gs2.grid_scores_, gs2.best_params_,gs2.best_score_


/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[29]:
([mean: 0.52899, std: 0.06227, params: {'max_depth': 5, 'min_child_weight': 1},
  mean: 0.52728, std: 0.06266, params: {'max_depth': 5, 'min_child_weight': 3},
  mean: 0.52656, std: 0.06450, params: {'max_depth': 5, 'min_child_weight': 5},
  mean: 0.52873, std: 0.06452, params: {'max_depth': 5, 'min_child_weight': 7},
  mean: 0.52921, std: 0.06631, params: {'max_depth': 5, 'min_child_weight': 9},
  mean: 0.52775, std: 0.07044, params: {'max_depth': 5, 'min_child_weight': 11},
  mean: 0.52510, std: 0.06619, params: {'max_depth': 5, 'min_child_weight': 13},
  mean: 0.52174, std: 0.06052, params: {'max_depth': 7, 'min_child_weight': 1},
  mean: 0.52366, std: 0.05723, params: {'max_depth': 7, 'min_child_weight': 3},
  mean: 0.52897, std: 0.06175, params: {'max_depth': 7, 'min_child_weight': 5},
  mean: 0.53715, std: 0.06338, params: {'max_depth': 7, 'min_child_weight': 7},
  mean: 0.52798, std: 0.06687, params: {'max_depth': 7, 'min_child_weight': 9},
  mean: 0.52510, std: 0.06674, params: {'max_depth': 7, 'min_child_weight': 11},
  mean: 0.52920, std: 0.06322, params: {'max_depth': 7, 'min_child_weight': 13},
  mean: 0.51741, std: 0.06442, params: {'max_depth': 9, 'min_child_weight': 1},
  mean: 0.52461, std: 0.06423, params: {'max_depth': 9, 'min_child_weight': 3},
  mean: 0.53643, std: 0.06838, params: {'max_depth': 9, 'min_child_weight': 5},
  mean: 0.53404, std: 0.06697, params: {'max_depth': 9, 'min_child_weight': 7},
  mean: 0.53161, std: 0.06506, params: {'max_depth': 9, 'min_child_weight': 9},
  mean: 0.53066, std: 0.06886, params: {'max_depth': 9, 'min_child_weight': 11},
  mean: 0.52848, std: 0.06856, params: {'max_depth': 9, 'min_child_weight': 13},
  mean: 0.51788, std: 0.06488, params: {'max_depth': 11, 'min_child_weight': 1},
  mean: 0.52605, std: 0.06566, params: {'max_depth': 11, 'min_child_weight': 3},
  mean: 0.52751, std: 0.06690, params: {'max_depth': 11, 'min_child_weight': 5},
  mean: 0.52728, std: 0.06630, params: {'max_depth': 11, 'min_child_weight': 7},
  mean: 0.52872, std: 0.06417, params: {'max_depth': 11, 'min_child_weight': 9},
  mean: 0.52415, std: 0.06648, params: {'max_depth': 11, 'min_child_weight': 11},
  mean: 0.52584, std: 0.06655, params: {'max_depth': 11, 'min_child_weight': 13},
  mean: 0.51692, std: 0.06642, params: {'max_depth': 13, 'min_child_weight': 1},
  mean: 0.52412, std: 0.06762, params: {'max_depth': 13, 'min_child_weight': 3},
  mean: 0.52149, std: 0.06568, params: {'max_depth': 13, 'min_child_weight': 5},
  mean: 0.52510, std: 0.06875, params: {'max_depth': 13, 'min_child_weight': 7},
  mean: 0.52438, std: 0.06386, params: {'max_depth': 13, 'min_child_weight': 9},
  mean: 0.52416, std: 0.06536, params: {'max_depth': 13, 'min_child_weight': 11},
  mean: 0.52632, std: 0.06628, params: {'max_depth': 13, 'min_child_weight': 13},
  mean: 0.51162, std: 0.07259, params: {'max_depth': 15, 'min_child_weight': 1},
  mean: 0.52292, std: 0.07189, params: {'max_depth': 15, 'min_child_weight': 3},
  mean: 0.52775, std: 0.06432, params: {'max_depth': 15, 'min_child_weight': 5},
  mean: 0.52414, std: 0.06726, params: {'max_depth': 15, 'min_child_weight': 7},
  mean: 0.52751, std: 0.06757, params: {'max_depth': 15, 'min_child_weight': 9},
  mean: 0.52319, std: 0.06455, params: {'max_depth': 15, 'min_child_weight': 11},
  mean: 0.52655, std: 0.06827, params: {'max_depth': 15, 'min_child_weight': 13}],
 {'max_depth': 7, 'min_child_weight': 7},
 0.53715189797054053)

In [30]:
gs2.best_estimator_


Out[30]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=7,
       min_child_weight=7, missing=None, n_estimators=60, nthread=4,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=True, subsample=1)

In [31]:
modelfit(gs2.best_estimator_, X_train, Y_train)


Model Report
Accuracy : 0.7932
F1 score (Train) : 0.793203
CV Score : Mean - 0.5371519 | Std - 0.06337505 | Min - 0.4227053 | Max - 0.6024096

Step 3: Tune gamma


In [32]:
param_test3={
    'gamma':[0,.05,.1,.15,.2,.3,.4],
    'subsample':[0.6,.7,.75,.8,.85,.9],
    'colsample_bytree':[i/10.0 for i in range(4,10)]
}

gs3 = GridSearchCV(gs2.best_estimator_,param_grid=param_test3, 
                   scoring='accuracy', n_jobs=4,iid=False, cv=skf)
gs3.fit(X_train, Y_train)
gs3.grid_scores_, gs3.best_params_,gs3.best_score_


/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[32]:
([mean: 0.51742, std: 0.05528, params: {'gamma': 0, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.52104, std: 0.05652, params: {'gamma': 0, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.51863, std: 0.05843, params: {'gamma': 0, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51982, std: 0.05658, params: {'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.51911, std: 0.05845, params: {'gamma': 0, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51719, std: 0.05397, params: {'gamma': 0, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.51838, std: 0.05525, params: {'gamma': 0.05, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.52079, std: 0.05709, params: {'gamma': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.52007, std: 0.05982, params: {'gamma': 0.05, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51837, std: 0.05939, params: {'gamma': 0.05, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.52104, std: 0.05854, params: {'gamma': 0.05, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51791, std: 0.05434, params: {'gamma': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.51935, std: 0.05540, params: {'gamma': 0.1, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.52079, std: 0.05879, params: {'gamma': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.51934, std: 0.05962, params: {'gamma': 0.1, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51717, std: 0.05676, params: {'gamma': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.52105, std: 0.05700, params: {'gamma': 0.1, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51959, std: 0.05420, params: {'gamma': 0.1, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.52055, std: 0.05607, params: {'gamma': 0.15, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.52007, std: 0.05644, params: {'gamma': 0.15, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.51983, std: 0.05791, params: {'gamma': 0.15, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51910, std: 0.05797, params: {'gamma': 0.15, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.51887, std: 0.05652, params: {'gamma': 0.15, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51718, std: 0.05536, params: {'gamma': 0.15, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.51765, std: 0.05617, params: {'gamma': 0.2, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.52103, std: 0.05701, params: {'gamma': 0.2, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.52104, std: 0.05946, params: {'gamma': 0.2, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51983, std: 0.05938, params: {'gamma': 0.2, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.52033, std: 0.05653, params: {'gamma': 0.2, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51863, std: 0.05552, params: {'gamma': 0.2, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.51934, std: 0.05705, params: {'gamma': 0.3, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.51887, std: 0.05718, params: {'gamma': 0.3, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.51838, std: 0.05890, params: {'gamma': 0.3, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51788, std: 0.05879, params: {'gamma': 0.3, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.51768, std: 0.05607, params: {'gamma': 0.3, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51791, std: 0.05485, params: {'gamma': 0.3, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.51836, std: 0.05758, params: {'gamma': 0.4, 'subsample': 0.6, 'colsample_bytree': 0.4},
  mean: 0.51766, std: 0.05684, params: {'gamma': 0.4, 'subsample': 0.7, 'colsample_bytree': 0.4},
  mean: 0.51958, std: 0.06113, params: {'gamma': 0.4, 'subsample': 0.75, 'colsample_bytree': 0.4},
  mean: 0.51909, std: 0.05894, params: {'gamma': 0.4, 'subsample': 0.8, 'colsample_bytree': 0.4},
  mean: 0.51863, std: 0.05769, params: {'gamma': 0.4, 'subsample': 0.85, 'colsample_bytree': 0.4},
  mean: 0.51911, std: 0.05557, params: {'gamma': 0.4, 'subsample': 0.9, 'colsample_bytree': 0.4},
  mean: 0.53547, std: 0.06129, params: {'gamma': 0, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53548, std: 0.05720, params: {'gamma': 0, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53836, std: 0.05929, params: {'gamma': 0, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53452, std: 0.05982, params: {'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53500, std: 0.05600, params: {'gamma': 0, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.53957, std: 0.05904, params: {'gamma': 0, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.53740, std: 0.05967, params: {'gamma': 0.05, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53403, std: 0.05688, params: {'gamma': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53571, std: 0.05824, params: {'gamma': 0.05, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53765, std: 0.05863, params: {'gamma': 0.05, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53572, std: 0.05564, params: {'gamma': 0.05, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.54197, std: 0.06020, params: {'gamma': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.53644, std: 0.05812, params: {'gamma': 0.1, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53524, std: 0.05620, params: {'gamma': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53763, std: 0.05929, params: {'gamma': 0.1, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53765, std: 0.05845, params: {'gamma': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53668, std: 0.05748, params: {'gamma': 0.1, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.54294, std: 0.06001, params: {'gamma': 0.1, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.53837, std: 0.05784, params: {'gamma': 0.15, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53620, std: 0.05711, params: {'gamma': 0.15, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53739, std: 0.05940, params: {'gamma': 0.15, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53620, std: 0.05942, params: {'gamma': 0.15, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53571, std: 0.05876, params: {'gamma': 0.15, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.53956, std: 0.06055, params: {'gamma': 0.15, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.53692, std: 0.06019, params: {'gamma': 0.2, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53789, std: 0.05528, params: {'gamma': 0.2, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53860, std: 0.05772, params: {'gamma': 0.2, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53381, std: 0.05576, params: {'gamma': 0.2, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53644, std: 0.05764, params: {'gamma': 0.2, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.54029, std: 0.05790, params: {'gamma': 0.2, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.53885, std: 0.06013, params: {'gamma': 0.3, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53740, std: 0.05664, params: {'gamma': 0.3, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53619, std: 0.05848, params: {'gamma': 0.3, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53525, std: 0.05721, params: {'gamma': 0.3, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53692, std: 0.05999, params: {'gamma': 0.3, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.54198, std: 0.05822, params: {'gamma': 0.3, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.53692, std: 0.06048, params: {'gamma': 0.4, 'subsample': 0.6, 'colsample_bytree': 0.5},
  mean: 0.53596, std: 0.05673, params: {'gamma': 0.4, 'subsample': 0.7, 'colsample_bytree': 0.5},
  mean: 0.53619, std: 0.05770, params: {'gamma': 0.4, 'subsample': 0.75, 'colsample_bytree': 0.5},
  mean: 0.53742, std: 0.05799, params: {'gamma': 0.4, 'subsample': 0.8, 'colsample_bytree': 0.5},
  mean: 0.53837, std: 0.05818, params: {'gamma': 0.4, 'subsample': 0.85, 'colsample_bytree': 0.5},
  mean: 0.54005, std: 0.05864, params: {'gamma': 0.4, 'subsample': 0.9, 'colsample_bytree': 0.5},
  mean: 0.54247, std: 0.05741, params: {'gamma': 0, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.53716, std: 0.05912, params: {'gamma': 0, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.54053, std: 0.06128, params: {'gamma': 0, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.53884, std: 0.06020, params: {'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54488, std: 0.06146, params: {'gamma': 0, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54512, std: 0.05946, params: {'gamma': 0, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54367, std: 0.05880, params: {'gamma': 0.05, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.53789, std: 0.05839, params: {'gamma': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.54005, std: 0.06012, params: {'gamma': 0.05, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.53812, std: 0.05907, params: {'gamma': 0.05, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54464, std: 0.06038, params: {'gamma': 0.05, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54392, std: 0.05940, params: {'gamma': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54391, std: 0.06157, params: {'gamma': 0.1, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.54078, std: 0.06326, params: {'gamma': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.53908, std: 0.06093, params: {'gamma': 0.1, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.53981, std: 0.05805, params: {'gamma': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54248, std: 0.05867, params: {'gamma': 0.1, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54464, std: 0.05889, params: {'gamma': 0.1, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54584, std: 0.05738, params: {'gamma': 0.15, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.53933, std: 0.06040, params: {'gamma': 0.15, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.53812, std: 0.06023, params: {'gamma': 0.15, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.54030, std: 0.05817, params: {'gamma': 0.15, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54440, std: 0.05940, params: {'gamma': 0.15, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54247, std: 0.05795, params: {'gamma': 0.15, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54464, std: 0.05684, params: {'gamma': 0.2, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.53862, std: 0.05865, params: {'gamma': 0.2, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.53933, std: 0.06252, params: {'gamma': 0.2, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.53885, std: 0.06004, params: {'gamma': 0.2, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54536, std: 0.06090, params: {'gamma': 0.2, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54103, std: 0.05695, params: {'gamma': 0.2, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54415, std: 0.05886, params: {'gamma': 0.3, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.54006, std: 0.06041, params: {'gamma': 0.3, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.53788, std: 0.06006, params: {'gamma': 0.3, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.53933, std: 0.05991, params: {'gamma': 0.3, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54295, std: 0.06138, params: {'gamma': 0.3, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54175, std: 0.05846, params: {'gamma': 0.3, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54487, std: 0.05856, params: {'gamma': 0.4, 'subsample': 0.6, 'colsample_bytree': 0.6},
  mean: 0.53958, std: 0.05770, params: {'gamma': 0.4, 'subsample': 0.7, 'colsample_bytree': 0.6},
  mean: 0.53981, std: 0.05806, params: {'gamma': 0.4, 'subsample': 0.75, 'colsample_bytree': 0.6},
  mean: 0.53861, std: 0.05931, params: {'gamma': 0.4, 'subsample': 0.8, 'colsample_bytree': 0.6},
  mean: 0.54295, std: 0.06028, params: {'gamma': 0.4, 'subsample': 0.85, 'colsample_bytree': 0.6},
  mean: 0.54223, std: 0.05858, params: {'gamma': 0.4, 'subsample': 0.9, 'colsample_bytree': 0.6},
  mean: 0.54247, std: 0.05741, params: {'gamma': 0, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.53716, std: 0.05912, params: {'gamma': 0, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.54053, std: 0.06128, params: {'gamma': 0, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.53884, std: 0.06020, params: {'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54488, std: 0.06146, params: {'gamma': 0, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54512, std: 0.05946, params: {'gamma': 0, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54367, std: 0.05880, params: {'gamma': 0.05, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.53789, std: 0.05839, params: {'gamma': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.54005, std: 0.06012, params: {'gamma': 0.05, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.53812, std: 0.05907, params: {'gamma': 0.05, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54464, std: 0.06038, params: {'gamma': 0.05, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54392, std: 0.05940, params: {'gamma': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54391, std: 0.06157, params: {'gamma': 0.1, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.54078, std: 0.06326, params: {'gamma': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.53908, std: 0.06093, params: {'gamma': 0.1, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.53981, std: 0.05805, params: {'gamma': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54248, std: 0.05867, params: {'gamma': 0.1, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54464, std: 0.05889, params: {'gamma': 0.1, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54584, std: 0.05738, params: {'gamma': 0.15, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.53933, std: 0.06040, params: {'gamma': 0.15, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.53812, std: 0.06023, params: {'gamma': 0.15, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.54030, std: 0.05817, params: {'gamma': 0.15, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54440, std: 0.05940, params: {'gamma': 0.15, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54247, std: 0.05795, params: {'gamma': 0.15, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54464, std: 0.05684, params: {'gamma': 0.2, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.53862, std: 0.05865, params: {'gamma': 0.2, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.53933, std: 0.06252, params: {'gamma': 0.2, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.53885, std: 0.06004, params: {'gamma': 0.2, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54536, std: 0.06090, params: {'gamma': 0.2, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54103, std: 0.05695, params: {'gamma': 0.2, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54415, std: 0.05886, params: {'gamma': 0.3, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.54006, std: 0.06041, params: {'gamma': 0.3, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.53788, std: 0.06006, params: {'gamma': 0.3, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.53933, std: 0.05991, params: {'gamma': 0.3, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54295, std: 0.06138, params: {'gamma': 0.3, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54175, std: 0.05846, params: {'gamma': 0.3, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54487, std: 0.05856, params: {'gamma': 0.4, 'subsample': 0.6, 'colsample_bytree': 0.7},
  mean: 0.53958, std: 0.05770, params: {'gamma': 0.4, 'subsample': 0.7, 'colsample_bytree': 0.7},
  mean: 0.53981, std: 0.05806, params: {'gamma': 0.4, 'subsample': 0.75, 'colsample_bytree': 0.7},
  mean: 0.53861, std: 0.05931, params: {'gamma': 0.4, 'subsample': 0.8, 'colsample_bytree': 0.7},
  mean: 0.54295, std: 0.06028, params: {'gamma': 0.4, 'subsample': 0.85, 'colsample_bytree': 0.7},
  mean: 0.54223, std: 0.05858, params: {'gamma': 0.4, 'subsample': 0.9, 'colsample_bytree': 0.7},
  mean: 0.54414, std: 0.06667, params: {'gamma': 0, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54101, std: 0.06193, params: {'gamma': 0, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.54318, std: 0.06244, params: {'gamma': 0, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53861, std: 0.06144, params: {'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.53620, std: 0.06564, params: {'gamma': 0, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53620, std: 0.06363, params: {'gamma': 0, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.54775, std: 0.06776, params: {'gamma': 0.05, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54029, std: 0.06213, params: {'gamma': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.53957, std: 0.06077, params: {'gamma': 0.05, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53765, std: 0.06127, params: {'gamma': 0.05, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.53812, std: 0.06492, params: {'gamma': 0.05, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53740, std: 0.06543, params: {'gamma': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.54511, std: 0.06684, params: {'gamma': 0.1, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54102, std: 0.06112, params: {'gamma': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.54342, std: 0.05979, params: {'gamma': 0.1, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53885, std: 0.06316, params: {'gamma': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.53716, std: 0.06587, params: {'gamma': 0.1, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53981, std: 0.06602, params: {'gamma': 0.1, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.54534, std: 0.06576, params: {'gamma': 0.15, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54198, std: 0.06113, params: {'gamma': 0.15, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.54246, std: 0.06235, params: {'gamma': 0.15, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53739, std: 0.06376, params: {'gamma': 0.15, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.53644, std: 0.06606, params: {'gamma': 0.15, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53692, std: 0.06356, params: {'gamma': 0.15, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.54390, std: 0.06508, params: {'gamma': 0.2, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54246, std: 0.06027, params: {'gamma': 0.2, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.54366, std: 0.06093, params: {'gamma': 0.2, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53668, std: 0.06195, params: {'gamma': 0.2, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.53716, std: 0.06672, params: {'gamma': 0.2, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53812, std: 0.06392, params: {'gamma': 0.2, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.54511, std: 0.06492, params: {'gamma': 0.3, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54270, std: 0.06238, params: {'gamma': 0.3, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.54198, std: 0.05905, params: {'gamma': 0.3, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53980, std: 0.06639, params: {'gamma': 0.3, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.53933, std: 0.06534, params: {'gamma': 0.3, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53426, std: 0.06530, params: {'gamma': 0.3, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.54679, std: 0.06668, params: {'gamma': 0.4, 'subsample': 0.6, 'colsample_bytree': 0.8},
  mean: 0.54053, std: 0.06293, params: {'gamma': 0.4, 'subsample': 0.7, 'colsample_bytree': 0.8},
  mean: 0.54487, std: 0.05863, params: {'gamma': 0.4, 'subsample': 0.75, 'colsample_bytree': 0.8},
  mean: 0.53716, std: 0.06329, params: {'gamma': 0.4, 'subsample': 0.8, 'colsample_bytree': 0.8},
  mean: 0.54078, std: 0.06667, params: {'gamma': 0.4, 'subsample': 0.85, 'colsample_bytree': 0.8},
  mean: 0.53667, std: 0.06667, params: {'gamma': 0.4, 'subsample': 0.9, 'colsample_bytree': 0.8},
  mean: 0.53667, std: 0.06978, params: {'gamma': 0, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53451, std: 0.06471, params: {'gamma': 0, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.53885, std: 0.06589, params: {'gamma': 0, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53619, std: 0.06430, params: {'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53450, std: 0.06663, params: {'gamma': 0, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.53305, std: 0.06511, params: {'gamma': 0, 'subsample': 0.9, 'colsample_bytree': 0.9},
  mean: 0.53691, std: 0.06954, params: {'gamma': 0.05, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53548, std: 0.06507, params: {'gamma': 0.05, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.54077, std: 0.06506, params: {'gamma': 0.05, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53644, std: 0.06264, params: {'gamma': 0.05, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53596, std: 0.06432, params: {'gamma': 0.05, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.53378, std: 0.06318, params: {'gamma': 0.05, 'subsample': 0.9, 'colsample_bytree': 0.9},
  mean: 0.53667, std: 0.06909, params: {'gamma': 0.1, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53621, std: 0.06409, params: {'gamma': 0.1, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.53885, std: 0.06631, params: {'gamma': 0.1, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53427, std: 0.06157, params: {'gamma': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53595, std: 0.06377, params: {'gamma': 0.1, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.53402, std: 0.06465, params: {'gamma': 0.1, 'subsample': 0.9, 'colsample_bytree': 0.9},
  mean: 0.53451, std: 0.06751, params: {'gamma': 0.15, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53765, std: 0.06604, params: {'gamma': 0.15, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.54076, std: 0.06835, params: {'gamma': 0.15, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53258, std: 0.06252, params: {'gamma': 0.15, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53620, std: 0.06548, params: {'gamma': 0.15, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.52969, std: 0.06371, params: {'gamma': 0.15, 'subsample': 0.9, 'colsample_bytree': 0.9},
  mean: 0.53692, std: 0.06788, params: {'gamma': 0.2, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53596, std: 0.06364, params: {'gamma': 0.2, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.54221, std: 0.06852, params: {'gamma': 0.2, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53548, std: 0.06142, params: {'gamma': 0.2, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53451, std: 0.06439, params: {'gamma': 0.2, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.53355, std: 0.06223, params: {'gamma': 0.2, 'subsample': 0.9, 'colsample_bytree': 0.9},
  mean: 0.53716, std: 0.06680, params: {'gamma': 0.3, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53596, std: 0.06460, params: {'gamma': 0.3, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.53619, std: 0.06968, params: {'gamma': 0.3, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53596, std: 0.06236, params: {'gamma': 0.3, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53716, std: 0.06634, params: {'gamma': 0.3, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.53211, std: 0.06159, params: {'gamma': 0.3, 'subsample': 0.9, 'colsample_bytree': 0.9},
  mean: 0.53619, std: 0.06868, params: {'gamma': 0.4, 'subsample': 0.6, 'colsample_bytree': 0.9},
  mean: 0.53548, std: 0.06532, params: {'gamma': 0.4, 'subsample': 0.7, 'colsample_bytree': 0.9},
  mean: 0.53812, std: 0.06628, params: {'gamma': 0.4, 'subsample': 0.75, 'colsample_bytree': 0.9},
  mean: 0.53548, std: 0.06376, params: {'gamma': 0.4, 'subsample': 0.8, 'colsample_bytree': 0.9},
  mean: 0.53620, std: 0.06377, params: {'gamma': 0.4, 'subsample': 0.85, 'colsample_bytree': 0.9},
  mean: 0.53499, std: 0.06242, params: {'gamma': 0.4, 'subsample': 0.9, 'colsample_bytree': 0.9}],
 {'colsample_bytree': 0.8, 'gamma': 0.05, 'subsample': 0.6},
 0.54775322798111314)

In [33]:
gs3.best_estimator_


Out[33]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0.05, learning_rate=0.05, max_delta_step=0, max_depth=7,
       min_child_weight=7, missing=None, n_estimators=60, nthread=4,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=True, subsample=0.6)

In [34]:
modelfit(gs3.best_estimator_,X_train,Y_train)


Model Report
Accuracy : 0.7857
F1 score (Train) : 0.785732
CV Score : Mean - 0.5477532 | Std - 0.06775601 | Min - 0.4311594 | Max - 0.6228916

Step 5: Tuning Regularization Parameters


In [35]:
param_test4={
    'reg_alpha':[0, 1e-5, 1e-2, 0.1, 0.2],
    'reg_lambda':[0, .25,.5,.75,.1]
}

gs4 = GridSearchCV(gs3.best_estimator_,param_grid=param_test4, 
                   scoring='accuracy', n_jobs=4,iid=False, cv=skf)
gs4.fit(X_train, Y_train)
gs4.grid_scores_, gs4.best_params_,gs4.best_score_


/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[35]:
([mean: 0.54246, std: 0.06716, params: {'reg_lambda': 0, 'reg_alpha': 0},
  mean: 0.54318, std: 0.06594, params: {'reg_lambda': 0.25, 'reg_alpha': 0},
  mean: 0.54294, std: 0.06187, params: {'reg_lambda': 0.5, 'reg_alpha': 0},
  mean: 0.54535, std: 0.06315, params: {'reg_lambda': 0.75, 'reg_alpha': 0},
  mean: 0.54367, std: 0.06458, params: {'reg_lambda': 0.1, 'reg_alpha': 0},
  mean: 0.54174, std: 0.06755, params: {'reg_lambda': 0, 'reg_alpha': 1e-05},
  mean: 0.54342, std: 0.06608, params: {'reg_lambda': 0.25, 'reg_alpha': 1e-05},
  mean: 0.54294, std: 0.06187, params: {'reg_lambda': 0.5, 'reg_alpha': 1e-05},
  mean: 0.54535, std: 0.06315, params: {'reg_lambda': 0.75, 'reg_alpha': 1e-05},
  mean: 0.54391, std: 0.06471, params: {'reg_lambda': 0.1, 'reg_alpha': 1e-05},
  mean: 0.54246, std: 0.06682, params: {'reg_lambda': 0, 'reg_alpha': 0.01},
  mean: 0.54222, std: 0.06460, params: {'reg_lambda': 0.25, 'reg_alpha': 0.01},
  mean: 0.54366, std: 0.06576, params: {'reg_lambda': 0.5, 'reg_alpha': 0.01},
  mean: 0.54318, std: 0.06475, params: {'reg_lambda': 0.75, 'reg_alpha': 0.01},
  mean: 0.54367, std: 0.06506, params: {'reg_lambda': 0.1, 'reg_alpha': 0.01},
  mean: 0.54536, std: 0.06266, params: {'reg_lambda': 0, 'reg_alpha': 0.1},
  mean: 0.54367, std: 0.06438, params: {'reg_lambda': 0.25, 'reg_alpha': 0.1},
  mean: 0.54245, std: 0.06696, params: {'reg_lambda': 0.5, 'reg_alpha': 0.1},
  mean: 0.54559, std: 0.06477, params: {'reg_lambda': 0.75, 'reg_alpha': 0.1},
  mean: 0.54343, std: 0.06743, params: {'reg_lambda': 0.1, 'reg_alpha': 0.1},
  mean: 0.54584, std: 0.06484, params: {'reg_lambda': 0, 'reg_alpha': 0.2},
  mean: 0.54511, std: 0.06708, params: {'reg_lambda': 0.25, 'reg_alpha': 0.2},
  mean: 0.54317, std: 0.06781, params: {'reg_lambda': 0.5, 'reg_alpha': 0.2},
  mean: 0.54800, std: 0.06599, params: {'reg_lambda': 0.75, 'reg_alpha': 0.2},
  mean: 0.54487, std: 0.06645, params: {'reg_lambda': 0.1, 'reg_alpha': 0.2}],
 {'reg_alpha': 0.2, 'reg_lambda': 0.75},
 0.54800375464834028)

In [36]:
modelfit(gs4.best_estimator_,X_train, Y_train)


Model Report
Accuracy : 0.7831
F1 score (Train) : 0.783080
CV Score : Mean - 0.5480038 | Std - 0.06598781 | Min - 0.4371981 | Max - 0.6240964

In [37]:
gs4.best_estimator_


Out[37]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0.05, learning_rate=0.05, max_delta_step=0, max_depth=7,
       min_child_weight=7, missing=None, n_estimators=60, nthread=4,
       objective='multi:softprob', reg_alpha=0.2, reg_lambda=0.75,
       scale_pos_weight=1, seed=123, silent=True, subsample=0.6)

In [39]:
param_test5={
    'reg_alpha':[.15,0.2,.25,.3,.4],
}

gs5 = GridSearchCV(gs4.best_estimator_,param_grid=param_test5, 
                   scoring='accuracy', n_jobs=4,iid=False, cv=skf)
gs5.fit(X_train, Y_train)
gs5.grid_scores_, gs5.best_params_,gs5.best_score_


/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[39]:
([mean: 0.54607, std: 0.06392, params: {'reg_alpha': 0.15},
  mean: 0.54800, std: 0.06599, params: {'reg_alpha': 0.2},
  mean: 0.54631, std: 0.06559, params: {'reg_alpha': 0.25},
  mean: 0.54655, std: 0.06266, params: {'reg_alpha': 0.3},
  mean: 0.54559, std: 0.06224, params: {'reg_alpha': 0.4}],
 {'reg_alpha': 0.2},
 0.54800375464834028)

In [42]:
modelfit(gs5.best_estimator_, X_train, Y_train)


Model Report
Accuracy : 0.7831
F1 score (Train) : 0.783080
CV Score : Mean - 0.5480038 | Std - 0.06598781 | Min - 0.4371981 | Max - 0.6240964

In [43]:
gs5.best_estimator_


Out[43]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0.05, learning_rate=0.05, max_delta_step=0, max_depth=7,
       min_child_weight=7, missing=None, n_estimators=60, nthread=4,
       objective='multi:softprob', reg_alpha=0.2, reg_lambda=0.75,
       scale_pos_weight=1, seed=123, silent=True, subsample=0.6)

Step 6: Reducing Learning Rate


In [44]:
xgb4 = XGBClassifier(
    learning_rate = 0.025,
    n_estimators=120,
    max_depth=7,
    min_child_weight=7,
    gamma = 0.05,
    subsample=0.6,
    colsample_bytree=0.8,
    reg_alpha=0.2,
    reg_lambda =0.75,
    objective='multi:softmax',
    nthread =4,
    seed = seed,
)
modelfit(xgb4,X_train, Y_train)


Model Report
Accuracy : 0.784
F1 score (Train) : 0.784044
CV Score : Mean - 0.5410048 | Std - 0.06517853 | Min - 0.4311594 | Max - 0.6120482

In [47]:
xgb5 = XGBClassifier(
    learning_rate = 0.00625,
    n_estimators=480,
    max_depth=7,
    min_child_weight=7,
    gamma = 0.05,
    subsample=0.6,
    colsample_bytree=0.8,
    reg_alpha=0.2,
    reg_lambda =0.75,
    objective='multi:softmax',
    nthread =4,
    seed = seed,
)
modelfit(xgb5,X_train, Y_train)


Model Report
Accuracy : 0.7862
F1 score (Train) : 0.786214
CV Score : Mean - 0.5431752 | Std - 0.06553735 | Min - 0.4323671 | Max - 0.6144578

Next we use our tuned final model to do cross validation on the training data set. One of the wells will be used as test data and the rest will be the training data. Each iteration, a different well is chosen.


In [48]:
# Load data 
filename = './facies_vectors.csv'
data = pd.read_csv(filename)

# Change to category data type
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')

X_train = data.drop(['Facies', 'Formation','Depth'], axis = 1 ) 
X_train_nowell = X_train.drop(['Well Name'], axis=1)
Y_train = data['Facies' ] - 1

# Final recommended model based on the extensive parameters search
model_final = gs5.best_estimator_
model_final.fit( X_train_nowell , Y_train , eval_metric = 'merror' )


Out[48]:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0.05, learning_rate=0.05, max_delta_step=0, max_depth=7,
       min_child_weight=7, missing=None, n_estimators=60, nthread=4,
       objective='multi:softprob', reg_alpha=0.2, reg_lambda=0.75,
       scale_pos_weight=1, seed=123, silent=True, subsample=0.6)

In [49]:
# Leave one well out for cross validation 
well_names = data['Well Name'].unique()
f1=[]
for i in range(len(well_names)):
    
    # Split data for training and testing

    
    train_X = X_train[X_train['Well Name'] != well_names[i] ]
    train_Y = Y_train[X_train['Well Name'] != well_names[i] ]
    test_X  = X_train[X_train['Well Name'] == well_names[i] ]
    test_Y  = Y_train[X_train['Well Name'] == well_names[i] ]

    train_X = train_X.drop(['Well Name'], axis = 1 ) 
    test_X = test_X.drop(['Well Name'], axis = 1 )

    
    # Train the model based on training data
    


    # Predict on the test set
    predictions = model_final.predict(test_X)

    # Print report
    print ("\n------------------------------------------------------")
    print ("Validation on the leaving out well " + well_names[i])
    conf = confusion_matrix( test_Y, predictions, labels = np.arange(9) )
    print ("\nModel Report")
    print ("-Accuracy: %.6f" % ( accuracy(conf) ))
    print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
    print ("-F1 Score: %.6f" % ( f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ) ))
    f1.append(f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ))
    facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
                     'WS', 'D','PS', 'BS']
    print ("\nConfusion Matrix Results")
    from classification_utilities import display_cm, display_adj_cm
    display_cm(conf, facies_labels,display_metrics=True, hide_zeros=True)
    
print ("\n------------------------------------------------------")
print ("Final Results")
print ("-Average F1 Score: %6f" % (sum(f1)/(1.0*len(f1))))


------------------------------------------------------
Validation on the leaving out well SHRIMPLIN

Model Report
-Accuracy: 0.861996
-Adjacent Accuracy: 0.978769
-F1 Score: 0.861164

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS         107    11                                       118
     FSiS          13   110                                       123
     SiSh                      15           1           2          18
       MS                       3    44    10           6          63
       WS                             2    54     1     6          63
        D                                         2     3           5
       PS                             1     4     1    63          69
       BS                                               1    11    12

Precision  0.00  0.89  0.91  0.83  0.94  0.78  0.50  0.78  1.00  0.87
   Recall  0.00  0.91  0.89  0.83  0.70  0.86  0.40  0.91  0.92  0.86
       F1  0.00  0.90  0.90  0.83  0.80  0.82  0.44  0.84  0.96  0.86

------------------------------------------------------
Validation on the leaving out well ALEXANDER D

Model Report
-Accuracy: 0.793991
-Adjacent Accuracy: 0.933476
-F1 Score: 0.783970

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS         105    12                                       117
     FSiS           9    82                                        91
     SiSh                      40           1     2     1          44
       MS                       2    11     6           7          26
       WS                      11     3    32     4    19          69
        D                                   1    13     2          16
       PS                       7           4     1    86          98
       BS                                   2     2           1     5

Precision  0.00  0.92  0.87  0.67  0.79  0.70  0.59  0.75  1.00  0.80
   Recall  0.00  0.90  0.90  0.91  0.42  0.46  0.81  0.88  0.20  0.79
       F1  0.00  0.91  0.89  0.77  0.55  0.56  0.68  0.81  0.33  0.78

------------------------------------------------------
Validation on the leaving out well SHANKLE

Model Report
-Accuracy: 0.839644
-Adjacent Accuracy: 0.986637
-F1 Score: 0.841908

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    61    28                                              89
     CSiS     2    80     7                                        89
     FSiS          13   104                                       117
     SiSh                       5                       2           7
       MS                            15     2     1     1          19
       WS                 1     1          60     1     8          71
        D                                        17                17
       PS                                   5          35          40
       BS                                                           0

Precision  0.97  0.66  0.93  0.83  1.00  0.90  0.89  0.76  0.00  0.86
   Recall  0.69  0.90  0.89  0.71  0.79  0.85  1.00  0.88  0.00  0.84
       F1  0.80  0.76  0.91  0.77  0.88  0.87  0.94  0.81  0.00  0.84

------------------------------------------------------
Validation on the leaving out well LUKE G U

Model Report
-Accuracy: 0.811280
-Adjacent Accuracy: 0.969631
-F1 Score: 0.820805

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS     3   102    12                                       117
     FSiS          22   107                                       129
     SiSh                      31     1     3                      35
       MS                                   1           1           2
       WS                       4     6    65           8     1    84
        D                                   1    13     5     1    20
       PS                       3     1    14          56          74
       BS                                                           0

Precision  0.00  0.82  0.90  0.82  0.00  0.77  1.00  0.80  0.00  0.84
   Recall  0.00  0.87  0.83  0.89  0.00  0.77  0.65  0.76  0.00  0.81
       F1  0.00  0.85  0.86  0.85  0.00  0.77  0.79  0.78  0.00  0.82

------------------------------------------------------
Validation on the leaving out well KIMZEY A

Model Report
-Accuracy: 0.760820
-Adjacent Accuracy: 0.933941
-F1 Score: 0.742198

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS           5     4                                         9
     CSiS          79     6                                        85
     FSiS          17    57                                        74
     SiSh                      40     2     1                      43
       MS                       2    27    11          13          53
       WS                 1     3     1    34     3     9          51
        D                       3           1    21     2          27
       PS                             3     7     4    76          90
       BS                                   1           6           7

Precision  0.00  0.78  0.84  0.83  0.82  0.62  0.75  0.72  0.00  0.74
   Recall  0.00  0.93  0.77  0.93  0.51  0.67  0.78  0.84  0.00  0.76
       F1  0.00  0.85  0.80  0.88  0.63  0.64  0.76  0.78  0.00  0.74

------------------------------------------------------
Validation on the leaving out well CROSS H CATTLE

Model Report
-Accuracy: 0.702595
-Adjacent Accuracy: 0.926148
-F1 Score: 0.709076

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   102    49     7                                       158
     CSiS     7   107    27                             1         142
     FSiS           5    39     1                       2          47
     SiSh           1     3    13           7           1          25
       MS           4     3          12     7           2          28
       WS                       1          25           5          31
        D                                         1     1           2
       PS                 4                 9     2    53          68
       BS                                                           0

Precision  0.94  0.64  0.47  0.87  1.00  0.52  0.33  0.82  0.00  0.77
   Recall  0.65  0.75  0.83  0.52  0.43  0.81  0.50  0.78  0.00  0.70
       F1  0.76  0.69  0.60  0.65  0.60  0.63  0.40  0.80  0.00  0.71

------------------------------------------------------
Validation on the leaving out well NOLAN

Model Report
-Accuracy: 0.756627
-Adjacent Accuracy: 0.913253
-F1 Score: 0.752754

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS           4                                               4
     CSiS     2   109     6                             1         118
     FSiS          16    49                 1           2          68
     SiSh                 1    19     1     5     1     1          28
       MS           2     1          19    12     1    12          47
       WS     1                            15          14          30
        D                                         3     1           4
       PS                 5     2           7         100     2   116
       BS                                                           0

Precision  0.00  0.83  0.79  0.90  0.95  0.38  0.60  0.76  0.00  0.78
   Recall  0.00  0.92  0.72  0.68  0.40  0.50  0.75  0.86  0.00  0.76
       F1  0.00  0.88  0.75  0.78  0.57  0.43  0.67  0.81  0.00  0.75
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/littleni/anaconda/lib/python3.5/site-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
------------------------------------------------------
Validation on the leaving out well Recruit F9

Model Report
-Accuracy: 0.712500
-Adjacent Accuracy: 0.950000
-F1 Score: 0.832117

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS                                                           0
     FSiS                                                           0
     SiSh                                                           0
       MS                                                           0
       WS                                                           0
        D                                                           0
       PS                                                           0
       BS                                   4     4    15    57    80

Precision  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  1.00  1.00
   Recall  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.71  0.71
       F1  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.83  0.83

------------------------------------------------------
Validation on the leaving out well NEWBY

Model Report
-Accuracy: 0.771058
-Adjacent Accuracy: 0.950324
-F1 Score: 0.764345

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS                                                           0
     CSiS          90     8                                        98
     FSiS          25    55                                        80
     SiSh                      50     1     3           4          58
       MS                 2     2     8    11     1     4          28
       WS                       2     1    70          21     2    96
        D                                        15     1          16
       PS                                   4          52          56
       BS                                   5           9    17    31

Precision  0.00  0.78  0.85  0.93  0.80  0.75  0.94  0.57  0.89  0.79
   Recall  0.00  0.92  0.69  0.86  0.29  0.73  0.94  0.93  0.55  0.77
       F1  0.00  0.85  0.76  0.89  0.42  0.74  0.94  0.71  0.68  0.76

------------------------------------------------------
Validation on the leaving out well CHURCHMAN BIBLE

Model Report
-Accuracy: 0.762376
-Adjacent Accuracy: 0.943069
-F1 Score: 0.750731

Confusion Matrix Results
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS     1     7                                               8
     CSiS          41    14                 1                      56
     FSiS           6    44                             1          51
     SiSh                 1    10           2                      13
       MS                 1     2     9    13           5          30
       WS                       4          79           4          87
        D                 1     1           3    22     7          34
       PS                 3     1     2    12          57          75
       BS                                         2     3    45    50

Precision  1.00  0.76  0.69  0.56  0.82  0.72  0.92  0.74  1.00  0.78
   Recall  0.12  0.73  0.86  0.77  0.30  0.91  0.65  0.76  0.90  0.76
       F1  0.22  0.75  0.77  0.65  0.44  0.80  0.76  0.75  0.95  0.75

------------------------------------------------------
Final Results
-Average F1 Score: 0.785907

Use final model to predict the given test data set


In [50]:
# Load test data
test_data = pd.read_csv('validation_data_nofacies.csv')
test_data['Well Name'] = test_data['Well Name'].astype('category')
X_test = test_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
# Predict facies of unclassified data
Y_predicted = model_final.predict(X_test)
test_data['Facies'] = Y_predicted + 1
# Store the prediction
test_data.to_csv('Prediction4.csv')

In [51]:
test_data[test_data['Well Name']=='STUART'].head()


Out[51]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS Facies
0 A1 SH STUART 2808.0 66.276 0.630 3.3 10.65 3.591 1 1.000 2
1 A1 SH STUART 2808.5 77.252 0.585 6.5 11.95 3.341 1 0.978 3
2 A1 SH STUART 2809.0 82.899 0.566 9.4 13.60 3.064 1 0.956 2
3 A1 SH STUART 2809.5 80.671 0.593 9.5 13.25 2.977 1 0.933 2
4 A1 SH STUART 2810.0 75.971 0.638 8.7 12.35 3.020 1 0.911 2

In [52]:
test_data[test_data['Well Name']=='CRAWFORD'].head()


Out[52]:
Formation Well Name Depth GR ILD_log10 DeltaPHI PHIND PE NM_M RELPOS Facies
474 A1 LM CRAWFORD 2972.5 49.675 0.845 3.905 11.175 3.265 2 1.000 8
475 A1 LM CRAWFORD 2973.0 34.435 0.879 3.085 8.175 3.831 2 0.991 8
476 A1 LM CRAWFORD 2973.5 26.178 0.920 2.615 4.945 4.306 2 0.981 8
477 A1 LM CRAWFORD 2974.0 19.463 0.967 0.820 3.820 4.578 2 0.972 8
478 A1 LM CRAWFORD 2974.5 19.260 0.995 0.320 3.630 4.643 2 0.962 8

In [57]:
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[5])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 
                                'SiSh', ' MS ', ' WS ', ' D  ', 
                                ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)

In [56]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable

In [58]:
make_facies_log_plot(
    test_data[test_data['Well Name'] == 'STUART'],
    facies_colors)