In this notebook, we mainly utilize extreme gradient boost to improve the prediction model originially proposed in TLE 2016 November machine learning tuotrial. Extreme gradient boost can be viewed as an enhanced version of gradient boost by using a more regularized model formalization to control over-fitting, and XGB usually performs better. Applications of XGB can be found in many Kaggle competitions. Some recommended tutorrials can be found
Our work will be orginized in the follwing order:
•Background
•Exploratory Data Analysis
•Data Prepration and Model Selection
•Final Results
The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a classifier to predict facies types.
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are: •Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE. •Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)
The nine discrete facies (classes of rocks) are:
1.Nonmarine sandstone
2.Nonmarine coarse siltstone
3.Nonmarine fine siltstone
4.Marine siltstone and shale
5.Mudstone (limestone)
6.Wackestone (limestone)
7.Dolomite
8.Packstone-grainstone (limestone)
9.Phylloid-algal bafflestone (limestone)
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies/ Label/ Adjacent Facies
1 SS 2
2 CSiS 1,3
3 FSiS 2
4 SiSh 5
5 MS 4,6
6 WS 5,7
7 D 6,8
8 PS 6,7,9
9 BS 7,8
After the background intorduction, we start to import the pandas library for some basic data analysis and manipulation. The matplotblib and seaborn are imported for data vislization.
In [1]:
%matplotlib inline
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import matplotlib.colors as colors
import xgboost as xgb
import numpy as np
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from classification_utilities import display_cm, display_adj_cm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import validation_curve
from sklearn.datasets import load_svmlight_files
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification
from xgboost.sklearn import XGBClassifier
from scipy.sparse import vstack
seed = 123
np.random.seed(seed)
In [2]:
import pandas as pd
filename = './facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data.head(10)
Out[2]:
In [3]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data.info()
In [4]:
training_data.describe()
Out[4]:
In [5]:
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00','#1B4F72',
'#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS','WS', 'D','PS', 'BS']
facies_counts = training_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors,title='Distribution of Training Data by Facies')
Out[5]:
In [6]:
sns.heatmap(training_data.corr(), vmax=1.0, square=True)
Out[6]:
Now we are ready to test the XGB approach, and will use confusion matrix and f1_score, which were imported, as metric for classification, as well as GridSearchCV, which is an excellent tool for parameter optimization.
In [7]:
import xgboost as xgb
X_train = training_data.drop(['Facies', 'Well Name','Formation','Depth'], axis = 1 )
Y_train = training_data['Facies' ] - 1
dtrain = xgb.DMatrix(X_train, Y_train)
In [9]:
train = X_train.copy()
In [10]:
train['Facies']=Y_train
In [11]:
train.head()
Out[11]:
The accuracy function and accuracy_adjacent function are defined in the following to quatify the prediction correctness.
In [12]:
def accuracy(conf):
total_correct = 0.
nb_classes = conf.shape[0]
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
acc = total_correct/sum(sum(conf))
return acc
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])
def accuracy_adjacent(conf, adjacent_facies):
nb_classes = conf.shape[0]
total_correct = 0.
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
for j in adjacent_facies[i]:
total_correct += conf[i][j]
return total_correct / sum(sum(conf))
In [13]:
target='Facies'
Before processing further, we define a functin which will help us create XGBoost models and perform cross-validation.
In [14]:
def modelfit(alg, dtrain, features, useTrainCV=True,
cv_fold=10,early_stopping_rounds = 50):
if useTrainCV:
xgb_param = alg.get_xgb_params()
xgb_param['num_class']=9
xgtrain = xgb.DMatrix(train[features].values,label = train[target].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=
alg.get_params()['n_estimators'],nfold=cv_fold,
metrics='merror',early_stopping_rounds = early_stopping_rounds)
alg.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
alg.fit(dtrain[features], dtrain[target],eval_metric='merror')
#Predict training set:
dtrain_prediction = alg.predict(dtrain[features])
dtrain_predprob = alg.predict_proba(dtrain[features])[:,1]
#Pring model report
print ("\nModel Report")
print ("Accuracy : %.4g" % accuracy_score(dtrain[target],
dtrain_prediction))
print ("F1 score (Train) : %f" % f1_score(dtrain[target],
dtrain_prediction,average='weighted'))
feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar',title='Feature Importances')
plt.ylabel('Feature Importance Score')
In [15]:
features =[x for x in X_train.columns]
features
Out[15]:
We are going to preform the steps as follows:
1.Choose a relatively high learning rate, e.g., 0.1. Usually somewhere between 0.05 and 0.3 should work for different problems.
2.Determine the optimum number of tress for this learning rate.XGBoost has a very usefull function called as "cv" which performs cross-validation at each boosting iteration and thus returns the optimum number of tress required.
3.Tune tree-based parameters(max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees.
4.Tune regularization parameters(lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
5.Lower the learning rate and decide the optimal parameters.
In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:
1.max_depth = 5
2.min_child_weight = 1
3.gamma = 0
4.subsample, colsample_bytree = 0.8 : This is a commonly used used start value.
5.scale_pos_weight = 1
Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.
In [15]:
from xgboost import XGBClassifier
xgb1 = XGBClassifier(
learning_rate = 0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma = 0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread =4,
seed = 123,
)
In [16]:
modelfit(xgb1, train, features)
In [17]:
xgb1
Out[17]:
In [18]:
from sklearn.model_selection import GridSearchCV
param_test1={
'max_depth':range(3,10,2),
'min_child_weight':range(1,6,2)
}
gs1 = GridSearchCV(xgb1,param_grid=param_test1,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs1.fit(train[features],train[target])
gs1.grid_scores_, gs1.best_params_,gs1.best_score_
Out[18]:
In [17]:
param_test2={
'max_depth':[8,9,10],
'min_child_weight':[1,2]
}
gs2 = GridSearchCV(XGBClassifier(colsample_bylevel=1, colsample_bytree=0.8,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=5,
min_child_weight=1, n_estimators=290, nthread=4,
objective='multi:softprob', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=123,subsample=0.8),param_grid=param_test2,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs2.fit(train[features],train[target])
gs2.grid_scores_, gs2.best_params_,gs2.best_score_
Out[17]:
In [18]:
gs2.best_estimator_
Out[18]:
In [19]:
param_test3={
'gamma':[i/10.0 for i in range(0,5)]
}
gs3 = GridSearchCV(XGBClassifier(colsample_bylevel=1, colsample_bytree=0.8,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, n_estimators=370, nthread=4,
objective='multi:softprob', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=123,subsample=0.8),param_grid=param_test3,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs3.fit(train[features],train[target])
gs3.grid_scores_, gs3.best_params_,gs3.best_score_
Out[19]:
In [20]:
xgb2 = XGBClassifier(
learning_rate = 0.1,
n_estimators=1000,
max_depth=9,
min_child_weight=1,
gamma = 0.2,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread =4,
scale_pos_weight=1,
seed = seed,
)
modelfit(xgb2,train,features)
In [21]:
xgb2
Out[21]:
In [22]:
param_test4={
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gs4 = GridSearchCV(XGBClassifier(colsample_bylevel=1, colsample_bytree=0.8,
gamma=0.2, learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, n_estimators=236, nthread=4,
objective='multi:softprob', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=123,subsample=0.8),param_grid=param_test4,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs4.fit(train[features],train[target])
gs4.grid_scores_, gs4.best_params_,gs4.best_score_
Out[22]:
In [23]:
param_test4b={
'subsample':[i/10.0 for i in range(5,7)],
}
gs4b = GridSearchCV(XGBClassifier(colsample_bylevel=1, colsample_bytree=0.8,
gamma=0.2, learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, n_estimators=236, nthread=4,
objective='multi:softprob', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=123,subsample=0.8),param_grid=param_test4b,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs4b.fit(train[features],train[target])
gs4b.grid_scores_, gs4b.best_params_,gs4b.best_score_
Out[23]:
In [24]:
param_test5={
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gs5 = GridSearchCV(XGBClassifier(colsample_bylevel=1, colsample_bytree=0.8,
gamma=0.2, learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, n_estimators=236, nthread=4,
objective='multi:softprob', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=123,subsample=0.6),param_grid=param_test5,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs5.fit(train[features],train[target])
gs5.grid_scores_, gs5.best_params_,gs5.best_score_
Out[24]:
In [25]:
param_test6={
'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gs6 = GridSearchCV(XGBClassifier(colsample_bylevel=1, colsample_bytree=0.8,
gamma=0.2, learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, n_estimators=236, nthread=4,
objective='multi:softprob', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=123,subsample=0.6),param_grid=param_test6,
scoring='accuracy', n_jobs=4,iid=False, cv=5)
gs6.fit(train[features],train[target])
gs6.grid_scores_, gs6.best_params_,gs6.best_score_
Out[25]:
In [26]:
xgb3 = XGBClassifier(
learning_rate = 0.1,
n_estimators=1000,
max_depth=9,
min_child_weight=1,
gamma = 0.2,
subsample=0.6,
colsample_bytree=0.8,
reg_alpha=0.05,
objective='multi:softmax',
nthread =4,
scale_pos_weight=1,
seed = seed,
)
modelfit(xgb3,train,features)
In [27]:
xgb3
Out[27]:
In [8]:
model = XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
gamma=0.2, learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, n_estimators=122, nthread=4,
objective='multi:softprob', reg_alpha=0.05, reg_lambda=1,
scale_pos_weight=1, seed=123, silent=True, subsample=0.6)
model.fit(X_train, Y_train)
xgb.plot_importance(model)
Out[8]:
In [ ]:
In [28]:
xgb4 = XGBClassifier(
learning_rate = 0.01,
n_estimators=5000,
max_depth=9,
min_child_weight=1,
gamma = 0.2,
subsample=0.6,
colsample_bytree=0.8,
reg_alpha=0.05,
objective='multi:softmax',
nthread =4,
scale_pos_weight=1,
seed = seed,
)
modelfit(xgb4,train,features)
In [29]:
xgb4
Out[29]:
Next we use our tuned final model to do cross validation on the training data set. One of the wells will be used as test data and the rest will be the training data. Each iteration, a different well is chosen.
In [5]:
# Load data
filename = './facies_vectors.csv'
data = pd.read_csv(filename)
# Change to category data type
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')
# Leave one well out for cross validation
well_names = data['Well Name'].unique()
f1=[]
for i in range(len(well_names)):
# Split data for training and testing
X_train = data.drop(['Facies', 'Formation','Depth'], axis = 1 )
Y_train = data['Facies' ] - 1
train_X = X_train[X_train['Well Name'] != well_names[i] ]
train_Y = Y_train[X_train['Well Name'] != well_names[i] ]
test_X = X_train[X_train['Well Name'] == well_names[i] ]
test_Y = Y_train[X_train['Well Name'] == well_names[i] ]
train_X = train_X.drop(['Well Name'], axis = 1 )
test_X = test_X.drop(['Well Name'], axis = 1 )
# Final recommended model based on the extensive parameters search
model_final = XGBClassifier(base_score=0.5, colsample_bylevel=1,
colsample_bytree=0.8, gamma=0.2,
learning_rate=0.01, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, n_estimators=432, nthread=4,
objective='multi:softmax', reg_alpha=0.05, reg_lambda=1,
scale_pos_weight=1, seed=123, silent=1,
subsample=0.6)
# Train the model based on training data
model_final.fit( train_X , train_Y , eval_metric = 'merror' )
# Predict on the test set
predictions = model_final.predict(test_X)
# Print report
print ("\n------------------------------------------------------")
print ("Validation on the leaving out well " + well_names[i])
conf = confusion_matrix( test_Y, predictions, labels = np.arange(9) )
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
print ("-F1 Score: %.6f" % ( f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ) ))
f1.append(f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ))
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
print ("\nConfusion Matrix Results")
from classification_utilities import display_cm, display_adj_cm
display_cm(conf, facies_labels,display_metrics=True, hide_zeros=True)
print ("\n------------------------------------------------------")
print ("Final Results")
print ("-Average F1 Score: %6f" % (sum(f1)/(1.0*len(f1))))
In [16]:
# Load data
filename = './facies_vectors.csv'
data = pd.read_csv(filename)
# Change to category data type
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')
# Split data for training and testing
X_train_all = data.drop(['Facies', 'Formation','Depth'], axis = 1 )
Y_train_all = data['Facies' ] - 1
X_train_all = X_train_all.drop(['Well Name'], axis = 1)
# Final recommended model based on the extensive parameters search
model_final = XGBClassifier(base_score=0.5, colsample_bylevel=1,
colsample_bytree=0.8, gamma=0.2,
learning_rate=0.01, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, n_estimators=432, nthread=4,
objective='multi:softmax', reg_alpha=0.05, reg_lambda=1,
scale_pos_weight=1, seed=123, silent=1,
subsample=0.6)
# Train the model based on training data
model_final.fit(X_train_all , Y_train_all , eval_metric = 'merror' )
Out[16]:
In [17]:
# Leave one well out for cross validation
well_names = data['Well Name'].unique()
f1=[]
for i in range(len(well_names)):
X_train = data.drop(['Facies', 'Formation','Depth'], axis = 1 )
Y_train = data['Facies' ] - 1
train_X = X_train[X_train['Well Name'] != well_names[i] ]
train_Y = Y_train[X_train['Well Name'] != well_names[i] ]
test_X = X_train[X_train['Well Name'] == well_names[i] ]
test_Y = Y_train[X_train['Well Name'] == well_names[i] ]
train_X = train_X.drop(['Well Name'], axis = 1 )
test_X = test_X.drop(['Well Name'], axis = 1 )
#print(test_Y)
predictions = model_final.predict(test_X)
# Print report
print ("\n------------------------------------------------------")
print ("Validation on the leaving out well " + well_names[i])
conf = confusion_matrix( test_Y, predictions, labels = np.arange(9) )
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
print ("-F1 Score: %.6f" % ( f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ) ))
f1.append(f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ))
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
print ("\nConfusion Matrix Results")
from classification_utilities import display_cm, display_adj_cm
display_cm(conf, facies_labels,display_metrics=True, hide_zeros=True)
print ("\n------------------------------------------------------")
print ("Final Results")
print ("-Average F1 Score: %6f" % (sum(f1)/(1.0*len(f1))))
In [ ]:
Use final model to predict the given test data set
In [18]:
# Load test data
test_data = pd.read_csv('validation_data_nofacies.csv')
test_data['Well Name'] = test_data['Well Name'].astype('category')
X_test = test_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
# Predict facies of unclassified data
Y_predicted = model_final.predict(X_test)
test_data['Facies'] = Y_predicted + 1
# Store the prediction
test_data.to_csv('Prediction3.csv')
In [19]:
test_data
Out[19]:
Future work, make more customerized objective function. Also, we could use RandomizedSearchCV instead of GridSearchCV to avoild potential local minimal trap and further improve the test results.