In this notebook, we mainly utilize extreme gradient boost to improve the prediction model originially proposed in TLE 2016 November machine learning tuotrial. Extreme gradient boost can be viewed as an enhanced version of gradient boost by using a more regularized model formalization to control over-fitting, and XGB usually performs better. Applications of XGB can be found in many Kaggle competitions. Some recommended tutorrials can be found
Our work will be orginized in the follwing order:
•Background
•Exploratory Data Analysis
•Data Prepration and Model Selection
•Final Results
The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a classifier to predict facies types.
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are: •Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE. •Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)
The nine discrete facies (classes of rocks) are:
1.Nonmarine sandstone
2.Nonmarine coarse siltstone
3.Nonmarine fine siltstone
4.Marine siltstone and shale
5.Mudstone (limestone)
6.Wackestone (limestone)
7.Dolomite
8.Packstone-grainstone (limestone)
9.Phylloid-algal bafflestone (limestone)
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies/ Label/ Adjacent Facies
1 SS 2
2 CSiS 1,3
3 FSiS 2
4 SiSh 5
5 MS 4,6
6 WS 5,7
7 D 6,8
8 PS 6,7,9
9 BS 7,8
The first thing we notice for this data is that it seems that neighboring facies are not symmetric, for example, the adjacent facies for 9 could be 7, yet the adjacent facies for 7 couldn't be 9. We already contacted the authors regarding this.
After the background intorduction, we start to import the pandas library for some basic data analysis and manipulation. The matplotblib and seaborn are imported for data vislization.
In [1]:
%matplotlib inline
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import matplotlib.colors as colors
In [4]:
filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data
Out[4]:
In [11]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data.info()
In [5]:
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00','#1B4F72',
'#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS','WS', 'D','PS', 'BS']
facies_counts = training_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors,title='Distribution of Training Data by Facies')
Out[5]:
In [6]:
sns.heatmap(training_data.corr(), vmax=1.0, square=True)
Out[6]:
In [7]:
training_data.describe()
Out[7]:
Now we are ready to test the XGB approach, along the way confusion matrix and f1_score are imported as metric for classification, as well as GridSearchCV, which is an excellent tool for parameter optimization.
In [17]:
import xgboost as xgb
import numpy as np
from sklearn.metrics import confusion_matrix, f1_score
from classification_utilities import display_cm, display_adj_cm
from sklearn.model_selection import GridSearchCV
In [12]:
X_train = training_data.drop(['Facies', 'Well Name','Formation','Depth'], axis = 1 )
Y_train = training_data['Facies' ] - 1
dtrain = xgb.DMatrix(X_train, Y_train)
The accuracy function and accuracy_adjacent function are defined in teh following to quatify the prediction correctness.
In [13]:
def accuracy(conf):
total_correct = 0.
nb_classes = conf.shape[0]
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
acc = total_correct/sum(sum(conf))
return acc
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])
def accuracy_adjacent(conf, adjacent_facies):
nb_classes = conf.shape[0]
total_correct = 0.
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
for j in adjacent_facies[i]:
total_correct += conf[i][j]
return total_correct / sum(sum(conf))
Initial model
In [10]:
# Proposed Initial Model
xgb1 = xgb.XGBClassifier( learning_rate =0.1, n_estimators=200, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.6,
colsample_bytree=0.6, reg_alpha=0, reg_lambda=1, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=100)
#Fit the algorithm on the data
xgb1.fit(X_train, Y_train,eval_metric='merror')
#Predict training set:
predictions = xgb1.predict(X_train)
#Print model report
# Confusion Matrix
conf = confusion_matrix(Y_train, predictions)
# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# Print Feature Importance
feat_imp = pd.Series(xgb1.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
Out[10]:
In [11]:
# Cross Validation parameters
cv_folds = 10
rounds = 100
xgb_param_1 = xgb1.get_xgb_params()
xgb_param_1['num_class'] = 9
# Perform cross-validation
cvresult1 = xgb.cv(xgb_param_1, dtrain, num_boost_round=xgb_param_1['n_estimators'],
stratified = True, nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)
print ("\nCross Validation Training Report Summary")
print (cvresult1.head())
print (cvresult1.tail())
The typical range for learning rate is around 0.01~0.2, so we vary ther learning rate a bit and at the same time, scan over the number of boosted trees to fit. This will take a little bit of time to finish.
In [12]:
print("Parameter optimization")
grid_search1 = GridSearchCV(xgb1,{'learning_rate':[0.05,0.01,0.1,0.2] , 'n_estimators':[200,400,600,800]},
scoring='accuracy' , n_jobs = 4)
grid_search1.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search1.grid_scores_, grid_search1.best_params_, grid_search1.best_score_
Out[12]:
It seems that we need to adjust the learning rate and make it smaller, which could help to reduce overfitting in my opinion. The number of boosted trees to fit also requires to be updated.
In [13]:
# Proposed Model with optimized learning rate and number of boosted trees to fit
xgb2 = xgb.XGBClassifier( learning_rate =0.01, n_estimators=400, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.6,
colsample_bytree=0.6, reg_alpha=0, reg_lambda=1, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=100)
#Fit the algorithm on the data
xgb2.fit(X_train, Y_train,eval_metric='merror')
#Predict training set:
predictions = xgb2.predict(X_train)
#Print model report
# Confusion Matrix
conf = confusion_matrix(Y_train, predictions )
# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
# Confusion Matrix
print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# Print Feature Importance
feat_imp = pd.Series(xgb2.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
Out[13]:
In [14]:
# Cross Validation parameters
cv_folds = 10
rounds = 100
xgb_param_2 = xgb2.get_xgb_params()
xgb_param_2['num_class'] = 9
# Perform cross-validation
cvresult2 = xgb.cv(xgb_param_2, dtrain, num_boost_round=xgb_param_2['n_estimators'],
stratified = True, nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)
print ("\nCross Validation Training Report Summary")
print (cvresult2.head())
print (cvresult2.tail())
In [15]:
print("Parameter optimization")
grid_search2 = GridSearchCV(xgb2,{'reg_alpha':[0, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10], 'reg_lambda':[0, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10] },
scoring='accuracy' , n_jobs = 4)
grid_search2.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search2.grid_scores_, grid_search2.best_params_, grid_search2.best_score_
Out[15]:
In [16]:
# Proposed Model with optimized regularization
xgb3 = xgb.XGBClassifier( learning_rate =0.01, n_estimators=400, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.6,
colsample_bytree=0.6, reg_alpha=0.1, reg_lambda=0.5, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=100)
#Fit the algorithm on the data
xgb3.fit(X_train, Y_train,eval_metric='merror')
#Predict training set:
predictions = xgb3.predict(X_train)
#Print model report
# Confusion Matrix
conf = confusion_matrix(Y_train, predictions )
# Print Results
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
# Confusion Matrix
print ("\nConfusion Matrix")
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# Print Feature Importance
feat_imp = pd.Series(xgb3.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
Out[16]:
In [17]:
print("Parameter optimization")
grid_search3 = GridSearchCV(xgb3,{'max_depth':[2, 5, 8], 'gamma':[0, 1], 'subsample':[0.4, 0.6, 0.8],'colsample_bytree':[0.4, 0.6, 0.8] },
scoring='accuracy' , n_jobs = 4)
grid_search3.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search3.grid_scores_, grid_search3.best_params_, grid_search3.best_score_
Out[17]:
In [18]:
# Load data
filename = '../facies_vectors.csv'
data = pd.read_csv(filename)
# Change to category data type
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')
# Leave one well out for cross validation
well_names = data['Well Name'].unique()
f1=[]
for i in range(len(well_names)):
# Split data for training and testing
X_train = data.drop(['Facies', 'Formation','Depth'], axis = 1 )
Y_train = data['Facies' ] - 1
train_X = X_train[X_train['Well Name'] != well_names[i] ]
train_Y = Y_train[X_train['Well Name'] != well_names[i] ]
test_X = X_train[X_train['Well Name'] == well_names[i] ]
test_Y = Y_train[X_train['Well Name'] == well_names[i] ]
train_X = train_X.drop(['Well Name'], axis = 1 )
test_X = test_X.drop(['Well Name'], axis = 1 )
# Final recommended model based on the extensive parameters search
model_final = xgb.XGBClassifier( learning_rate =0.01, n_estimators=400, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.6, reg_alpha=0.1, reg_lambda=0.5,
colsample_bytree=0.6, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=100)
# Train the model based on training data
model_final.fit( train_X , train_Y , eval_metric = 'merror' )
# Predict on the test set
predictions = model_final.predict(test_X)
# Print report
print ("\n------------------------------------------------------")
print ("Validation on the leaving out well " + well_names[i])
conf = confusion_matrix( test_Y, predictions, labels = np.arange(9) )
print ("\nModel Report")
print ("-Accuracy: %.6f" % ( accuracy(conf) ))
print ("-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) ))
print ("-F1 Score: %.6f" % ( f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ) ))
f1.append(f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ))
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
print ("\nConfusion Matrix Results")
from classification_utilities import display_cm, display_adj_cm
display_cm(conf, facies_labels,display_metrics=True, hide_zeros=True)
print ("\n------------------------------------------------------")
print ("Final Results")
print ("-Average F1 Score: %6f" % (sum(f1)/(1.0*len(f1))))
In [19]:
# Load test data
test_data = pd.read_csv('../validation_data_nofacies.csv')
test_data['Well Name'] = test_data['Well Name'].astype('category')
X_test = test_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
# Predict facies of unclassified data
Y_predicted = model_final.predict(X_test)
test_data['Facies'] = Y_predicted + 1
# Store the prediction
test_data.to_csv('Prediction1.csv')
In [20]:
test_data
Out[20]:
Future work, make more customerized objective function. Also, we could use RandomizedSearchCV instead of GridSearchCV to avoild potential local minimal trap and further improve the test results.