In this python notebook we explore many different machine learning algorithms to outperform the prediction model proposed in the prediction facies from wel logs challenge. Particulary, this is a classification problem that belongs to the area of supervised learning. Our approach involves a series of well-known machine learning and statistical algorithms to minimize the defined prediction error functions.
We will organize our present work in the following areas of analysis:
The dataset we will use comes from a class excercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
The dataset we will use is log data from nine wells that have been labeled with a facies type based on oberservation of core. We will use this log data to train a classifier to predict facies types.
This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
The nine discrete facies (classes of rocks) are:
These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.
Facies | Label | Adjacent Facies |
---|---|---|
1 | SS | 2 |
2 | CSiS | 1,3 |
3 | FSiS | 2 |
4 | SiSh | 5 |
5 | MS | 4,6 |
6 | WS | 5,7 |
7 | D | 6,8 |
8 | PS | 6,7,9 |
9 | BS | 7,8 |
IMPORTANT: We need to install the xgboost package to run this notebook ( %sh pip install xgboost )
In [3]:
import xgboost as xgb
print xgb.__version__
In [4]:
%matplotlib inline
import pandas as pd
from pandas.tools.plotting import scatter_matrix
from pandas import set_option
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import matplotlib.colors as colors
from sklearn import preprocessing
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn.cross_validation import KFold, train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.ensemble import RandomForestClassifier
from classification_utilities import display_cm, display_adj_cm
We load the training data to start the exploration stage.
In [5]:
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None
filename = '../facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data
Out[5]:
We declare the fields "Formation" and "Well Name" as categorical variables and then map them to integer values.
In [6]:
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data.info()
Observation
We could remove the NaN values from the PE variable for further analysis, going from 4149 total values to just 3232 valid values for PE, which has 3232. However, because we will be using XGBoost Algorithm that handles missing data, this won't be necessary.
In [8]:
#PE_mask = training_data['PE'].notnull().values
#training_data = training_data[PE_mask]
We must realize the classification problem is highly imbalanced, therefore making it more challenging to approach.
In [7]:
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00','#1B4F72',
'#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS','WS', 'D','PS', 'BS']
facies_counts = training_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels
facies_counts.plot(kind='bar',color=facies_colors,title='Distribution of Training Data by Facies')
Out[7]:
We produce correlation plot to observe relationship between variables. The target variable 'Facies' is highly correlated to 'NM_M', 'PE' and 'ILD_log10'.
In [8]:
plt.figure(figsize=(10, 10))
sns.heatmap(training_data.corr(), vmax=1.0, square=True)
Out[8]:
In [9]:
training_data.describe()
Out[9]:
XGboost will be our weapon of choice. XGBoost (or Extreme Gradient Boosting) is a sophisticated algorithm that corresponds to an advanced implementation of gradient boosting algorithm. Despite it's complexity, it is fairly easy to use and very powerful to deal with many different types of irregularities in data. It has been no surprise then, that is has been extensively applied in many machine learning competitions to obtain very promising results.
For the above reasons, to implement the XGBoost algorithm for our regression problem, we will use the above preprocessed data without modifications.
In [11]:
X_train = training_data.drop(['Facies', 'Well Name','Formation','Depth'], axis = 1 )
Y_train = training_data['Facies' ] - 1
dtrain = xgb.DMatrix(X_train, Y_train)
type(Y_train)
Out[11]:
In order to evaluate our classification model accurary we will use the our following defined metrics, based on the confusion matrix once the classification is performed. The first metric only considers misclassification error and the second one takes into account the fact that facies could be misclassified if they belong to a same group with similar geological characteristics.
In [13]:
def accuracy(conf):
total_correct = 0.
nb_classes = conf.shape[0]
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
acc = total_correct/sum(sum(conf))
return acc
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])
def accuracy_adjacent(conf, adjacent_facies):
nb_classes = conf.shape[0]
total_correct = 0.
for i in np.arange(0,nb_classes):
total_correct += conf[i][i]
for j in adjacent_facies[i]:
total_correct += conf[i][j]
return total_correct / sum(sum(conf))
Although XGBoost is a very straightforward algorithm to implement, it's difficulty arises in dealing with a big number of hyperparameters. For that reason, we need to develop routines to tune this parameters to optimize the algorithm performance on the data prediction.
We will use a Cross-Validation approach to improve our model by tuning parameters at each step. There are three types of parameters to consider:
We must realize that our training data is quite reduced, therefore to evaluate our model generalization
In [15]:
# Cross Validation parameters
cv_folds = 10
rounds = 100
# Proposed Initial Model
xgb1 = xgb.XGBClassifier( learning_rate =0.01, n_estimators=500, max_depth=6,
min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=27)
xgb_param_1 = xgb1.get_xgb_params()
xgb_param_1['num_class'] = 9
# Perform cross-validation
cvresult = xgb.cv(xgb_param_1, dtrain, num_boost_round=xgb_param_1['n_estimators'],
stratified = True, nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)
print "\nCross Validation Training Report Summary"
print cvresult.tail()
xgb1.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
xgb1.fit(X_train, Y_train,eval_metric='merror')
#Predict training set:
predictions = xgb1.predict(X_train)
#Print model report
# Confusion Matrix
conf = confusion_matrix(Y_train, predictions )
# Print Results
print "\nModel Report"
print "-Accuracy: %.6f" % ( accuracy(conf) )
print "-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) )
print "\nConfusion Matrix"
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# Print Feature Importance
feat_imp = pd.Series(xgb1.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
Out[15]:
By performing cross-validation routines we can use the produced results to measure how well the model generalizes and at the same time tune the hyperparameters. In this case, we will explore how the results look like if we vary the learning rate.
In [49]:
print("Parameter optimization")
grid_search1 = GridSearchCV(xgb1,{'learning_rate':[0.001,0.01,0.05] , 'n_estimators':[100,200,500]},
scoring='accuracy' , n_jobs = 4)
grid_search1.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search1.grid_scores_, grid_search1.best_params_, grid_search1.best_score_
Out[49]:
In [50]:
# Cross Validation parameters
cv_folds = 10
rounds = 100
# Proposed Initial Model
xgb2 = xgb.XGBClassifier( learning_rate =0.001, n_estimators=500, max_depth=6,
min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=27)
xgb_param_2 = xgb2.get_xgb_params()
xgb_param_2['num_class'] = 9
# Perform cross-validation
cvresult = xgb.cv(xgb_param_2, dtrain, num_boost_round=xgb_param_2['n_estimators'],
nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)
print "\nCross Validation Training Report Summary"
print cvresult.tail()
xgb2.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
xgb2.fit(X_train, Y_train,eval_metric='merror')
#Predict training set:
predictions = xgb2.predict(X_train)
#Print model report
# Confusion Matrix
conf = confusion_matrix(Y_train, predictions )
# Print Results
print "\nModel Report"
print "-Accuracy: %.6f" % ( accuracy(conf) )
print "-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) )
# Confusion Matrix
print "\nConfusion Matrix"
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# Print Feature Importance
feat_imp = pd.Series(xgb2.booster().get_fscore()).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
Out[50]:
In [13]:
print("Parameter optimization")
grid_search1 = GridSearchCV(xgb2,{'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100] },
scoring='accuracy' , n_jobs = 4)
grid_search1.fit(X_train,Y_train)
print("Best Set of Parameters")
grid_search1.grid_scores_, grid_search1.best_params_, grid_search1.best_score_
Out[13]:
In [16]:
#Final Model
cv_folds = 10
rounds = 100
xgb_final = xgb.XGBClassifier( learning_rate =0.001, n_estimators=500, max_depth=6,
min_child_weight=1, gamma=0, subsample=0.8, reg_alpha = 1,
colsample_bytree=0.8, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=27)
xgb_param_final = xgb_final.get_xgb_params()
xgb_param_final['num_class'] = 9
# Perform cross-validation
cvresult = xgb.cv(xgb_param_final, dtrain, num_boost_round=xgb_param_final['n_estimators'],
nfold=cv_folds, metrics='merror', early_stopping_rounds=rounds)
print "\nCross Validation Training Report Summary"
print cvresult.tail()
xgb_final.set_params(n_estimators=cvresult.shape[0])
#Fit the algorithm on the data
xgb_final.fit(X_train, Y_train,eval_metric='merror')
#Predict training set:
predictions = xgb_final.predict(X_train)
#Print model report
# Confusion Matrix
print "\nConfusion Matrix"
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
# Print Results
print "\nModel Report"
print "-Accuracy: %.6f" % ( accuracy(conf) )
print "-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) )
In [14]:
# Import data
filename = '../facies_vectors.csv'
data = pd.read_csv(filename)
data['Well Name'] = data['Well Name'].astype('category')
data['Formation'] = data['Formation'].astype('category')
# Leave out one well for prediction
well_names = data['Well Name'].unique()
f1=[]
for i in range(len(well_names)):
# Split data
X_train = data.drop(['Facies', 'Formation','Depth'], axis = 1 )
Y_train = data['Facies' ] - 1
train_X = X_train[X_train['Well Name'] != well_names[i] ]
train_Y = Y_train[X_train['Well Name'] != well_names[i] ]
test_X = X_train[X_train['Well Name'] == well_names[i] ]
test_Y = Y_train[X_train['Well Name'] == well_names[i] ]
train_X = train_X.drop(['Well Name'], axis = 1 )
test_X = test_X.drop(['Well Name'], axis = 1 )
# Model
model_final = xgb.XGBClassifier( learning_rate =0.001, n_estimators=500, max_depth=6,
min_child_weight=1, gamma=0, subsample=0.8, reg_alpha = 10,
colsample_bytree=0.8, objective='multi:softmax',
nthread=4, scale_pos_weight=1, seed=27)
#Fit the algorithm on the data
model_final.fit( train_X , train_Y , eval_metric = 'merror' )
# model_final = RandomForestClassifier(n_estimators=1000) # RANDOM FORREST
#model_final.fit(train_X, train_Y)
#Predict training set:
predictions = model_final.predict(test_X)
#Print model report
print "\n------------------------------------------------------"
print "Leaving out well " + well_names[i]
# Confusion Matrix
conf = confusion_matrix( test_Y, predictions, labels = np.arange(9) )
# Print Results
print "\nModel Report"
print "-Accuracy: %.6f" % ( accuracy(conf) )
print "-Adjacent Accuracy: %.6f" % ( accuracy_adjacent(conf, adjacent_facies) )
print "-F1 Score: %.6f" % ( f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ) )
f1.append(f1_score ( test_Y , predictions , labels = np.arange(9), average = 'weighted' ))
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS',
'WS', 'D','PS', 'BS']
print "\nConfusion Matrix Results"
from classification_utilities import display_cm, display_adj_cm
display_cm(conf, facies_labels,display_metrics=True, hide_zeros=True)
print "\n------------------------------------------------------"
print "Final Results"
print "-Average F1 Score: %6f" % (sum(f1)/(1.0*len(f1)))
We import the testing data and apply our trained model to make a prediction.
In [108]:
# Load test data
test_data = pd.read_csv('validation_data_nofacies.csv')
test_data['Well Name'] = test_data['Well Name'].astype('category')
X_test = test_data.drop(['Formation', 'Well Name', 'Depth'], axis=1)
# Predict facies of unclassified data
Y_predicted = xgb_final.predict(X_test)
test_data['Facies'] = Y_predicted + 1
# Store the prediction
test_data.to_csv('Prediction.csv')