Boosting
Combine weak learners to build a strong model
How does this work?
Build a base model:
Y = Model1(x) + Error1
Models are abstractions. There will be error between predictions and actual
What is this error can be modeled ? Say:
Error1 = Model2(x) + Error2
If modeled right, this will improve the accuracy of the predictions.
And we can continue:
Error2 = Model3(x) + Error3
Combining these three steps, we have:
Y = Model1(x) + Model2(x) + Model3(x) + Error3
And if we find weights(parameters) for these models?
$$ Y = \alpha Model1(x) + \beta Model2(x) + \gamma Model3(x) + Error3$$Exercise
Run sklearn.ensemble.AdaBoostClassifier
In [ ]:
In [ ]:
xgboost
In [5]:
import xgboost as xgb
import pandas as pd
import numpy as np
In [6]:
#Read the data
df = pd.read_csv("data/historical_loan.csv")
# refine the data
df.years = df.years.fillna(np.mean(df.years))
#Load the preprocessing module
from sklearn import preprocessing
categorical_variables = df.dtypes[df.dtypes=="object"].index.tolist()
for i in categorical_variables:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(df[i]))
df[i] = lbl.transform(df[i])
In [11]:
df.head()
Out[11]:
In [12]:
# Setup the features and target
X = df.iloc[:,1:]
y = df.iloc[:,0]
In [13]:
from sklearn.model_selection import train_test_split
In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Details on the various parameters for xgboost can be found here: https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
O(1 / sketch_eps)
number of bins.
Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.Specify the learning task and the corresponding learning objective. The objective options are below:
In [15]:
#Parameters
params = {}
params["min_child_weight"] = 3
params["subsample"] = 0.7
params["colsample_bytree"] = 0.7
params["scale_pos_weight"] = 1
params["silent"] = 0
params["max_depth"] = 4
params["nthread"] = 6
params["gamma"] = 1
params["objective"] = "binary:logistic"
params["eta"] = 0.005
params["base_score"] = 0.1
params["eval_metric"] = "auc"
params["seed"] = 123
plst = list(params.items())
num_rounds = 40
In [16]:
xgtrain = xgb.DMatrix(X_train, label=y_train)
watchlist = [(xgtrain, 'train')]
In [17]:
model_xgboost = xgb.train(plst, xgtrain, num_rounds)
In [18]:
import matplotlib.pyplot as plt
%matplotlib inline
In [19]:
#Variable Importance Plot
plt.style.use('seaborn-notebook')
xgb.plot_importance(model_xgboost)
Out[19]:
In [21]:
xgb.plot_tree(model_xgboost, num_trees=39)
Out[21]:
In [ ]:
In [22]:
# Prediction using xgboost
In [23]:
xgb_predict = model_xgboost.predict(xgb.DMatrix(X_test))
In [24]:
xgb_predict.shape
Out[24]:
Exercise
ntree_limit
option xgboost's
cross-validation method. Sample code:xgb.cv(parameters, train_matrix, num_round, nfold,
metrics={'error'}, seed = 0,
callbacks=[xgb.callback.print_evaluation(show_stdv=True)])
In [ ]:
Early Stopping
Source: xgboost docs
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
train(..., evals=evals, early_stopping_rounds=10)
The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.
If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit
. Note that train() will return a model from the last iteration, not the best one.
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in param['eval_metric'] is used for early stopping.
In [ ]:
In [ ]:
In [ ]: