XGBoost is the leading model for working with standard tabular data (the type of data you store in pandas DataFrames, as opposed to more exotic types of data like images and videos).
XGBoost models do well in many Kaggle competitions.
To reach peak accuracy, XGBoost models require more knowledge and model tuning than techniques like Random Forest. After this tutorial, you'll be able to:

  • Follow the full modeling workflow with XGBoost, and
  • Fine-tune XGBoost models for optimal performance

XGBoost is an implementation of the Gradient Boosted Decision Trees algorithm (scikit-learn has another version of this algorithm, but XGBoost has some technical advantages.)
What are Gradient Boosted Decision Trees?

New models are generated in cycles, and the results of these models are aggregated and used to build into an ensemble model.
We start the cycle by calculating the errors for each observation in the dataset.
We then build a new model to predict those errors.
We add predictions from this error-predicting model to the ensemble of models.
To make a prediction, we include the predictions from all previous models.
We can use these predictions to calculate new errors, build the next model, and add it to the ensemble.
There's one piece outside that cycle.
We need some base prediction to start the cycle.
In practice, the initial predictions can be pretty naive.
Even if the predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.
This process may sound complicated, but the code to use it is straightforward.
We'll fill in some additional explanatory details in the model tuning section below.


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

data = pd.read_csv('input/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
print("First entry of train_X :\n", train_X[:1])
print()
test_X = my_imputer.transform(test_X)
print("First entry of test_X :\n", test_X[:1])


First entry of train_X :
 [[5.000e+00 6.000e+01 8.400e+01 1.426e+04 8.000e+00 5.000e+00 2.000e+03
  2.000e+03 3.500e+02 6.550e+02 0.000e+00 4.900e+02 1.145e+03 1.145e+03
  1.053e+03 0.000e+00 2.198e+03 1.000e+00 0.000e+00 2.000e+00 1.000e+00
  4.000e+00 1.000e+00 9.000e+00 1.000e+00 2.000e+03 3.000e+00 8.360e+02
  1.920e+02 8.400e+01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  1.200e+01 2.008e+03]]

First entry of test_X :
 [[3.85000000e+02 6.00000000e+01 6.96540601e+01 5.31070000e+04
  6.00000000e+00 5.00000000e+00 1.99200000e+03 1.99200000e+03
  0.00000000e+00 9.85000000e+02 0.00000000e+00 5.95000000e+02
  1.58000000e+03 1.07900000e+03 8.74000000e+02 0.00000000e+00
  1.95300000e+03 1.00000000e+00 0.00000000e+00 2.00000000e+00
  1.00000000e+00 3.00000000e+00 1.00000000e+00 9.00000000e+00
  2.00000000e+00 1.99200000e+03 2.00000000e+00 5.01000000e+02
  2.16000000e+02 2.31000000e+02 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 6.00000000e+00
  2.00700000e+03]]

Now we can build and fit a model just as we would in sklearn:


In [7]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle:
# Don't forget to examine the parameters displayed when the model is built.
# Tuning those parameters properly may improve the model's performance.
my_model.fit(train_X, train_y, verbose=False)


Out[7]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

And now on to evaluating the model and making predictions, also like in scikit-learn.


In [8]:
predictions = my_model.predict(test_X)
predictions[:5]


Out[8]:
array([252216.58, 163965.92, 213855.88, 139755.72, 134389.5 ],
      dtype=float32)

In [9]:
from sklearn.metrics import mean_absolute_error

print("Mean Absolute Error:\n", str(mean_absolute_error(predictions, test_y)))


Mean Absolute Error:
 17924.18821703767

Model Tuning

XGBoost has a number of parameters that can dramatically affect your model's accuracy and speed.
Some significant parameters are:

n_estimators and early_stopping_rounds:
n_estimators specifies how many times the modeling cycle is repeated.
In the underfitting vs overfitting graph below, n_estimators moves you further to the right.
Too low a value causes underfitting, which will result in inaccurate predictions on both training data and new data. Too large a value causes overfitting, which means accurate predictions on training data, but inaccurate predictions on new data (which is what we care about).
You can experiment with your dataset to find the ideal.
Typical values range from 100-1000, though this depends a lot on the learning rate discussed below.

early_stopping_rounds offers a way to automatically find the maximum value.
Early stopping tells the program to stop iterating when the validation score stops improving.
One effective technique is to set a relatively high value for n_estimators and then use early_stopping_rounds to figure out when to stop.
Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping.
early_stopping_rounds = 5 is a reasonable value to experiment with.
Thus we stop after 5 straight rounds of deteriorating validation scores.
Here is the code to fit with early_stopping:


In [10]:
my_model = XGBRegressor(n_estimators=1000)
my_model.fit(train_X, train_y, early_stopping_rounds=5,
            eval_set=[(test_X, test_y)], verbose=False)


Out[10]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

When using early_stopping_rounds, you need to set aside some of your data for checking the number of rounds to use.
If you later want to fit a model with all of your data, set n_estimators to whatever value you found to be optimal when run with early stopping.

learning_rate
Here's a subtle but important trick for better XGBoost models:
Instead of getting predictions by simply adding up the predictions from each component model, we will multiply the predictions from each model by a small number before adding them in.
This means each tree we add to the ensemble helps us less.
In practice, this reduces the model's propensity to overfit.
So, you can use a higher value of n_estimators without overfitting.
If you use early stopping, the appropriate number of trees will be set automatically.
In general, a small learning rate (and large number of estimators) will yield more accurate XGBoost models, though it will also take the model longer to train since more iterations are needed.


In [13]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5,
            eval_set=[(test_X, test_y)], verbose=False)


Out[13]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

n_jobs
On larger datasets where runtime is a consideration, you can use parallelism to build your models faster.
It's common to set the parameter n_jobs equal to the number of cores on your machine.
On smaller datasets, this won't help.
The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction.
But, it's useful in large datasets where you would otherwise spend a long time waiting during the fit command.
XGBoost has a multitude of other parameters, but these will go a very long way in helping you fine-tune your XGBoost model for optimal performance.
In conclusion, XGBoost is a very effective method to use for building robust models, especially when using strctured (aka tabular) data.