In this notebook, we will introduce how to use Gradient Boosted Trees in GraphLab Create to accomplish a Kaggle compitition of forcasting the usage of a city bikeshare system. The link to the compeition is here.
The notebook has three parts:
To run the following code, download the data from here and make sure you have Graphlab Create 1.3 or later.
In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')
In [2]:
# load training data
training_sframe = gl.SFrame.read_csv('train.csv', verbose=False)
# train a model
features = ['datetime', 'season', 'holiday', 'workingday', 'weather',
'temp', 'atemp', 'humidity', 'windspeed']
m = gl.boosted_trees_regression.create(training_sframe,
features=features,
target='count')
# predict on test data
test_sframe = gl.SFrame.read_csv('test.csv', verbose=False)
prediction = m.predict(test_sframe)
The following code saves the prediction to disk, at which point you can submit it to the Kaggle website. (You will need a Kaggle account to submit your result).
In [3]:
def make_submission(prediction, filename='submission.txt'):
with open(filename, 'w') as f:
f.write('datetime,count\n')
submission_strings = test_sframe['datetime'] + ',' + prediction.astype(str)
for row in submission_strings:
f.write(row + '\n')
make_submission(prediction, 'submission1.txt')
In [4]:
training_sframe.show()
Looking at the data in Canvas, I quickly realized that I need a couple of transformations to incoporate our knowledge about data and the evaluation.
First, the target column count which we are trying to predict, is really the sum of two other columns registered, the count of registered users, and casual, the count of unregistered users. Training separate models to predict each of these counts should yield better results.
Second, the evaluation metric is the RMSE in the log domain, so we should transform the target columns into log domain as well.
Finally, the datetime column should be split into separate columns as year, month, weekday, and hour because strings are taken to be categorical variables, so the tree was treating each datetime as a unique value that is not comparable to each other.
In [5]:
date_format_str = '%Y-%m-%d %H:%M:%S'
def process_date_column(data_sframe):
"""Split the 'datetime' column of a given sframe"""
datetime_col = data_sframe['datetime']
parsed_datetime = datetime_col.str_to_datetime(date_format_str)
parsed_datetime_sf = parsed_datetime.split_datetime(column_name_prefix='', limit=['year', 'month', 'day', 'hour'])
for col in ['year', 'month', 'day', 'hour']:
data_sframe[col] = parsed_datetime_sf[col]
data_sframe['weekday'] = parsed_datetime.apply(lambda x: x.weekday())
process_date_column(training_sframe)
process_date_column(test_sframe)
In [6]:
import math
# Create three new columns: log-casual, log-registered, and log-count
for col in ['casual', 'registered', 'count']:
training_sframe['log-' + col] = training_sframe[col].apply(lambda x: math.log(1 + x))
In [7]:
new_features = features + ['year', 'month', 'hour', 'weekday']
new_features.remove('datetime')
m1 = gl.boosted_trees_regression.create(training_sframe,
features=new_features,
target='log-casual')
m2 = gl.boosted_trees_regression.create(training_sframe,
features=new_features,
target='log-registered')
def fused_predict(m1, m2, test_sframe):
"""
Fused the prediction of two separately trained models.
The input models are trained in the log domain.
Return the combine predictions in the original domain.
"""
p1 = m1.predict(test_sframe).apply(lambda x: math.exp(x)-1)
p2 = m2.predict(test_sframe).apply(lambda x: math.exp(x)-1)
return (p1 + p2).apply(lambda x: x if x > 0 else 0)
prediction = fused_predict(m1, m2, test_sframe)
We have done the simple feature engineering of transforming the datetime column and the count columns, and made a fused_prediction() function to combine the predictions of two models trained separately in the log domain of registered and casual target columns. Use the make_submission() function to see how much improvments we have made. (Again, you will need a Kaggle account to submit your result).
In the this section, we will use the model_parameter_search() function to search for the best hyperparameters. There are a couple of important parameters in the gradient boosted trees model:
max_iterations determines the number of trees in the final model. Usually the more trees, the higher the prediction accuracy. But both the training and prediction times also grow linearly in the number of trees.max_depth restricts the depth of each individual tree to prevent overfitting.min_child_weight also regularizes the complexity by restricting the minimum number of observations contained at each leaf node.min_loss_reduction restricts the reduction of loss function for a node split, works similarly to min_child_weight.
In this example, I fixed the number of trees to 500 and tuned max_depth, min_child_weight.Hyperparameter tuning is a task that can be easily parallelized or distributed. GraphLab Create provides different running environments for such tasks. For demonstration, the following uses the local environment, but you can define environments using your local Hadoop cluster or EC2 machines.
Since 1.3, we have simplified the API for model_parameter_search:
In [8]:
def parameter_search(training, validation, target):
"""
Return the optimal parameters in the given search space.
The parameter returned has the lowest validation rmse.
"""
parameter_grid = {'features': [new_features],
'target': [target],
'max_depth': [10, 15, 20],
'min_child_weight': [5, 10, 20],
'step_size': [0.05],
'max_iterations': [500]}
job = gl.model_parameter_search.grid_search.create((training, validation),
model_factory=gl.boosted_trees_regression.create,
model_parameters=parameter_grid,
return_model=False)
# When the job is done, the result is a dictionary containing all the models
# being generated, and a SFrame containing summary of the metrics, for each parameter set.
summary = job.get_results()
sorted_summary = summary.sort('validation_rmse', ascending=True)
print sorted_summary
optimal_model_idx = sorted_summary[0]['model_id']
# Return the parameters with the lowest validation error.
optimal_params = sorted_summary[['max_depth', 'min_child_weight']][0]
optimal_rmse = sorted_summary[0]['validation_rmse']
print 'Optimal parameters: %s' % str(optimal_params)
print 'RMSE: %s' % str(optimal_rmse)
return optimal_params
This call to get_results in the above function is blocking. While waiting, you can open up a new notebook and visualize the progress:
gl.deploy.jobs to list all the jobsgl.deploy.jobs[i].show() to open the canvas and watch the job progress.
In [9]:
training = training_sframe[training_sframe['day'] <= 16]
validation = training_sframe[training_sframe['day'] > 16]
params_log_casual = parameter_search(training,
validation,
target='log-casual')
params_log_registered = parameter_search(training,
validation,
target='log-registered')
Doing hyperparameter search requires us to hold out a validation set from the original training data. In the final submission, we want to train models that take full advantages of the provided training data.
In [10]:
m_log_registered = gl.boosted_trees_regression.create(training_sframe,
target='log-registered',
features=new_features,
**params_log_registered)
m_log_casual = gl.boosted_trees_regression.create(training_sframe,
target='log-casual',
features=new_features,
**params_log_casual)
final_prediction = fused_predict(m_log_registered, m_log_casual, test_sframe)
make_submission(final_prediction, 'submission2.txt')
Now try submitting the new result here. (Again, you will need a Kaggle account to submit your result).
Different kinds of models have different advantages. The boosted trees model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories.
One important note is that tree based models are not designed to work with very sparse features. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.