Forecast Use of a City Bikeshare System using GraphLab Create

In this notebook, we will introduce how to use Gradient Boosted Trees in GraphLab Create to accomplish a Kaggle compitition of forcasting the usage of a city bikeshare system. The link to the compeition is here.

The notebook has three parts:

Part I is a quick start guide, demonstrating the steps of 1. load data 2. create a model 3. make prediction and submit
Part II dives deep into the data, and apply some feature engineering to improve the result
Part III shows how to use hyperparameter search to improve the model performance

To run the following code, download the data from here and make sure you have Graphlab Create 1.3 or later.

Part I: Quick start



In [1]:

    
import graphlab as gl
gl.canvas.set_target('ipynb')

Load data, create a model, and make prediction



In [2]:

    
# load training data
training_sframe = gl.SFrame.read_csv('train.csv', verbose=False)

# train a model
features = ['datetime', 'season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed']

m = gl.boosted_trees_regression.create(training_sframe,
                                       features=features, 
                                       target='count')

# predict on test data
test_sframe = gl.SFrame.read_csv('test.csv', verbose=False)
prediction = m.predict(test_sframe)









    



[INFO] 1447192391 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /Users/zach/anaconda/lib/python2.7/site-packages/certifi/cacert.pem
1447192391 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-3835 - Server binary: /Users/zach/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1447192391.log
[INFO] GraphLab Server Version: 1.6.914






    




PROGRESS: Boosted trees regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 10334






    




PROGRESS: Number of features          : 9






    




PROGRESS: Number of unpacked features : 9






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter          RMSE          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0   2.122e+02   2.164e+02        0.03s






    




PROGRESS:      1   1.816e+02   1.844e+02        0.06s






    




PROGRESS:      2   1.638e+02   1.668e+02        0.08s






    




PROGRESS:      3   1.536e+02   1.565e+02        0.11s






    




PROGRESS:      4   1.479e+02   1.514e+02        0.13s






    




PROGRESS:      5   1.443e+02   1.485e+02        0.16s






    




PROGRESS:      6   1.423e+02   1.470e+02        0.18s






    




PROGRESS:      7   1.412e+02   1.462e+02        0.20s






    




PROGRESS:      8   1.403e+02   1.459e+02        0.22s






    




PROGRESS:      9   1.395e+02   1.456e+02        0.24s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Submit your result

The following code saves the prediction to disk, at which point you can submit it to the Kaggle website. (You will need a Kaggle account to submit your result).



In [3]:

    
def make_submission(prediction, filename='submission.txt'):
    with open(filename, 'w') as f:
        f.write('datetime,count\n')
        submission_strings = test_sframe['datetime'] + ',' + prediction.astype(str)
        for row in submission_strings:
            f.write(row + '\n')

make_submission(prediction, 'submission1.txt')

Part II: Feature engineering



In [4]:

    
training_sframe.show()

Looking at the data in Canvas, I quickly realized that I need a couple of transformations to incoporate our knowledge about data and the evaluation.

First, the target column count which we are trying to predict, is really the sum of two other columns registered, the count of registered users, and casual, the count of unregistered users. Training separate models to predict each of these counts should yield better results.

Second, the evaluation metric is the RMSE in the log domain, so we should transform the target columns into log domain as well.

Finally, the datetime column should be split into separate columns as year, month, weekday, and hour because strings are taken to be categorical variables, so the tree was treating each datetime as a unique value that is not comparable to each other.



In [5]:

    
date_format_str = '%Y-%m-%d %H:%M:%S'

def process_date_column(data_sframe):
    """Split the 'datetime' column of a given sframe"""
    datetime_col = data_sframe['datetime']
    parsed_datetime = datetime_col.str_to_datetime(date_format_str)
    parsed_datetime_sf = parsed_datetime.split_datetime(column_name_prefix='', limit=['year', 'month', 'day', 'hour'])
    for col in ['year', 'month', 'day', 'hour']:
        data_sframe[col] = parsed_datetime_sf[col]
    data_sframe['weekday'] = parsed_datetime.apply(lambda x: x.weekday())
    
process_date_column(training_sframe)
process_date_column(test_sframe)

Transform target counts into log domain



In [6]:

    
import math

# Create three new columns: log-casual, log-registered, and log-count
for col in ['casual', 'registered', 'count']:
    training_sframe['log-' + col] = training_sframe[col].apply(lambda x: math.log(1 + x))

Combine the predictions of separately trained models



In [7]:

    
new_features = features + ['year', 'month', 'hour', 'weekday']
new_features.remove('datetime')

m1 = gl.boosted_trees_regression.create(training_sframe,
                                        features=new_features,
                                        target='log-casual')

m2 = gl.boosted_trees_regression.create(training_sframe,
                                        features=new_features,
                                        target='log-registered')

def fused_predict(m1, m2, test_sframe):
    """
    Fused the prediction of two separately trained models.
    The input models are trained in the log domain.
    Return the combine predictions in the original domain.
    """
    p1 = m1.predict(test_sframe).apply(lambda x: math.exp(x)-1)
    p2 = m2.predict(test_sframe).apply(lambda x: math.exp(x)-1)
    return (p1 + p2).apply(lambda x: x if x > 0 else 0)

prediction = fused_predict(m1, m2, test_sframe)









    




PROGRESS: Boosted trees regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 10299






    




PROGRESS: Number of features          : 12






    




PROGRESS: Number of unpacked features : 12






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter          RMSE          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0   1.920e+00   1.968e+00        0.04s






    




PROGRESS:      1   1.407e+00   1.455e+00        0.07s






    




PROGRESS:      2   1.061e+00   1.105e+00        0.10s






    




PROGRESS:      3   8.320e-01   8.735e-01        0.14s






    




PROGRESS:      4   6.849e-01   7.265e-01        0.18s






    




PROGRESS:      5   5.948e-01   6.341e-01        0.22s






    




PROGRESS:      6   5.390e-01   5.775e-01        0.25s






    




PROGRESS:      7   5.056e-01   5.440e-01        0.29s






    




PROGRESS:      8   4.846e-01   5.217e-01        0.33s






    




PROGRESS:      9   4.683e-01   5.050e-01        0.36s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.






    




PROGRESS: Boosted trees regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 10331






    




PROGRESS: Number of features          : 12






    




PROGRESS: Number of unpacked features : 12






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter          RMSE          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0   2.926e+00   2.961e+00        0.03s






    




PROGRESS:      1   2.081e+00   2.111e+00        0.05s






    




PROGRESS:      2   1.494e+00   1.521e+00        0.08s






    




PROGRESS:      3   1.089e+00   1.121e+00        0.11s






    




PROGRESS:      4   8.171e-01   8.488e-01        0.15s






    




PROGRESS:      5   6.283e-01   6.651e-01        0.18s






    




PROGRESS:      6   4.944e-01   5.355e-01        0.21s






    




PROGRESS:      7   4.156e-01   4.634e-01        0.25s






    




PROGRESS:      8   3.679e-01   4.174e-01        0.29s






    




PROGRESS:      9   3.373e-01   3.924e-01        0.31s

Summary of Part II

We have done the simple feature engineering of transforming the datetime column and the count columns, and made a fused_prediction() function to combine the predictions of two models trained separately in the log domain of registered and casual target columns. Use the make_submission() function to see how much improvments we have made. (Again, you will need a Kaggle account to submit your result).

Part III: Hyperparameter tuning

In the this section, we will use the model_parameter_search() function to search for the best hyperparameters. There are a couple of important parameters in the gradient boosted trees model:

max_iterations determines the number of trees in the final model. Usually the more trees, the higher the prediction accuracy. But both the training and prediction times also grow linearly in the number of trees.
max_depth restricts the depth of each individual tree to prevent overfitting.
min_child_weight also regularizes the complexity by restricting the minimum number of observations contained at each leaf node.
min_loss_reduction restricts the reduction of loss function for a node split, works similarly to min_child_weight. In this example, I fixed the number of trees to 500 and tuned max_depth, min_child_weight.

Setup the environment

Hyperparameter tuning is a task that can be easily parallelized or distributed. GraphLab Create provides different running environments for such tasks. For demonstration, the following uses the local environment, but you can define environments using your local Hadoop cluster or EC2 machines.

Define the search space

Since 1.3, we have simplified the API for model_parameter_search:

You can direcly pass in SFrames as input, no need to save it first and use the path any more.
You can define the search space directly by putting hyperparameters into lists.
You can direcly get all the models every searched as well as a nicely formatted summary table containing training and validation metrics.



In [8]:

    
def parameter_search(training, validation, target):
    """
    Return the optimal parameters in the given search space.
    The parameter returned has the lowest validation rmse.
    """
    parameter_grid = {'features': [new_features],
                      'target': [target],
                      'max_depth': [10, 15, 20],
                      'min_child_weight': [5, 10, 20],
                      'step_size': [0.05],
                      'max_iterations': [500]}
    job = gl.model_parameter_search.grid_search.create((training, validation),
                                                       model_factory=gl.boosted_trees_regression.create,
                                                       model_parameters=parameter_grid,
                                                       return_model=False)


    # When the job is done, the result is a dictionary containing all the models
    # being generated, and a SFrame containing summary of the metrics, for each parameter set.
    summary = job.get_results()    
    
    sorted_summary = summary.sort('validation_rmse', ascending=True)
    print sorted_summary
       
    optimal_model_idx = sorted_summary[0]['model_id']

    # Return the parameters with the lowest validation error. 
    optimal_params = sorted_summary[['max_depth', 'min_child_weight']][0]
    optimal_rmse = sorted_summary[0]['validation_rmse']

    print 'Optimal parameters: %s' % str(optimal_params)
    print 'RMSE: %s' % str(optimal_rmse)
    return optimal_params

Perform hyperparameter search for both models

This call to get_results in the above function is blocking. While waiting, you can open up a new notebook and visualize the progress:

use gl.deploy.jobs to list all the jobs
use gl.deploy.jobs[i].show() to open the canvas and watch the job progress.



In [9]:

    
training = training_sframe[training_sframe['day'] <= 16]
validation = training_sframe[training_sframe['day'] > 16]

params_log_casual = parameter_search(training,
                                     validation,
                                     target='log-casual')

params_log_registered = parameter_search(training,
                                         validation,
                                         target='log-registered')









    



[INFO] Validating job.
[INFO] Creating a LocalAsync environment called 'async'.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Nov-10-2015-13-53-2000000' ready for execution
[INFO] Job: 'Model-Parameter-Search-Nov-10-2015-13-53-2000000' scheduled.
[INFO] Validating job.
[INFO] A job with name 'Model-Parameter-Search-Nov-10-2015-13-53-2000000' already exists. Renaming the job to 'Model-Parameter-Search-Nov-10-2015-13-53-2000000-4efb9'.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Nov-10-2015-13-53-2000000-4efb9' ready for execution
[INFO] Job: 'Model-Parameter-Search-Nov-10-2015-13-53-2000000-4efb9' scheduled.
[INFO] Validating job.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Nov-10-2015-13-56-1500000' ready for execution
[INFO] Job: 'Model-Parameter-Search-Nov-10-2015-13-56-1500000' scheduled.
[INFO] Validating job.
[INFO] A job with name 'Model-Parameter-Search-Nov-10-2015-13-56-1500000' already exists. Renaming the job to 'Model-Parameter-Search-Nov-10-2015-13-56-1500000-b1fe6'.
[INFO] Validation complete. Job: 'Model-Parameter-Search-Nov-10-2015-13-56-1500000-b1fe6' ready for execution
[INFO] Job: 'Model-Parameter-Search-Nov-10-2015-13-56-1500000-b1fe6' scheduled.






    



+----------+-------------------------------+-----------+----------------+------------------+
| model_id |            features           | max_depth | max_iterations | min_child_weight |
+----------+-------------------------------+-----------+----------------+------------------+
|    7     | [season, holiday, workingd... |     15    |      500       |        20        |
|    6     | [season, holiday, workingd... |     10    |      500       |        20        |
|    8     | [season, holiday, workingd... |     20    |      500       |        20        |
|    3     | [season, holiday, workingd... |     10    |      500       |        10        |
|    4     | [season, holiday, workingd... |     15    |      500       |        10        |
|    5     | [season, holiday, workingd... |     20    |      500       |        10        |
|    0     | [season, holiday, workingd... |     10    |      500       |        5         |
|    1     | [season, holiday, workingd... |     15    |      500       |        5         |
|    2     | [season, holiday, workingd... |     20    |      500       |        5         |
+----------+-------------------------------+-----------+----------------+------------------+
+-----------+------------+-----------------+-----------------+
| step_size |   target   |  training_rmse  | validation_rmse |
+-----------+------------+-----------------+-----------------+
|    0.05   | log-casual |  0.211708298721 |  0.527741453494 |
|    0.05   | log-casual |  0.261558174636 |  0.532236354388 |
|    0.05   | log-casual |  0.171324037826 |  0.536103430116 |
|    0.05   | log-casual |  0.192388784454 |  0.537554100082 |
|    0.05   | log-casual |  0.118351695496 |  0.545491024959 |
|    0.05   | log-casual | 0.0859382910046 |  0.547910835715 |
|    0.05   | log-casual |  0.135225261928 |  0.550955629347 |
|    0.05   | log-casual | 0.0525473213263 |  0.558364989678 |
|    0.05   | log-casual |  0.027499841632 |  0.558461356649 |
+-----------+------------+-----------------+-----------------+
[9 rows x 9 columns]

Optimal parameters: {'max_depth': 15, 'min_child_weight': 20}
RMSE: 0.527741453494
+----------+-------------------------------+-----------+----------------+------------------+
| model_id |            features           | max_depth | max_iterations | min_child_weight |
+----------+-------------------------------+-----------+----------------+------------------+
|    6     | [season, holiday, workingd... |     10    |      500       |        20        |
|    3     | [season, holiday, workingd... |     10    |      500       |        10        |
|    7     | [season, holiday, workingd... |     15    |      500       |        20        |
|    0     | [season, holiday, workingd... |     10    |      500       |        5         |
|    8     | [season, holiday, workingd... |     20    |      500       |        20        |
|    1     | [season, holiday, workingd... |     15    |      500       |        5         |
|    5     | [season, holiday, workingd... |     20    |      500       |        10        |
|    4     | [season, holiday, workingd... |     15    |      500       |        10        |
|    2     | [season, holiday, workingd... |     20    |      500       |        5         |
+----------+-------------------------------+-----------+----------------+------------------+
+-----------+----------------+-----------------+-----------------+
| step_size |     target     |  training_rmse  | validation_rmse |
+-----------+----------------+-----------------+-----------------+
|    0.05   | log-registered |  0.14534218136  |  0.302992927815 |
|    0.05   | log-registered |  0.106162000933 |  0.305979383876 |
|    0.05   | log-registered |  0.108487451349 |  0.306726572594 |
|    0.05   | log-registered | 0.0727354157246 |  0.307802511061 |
|    0.05   | log-registered | 0.0930870333757 |  0.308627326669 |
|    0.05   | log-registered | 0.0312390898415 |  0.310193995682 |
|    0.05   | log-registered | 0.0474395875696 |  0.310711705939 |
|    0.05   | log-registered | 0.0683128478797 |  0.310835563728 |
|    0.05   | log-registered | 0.0190234300688 |  0.311826519322 |
+-----------+----------------+-----------------+-----------------+
[9 rows x 9 columns]

Optimal parameters: {'max_depth': 10, 'min_child_weight': 20}
RMSE: 0.302992927815

Train models with the tuned hyperparameters

Doing hyperparameter search requires us to hold out a validation set from the original training data. In the final submission, we want to train models that take full advantages of the provided training data.



In [10]:

    
m_log_registered = gl.boosted_trees_regression.create(training_sframe,
                                                      target='log-registered',
                                                      features=new_features,
                                                      **params_log_registered)

m_log_casual = gl.boosted_trees_regression.create(training_sframe,
                                                  target='log-casual',
                                                  features=new_features,
                                                  **params_log_casual)

final_prediction = fused_predict(m_log_registered, m_log_casual, test_sframe)

make_submission(final_prediction, 'submission2.txt')









    




PROGRESS: Boosted trees regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 10362






    




PROGRESS: Number of features          : 12






    




PROGRESS: Number of unpacked features : 12






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter          RMSE          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0   2.925e+00   2.904e+00        0.03s






    




PROGRESS:      1   2.074e+00   2.055e+00        0.04s






    




PROGRESS:      2   1.477e+00   1.460e+00        0.06s






    




PROGRESS:      3   1.065e+00   1.049e+00        0.08s






    




PROGRESS:      4   7.774e-01   7.673e-01        0.10s






    




PROGRESS:      5   5.811e-01   5.767e-01        0.12s






    




PROGRESS:      6   4.494e-01   4.500e-01        0.14s






    




PROGRESS:      7   3.668e-01   3.744e-01        0.16s






    




PROGRESS:      8   3.113e-01   3.289e-01        0.18s






    




PROGRESS:      9   2.773e-01   3.050e-01        0.20s






    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.






    




PROGRESS: Boosted trees regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 10312






    




PROGRESS: Number of features          : 12






    




PROGRESS: Number of unpacked features : 12






    




PROGRESS: Starting Boosted Trees






    




PROGRESS: --------------------------------------------------------






    




PROGRESS:   Iter          RMSE          Elapsed time






    




PROGRESS:         (training) (validation)






    




PROGRESS:      0   1.912e+00   1.916e+00        0.02s






    




PROGRESS:      1   1.390e+00   1.400e+00        0.04s






    




PROGRESS:      2   1.033e+00   1.046e+00        0.07s






    




PROGRESS:      3   7.922e-01   8.161e-01        0.10s






    




PROGRESS:      4   6.338e-01   6.650e-01        0.12s






    




PROGRESS:      5   5.317e-01   5.755e-01        0.15s






    




PROGRESS:      6   4.663e-01   5.218e-01        0.18s






    




PROGRESS:      7   4.248e-01   4.938e-01        0.21s






    




PROGRESS:      8   3.971e-01   4.753e-01        0.23s






    




PROGRESS:      9   3.780e-01   4.653e-01        0.26s

Now try submitting the new result here. (Again, you will need a Kaggle account to submit your result).

When to use a boosted trees model?

Different kinds of models have different advantages. The boosted trees model is very good at handling tabular data with numerical features, or categorical features with fewer than hundreds of categories.

One important note is that tree based models are not designed to work with very sparse features. When dealing with sparse input data (e.g. categorical features with large dimension), we can either pre-process the sparse features to generate numerical statistics, or switch to a linear model, which is better suited for such scenarios.