Click-Through-Rate Prediction with GraphLab Create Feature Engineering Transformers

Feature engineering is well known to be one of the key ingredients in a successful intelligent application. In this notebook, we introduce what feature engineering is, why it is important, and how to engineer features. For this, we work through a Kaggle competition on click-through prediction (CTR) using Avazu's anonymized dataset.

More specifically, you will learn how to:

  1. Build a click-through-rate predictor using GraphLab Create.
  2. Improve the quality of the predictions using complex feature engineering transformers provided in GraphLab Create.

The specific feature engineering techniques used in this notebook are:

The goal is not to try to win this competition, but instead show that prediction accuracy can be greatly improved with feature engineering tools and some more time.

Note: This notebook requires GraphLab Create 1.3.


In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')

(from Kaggle):

For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms?

Data fields

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables

1. Load the data

We need to load the data. The data is provided as a gzipped-CSV, so we download it from Kaggle's website and then load it locally using SFrame.read_csv. Since the dataset is almost entirely categorical, we set the column_type_hints to str, and then simply change the click column to int.

Since there is a lot of iterative steps while feature engineering, and to save disk space and time to load, it is often a good idea to save the SFrame in its native format. That way you can restart this IPython session and more quickly pick up where you left off.


In [2]:
import os

if not os.path.exists('train.gl'):
    path = 'train.gz' # path to dataset - download from Kaggle, 1.2GB
    train = gl.SFrame.read_csv(path, column_type_hints=str)
    train['click'] = train['click'].astype(int)
    train.save('train.gl')
else:
    train = gl.SFrame('train.gl')


[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-72603 - Server binary: /Users/bdol/anaconda/envs/clickthrough_test/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1437153352.log
[INFO] GraphLab Server Version: 1.4.830
PROGRESS: Read 336736 lines. Lines per second: 126827
PROGRESS: Read 1346636 lines. Lines per second: 166778
PROGRESS: Read 2355736 lines. Lines per second: 173668
PROGRESS: Read 3364024 lines. Lines per second: 174608
PROGRESS: Read 4372157 lines. Lines per second: 172454
PROGRESS: Read 5045269 lines. Lines per second: 164179
PROGRESS: Read 5717567 lines. Lines per second: 159242
PROGRESS: Read 6725242 lines. Lines per second: 161165
PROGRESS: Read 7732451 lines. Lines per second: 160115
PROGRESS: Read 8741759 lines. Lines per second: 160537
PROGRESS: Read 9751627 lines. Lines per second: 161166
PROGRESS: Read 10759260 lines. Lines per second: 160118
PROGRESS: Read 11766454 lines. Lines per second: 159164
PROGRESS: Read 12777884 lines. Lines per second: 157526
PROGRESS: Read 13450141 lines. Lines per second: 155918
PROGRESS: Read 14459648 lines. Lines per second: 155187
PROGRESS: Read 15465592 lines. Lines per second: 154372
PROGRESS: Read 16471502 lines. Lines per second: 153696
PROGRESS: Read 17479087 lines. Lines per second: 153377
PROGRESS: Read 18486842 lines. Lines per second: 153254
PROGRESS: Read 19494209 lines. Lines per second: 153256
PROGRESS: Read 20501474 lines. Lines per second: 153324
PROGRESS: Read 21510591 lines. Lines per second: 153025
PROGRESS: Read 22517836 lines. Lines per second: 153076
PROGRESS: Read 23526032 lines. Lines per second: 152779
PROGRESS: Read 24196556 lines. Lines per second: 152197
PROGRESS: Read 25203931 lines. Lines per second: 152041
PROGRESS: Read 26209876 lines. Lines per second: 151789
PROGRESS: Read 27217728 lines. Lines per second: 151774
PROGRESS: Read 28223267 lines. Lines per second: 151587
PROGRESS: Read 29228946 lines. Lines per second: 151370
PROGRESS: Read 30232532 lines. Lines per second: 151143
PROGRESS: Read 31238088 lines. Lines per second: 151346
PROGRESS: Read 32245216 lines. Lines per second: 151216
PROGRESS: Read 33251298 lines. Lines per second: 151127
PROGRESS: Read 34258362 lines. Lines per second: 150982
PROGRESS: Read 35264915 lines. Lines per second: 150548
PROGRESS: Read 35938177 lines. Lines per second: 150099
PROGRESS: Read 36611186 lines. Lines per second: 149690
PROGRESS: Read 37617474 lines. Lines per second: 149212
PROGRESS: Read 38624198 lines. Lines per second: 149027
PROGRESS: Read 39630509 lines. Lines per second: 149046
PROGRESS: Finished parsing file /Users/bdol/graphlab/turi.com/src/learn/gallery/notebooks/train.gz
PROGRESS: Parsing completed. Parsed 40428967 lines in 271.284 secs.

In [3]:
train


Out[3]:
id click hour C1 banner_pos site_id site_domain site_category app_id
1000009418151094273 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000169349117863715 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000371904215119486 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000640724480838376 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000679056417042096 0 14102100 1005 1 fe8cc448 9166c161 0569f928 ecad2386
10000720757801103869 0 14102100 1005 0 d6137915 bb1ef334 f028772b ecad2386
10000724729988544911 0 14102100 1005 0 8fda644b 25d4cfcd f028772b ecad2386
10000918755742328737 0 14102100 1005 1 e151e245 7e091613 f028772b ecad2386
10000949271186029916 1 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10001264480619467364 0 14102100 1002 0 84c7ba46 c4e18dd6 50e219e0 ecad2386
app_domain app_category device_id device_ip device_model device_type device_conn_type C14 C15 C16
7801e8d9 07d7df22 a99f214a ddd2926e 44956a24 1 2 15706 320 50
7801e8d9 07d7df22 a99f214a 96809ac8 711ee120 1 0 15704 320 50
7801e8d9 07d7df22 a99f214a b3cf8def 8a4875bd 1 0 15704 320 50
7801e8d9 07d7df22 a99f214a e8275b8f 6332421a 1 0 15706 320 50
7801e8d9 07d7df22 a99f214a 9644d0bf 779d90c2 1 0 18993 320 50
7801e8d9 07d7df22 a99f214a 05241af0 8a4875bd 1 0 16920 320 50
7801e8d9 07d7df22 a99f214a b264c159 be6db1d7 1 0 20362 320 50
7801e8d9 07d7df22 a99f214a e6f67278 be74e6fe 1 0 20632 320 50
7801e8d9 07d7df22 a99f214a 37e8da74 5db079b5 1 2 15707 320 50
7801e8d9 07d7df22 c357dbff f1ac7184 373ecbe6 0 0 21689 320 50
C17 C18 C19 C20 C21
1722 0 35 -1 79
1722 0 35 100084 79
1722 0 35 100084 79
1722 0 35 100084 79
2161 0 35 -1 157
1899 0 431 100077 117
2333 0 39 -1 157
2374 3 39 -1 23
1722 0 35 -1 79
2496 3 167 100191 23
[40428967 rows x 24 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

This dataset has 40M+ rows. Though that is no problem for GraphLab Create, on most laptops that could take a while to train models. When doing feature engineering and feature selection you want to be able to train lots of models with different feature combinations to see what features are predictive. Most of these experiments may not be successful, hence it is useful to sample the large dataset to a representative size that allows for fast iterative model training

2. Sample & Split the data

For this dataset, it is important not to use a random sample of the click data. Since we have 10 days of data in train, let us take one day of the data as the sample to use for feature engineering. All the data from a single day will be a better representative of the data in comparison with a random sample.

To do this, we notice that the first day in the train data is 141021 (which is October 21, 2014) followed by 00-23 for the hours in that day (in UTC). A simple way to get the click-stream traffic for that one day is to filter the 'hour' column for values in [14102100 .. 14102123] using SFrame.filter_by.

Since we want to operate on oneday in multiple sessions, we save it in SFrame binary format for easy loading later on.


In [4]:
# let's start with a one-day sample (our first bit of Feature Engineering!)
if not os.path.exists('oneday.gl'):
    # this filters the SFrame, returning the SFrame where the hour column is 14102100 to 14102123
    # which is the 2014-10-21 day of clickstream traffic
    oneday = train.filter_by(['141021%02d' % hour for hour in range(24)], 'hour')
    oneday.save('oneday.gl')
else:
    oneday = gl.SFrame('oneday.gl')

In [5]:
oneday


Out[5]:
id click hour C1 banner_pos site_id site_domain site_category app_id
1000009418151094273 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000169349117863715 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000371904215119486 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000640724480838376 0 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10000679056417042096 0 14102100 1005 1 fe8cc448 9166c161 0569f928 ecad2386
10000720757801103869 0 14102100 1005 0 d6137915 bb1ef334 f028772b ecad2386
10000724729988544911 0 14102100 1005 0 8fda644b 25d4cfcd f028772b ecad2386
10000918755742328737 0 14102100 1005 1 e151e245 7e091613 f028772b ecad2386
10000949271186029916 1 14102100 1005 0 1fbe01fe f3845767 28905ebd ecad2386
10001264480619467364 0 14102100 1002 0 84c7ba46 c4e18dd6 50e219e0 ecad2386
app_domain app_category device_id device_ip device_model device_type device_conn_type C14 C15 C16
7801e8d9 07d7df22 a99f214a ddd2926e 44956a24 1 2 15706 320 50
7801e8d9 07d7df22 a99f214a 96809ac8 711ee120 1 0 15704 320 50
7801e8d9 07d7df22 a99f214a b3cf8def 8a4875bd 1 0 15704 320 50
7801e8d9 07d7df22 a99f214a e8275b8f 6332421a 1 0 15706 320 50
7801e8d9 07d7df22 a99f214a 9644d0bf 779d90c2 1 0 18993 320 50
7801e8d9 07d7df22 a99f214a 05241af0 8a4875bd 1 0 16920 320 50
7801e8d9 07d7df22 a99f214a b264c159 be6db1d7 1 0 20362 320 50
7801e8d9 07d7df22 a99f214a e6f67278 be74e6fe 1 0 20632 320 50
7801e8d9 07d7df22 a99f214a 37e8da74 5db079b5 1 2 15707 320 50
7801e8d9 07d7df22 c357dbff f1ac7184 373ecbe6 0 0 21689 320 50
C17 C18 C19 C20 C21
1722 0 35 -1 79
1722 0 35 100084 79
1722 0 35 100084 79
1722 0 35 100084 79
2161 0 35 -1 157
1899 0 431 100077 117
2333 0 39 -1 157
2374 3 39 -1 23
1722 0 35 -1 79
2496 3 167 100191 23
[4122995 rows x 24 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [6]:
oneday.show()


3. Data Science 101: Create Train / Validation Split

Now that we have sampled and taken a quick look at the data, let us create a train-validation split. Doing this process once ensures that we can train several models using different features and use the the same data for evaluating each of the models.

Now that we have sampled the data to one day, we use a random split of this data (using SFrame.random_split), with 90% being used for training and 10% used for validation. We also specify a seed for reproducability.


In [7]:
train, val = oneday.random_split(0.9, seed=12345)

In [8]:
train.print_rows(5)


+----------------------+-------+----------+------+------------+----------+
|          id          | click |   hour   |  C1  | banner_pos | site_id  |
+----------------------+-------+----------+------+------------+----------+
| 1000009418151094273  |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000169349117863715 |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000371904215119486 |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000640724480838376 |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000679056417042096 |   0   | 14102100 | 1005 |     1      | fe8cc448 |
+----------------------+-------+----------+------+------------+----------+
+-------------+---------------+----------+------------+--------------+-----------+
| site_domain | site_category |  app_id  | app_domain | app_category | device_id |
+-------------+---------------+----------+------------+--------------+-----------+
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   9166c161  |    0569f928   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
+-------------+---------------+----------+------------+--------------+-----------+
+-----------+--------------+-------------+------------------+-------+-----+-----+
| device_ip | device_model | device_type | device_conn_type |  C14  | C15 | C16 |
+-----------+--------------+-------------+------------------+-------+-----+-----+
|  ddd2926e |   44956a24   |      1      |        2         | 15706 | 320 |  50 |
|  96809ac8 |   711ee120   |      1      |        0         | 15704 | 320 |  50 |
|  b3cf8def |   8a4875bd   |      1      |        0         | 15704 | 320 |  50 |
|  e8275b8f |   6332421a   |      1      |        0         | 15706 | 320 |  50 |
|  9644d0bf |   779d90c2   |      1      |        0         | 18993 | 320 |  50 |
+-----------+--------------+-------------+------------------+-------+-----+-----+
+------+-----+-----+--------+-----+
| C17  | C18 | C19 |  C20   | C21 |
+------+-----+-----+--------+-----+
| 1722 |  0  |  35 |   -1   |  79 |
| 1722 |  0  |  35 | 100084 |  79 |
| 1722 |  0  |  35 | 100084 |  79 |
| 1722 |  0  |  35 | 100084 |  79 |
| 2161 |  0  |  35 |   -1   | 157 |
+------+-----+-----+--------+-----+
[3710945 rows x 24 columns]

4. Baseline Model

As a baseline model, let us use all categorical features as is using the logistic classifier model. This is not only a great way to get started, but also acts as a reasonable baseline prior to feature engineering and feature selection. To get all the features we call SFrame.column_names and then remove the id column since it isn't meaningful in this appication.


In [9]:
features = train.column_names()
features.remove('click') # cannot have target in features list
features.remove('id')    # id is not useful for training the model
print features


['hour', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

In [10]:
# Baseline: use ALL features as is
baseline = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 3710945
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 22
PROGRESS: Number of unpacked features : 22
PROGRESS: Number of coefficients    : 1377440
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 0.000000  | 11.352040    | 0.825720          | 0.826538            |
PROGRESS: | 2         | 5        | 1.000000  | 17.485349    | 0.836532          | 0.827994            |
PROGRESS: | 3         | 6        | 1.000000  | 21.652770    | 0.730601          | 0.684546            |
PROGRESS: | 4         | 8        | 1.000000  | 29.905691    | 0.876269          | 0.831734            |
PROGRESS: | 5         | 9        | 1.000000  | 34.128402    | 0.884881          | 0.830637            |
PROGRESS: | 6         | 10       | 1.000000  | 38.261015    | 0.887878          | 0.824951            |
PROGRESS: | 7         | 11       | 1.000000  | 42.092548    | 0.889639          | 0.823315            |
PROGRESS: | 8         | 12       | 1.000000  | 46.108497    | 0.890809          | 0.823393            |
PROGRESS: | 9         | 13       | 1.000000  | 51.264788    | 0.891148          | 0.823055            |
PROGRESS: | 10        | 14       | 1.000000  | 55.588145    | 0.891176          | 0.822708            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

Uh-Oh, we might be overfitting. Notice how there 1.4M coefficients and only 3.7M examples. Also notice how the training accuracy continues to climb and the validation accuracy has held steady. That is a strong indicator that we are overfitting.

Model Evaluation

But to evaluate how well we are doing, we will evaluate it using Kaggle measure for this competition: log_loss. Here is a quick logloss function written in Python.


In [11]:
import math
def log_loss_raw(target, predicted):
    '''Calculate log_loss between target and predicted and return.'''
    p = predicted.apply(lambda x: min(0.99999, max(1e-5, x)))
    logp = p.apply(lambda x: math.log(x))
    logmp = p.apply(lambda x: (math.log(1-x)))
    return -(target * logp + (1-target) * logmp).mean()

def eval_model(model, test):
    '''Evaluate a trained model using Kaggle scoring.'''
    return log_loss_raw(test['click'], model.predict(test, output_type='probability'))

In [12]:
print "%0.20f" % eval_model(baseline, val)


0.50184766410603109943

This is a good model to get started but not as good as some of the entries in the leaderboard. We definitely need to do some feature engineering.

5. Feature Engineering & Feature Selection

Feature Engineering is the process of representing the data using domain knowledge so a model can effectively use the data. Effective feature engineering is part art and part science. Feature Selection is the related discipline of deciding which set of features the model should consider during training. Typically both feature engineering and feature selection are iterative processes.

Getting started with feature enineering

For CTR, it is often good practice to ask yourself the following question:

What domain knowledge do we have about what influences user clicks?

The first thing to consider is the time of day. However, we don't have the time in a convenient format for the model, so we need to do some feature engineering to get the 'hour' of the day as a feature.

We want to modify both the train and the val SFrames, since we want the same transformations to occur in both places. To make this simpler, we create a function to do the time of day feature, which takes an SFrame and returns the modified SFrame.

Day / Hour Feature Engineering - DateTime parsing


In [13]:
def hour_eng(data):
    '''Parse out day/hour from SFrame.
    
    Given an SFrame with an 'hour' column formatted with YYMMDDHH parse this as a datetime
    and then split to just keep hour and day features (as categorical variables)
    '''
    if 'day' in data.column_names():
        # idempotency check, if transformation already applied then simply return
        return data
    
    data['datetime'] = data['hour'].str_to_datetime('%y%m%d%H')
    data.remove_column('hour')
    data = data.split_datetime('datetime', column_name_prefix='', limit=['day', 'hour'])
    data['hour'] = data['hour'].astype(str)
    data['day'] = data['day'].astype(str)
    return data

In [14]:
# to measure how we are doing, we need to apply to both train and validation SFrames
train = hour_eng(train)
val = hour_eng(val)
print train.column_names()


['id', 'click', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'day', 'hour']

Notice how there is a new 'day' column and a new 'hour' column. These are the new columns/features added by the hour_eng function.

Basic Feature Selection: Pick the day, hour, and site_id features


In [15]:
features = ['site_id', 'day', 'hour']
model = gl.logistic_classifier.create(train, features=features, 
                                      target='click', validation_set=val)


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 3710945
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 3
PROGRESS: Number of unpacked features : 3
PROGRESS: Number of coefficients    : 2835
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 0.000000  | 2.658805     | 0.825722          | 0.826538            |
PROGRESS: | 2         | 5        | 1.000000  | 4.812624     | 0.825830          | 0.826577            |
PROGRESS: | 3         | 6        | 1.000000  | 6.006163     | 0.826797          | 0.827346            |
PROGRESS: | 4         | 7        | 1.000000  | 7.212546     | 0.826423          | 0.827150            |
PROGRESS: | 5         | 8        | 1.000000  | 8.439682     | 0.827090          | 0.827618            |
PROGRESS: | 6         | 9        | 1.000000  | 9.742179     | 0.827105          | 0.827650            |
PROGRESS: | 7         | 10       | 1.000000  | 11.064490    | 0.827145          | 0.827533            |
PROGRESS: | 8         | 11       | 1.000000  | 12.385897    | 0.827142          | 0.827538            |
PROGRESS: | 9         | 12       | 1.000000  | 13.655575    | 0.827139          | 0.827662            |
PROGRESS: | 10        | 13       | 1.000000  | 14.918936    | 0.826808          | 0.827528            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

Note: The classifier has warned us that the column 'day' has low variance. We should remove it because we are only training on one day's worth of data, so the day is not meaningful.


In [16]:
print "%0.20f" % eval_model(model, val)


0.42543819222628642684

That is a big improvement and a good first step but we can do better with some more feature engineering.

Interaction Terms: Building new features from existing ones

I believe click-through rate is dependent on which site the user is on, along with which app the user seeing/using. So I want to create a new feature that represents the interaction of site_id with app_id features. This results in a cross-product between the two features.

Again it is easiest to have a function so this set of transformations can be done on both the training set and the validation set, as below.


In [17]:
def site_app(data):
    # if already has a site_app then consider this operation completed.
    if 'site_app' in data.column_names():
        return data
    
    data['site_app'] = data['site_id'] + data['app_id']
    return data

In [18]:
train = site_app(train)
val = site_app(val)
print train.column_names()


['id', 'click', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'day', 'hour', 'site_app']

Now let's use this newly generated feature, instead of 'site_id', and see how we do. We are also dropping 'hour' since it isn't meaningful.


In [19]:
features=['site_app', 'day', 'hour']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 3710945
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 3
PROGRESS: Number of unpacked features : 3
PROGRESS: Number of coefficients    : 6887
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 0.000000  | 2.718713     | 0.825732          | 0.826538            |
PROGRESS: | 2         | 5        | 1.000000  | 4.820923     | 0.826295          | 0.826929            |
PROGRESS: | 3         | 6        | 1.000000  | 6.095525     | 0.827564          | 0.827880            |
PROGRESS: | 4         | 7        | 1.000000  | 7.545748     | 0.827019          | 0.827346            |
PROGRESS: | 5         | 8        | 1.000000  | 9.071496     | 0.827644          | 0.828004            |
PROGRESS: | 6         | 9        | 1.000000  | 10.449221    | 0.827791          | 0.828101            |
PROGRESS: | 7         | 10       | 1.000000  | 11.652915    | 0.827726          | 0.828091            |
PROGRESS: | 8         | 11       | 1.000000  | 12.867802    | 0.827680          | 0.828060            |
PROGRESS: | 9         | 12       | 1.000000  | 14.059563    | 0.827815          | 0.828145            |
PROGRESS: | 10        | 13       | 1.000000  | 15.331704    | 0.827629          | 0.828074            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

In [20]:
print "%0.20f" % eval_model(model, val)


0.40996175377622545710

That is better, but let us see if we can add some more complex feature engineering transformations to get better results.

Quadratic Features: All-pair interation terms from all features.

As part of the Feature Engineering Transformers, we can extend beyond the one interaction combination described above, and generate a feature with all feature combinations, using the QuadraticFeatures transformer, as follows.

Note: This step can take quite a while (about an hour on a basic Macbook Pro)


In [21]:
from graphlab import feature_engineering as fe

# QuadraticFeatures requires float, integer, array, or dictionary column types
# so pack all interaction columns into dict first
interaction_columns = ['C1', 'banner_pos', 
                       'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 
                       'C20', 'C21', 'hour']

for feature in interaction_columns:
    train[feature] = train[feature].astype(int)

quad = fe.create(train, fe.QuadraticFeatures(features=interaction_columns))

In [22]:
def apply_quadratic(dataset):
    if 'quadratic_features' in dataset.column_names():
        # operation already performed, do nothing
        return dataset
    for feature in interaction_columns:
        dataset[feature] = dataset[feature].astype(int)
    return quad.transform(dataset)

In [23]:
# this adds a 'quadratic_features' column to both SFrames
train = apply_quadratic(train)
val = apply_quadratic(val)

In [24]:
features=['site_app', 'day', 'hour', 'quadratic_features']
model = gl.logistic_classifier.create(train, features=features, 
                                      target='click', validation_set=val, max_iterations=20)


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 3710945
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 94
PROGRESS: Number of coefficients    : 6956
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000000  | 45.533969    | 0.825720          | 0.826538            |
PROGRESS: | 2         | 7        | 1.000000  | 70.482053    | 0.825720          | 0.826538            |
PROGRESS: | 3         | 8        | 1.000000  | 87.371969    | 0.807487          | 0.807409            |
PROGRESS: | 4         | 10       | 1.000000  | 111.721498   | 0.825720          | 0.826538            |
PROGRESS: | 5         | 11       | 1.000000  | 127.704416   | 0.825724          | 0.826538            |
PROGRESS: | 6         | 12       | 1.000000  | 143.244673   | 0.825916          | 0.826715            |
PROGRESS: | 7         | 13       | 1.000000  | 159.021249   | 0.826398          | 0.827133            |
PROGRESS: | 8         | 14       | 1.000000  | 174.844463   | 0.825627          | 0.826094            |
PROGRESS: | 9         | 15       | 1.000000  | 190.633223   | 0.826568          | 0.826798            |
PROGRESS: | 10        | 16       | 1.000000  | 206.159300   | 0.827544          | 0.827960            |
PROGRESS: | 11        | 18       | 1.000000  | 230.489839   | 0.827624          | 0.827637            |
PROGRESS: | 12        | 19       | 1.000000  | 246.828216   | 0.827161          | 0.827310            |
PROGRESS: | 13        | 20       | 1.000000  | 264.410772   | 0.826654          | 0.826968            |
PROGRESS: | 14        | 21       | 1.000000  | 281.717649   | 0.827167          | 0.827611            |
PROGRESS: | 15        | 22       | 1.000000  | 297.707801   | 0.827813          | 0.828217            |
PROGRESS: | 16        | 23       | 1.000000  | 313.466256   | 0.828205          | 0.828363            |
PROGRESS: | 17        | 25       | 1.000000  | 337.432553   | 0.828368          | 0.828730            |
PROGRESS: | 18        | 26       | 1.000000  | 358.087450   | 0.828440          | 0.828763            |
PROGRESS: | 19        | 27       | 1.000000  | 375.887754   | 0.828519          | 0.828890            |
PROGRESS: | 20        | 28       | 1.000000  | 392.809671   | 0.828580          | 0.828878            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

In [25]:
print "%0.20f" % eval_model(model, val)


0.40728018889038708483

This is an improvement, and so we are on the right track. But let us see if further reducing the dimensionality of number of categories for each feature will help.

One-Hot Encoding

A one-hot encoder encodes a collection of categorical features using a 1-of-K encoding scheme while decreasing dimensionality by limiting each feature to a max number of categories.


In [26]:
# Create the OneHotEncoder, and fit it to the train dataset
# use ALL features, except id and click. Limit the number of categories per feature to only 800 most frequent.
# Excluding quadratic_features column because it is encoding the same thing
onehot = fe.create(train, fe.OneHotEncoder(
            excluded_features=['id', 'click', 'site_app', 'day', 'hour', 'quadratic_features'], 
            max_categories=300))

In [27]:
# Note: max_categories choses the top categories based on
# frequency therefore retaining only the most frequent ones.
def apply_onehot(dataset):
    if 'encoded_features' in dataset.column_names():
        # operation already completed on SFrame
        return dataset
    
    return onehot.transform(dataset)

In [28]:
train = apply_onehot(train)
val = apply_onehot(val)

In [29]:
features=['site_app', 'day', 'hour', 'quadratic_features', 'encoded_features']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 3710945
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 5
PROGRESS: Number of unpacked features : 3012
PROGRESS: Number of coefficients    : 9874
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 6        | 0.000000  | 70.576495    | 0.825720          | 0.826538            |
PROGRESS: | 2         | 9        | 5.000000  | 113.657719   | 0.825512          | 0.825931            |
PROGRESS: | 3         | 10       | 5.000000  | 134.748759   | 0.825720          | 0.826538            |
PROGRESS: | 4         | 12       | 1.000000  | 166.412811   | 0.825727          | 0.826548            |
PROGRESS: | 5         | 13       | 1.000000  | 187.360726   | 0.826751          | 0.827164            |
PROGRESS: | 6         | 14       | 1.000000  | 208.052215   | 0.827344          | 0.827671            |
PROGRESS: | 7         | 15       | 1.000000  | 229.716529   | 0.827055          | 0.827637            |
PROGRESS: | 8         | 16       | 1.000000  | 249.929231   | 0.829657          | 0.829674            |
PROGRESS: | 9         | 17       | 1.000000  | 271.027735   | 0.830671          | 0.830581            |
PROGRESS: | 10        | 18       | 1.000000  | 298.356538   | 0.830256          | 0.830135            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

In [30]:
print "%0.20f" % eval_model(model, val)


0.40357369232907458478

Again, we are doing much better by restricting ourselves to only the top few categories. Now let us see if we can apply one transformer to another, using the FeatureHasher to reduce dimensionality overall, including reducing the dimensionality of the quadratic_features column.

Feature Hashing

The feature hasher hashes multiple features into an n-bit feature space. This process greatly reduces dimensionality.


In [31]:
hasher = fe.create(train, fe.FeatureHasher(
          excluded_features=['id', 'click', 'site_app', 'day', 'hour', 'encoded_features'], 
          num_bits=22))

In [32]:
def apply_feature_hasher(dataset):
    if 'hashed_features' in dataset.column_names():
        # feature hasher already performed
        return dataset
    return hasher.transform(dataset)

In [33]:
train = apply_feature_hasher(train)
val = apply_feature_hasher(val)
print train.column_names()


['id', 'click', 'day', 'hour', 'site_app', 'encoded_features', 'hashed_features']

In [34]:
# we drop quadratic_features because it has been hashed by the FeatureHasher, 
# so its signal is encoded in the 'hashed_features' column
features=['site_app', 'day', 'hour', 'encoded_features', 'hashed_features']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)


PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 3710945
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 5
PROGRESS: Number of unpacked features : 3012
PROGRESS: Number of coefficients    : 9874
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 6        | 0.000000  | 70.896868    | 0.825720          | 0.826538            |
PROGRESS: | 2         | 9        | 5.000000  | 115.045681   | 0.825512          | 0.825931            |
PROGRESS: | 3         | 10       | 5.000000  | 135.466586   | 0.825720          | 0.826538            |
PROGRESS: | 4         | 12       | 1.000000  | 167.251210   | 0.825727          | 0.826548            |
PROGRESS: | 5         | 13       | 1.000000  | 188.308783   | 0.826751          | 0.827164            |
PROGRESS: | 6         | 14       | 1.000000  | 210.500583   | 0.827344          | 0.827671            |
PROGRESS: | 7         | 15       | 1.000000  | 230.813210   | 0.827055          | 0.827637            |
PROGRESS: | 8         | 16       | 1.000000  | 251.941302   | 0.829657          | 0.829674            |
PROGRESS: | 9         | 17       | 1.000000  | 272.294446   | 0.830671          | 0.830581            |
PROGRESS: | 10        | 18       | 1.000000  | 293.590415   | 0.830256          | 0.830135            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

In [35]:
print "%0.20f" % eval_model(model, val)


0.40357369232907458478

Best Score so far.

This log_loss is being calculated against the validation set, which was a random 10% split of the first day of clickstream data. That validation set may not be entirely generalizable, so it would not be surprising if the log_loss on the test set is higher. Still, this is definitely a good demonstration of how using feature engineering and feature selection can greatly improve the results of a model.

Recap

In this notebook we covered the following types of feature engineering:

  1. Datetime parsing for extracting day and hour features
  2. Interaction features for site_id and app_id
  3. Quadratic Features for creating full interaction terms between all features in a row.
  4. One-Hot Encoding to reduce the dimensionality within each categorical variable to just the top 100 categories.
  5. Feature Hashing to reduce the overall dimensionality of all the features (including the Quadratic Features)

We have done a bunch of feature engineering, but have only introduced the powerful transformers available in GraphLab Create.

Feature Engineering Transformers in GraphLab Create

  1. FeatureHasher: Hashes an input feature space to an n-bit feature space.
  2. QuadraticFeatures: Calculates quadratic interaction terms between features.
  3. FeatureBinner: Feature binning is a method of turning continuous variables into categorical values.
  4. OneHotEncoder: Encode a collection of categorical features using a 1-of-K encoding scheme.
  5. CountThresholder: Map infrequent categorical variables to a new/separate category.
  6. TransformerChain: Sequentially apply a list of transforms.

It is just as easy to write your own custom Transformer, simply extend the TransformerBase and then write arbitrary Python code for the fit and transform methods - this allows for creative ensembling.

It is also worth mentioning that we could have chained the Transformers together using the TransformerChain class, making it easy for us to compose complex feature engineering transformations that can be used just as simply.

There is a lot more we can do with Feature Engineering, hopefully this notebook has given you a flavor of how easy and scalable Feature Engineering is in GraphLab Create!