Click-Through-Rate Prediction with GraphLab Create Feature Engineering Transformers

Feature engineering is well known to be one of the key ingredients in a successful intelligent application. In this notebook, we introduce what feature engineering is, why it is important, and how to engineer features. For this, we work through a Kaggle competition on click-through prediction (CTR) using Avazu's anonymized dataset.

More specifically, you will learn how to:

Build a click-through-rate predictor using GraphLab Create.
Improve the quality of the predictions using complex feature engineering transformers provided in GraphLab Create.

The specific feature engineering techniques used in this notebook are:

Date / time features
Interation Terms (Quadratic Features)
One-Hot Encoding
Feature Hashing

The goal is not to try to win this competition, but instead show that prediction accuracy can be greatly improved with feature engineering tools and some more time.

Note: This notebook requires GraphLab Create 1.3.

The Avazu dataset



In [1]:

    
import graphlab as gl
gl.canvas.set_target('ipynb')

(from Kaggle):

For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms?

Data fields

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables

1. Load the data

We need to load the data. The data is provided as a gzipped-CSV, so we download it from Kaggle's website and then load it locally using SFrame.read_csv. Since the dataset is almost entirely categorical, we set the column_type_hints to str, and then simply change the click column to int.

Since there is a lot of iterative steps while feature engineering, and to save disk space and time to load, it is often a good idea to save the SFrame in its native format. That way you can restart this IPython session and more quickly pick up where you left off.



In [2]:

    
import os

if not os.path.exists('train.gl'):
    path = 'train.gz' # path to dataset - download from Kaggle, 1.2GB
    train = gl.SFrame.read_csv(path, column_type_hints=str)
    train['click'] = train['click'].astype(int)
    train.save('train.gl')
else:
    train = gl.SFrame('train.gl')









    



[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-72603 - Server binary: /Users/bdol/anaconda/envs/clickthrough_test/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1437153352.log
[INFO] GraphLab Server Version: 1.4.830






    




PROGRESS: Read 336736 lines. Lines per second: 126827






    




PROGRESS: Read 1346636 lines. Lines per second: 166778






    




PROGRESS: Read 2355736 lines. Lines per second: 173668






    




PROGRESS: Read 3364024 lines. Lines per second: 174608






    




PROGRESS: Read 4372157 lines. Lines per second: 172454






    




PROGRESS: Read 5045269 lines. Lines per second: 164179






    




PROGRESS: Read 5717567 lines. Lines per second: 159242






    




PROGRESS: Read 6725242 lines. Lines per second: 161165






    




PROGRESS: Read 7732451 lines. Lines per second: 160115






    




PROGRESS: Read 8741759 lines. Lines per second: 160537






    




PROGRESS: Read 9751627 lines. Lines per second: 161166






    




PROGRESS: Read 10759260 lines. Lines per second: 160118






    




PROGRESS: Read 11766454 lines. Lines per second: 159164






    




PROGRESS: Read 12777884 lines. Lines per second: 157526






    




PROGRESS: Read 13450141 lines. Lines per second: 155918






    




PROGRESS: Read 14459648 lines. Lines per second: 155187






    




PROGRESS: Read 15465592 lines. Lines per second: 154372






    




PROGRESS: Read 16471502 lines. Lines per second: 153696






    




PROGRESS: Read 17479087 lines. Lines per second: 153377






    




PROGRESS: Read 18486842 lines. Lines per second: 153254






    




PROGRESS: Read 19494209 lines. Lines per second: 153256






    




PROGRESS: Read 20501474 lines. Lines per second: 153324






    




PROGRESS: Read 21510591 lines. Lines per second: 153025






    




PROGRESS: Read 22517836 lines. Lines per second: 153076






    




PROGRESS: Read 23526032 lines. Lines per second: 152779






    




PROGRESS: Read 24196556 lines. Lines per second: 152197






    




PROGRESS: Read 25203931 lines. Lines per second: 152041






    




PROGRESS: Read 26209876 lines. Lines per second: 151789






    




PROGRESS: Read 27217728 lines. Lines per second: 151774






    




PROGRESS: Read 28223267 lines. Lines per second: 151587






    




PROGRESS: Read 29228946 lines. Lines per second: 151370






    




PROGRESS: Read 30232532 lines. Lines per second: 151143






    




PROGRESS: Read 31238088 lines. Lines per second: 151346






    




PROGRESS: Read 32245216 lines. Lines per second: 151216






    




PROGRESS: Read 33251298 lines. Lines per second: 151127






    




PROGRESS: Read 34258362 lines. Lines per second: 150982






    




PROGRESS: Read 35264915 lines. Lines per second: 150548






    




PROGRESS: Read 35938177 lines. Lines per second: 150099






    




PROGRESS: Read 36611186 lines. Lines per second: 149690






    




PROGRESS: Read 37617474 lines. Lines per second: 149212






    




PROGRESS: Read 38624198 lines. Lines per second: 149027






    




PROGRESS: Read 39630509 lines. Lines per second: 149046






    




PROGRESS: Finished parsing file /Users/bdol/graphlab/turi.com/src/learn/gallery/notebooks/train.gz






    




PROGRESS: Parsing completed. Parsed 40428967 lines in 271.284 secs.



In [3]:

    
train









    Out[3]:





    
        id
        click
        hour
        C1
        banner_pos
        site_id
        site_domain
        site_category
        app_id
    
    
        1000009418151094273
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000169349117863715
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000371904215119486
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000640724480838376
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000679056417042096
        0
        14102100
        1005
        1
        fe8cc448
        9166c161
        0569f928
        ecad2386
    
    
        10000720757801103869
        0
        14102100
        1005
        0
        d6137915
        bb1ef334
        f028772b
        ecad2386
    
    
        10000724729988544911
        0
        14102100
        1005
        0
        8fda644b
        25d4cfcd
        f028772b
        ecad2386
    
    
        10000918755742328737
        0
        14102100
        1005
        1
        e151e245
        7e091613
        f028772b
        ecad2386
    
    
        10000949271186029916
        1
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10001264480619467364
        0
        14102100
        1002
        0
        84c7ba46
        c4e18dd6
        50e219e0
        ecad2386
    


    
        app_domain
        app_category
        device_id
        device_ip
        device_model
        device_type
        device_conn_type
        C14
        C15
        C16
    
    
        7801e8d9
        07d7df22
        a99f214a
        ddd2926e
        44956a24
        1
        2
        15706
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        96809ac8
        711ee120
        1
        0
        15704
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        b3cf8def
        8a4875bd
        1
        0
        15704
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        e8275b8f
        6332421a
        1
        0
        15706
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        9644d0bf
        779d90c2
        1
        0
        18993
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        05241af0
        8a4875bd
        1
        0
        16920
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        b264c159
        be6db1d7
        1
        0
        20362
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        e6f67278
        be74e6fe
        1
        0
        20632
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        37e8da74
        5db079b5
        1
        2
        15707
        320
        50
    
    
        7801e8d9
        07d7df22
        c357dbff
        f1ac7184
        373ecbe6
        0
        0
        21689
        320
        50
    


    
        C17
        C18
        C19
        C20
        C21
    
    
        1722
        0
        35
        -1
        79
    
    
        1722
        0
        35
        100084
        79
    
    
        1722
        0
        35
        100084
        79
    
    
        1722
        0
        35
        100084
        79
    
    
        2161
        0
        35
        -1
        157
    
    
        1899
        0
        431
        100077
        117
    
    
        2333
        0
        39
        -1
        157
    
    
        2374
        3
        39
        -1
        23
    
    
        1722
        0
        35
        -1
        79
    
    
        2496
        3
        167
        100191
        23
    

[40428967 rows x 24 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

This dataset has 40M+ rows. Though that is no problem for GraphLab Create, on most laptops that could take a while to train models. When doing feature engineering and feature selection you want to be able to train lots of models with different feature combinations to see what features are predictive. Most of these experiments may not be successful, hence it is useful to sample the large dataset to a representative size that allows for fast iterative model training

2. Sample & Split the data

For this dataset, it is important not to use a random sample of the click data. Since we have 10 days of data in train, let us take one day of the data as the sample to use for feature engineering. All the data from a single day will be a better representative of the data in comparison with a random sample.

To do this, we notice that the first day in the train data is 141021 (which is October 21, 2014) followed by 00-23 for the hours in that day (in UTC). A simple way to get the click-stream traffic for that one day is to filter the 'hour' column for values in [14102100 .. 14102123] using SFrame.filter_by.

Since we want to operate on oneday in multiple sessions, we save it in SFrame binary format for easy loading later on.



In [4]:

    
# let's start with a one-day sample (our first bit of Feature Engineering!)
if not os.path.exists('oneday.gl'):
    # this filters the SFrame, returning the SFrame where the hour column is 14102100 to 14102123
    # which is the 2014-10-21 day of clickstream traffic
    oneday = train.filter_by(['141021%02d' % hour for hour in range(24)], 'hour')
    oneday.save('oneday.gl')
else:
    oneday = gl.SFrame('oneday.gl')



In [5]:

    
oneday









    Out[5]:





    
        id
        click
        hour
        C1
        banner_pos
        site_id
        site_domain
        site_category
        app_id
    
    
        1000009418151094273
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000169349117863715
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000371904215119486
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000640724480838376
        0
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10000679056417042096
        0
        14102100
        1005
        1
        fe8cc448
        9166c161
        0569f928
        ecad2386
    
    
        10000720757801103869
        0
        14102100
        1005
        0
        d6137915
        bb1ef334
        f028772b
        ecad2386
    
    
        10000724729988544911
        0
        14102100
        1005
        0
        8fda644b
        25d4cfcd
        f028772b
        ecad2386
    
    
        10000918755742328737
        0
        14102100
        1005
        1
        e151e245
        7e091613
        f028772b
        ecad2386
    
    
        10000949271186029916
        1
        14102100
        1005
        0
        1fbe01fe
        f3845767
        28905ebd
        ecad2386
    
    
        10001264480619467364
        0
        14102100
        1002
        0
        84c7ba46
        c4e18dd6
        50e219e0
        ecad2386
    


    
        app_domain
        app_category
        device_id
        device_ip
        device_model
        device_type
        device_conn_type
        C14
        C15
        C16
    
    
        7801e8d9
        07d7df22
        a99f214a
        ddd2926e
        44956a24
        1
        2
        15706
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        96809ac8
        711ee120
        1
        0
        15704
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        b3cf8def
        8a4875bd
        1
        0
        15704
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        e8275b8f
        6332421a
        1
        0
        15706
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        9644d0bf
        779d90c2
        1
        0
        18993
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        05241af0
        8a4875bd
        1
        0
        16920
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        b264c159
        be6db1d7
        1
        0
        20362
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        e6f67278
        be74e6fe
        1
        0
        20632
        320
        50
    
    
        7801e8d9
        07d7df22
        a99f214a
        37e8da74
        5db079b5
        1
        2
        15707
        320
        50
    
    
        7801e8d9
        07d7df22
        c357dbff
        f1ac7184
        373ecbe6
        0
        0
        21689
        320
        50
    


    
        C17
        C18
        C19
        C20
        C21
    
    
        1722
        0
        35
        -1
        79
    
    
        1722
        0
        35
        100084
        79
    
    
        1722
        0
        35
        100084
        79
    
    
        1722
        0
        35
        100084
        79
    
    
        2161
        0
        35
        -1
        157
    
    
        1899
        0
        431
        100077
        117
    
    
        2333
        0
        39
        -1
        157
    
    
        2374
        3
        39
        -1
        23
    
    
        1722
        0
        35
        -1
        79
    
    
        2496
        3
        167
        100191
        23
    

[4122995 rows x 24 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [6]:

    
oneday.show()

3. Data Science 101: Create Train / Validation Split

Now that we have sampled and taken a quick look at the data, let us create a train-validation split. Doing this process once ensures that we can train several models using different features and use the the same data for evaluating each of the models.

Now that we have sampled the data to one day, we use a random split of this data (using SFrame.random_split), with 90% being used for training and 10% used for validation. We also specify a seed for reproducability.



In [7]:

    
train, val = oneday.random_split(0.9, seed=12345)



In [8]:

    
train.print_rows(5)









    



+----------------------+-------+----------+------+------------+----------+
|          id          | click |   hour   |  C1  | banner_pos | site_id  |
+----------------------+-------+----------+------+------------+----------+
| 1000009418151094273  |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000169349117863715 |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000371904215119486 |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000640724480838376 |   0   | 14102100 | 1005 |     0      | 1fbe01fe |
| 10000679056417042096 |   0   | 14102100 | 1005 |     1      | fe8cc448 |
+----------------------+-------+----------+------+------------+----------+
+-------------+---------------+----------+------------+--------------+-----------+
| site_domain | site_category |  app_id  | app_domain | app_category | device_id |
+-------------+---------------+----------+------------+--------------+-----------+
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   f3845767  |    28905ebd   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
|   9166c161  |    0569f928   | ecad2386 |  7801e8d9  |   07d7df22   |  a99f214a |
+-------------+---------------+----------+------------+--------------+-----------+
+-----------+--------------+-------------+------------------+-------+-----+-----+
| device_ip | device_model | device_type | device_conn_type |  C14  | C15 | C16 |
+-----------+--------------+-------------+------------------+-------+-----+-----+
|  ddd2926e |   44956a24   |      1      |        2         | 15706 | 320 |  50 |
|  96809ac8 |   711ee120   |      1      |        0         | 15704 | 320 |  50 |
|  b3cf8def |   8a4875bd   |      1      |        0         | 15704 | 320 |  50 |
|  e8275b8f |   6332421a   |      1      |        0         | 15706 | 320 |  50 |
|  9644d0bf |   779d90c2   |      1      |        0         | 18993 | 320 |  50 |
+-----------+--------------+-------------+------------------+-------+-----+-----+
+------+-----+-----+--------+-----+
| C17  | C18 | C19 |  C20   | C21 |
+------+-----+-----+--------+-----+
| 1722 |  0  |  35 |   -1   |  79 |
| 1722 |  0  |  35 | 100084 |  79 |
| 1722 |  0  |  35 | 100084 |  79 |
| 1722 |  0  |  35 | 100084 |  79 |
| 2161 |  0  |  35 |   -1   | 157 |
+------+-----+-----+--------+-----+
[3710945 rows x 24 columns]

4. Baseline Model

As a baseline model, let us use all categorical features as is using the logistic classifier model. This is not only a great way to get started, but also acts as a reasonable baseline prior to feature engineering and feature selection. To get all the features we call SFrame.column_names and then remove the id column since it isn't meaningful in this appication.



In [9]:

    
features = train.column_names()
features.remove('click') # cannot have target in features list
features.remove('id')    # id is not useful for training the model
print features









    



['hour', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']



In [10]:

    
# Baseline: use ALL features as is
baseline = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)









    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 3710945






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 22






    




PROGRESS: Number of unpacked features : 22






    




PROGRESS: Number of coefficients    : 1377440






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 3        | 0.000000  | 11.352040    | 0.825720          | 0.826538            |






    




PROGRESS: | 2         | 5        | 1.000000  | 17.485349    | 0.836532          | 0.827994            |






    




PROGRESS: | 3         | 6        | 1.000000  | 21.652770    | 0.730601          | 0.684546            |






    




PROGRESS: | 4         | 8        | 1.000000  | 29.905691    | 0.876269          | 0.831734            |






    




PROGRESS: | 5         | 9        | 1.000000  | 34.128402    | 0.884881          | 0.830637            |






    




PROGRESS: | 6         | 10       | 1.000000  | 38.261015    | 0.887878          | 0.824951            |






    




PROGRESS: | 7         | 11       | 1.000000  | 42.092548    | 0.889639          | 0.823315            |






    




PROGRESS: | 8         | 12       | 1.000000  | 46.108497    | 0.890809          | 0.823393            |






    




PROGRESS: | 9         | 13       | 1.000000  | 51.264788    | 0.891148          | 0.823055            |






    




PROGRESS: | 10        | 14       | 1.000000  | 55.588145    | 0.891176          | 0.822708            |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

Uh-Oh, we might be overfitting. Notice how there 1.4M coefficients and only 3.7M examples. Also notice how the training accuracy continues to climb and the validation accuracy has held steady. That is a strong indicator that we are overfitting.

Model Evaluation

But to evaluate how well we are doing, we will evaluate it using Kaggle measure for this competition: log_loss. Here is a quick logloss function written in Python.



In [11]:

    
import math
def log_loss_raw(target, predicted):
    '''Calculate log_loss between target and predicted and return.'''
    p = predicted.apply(lambda x: min(0.99999, max(1e-5, x)))
    logp = p.apply(lambda x: math.log(x))
    logmp = p.apply(lambda x: (math.log(1-x)))
    return -(target * logp + (1-target) * logmp).mean()

def eval_model(model, test):
    '''Evaluate a trained model using Kaggle scoring.'''
    return log_loss_raw(test['click'], model.predict(test, output_type='probability'))



In [12]:

    
print "%0.20f" % eval_model(baseline, val)









    



0.50184766410603109943

This is a good model to get started but not as good as some of the entries in the leaderboard. We definitely need to do some feature engineering.

5. Feature Engineering & Feature Selection

Feature Engineering is the process of representing the data using domain knowledge so a model can effectively use the data. Effective feature engineering is part art and part science. Feature Selection is the related discipline of deciding which set of features the model should consider during training. Typically both feature engineering and feature selection are iterative processes.

Getting started with feature enineering

For CTR, it is often good practice to ask yourself the following question:

What domain knowledge do we have about what influences user clicks?

The first thing to consider is the time of day. However, we don't have the time in a convenient format for the model, so we need to do some feature engineering to get the 'hour' of the day as a feature.

We want to modify both the train and the val SFrames, since we want the same transformations to occur in both places. To make this simpler, we create a function to do the time of day feature, which takes an SFrame and returns the modified SFrame.

Day / Hour Feature Engineering - DateTime parsing



In [13]:

    
def hour_eng(data):
    '''Parse out day/hour from SFrame.
    
    Given an SFrame with an 'hour' column formatted with YYMMDDHH parse this as a datetime
    and then split to just keep hour and day features (as categorical variables)
    '''
    if 'day' in data.column_names():
        # idempotency check, if transformation already applied then simply return
        return data
    
    data['datetime'] = data['hour'].str_to_datetime('%y%m%d%H')
    data.remove_column('hour')
    data = data.split_datetime('datetime', column_name_prefix='', limit=['day', 'hour'])
    data['hour'] = data['hour'].astype(str)
    data['day'] = data['day'].astype(str)
    return data



In [14]:

    
# to measure how we are doing, we need to apply to both train and validation SFrames
train = hour_eng(train)
val = hour_eng(val)
print train.column_names()









    



['id', 'click', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'day', 'hour']

Notice how there is a new 'day' column and a new 'hour' column. These are the new columns/features added by the hour_eng function.

Basic Feature Selection: Pick the day, hour, and site_id features



In [15]:

    
features = ['site_id', 'day', 'hour']
model = gl.logistic_classifier.create(train, features=features, 
                                      target='click', validation_set=val)









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 3710945






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 3






    




PROGRESS: Number of unpacked features : 3






    




PROGRESS: Number of coefficients    : 2835






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 3        | 0.000000  | 2.658805     | 0.825722          | 0.826538            |






    




PROGRESS: | 2         | 5        | 1.000000  | 4.812624     | 0.825830          | 0.826577            |






    




PROGRESS: | 3         | 6        | 1.000000  | 6.006163     | 0.826797          | 0.827346            |






    




PROGRESS: | 4         | 7        | 1.000000  | 7.212546     | 0.826423          | 0.827150            |






    




PROGRESS: | 5         | 8        | 1.000000  | 8.439682     | 0.827090          | 0.827618            |






    




PROGRESS: | 6         | 9        | 1.000000  | 9.742179     | 0.827105          | 0.827650            |






    




PROGRESS: | 7         | 10       | 1.000000  | 11.064490    | 0.827145          | 0.827533            |






    




PROGRESS: | 8         | 11       | 1.000000  | 12.385897    | 0.827142          | 0.827538            |






    




PROGRESS: | 9         | 12       | 1.000000  | 13.655575    | 0.827139          | 0.827662            |






    




PROGRESS: | 10        | 13       | 1.000000  | 14.918936    | 0.826808          | 0.827528            |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

Note: The classifier has warned us that the column 'day' has low variance. We should remove it because we are only training on one day's worth of data, so the day is not meaningful.



In [16]:

    
print "%0.20f" % eval_model(model, val)









    



0.42543819222628642684

That is a big improvement and a good first step but we can do better with some more feature engineering.

Interaction Terms: Building new features from existing ones

I believe click-through rate is dependent on which site the user is on, along with which app the user seeing/using. So I want to create a new feature that represents the interaction of site_id with app_id features. This results in a cross-product between the two features.

Again it is easiest to have a function so this set of transformations can be done on both the training set and the validation set, as below.



In [17]:

    
def site_app(data):
    # if already has a site_app then consider this operation completed.
    if 'site_app' in data.column_names():
        return data
    
    data['site_app'] = data['site_id'] + data['app_id']
    return data



In [18]:

    
train = site_app(train)
val = site_app(val)
print train.column_names()









    



['id', 'click', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'day', 'hour', 'site_app']

Now let's use this newly generated feature, instead of 'site_id', and see how we do. We are also dropping 'hour' since it isn't meaningful.



In [19]:

    
features=['site_app', 'day', 'hour']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 3710945






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 3






    




PROGRESS: Number of unpacked features : 3






    




PROGRESS: Number of coefficients    : 6887






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 3        | 0.000000  | 2.718713     | 0.825732          | 0.826538            |






    




PROGRESS: | 2         | 5        | 1.000000  | 4.820923     | 0.826295          | 0.826929            |






    




PROGRESS: | 3         | 6        | 1.000000  | 6.095525     | 0.827564          | 0.827880            |






    




PROGRESS: | 4         | 7        | 1.000000  | 7.545748     | 0.827019          | 0.827346            |






    




PROGRESS: | 5         | 8        | 1.000000  | 9.071496     | 0.827644          | 0.828004            |






    




PROGRESS: | 6         | 9        | 1.000000  | 10.449221    | 0.827791          | 0.828101            |






    




PROGRESS: | 7         | 10       | 1.000000  | 11.652915    | 0.827726          | 0.828091            |






    




PROGRESS: | 8         | 11       | 1.000000  | 12.867802    | 0.827680          | 0.828060            |






    




PROGRESS: | 9         | 12       | 1.000000  | 14.059563    | 0.827815          | 0.828145            |






    




PROGRESS: | 10        | 13       | 1.000000  | 15.331704    | 0.827629          | 0.828074            |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+



In [20]:

    
print "%0.20f" % eval_model(model, val)









    



0.40996175377622545710

That is better, but let us see if we can add some more complex feature engineering transformations to get better results.

Quadratic Features: All-pair interation terms from all features.

As part of the Feature Engineering Transformers, we can extend beyond the one interaction combination described above, and generate a feature with all feature combinations, using the QuadraticFeatures transformer, as follows.

Note: This step can take quite a while (about an hour on a basic Macbook Pro)



In [21]:

    
from graphlab import feature_engineering as fe

# QuadraticFeatures requires float, integer, array, or dictionary column types
# so pack all interaction columns into dict first
interaction_columns = ['C1', 'banner_pos', 
                       'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 
                       'C20', 'C21', 'hour']

for feature in interaction_columns:
    train[feature] = train[feature].astype(int)

quad = fe.create(train, fe.QuadraticFeatures(features=interaction_columns))



In [22]:

    
def apply_quadratic(dataset):
    if 'quadratic_features' in dataset.column_names():
        # operation already performed, do nothing
        return dataset
    for feature in interaction_columns:
        dataset[feature] = dataset[feature].astype(int)
    return quad.transform(dataset)



In [23]:

    
# this adds a 'quadratic_features' column to both SFrames
train = apply_quadratic(train)
val = apply_quadratic(val)



In [24]:

    
features=['site_app', 'day', 'hour', 'quadratic_features']
model = gl.logistic_classifier.create(train, features=features, 
                                      target='click', validation_set=val, max_iterations=20)









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 3710945






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 4






    




PROGRESS: Number of unpacked features : 94






    




PROGRESS: Number of coefficients    : 6956






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 5        | 0.000000  | 45.533969    | 0.825720          | 0.826538            |






    




PROGRESS: | 2         | 7        | 1.000000  | 70.482053    | 0.825720          | 0.826538            |






    




PROGRESS: | 3         | 8        | 1.000000  | 87.371969    | 0.807487          | 0.807409            |






    




PROGRESS: | 4         | 10       | 1.000000  | 111.721498   | 0.825720          | 0.826538            |






    




PROGRESS: | 5         | 11       | 1.000000  | 127.704416   | 0.825724          | 0.826538            |






    




PROGRESS: | 6         | 12       | 1.000000  | 143.244673   | 0.825916          | 0.826715            |






    




PROGRESS: | 7         | 13       | 1.000000  | 159.021249   | 0.826398          | 0.827133            |






    




PROGRESS: | 8         | 14       | 1.000000  | 174.844463   | 0.825627          | 0.826094            |






    




PROGRESS: | 9         | 15       | 1.000000  | 190.633223   | 0.826568          | 0.826798            |






    




PROGRESS: | 10        | 16       | 1.000000  | 206.159300   | 0.827544          | 0.827960            |






    




PROGRESS: | 11        | 18       | 1.000000  | 230.489839   | 0.827624          | 0.827637            |






    




PROGRESS: | 12        | 19       | 1.000000  | 246.828216   | 0.827161          | 0.827310            |






    




PROGRESS: | 13        | 20       | 1.000000  | 264.410772   | 0.826654          | 0.826968            |






    




PROGRESS: | 14        | 21       | 1.000000  | 281.717649   | 0.827167          | 0.827611            |






    




PROGRESS: | 15        | 22       | 1.000000  | 297.707801   | 0.827813          | 0.828217            |






    




PROGRESS: | 16        | 23       | 1.000000  | 313.466256   | 0.828205          | 0.828363            |






    




PROGRESS: | 17        | 25       | 1.000000  | 337.432553   | 0.828368          | 0.828730            |






    




PROGRESS: | 18        | 26       | 1.000000  | 358.087450   | 0.828440          | 0.828763            |






    




PROGRESS: | 19        | 27       | 1.000000  | 375.887754   | 0.828519          | 0.828890            |






    




PROGRESS: | 20        | 28       | 1.000000  | 392.809671   | 0.828580          | 0.828878            |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+



In [25]:

    
print "%0.20f" % eval_model(model, val)









    



0.40728018889038708483

This is an improvement, and so we are on the right track. But let us see if further reducing the dimensionality of number of categories for each feature will help.

One-Hot Encoding

A one-hot encoder encodes a collection of categorical features using a 1-of-K encoding scheme while decreasing dimensionality by limiting each feature to a max number of categories.



In [26]:

    
# Create the OneHotEncoder, and fit it to the train dataset
# use ALL features, except id and click. Limit the number of categories per feature to only 800 most frequent.
# Excluding quadratic_features column because it is encoding the same thing
onehot = fe.create(train, fe.OneHotEncoder(
            excluded_features=['id', 'click', 'site_app', 'day', 'hour', 'quadratic_features'], 
            max_categories=300))



In [27]:

    
# Note: max_categories choses the top categories based on
# frequency therefore retaining only the most frequent ones.
def apply_onehot(dataset):
    if 'encoded_features' in dataset.column_names():
        # operation already completed on SFrame
        return dataset
    
    return onehot.transform(dataset)



In [28]:

    
train = apply_onehot(train)
val = apply_onehot(val)



In [29]:

    
features=['site_app', 'day', 'hour', 'quadratic_features', 'encoded_features']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 3710945






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 5






    




PROGRESS: Number of unpacked features : 3012






    




PROGRESS: Number of coefficients    : 9874






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 6        | 0.000000  | 70.576495    | 0.825720          | 0.826538            |






    




PROGRESS: | 2         | 9        | 5.000000  | 113.657719   | 0.825512          | 0.825931            |






    




PROGRESS: | 3         | 10       | 5.000000  | 134.748759   | 0.825720          | 0.826538            |






    




PROGRESS: | 4         | 12       | 1.000000  | 166.412811   | 0.825727          | 0.826548            |






    




PROGRESS: | 5         | 13       | 1.000000  | 187.360726   | 0.826751          | 0.827164            |






    




PROGRESS: | 6         | 14       | 1.000000  | 208.052215   | 0.827344          | 0.827671            |






    




PROGRESS: | 7         | 15       | 1.000000  | 229.716529   | 0.827055          | 0.827637            |






    




PROGRESS: | 8         | 16       | 1.000000  | 249.929231   | 0.829657          | 0.829674            |






    




PROGRESS: | 9         | 17       | 1.000000  | 271.027735   | 0.830671          | 0.830581            |






    




PROGRESS: | 10        | 18       | 1.000000  | 298.356538   | 0.830256          | 0.830135            |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+



In [30]:

    
print "%0.20f" % eval_model(model, val)









    



0.40357369232907458478

Again, we are doing much better by restricting ourselves to only the top few categories. Now let us see if we can apply one transformer to another, using the FeatureHasher to reduce dimensionality overall, including reducing the dimensionality of the quadratic_features column.

Feature Hashing

The feature hasher hashes multiple features into an n-bit feature space. This process greatly reduces dimensionality.



In [31]:

    
hasher = fe.create(train, fe.FeatureHasher(
          excluded_features=['id', 'click', 'site_app', 'day', 'hour', 'encoded_features'], 
          num_bits=22))



In [32]:

    
def apply_feature_hasher(dataset):
    if 'hashed_features' in dataset.column_names():
        # feature hasher already performed
        return dataset
    return hasher.transform(dataset)



In [33]:

    
train = apply_feature_hasher(train)
val = apply_feature_hasher(val)
print train.column_names()









    



['id', 'click', 'day', 'hour', 'site_app', 'encoded_features', 'hashed_features']



In [34]:

    
# we drop quadratic_features because it has been hashed by the FeatureHasher, 
# so its signal is encoded in the 'hashed_features' column
features=['site_app', 'day', 'hour', 'encoded_features', 'hashed_features']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)









    




PROGRESS: WARNING: Detected extremely low variance for feature(s) 'day' because all entries are nearly the same.
Proceeding with model training using all features. If the model does not provide results of adequate quality, exclude the above mentioned feature(s) from the input dataset.






    




PROGRESS: Logistic regression:






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: Number of examples          : 3710945






    




PROGRESS: Number of classes           : 2






    




PROGRESS: Number of feature columns   : 5






    




PROGRESS: Number of unpacked features : 3012






    




PROGRESS: Number of coefficients    : 9874






    




PROGRESS: Starting L-BFGS






    




PROGRESS: --------------------------------------------------------






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+






    




PROGRESS: | 1         | 6        | 0.000000  | 70.896868    | 0.825720          | 0.826538            |






    




PROGRESS: | 2         | 9        | 5.000000  | 115.045681   | 0.825512          | 0.825931            |






    




PROGRESS: | 3         | 10       | 5.000000  | 135.466586   | 0.825720          | 0.826538            |






    




PROGRESS: | 4         | 12       | 1.000000  | 167.251210   | 0.825727          | 0.826548            |






    




PROGRESS: | 5         | 13       | 1.000000  | 188.308783   | 0.826751          | 0.827164            |






    




PROGRESS: | 6         | 14       | 1.000000  | 210.500583   | 0.827344          | 0.827671            |






    




PROGRESS: | 7         | 15       | 1.000000  | 230.813210   | 0.827055          | 0.827637            |






    




PROGRESS: | 8         | 16       | 1.000000  | 251.941302   | 0.829657          | 0.829674            |






    




PROGRESS: | 9         | 17       | 1.000000  | 272.294446   | 0.830671          | 0.830581            |






    




PROGRESS: | 10        | 18       | 1.000000  | 293.590415   | 0.830256          | 0.830135            |






    




PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+



In [35]:

    
print "%0.20f" % eval_model(model, val)









    



0.40357369232907458478

Best Score so far.

This log_loss is being calculated against the validation set, which was a random 10% split of the first day of clickstream data. That validation set may not be entirely generalizable, so it would not be surprising if the log_loss on the test set is higher. Still, this is definitely a good demonstration of how using feature engineering and feature selection can greatly improve the results of a model.

Recap

In this notebook we covered the following types of feature engineering:

Datetime parsing for extracting day and hour features
Interaction features for site_id and app_id
Quadratic Features for creating full interaction terms between all features in a row.
One-Hot Encoding to reduce the dimensionality within each categorical variable to just the top 100 categories.
Feature Hashing to reduce the overall dimensionality of all the features (including the Quadratic Features)

We have done a bunch of feature engineering, but have only introduced the powerful transformers available in GraphLab Create.

Feature Engineering Transformers in GraphLab Create

FeatureHasher: Hashes an input feature space to an n-bit feature space.
QuadraticFeatures: Calculates quadratic interaction terms between features.
FeatureBinner: Feature binning is a method of turning continuous variables into categorical values.
OneHotEncoder: Encode a collection of categorical features using a 1-of-K encoding scheme.
CountThresholder: Map infrequent categorical variables to a new/separate category.
TransformerChain: Sequentially apply a list of transforms.

It is just as easy to write your own custom Transformer, simply extend the TransformerBase and then write arbitrary Python code for the fit and transform methods - this allows for creative ensembling.

It is also worth mentioning that we could have chained the Transformers together using the TransformerChain class, making it easy for us to compose complex feature engineering transformations that can be used just as simply.

There is a lot more we can do with Feature Engineering, hopefully this notebook has given you a flavor of how easy and scalable Feature Engineering is in GraphLab Create!

id	click	hour	C1	banner_pos	site_id	site_domain	site_category	app_id
1000009418151094273	0	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386
10000169349117863715	0	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386
10000371904215119486	0	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386
10000640724480838376	0	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386
10000679056417042096	0	14102100	1005	1	fe8cc448	9166c161	0569f928	ecad2386
10000720757801103869	0	14102100	1005	0	d6137915	bb1ef334	f028772b	ecad2386
10000724729988544911	0	14102100	1005	0	8fda644b	25d4cfcd	f028772b	ecad2386
10000918755742328737	0	14102100	1005	1	e151e245	7e091613	f028772b	ecad2386
10000949271186029916	1	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386
10001264480619467364	0	14102100	1002	0	84c7ba46	c4e18dd6	50e219e0	ecad2386

app_domain	app_category	device_id	device_ip	device_model	device_type	device_conn_type	C14	C15	C16
7801e8d9	07d7df22	a99f214a	ddd2926e	44956a24	1	2	15706	320	50
7801e8d9	07d7df22	a99f214a	96809ac8	711ee120	1	0	15704	320	50
7801e8d9	07d7df22	a99f214a	b3cf8def	8a4875bd	1	0	15704	320	50
7801e8d9	07d7df22	a99f214a	e8275b8f	6332421a	1	0	15706	320	50
7801e8d9	07d7df22	a99f214a	9644d0bf	779d90c2	1	0	18993	320	50
7801e8d9	07d7df22	a99f214a	05241af0	8a4875bd	1	0	16920	320	50
7801e8d9	07d7df22	a99f214a	b264c159	be6db1d7	1	0	20362	320	50
7801e8d9	07d7df22	a99f214a	e6f67278	be74e6fe	1	0	20632	320	50
7801e8d9	07d7df22	a99f214a	37e8da74	5db079b5	1	2	15707	320	50
7801e8d9	07d7df22	c357dbff	f1ac7184	373ecbe6	0	0	21689	320	50

C17	C18	C19	C20	C21
1722	0	35	-1	79
1722	0	35	100084	79
1722	0	35	100084	79
1722	0	35	100084	79
2161	0	35	-1	157
1899	0	431	100077	117
2333	0	39	-1	157
2374	3	39	-1	23
1722	0	35	-1	79
2496	3	167	100191	23