Feature engineering is well known to be one of the key ingredients in a successful intelligent application. In this notebook, we introduce what feature engineering is, why it is important, and how to engineer features. For this, we work through a Kaggle competition on click-through prediction (CTR) using Avazu's anonymized dataset.
More specifically, you will learn how to:
The specific feature engineering techniques used in this notebook are:
The goal is not to try to win this competition, but instead show that prediction accuracy can be greatly improved with feature engineering tools and some more time.
Note: This notebook requires GraphLab Create 1.3.
In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')
(from Kaggle):
For this competition, we have provided 11 days worth of Avazu data to build and test prediction models. Can you find a strategy that beats standard classification algorithms?
Data fields
id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables
We need to load the data. The data is provided as a gzipped-CSV, so we download it from Kaggle's website and then load it locally using SFrame.read_csv. Since the dataset is almost entirely categorical, we set the column_type_hints
to str
, and then simply change the click column to int
.
Since there is a lot of iterative steps while feature engineering, and to save disk space and time to load, it is often a good idea to save the SFrame in its native format. That way you can restart this IPython session and more quickly pick up where you left off.
In [2]:
import os
if not os.path.exists('train.gl'):
path = 'train.gz' # path to dataset - download from Kaggle, 1.2GB
train = gl.SFrame.read_csv(path, column_type_hints=str)
train['click'] = train['click'].astype(int)
train.save('train.gl')
else:
train = gl.SFrame('train.gl')
In [3]:
train
Out[3]:
This dataset has 40M+ rows. Though that is no problem for GraphLab Create, on most laptops that could take a while to train models. When doing feature engineering and feature selection you want to be able to train lots of models with different feature combinations to see what features are predictive. Most of these experiments may not be successful, hence it is useful to sample the large dataset to a representative size that allows for fast iterative model training
For this dataset, it is important not to use a random sample of the click data. Since we have 10 days of data in train, let us take one day of the data as the sample to use for feature engineering. All the data from a single day will be a better representative of the data in comparison with a random sample.
To do this, we notice that the first day in the train data is 141021 (which is October 21, 2014) followed by 00-23 for the hours in that day (in UTC). A simple way to get the click-stream traffic for that one day is to filter the 'hour' column for values in [14102100 .. 14102123] using SFrame.filter_by.
Since we want to operate on oneday in multiple sessions, we save it in SFrame binary format for easy loading later on.
In [4]:
# let's start with a one-day sample (our first bit of Feature Engineering!)
if not os.path.exists('oneday.gl'):
# this filters the SFrame, returning the SFrame where the hour column is 14102100 to 14102123
# which is the 2014-10-21 day of clickstream traffic
oneday = train.filter_by(['141021%02d' % hour for hour in range(24)], 'hour')
oneday.save('oneday.gl')
else:
oneday = gl.SFrame('oneday.gl')
In [5]:
oneday
Out[5]:
In [6]:
oneday.show()
Now that we have sampled and taken a quick look at the data, let us create a train-validation split. Doing this process once ensures that we can train several models using different features and use the the same data for evaluating each of the models.
Now that we have sampled the data to one day, we use a random split of this data (using SFrame.random_split), with 90% being used for training and 10% used for validation. We also specify a seed for reproducability.
In [7]:
train, val = oneday.random_split(0.9, seed=12345)
In [8]:
train.print_rows(5)
As a baseline model, let us use all categorical features as is using the logistic classifier model. This is not only a great way to get started, but also acts as a reasonable baseline prior to feature engineering and feature selection. To get all the features we call SFrame.column_names and then remove the id
column since it isn't meaningful in this appication.
In [9]:
features = train.column_names()
features.remove('click') # cannot have target in features list
features.remove('id') # id is not useful for training the model
print features
In [10]:
# Baseline: use ALL features as is
baseline = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)
Uh-Oh, we might be overfitting. Notice how there 1.4M coefficients and only 3.7M examples. Also notice how the training accuracy continues to climb and the validation accuracy has held steady. That is a strong indicator that we are overfitting.
But to evaluate how well we are doing, we will evaluate it using Kaggle measure for this competition: log_loss. Here is a quick logloss function written in Python.
In [11]:
import math
def log_loss_raw(target, predicted):
'''Calculate log_loss between target and predicted and return.'''
p = predicted.apply(lambda x: min(0.99999, max(1e-5, x)))
logp = p.apply(lambda x: math.log(x))
logmp = p.apply(lambda x: (math.log(1-x)))
return -(target * logp + (1-target) * logmp).mean()
def eval_model(model, test):
'''Evaluate a trained model using Kaggle scoring.'''
return log_loss_raw(test['click'], model.predict(test, output_type='probability'))
In [12]:
print "%0.20f" % eval_model(baseline, val)
This is a good model to get started but not as good as some of the entries in the leaderboard. We definitely need to do some feature engineering.
Feature Engineering is the process of representing the data using domain knowledge so a model can effectively use the data. Effective feature engineering is part art and part science. Feature Selection is the related discipline of deciding which set of features the model should consider during training. Typically both feature engineering and feature selection are iterative processes.
For CTR, it is often good practice to ask yourself the following question:
What domain knowledge do we have about what influences user clicks?
The first thing to consider is the time of day. However, we don't have the time in a convenient format for the model, so we need to do some feature engineering to get the 'hour' of the day as a feature.
We want to modify both the train
and the val
SFrames, since we want the same transformations to occur in both places. To make this simpler, we create a function to do the time of day feature, which takes an SFrame and returns the modified SFrame.
In [13]:
def hour_eng(data):
'''Parse out day/hour from SFrame.
Given an SFrame with an 'hour' column formatted with YYMMDDHH parse this as a datetime
and then split to just keep hour and day features (as categorical variables)
'''
if 'day' in data.column_names():
# idempotency check, if transformation already applied then simply return
return data
data['datetime'] = data['hour'].str_to_datetime('%y%m%d%H')
data.remove_column('hour')
data = data.split_datetime('datetime', column_name_prefix='', limit=['day', 'hour'])
data['hour'] = data['hour'].astype(str)
data['day'] = data['day'].astype(str)
return data
In [14]:
# to measure how we are doing, we need to apply to both train and validation SFrames
train = hour_eng(train)
val = hour_eng(val)
print train.column_names()
Notice how there is a new 'day' column and a new 'hour' column. These are the new columns/features added by the hour_eng
function.
In [15]:
features = ['site_id', 'day', 'hour']
model = gl.logistic_classifier.create(train, features=features,
target='click', validation_set=val)
Note: The classifier has warned us that the column 'day' has low variance. We should remove it because we are only training on one day's worth of data, so the day is not meaningful.
In [16]:
print "%0.20f" % eval_model(model, val)
That is a big improvement and a good first step but we can do better with some more feature engineering.
I believe click-through rate is dependent on which site the user is on, along with which app the user seeing/using. So I want to create a new feature that represents the interaction of site_id with app_id features. This results in a cross-product between the two features.
Again it is easiest to have a function so this set of transformations can be done on both the training set and the validation set, as below.
In [17]:
def site_app(data):
# if already has a site_app then consider this operation completed.
if 'site_app' in data.column_names():
return data
data['site_app'] = data['site_id'] + data['app_id']
return data
In [18]:
train = site_app(train)
val = site_app(val)
print train.column_names()
Now let's use this newly generated feature, instead of 'site_id', and see how we do. We are also dropping 'hour' since it isn't meaningful.
In [19]:
features=['site_app', 'day', 'hour']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)
In [20]:
print "%0.20f" % eval_model(model, val)
That is better, but let us see if we can add some more complex feature engineering transformations to get better results.
As part of the Feature Engineering Transformers, we can extend beyond the one interaction combination described above, and generate a feature with all feature combinations, using the QuadraticFeatures transformer, as follows.
Note: This step can take quite a while (about an hour on a basic Macbook Pro)
In [21]:
from graphlab import feature_engineering as fe
# QuadraticFeatures requires float, integer, array, or dictionary column types
# so pack all interaction columns into dict first
interaction_columns = ['C1', 'banner_pos',
'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19',
'C20', 'C21', 'hour']
for feature in interaction_columns:
train[feature] = train[feature].astype(int)
quad = fe.create(train, fe.QuadraticFeatures(features=interaction_columns))
In [22]:
def apply_quadratic(dataset):
if 'quadratic_features' in dataset.column_names():
# operation already performed, do nothing
return dataset
for feature in interaction_columns:
dataset[feature] = dataset[feature].astype(int)
return quad.transform(dataset)
In [23]:
# this adds a 'quadratic_features' column to both SFrames
train = apply_quadratic(train)
val = apply_quadratic(val)
In [24]:
features=['site_app', 'day', 'hour', 'quadratic_features']
model = gl.logistic_classifier.create(train, features=features,
target='click', validation_set=val, max_iterations=20)
In [25]:
print "%0.20f" % eval_model(model, val)
This is an improvement, and so we are on the right track. But let us see if further reducing the dimensionality of number of categories for each feature will help.
A one-hot encoder encodes a collection of categorical features using a 1-of-K encoding scheme while decreasing dimensionality by limiting each feature to a max number of categories.
In [26]:
# Create the OneHotEncoder, and fit it to the train dataset
# use ALL features, except id and click. Limit the number of categories per feature to only 800 most frequent.
# Excluding quadratic_features column because it is encoding the same thing
onehot = fe.create(train, fe.OneHotEncoder(
excluded_features=['id', 'click', 'site_app', 'day', 'hour', 'quadratic_features'],
max_categories=300))
In [27]:
# Note: max_categories choses the top categories based on
# frequency therefore retaining only the most frequent ones.
def apply_onehot(dataset):
if 'encoded_features' in dataset.column_names():
# operation already completed on SFrame
return dataset
return onehot.transform(dataset)
In [28]:
train = apply_onehot(train)
val = apply_onehot(val)
In [29]:
features=['site_app', 'day', 'hour', 'quadratic_features', 'encoded_features']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)
In [30]:
print "%0.20f" % eval_model(model, val)
Again, we are doing much better by restricting ourselves to only the top few categories. Now let us see if we can apply one transformer to another, using the FeatureHasher to reduce dimensionality overall, including reducing the dimensionality of the quadratic_features column.
The feature hasher hashes multiple features into an n-bit feature space. This process greatly reduces dimensionality.
In [31]:
hasher = fe.create(train, fe.FeatureHasher(
excluded_features=['id', 'click', 'site_app', 'day', 'hour', 'encoded_features'],
num_bits=22))
In [32]:
def apply_feature_hasher(dataset):
if 'hashed_features' in dataset.column_names():
# feature hasher already performed
return dataset
return hasher.transform(dataset)
In [33]:
train = apply_feature_hasher(train)
val = apply_feature_hasher(val)
print train.column_names()
In [34]:
# we drop quadratic_features because it has been hashed by the FeatureHasher,
# so its signal is encoded in the 'hashed_features' column
features=['site_app', 'day', 'hour', 'encoded_features', 'hashed_features']
model = gl.logistic_classifier.create(train, features=features, target='click', validation_set=val)
In [35]:
print "%0.20f" % eval_model(model, val)
This log_loss is being calculated against the validation set, which was a random 10% split of the first day of clickstream data. That validation set may not be entirely generalizable, so it would not be surprising if the log_loss on the test set is higher. Still, this is definitely a good demonstration of how using feature engineering and feature selection can greatly improve the results of a model.
In this notebook we covered the following types of feature engineering:
We have done a bunch of feature engineering, but have only introduced the powerful transformers available in GraphLab Create.
It is just as easy to write your own custom Transformer, simply extend the TransformerBase and then write arbitrary Python code for the fit and transform methods - this allows for creative ensembling.
It is also worth mentioning that we could have chained the Transformers together using the TransformerChain class, making it easy for us to compose complex feature engineering transformations that can be used just as simply.
There is a lot more we can do with Feature Engineering, hopefully this notebook has given you a flavor of how easy and scalable Feature Engineering is in GraphLab Create!