Introduction

Description

A large number of machine learning algorithms can not work with categorical data out-of-the-box. One approach to go over this obstacle is to use one-hot encoding. However, if categorical feature takes k distinct values and there are n objects in a dataset, n * k values will be stored after dense one-hot encoding. In case of high k, this can cause MemoryError. Of course, there are several approaches that do not result in extreme growth of learning sample size even if categorical features have high cardinality. For example, hashing trick is implemented in sklearn.feature_extraction.FeatureHasher. In this demo, another memory-efficient approach is shown.

Suppose that we want to replace categorical feature with mean of target variable conditional on this feature (i.e., group by the feature, average target within each group, and replace each value of the feature with average of its group). Is it a good idea? At the first glance, yes, it is. Nevertheless, the devil is in the detail. The smaller a group of objects with a particular value of the feature is, the higher contribution of each object's target to within-group average is. Random noise from target variable leaks to a new feature and it is not milded by averaging over a sample if this sample is not big enough. Hence, this new feature gives an advantage that is unfair, because this advantage is present during training, but it is absent when new examples with unknown target are provided.

A cure to this problem is to compute aggregates of target with usage of other object's targets, but without target of an object for which a new feature is being computed right now. For example, new feature can be generated out-of-fold: data are split into folds and a new feature for a fold is computed on all other folds only. It is like stacking where a first-stage model does not use any features directly and just applies an aggregating function to a target with respect to grouping by a categorical feature under processing.

Classes OutOfFoldTargetEncodingRegressor and OutOfFoldTargetEncodingClassifier that can be imported from dsawl.target_encoding are plug-and-play implementations of this trick. They have sklearn-compatible API such that cross validation scores measured by means of sklearn are realistic.

References

Look at this Kaggle post for additional details.

General Preparations

Import Statements & Random Seed Setting


In [1]:
from typing import List, Tuple

import numpy as np
import pandas as pd

from sklearn.metrics import r2_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

from dsawl.target_encoding import OutOfFoldTargetEncodingRegressor

In [2]:
np.random.seed(361)

Synthetic Dataset Generation


In [3]:
def generate_data(
        slopes: List[float],
        group_size: int,
        noise_stddev: float
        ) -> pd.DataFrame:
    """
    Generate `len(slopes)` * `group_size` examples
    with dependency y = slope * x + noise.
    """
    dfs = []
    for i, slope in enumerate(slopes):
        curr_df = pd.DataFrame(columns=['x', 'category', 'y'])
        curr_df['x'] = range(group_size)
        curr_df['category'] = i
        curr_df['y'] = curr_df['x'].apply(
            lambda x: slope * x + np.random.normal(scale=noise_stddev)
        )
        dfs.append(curr_df)
    df = pd.concat(dfs)
    return df

Let us create a situation where in-fold generation of target-based features leads to overfitting. To do so, make a lot of small categories and set noise variance to a high value. Such settings result in leakage of noise from target into in-fold generated mean. Thus, regressor learns how to use this leakage, which is useless on hold-out sets.


In [4]:
slopes = [2, 1, 3, 4, -1, -2, 3, 2, 1, 5, -2, -3, -5, 8, 1, -7, 0, 2, 0]
group_size = 5
noise_stddev = 10

In [5]:
train_df = generate_data(slopes, group_size, noise_stddev)
train_df.head()


Out[5]:
x category y
0 0 0 3.314436
1 1 0 -4.654332
2 2 0 1.642263
3 3 0 15.854888
4 4 0 21.873901

Generate test set from the same distribution (in a way that preserves balance between categories).


In [6]:
test_df = generate_data(slopes, group_size, noise_stddev)
test_df.head()


Out[6]:
x category y
0 0 0 -0.682203
1 1 0 9.033670
2 2 0 8.911399
3 3 0 2.345571
4 4 0 3.408391

Benchmark: Training with In-Fold Generated Mean


In [7]:
encoding_df = (
    train_df
    .groupby('category', as_index=False)
    .agg({'y': np.mean})
    .rename(columns={'y': 'infold_mean'})
)

In [8]:
def get_target_and_features(df: pd.DataFrame) -> Tuple[pd.DataFrame]:
    merged_df = df.merge(encoding_df, on='category')
    X = merged_df[['x', 'infold_mean']]
    y = merged_df['y']
    return X, y

In [9]:
X_train, y_train = get_target_and_features(train_df)
X_test, y_test = get_target_and_features(test_df)

In [10]:
rgr = LinearRegression()

In [11]:
rgr.fit(X_train, y_train)
y_hat_train = rgr.predict(X_train)
r2_score(y_train, y_hat_train)


Out[11]:
0.42987610029007461

In [12]:
y_hat = rgr.predict(X_test)
r2_score(y_test, y_hat)


Out[12]:
0.061377490670754042

Obviously, overfitting is detected.

Out-of-Fold Target Encoding Estimator


In [13]:
X_train, y_train = train_df[['x', 'category']], train_df['y']
X_test, y_test = test_df[['x', 'category']], test_df['y']

In [14]:
splitter = KFold(shuffle=True, random_state=361)
rgr = OutOfFoldTargetEncodingRegressor(
    LinearRegression,  # It is a type, not an instance of a class.
    dict(),  # If neeeded, pass constructor arguments here as a dictionary.
    # Separation of constructor arguments from estimator makes code
    # with involvement of tools such as `GridSearchCV` more consistent.
    splitter=splitter,  # Define how to make folds for features generation.
    smoothing_strength=0,  # New feature can be smoothed towards unconditional aggregate.
    min_frequency=1,  # Unconditional aggregate is used for rare enough values.
    drop_source_features=True  # To use or not to use features from conditioning.
)

What is below is a wrong way to measure train score. Regressor uses in-fold generated features for predictions, not the features that are used for training.


In [15]:
rgr.fit(X_train, y_train, source_positions=[1])
y_hat_train = rgr.predict(X_train)
r2_score(y_train, y_hat_train)


Out[15]:
0.27857429182727422

Actually, imparity between training set and all other sets is the main reason why special estimators are implemented instead of making dsawl.target_encoding.TargetEncoder be able to work inside sklearn.pipeline.Pipeline. If you want to use an estimator with target encoding inside a pipeline, pass pipeline instance as an internal estimator, i.e., as the first argument. For more details, please go to Appendix II of this demo.

Now, let us look at the right way to measure performance on train set. In OutOfFoldTargetEncodingRegressor and OutOfFoldTargetEncodingClassifier, fit_predict method is not just a combination of fit and predict methods, because it is designed specially for correct work with training sets.


In [16]:
y_hat_train = rgr.fit_predict(X_train, y_train, source_positions=[1])
r2_score(y_train, y_hat_train)


Out[16]:
0.18233620473442524

In [17]:
rgr.fit(X_train, y_train, source_positions=[1])
y_hat = rgr.predict(X_test)
r2_score(y_test, y_hat)


Out[17]:
0.14721978147696491

Thus, there is no blatant overfitting — train set score and test set score are close to each other (for some other random seeds the gap between them might be lower, but, anyway, it is not too considerable even now). Also it is appropriate to highlight that test set score is significantly higher than that of the benchmark regressor trained with mean generated in-fold.

Appendix I. Some Snippets

How to Run Grid Search?


In [18]:
from sklearn.model_selection import GridSearchCV


grid_params = {
    'estimator_params': [{}, {'fit_intercept': False}],
    'smoothing_strength': [0, 10],
    'min_frequency': [0, 10]
}
rgr = GridSearchCV(
    OutOfFoldTargetEncodingRegressor(),
    grid_params
)
rgr.fit(X_train, y_train, source_positions=[1])
rgr.best_params_


Out[18]:
{'estimator_params': {}, 'min_frequency': 0, 'smoothing_strength': 0}

Appendix II. Advanced Integration with Pipelines

To start with, remind that the key reason of difficulties with target encoding is imparity between training set and other sets for which predictions are made. Training set is split into folds, whereas other sets are not, because targets from them are not used for target encoding.

If correct measuring of train scores is not crucial for you, you can use this quick-and-impure trick.


In [19]:
from dsawl.target_encoding import TargetEncoder


class PipelinedTargetEncoder(TargetEncoder):
    pass


PipelinedTargetEncoder.fit_transform = PipelinedTargetEncoder.fit_transform_out_of_fold

# Now instances of `PipelinedTargetEncoder` can be used as transformers in pipelines
# and target encoding inside these pipelines is implemented in an out-of-fold fashion.

Anyway, look at a way that allows working with train scores methodocogically right.

Suppose that by some unknown reasons there is a need in learning a model from the dataset under consideration such that:

  • first of all, both features are scaled,
  • then the categorical feature is target-encoded,
  • then squares, cubes, and interactions of order not higher than three between all terms are included as a new features,
  • and, finally, linear regression is run.

This is not so easy, because if Pipeline instance is passed as an internal estimator, target encoding is the first transformation, yet in this case it must go after scaling. Below snippet demonstrates how to use OutOfFoldTargetEncodingRegressor inside a pipeline that meets above specifications.


In [20]:
splitter = KFold(shuffle=True, random_state=361)
rgr = OutOfFoldTargetEncodingRegressor(
    Pipeline,
    {
        'steps': [
            ('poly_enrichment', PolynomialFeatures()),
            ('linear_model', LinearRegression())
        ],
        'poly_enrichment__degree': 3
    },
    splitter=splitter
)

pipeline = Pipeline([
    ('scaling', StandardScaler()),
    ('target_encoding_regression', rgr)
])

X_train = X_train.astype(np.float64)  # Avoid `DataConversionWarning`.
pipeline.fit(
    X_train, y_train,
    target_encoding_regression__source_positions=[1]
)


Out[20]:
Pipeline(memory=None,
     steps=[('scaling', StandardScaler(copy=True, with_mean=True, with_std=True)), ('target_encoding_regression', OutOfFoldTargetEncodingRegressor(aggregators=None, drop_source_features=True,
                 estimator_params=None, estimator_type=None,
                 min_frequency=1, smoothing_strength=0.0,
                 splitter=KFold(n_splits=3, random_state=None, shuffle=True)))])

In [21]:
r2_score(
    y_train,
    pipeline.fit_predict(
        X_train, y_train,
        target_encoding_regression__source_positions=[1]
    )
)


Out[21]:
0.263271432939716

Actually, relatively good training score is achieved by chance and this odd pipeline does not have any advantages over regular estimator. The cell above just demonstrates how train score can be measured.