```
In [18]:
```__author__ = 'Nick Dingwall, Chris Potts'

**TL;DR** Decision tree models can handle categorical variables without one-hot encoding them. However, popular implementations of decision trees (and random forests) differ as to whether they honor this fact. We show that one-hot encoding can seriously degrade tree-model performance. Our primary comparison is between H2O (which honors categorical variables) and scikit-learn (which requires them to be one-hot encoded).

Many real-world datasets include a mix of continuous and categorical variables. The defining property of the latter is that they do not permit a total ordering. A major advantage of **decision tree models** and their ensemble counterparts, **random forests**, is that they are able to operate on both continuous and categorical variables directly. In contrast, most other popular models (e.g., generalized linear models, neural networks) must instead transform categorical variables into some numerical analog, usually by one-hot encoding them to create a new **dummy variable** for each **level** of the original variable:

One-hot encoding can lead to a huge increase in the dimensionality of the feature representations. For example, one-hot encoding U.S. states adds 49 dimensions to the intuitive feature representation. In addition, one-hot encoding erases important structure in the underlying representation by splitting a single feature into many separate ones. (The naming convention used above, and by many software packages, can be misleading: the three features on the right are completely separate.)

But one-hot encoding also presents two problems that are more particular to tree-based models:

- The resulting sparsity virtually ensures that continuous variables are assigned higher feature importance.
- A single level of a categorical variable must meet a very high bar in order to be selected for splitting early in the tree building. This can degrade predictive performance.

This post substantiates both of these points with a comparison between scikit-learn, which presupposes one-hot encoding, and H2O, which does not. We'll do it by constructing an artificial dataset with a known relationship between the features and the target, and explain how these problems arise.

For the most part, this post reports just our technique and central findings, but all of the code is available in `tree_categorical_variables.py`

.

```
In [2]:
```from tree_categorical_variables import *

For our artificial data experiments, we define a categorical variable $c$ which takes values from $C^+$ or $C^-$ and a normally-distributed continuous variable $z \sim N(10, 3^2)$, and specify that

$$ y = \begin{cases} 1, & \text{if } c\in C^+ \text{ or } z>10 \\ 0, & \text{otherwise } \end{cases} $$To make this more challenging, we'll create some further continuous variables $x_i$ that have varying correlations with $y$.

Let's generate a dataset with $|C^+| = |C^-| = 100$, $100$ additional weakly correlated features, and $10,000$ samples:

```
In [3]:
```data_categorical, data_onehot = generate_dataset(
num_x=100, n_samples=10000, n_levels=200)

```
In [4]:
```data_categorical.head(10).round(3)

```
Out[4]:
```

```
In [6]:
```from IPython.display import SVG, display
display(SVG("fig/Decision tree visualization.svg"))

```
```

```
In [7]:
```data_onehot.head(10).round(3)

```
Out[7]:
```

We'll first focus our discussion on single decision trees to keep things simple, and then extend the results to random forests. We conduct two kinds of analysis:

A baseline model that doesn't include the categorical variable $c$.

A model that includes $c$.

This allows us to intuitively quantify the value of $c$ for the prediction problem. For each experiment, we'll train and evaluate a tree 10 times and average the results.

```
In [4]:
```results_no_c = evaluate_sklearn_model(
data_onehot,
feature_names=get_feature_names(data_onehot, include_c=False),
target_col='y',
model=DecisionTreeClassifier())
print_auc_mean_std(results_no_c)

```
```

**area under the ROC curve** (AUC) which balances the true positive and false positive rates and has a maximum value of 1.0 (perfect classification). A score of 0.73 is respectable but far from stellar. Now let's add $c$ back:

```
In [5]:
```results_with_c = evaluate_sklearn_model(
data_onehot,
feature_names=get_feature_names(data_onehot, include_c=True),
target_col='y',
model=DecisionTreeClassifier())
print_auc_mean_std(results_with_c)

```
```

```
In [6]:
```print_sorted_mean_importances(results_with_c)

```
```

In fact, all of the $x$-based ones have higher importance than all of the $c$-based ones.

`H2ODecisionTree`

, which specifies a single-tree forest using all the available features, which is equivalent to a single decision tree. As before, we first evaluate the model without $c$:

```
In [7]:
```h2o_results_no_c = evaluate_h2o_model(
data_categorical,
feature_names=get_feature_names(data_categorical, include_c=False),
target_col='y',
model=H2ODecisionTree())
print_auc_mean_std(h2o_results_no_c)

```
```

```
In [8]:
```h2o_results_with_c = evaluate_h2o_model(
data_categorical,
feature_names=get_feature_names(data_categorical, include_c=True),
target_col='y',
model=H2ODecisionTree())
print_auc_mean_std(h2o_results_with_c)

```
```

```
In [9]:
```print_sorted_mean_importances(h2o_results_with_c)

```
```

Finally, let's see what happens if we use H2O with the one-hot encoded data:

```
In [12]:
```h2o_results_with_c_onehot = evaluate_h2o_model(
data_onehot,
feature_names=get_feature_names(data_onehot, include_c=True),
target_col='y',
model=H2ODecisionTree())
print_auc_mean_std(h2o_results_with_c_onehot)

```
```

With one-hot encoding, H2O's performance is about the same as that of scikit-learn.

To understand what's causing the difference, we need to study the logic of tree-building algorithms.

At the heart of the tree-building algorithm is a subalgorithm that splits the samples into two bins by selecting a variable and a value. This splitting algorithm considers each of the features in turn, and for each feature selects the value of that feature that minimizes the *impurity* of the bins. We won't get into the details of how this is calculated (and there's more than one way), except to say that you can consider a bin that contains mostly positive or mostly negative samples more pure than one that contains a mixture. There's a nice visualization of the algorithm in the Visual Introduction to Machine Learning.

In our case, we'd hope that, when the algorithm considers $z$, it would choose to split at $10$. That is, any example whose value of $z$ is less than $10$ goes to into one bin, and any whose value is greater than $10$ goes in the other. It should, in turn, further subdivide the samples assigned to the 'less-than' bin, since we know that some of these are in fact positive.

Binary variables are automatically disadvantaged here, since there is only one way to split the samples: 0s one way, and 1s the other. Low-cardinality categorical variables suffer from the same problem. Another way to look at it: a continuous variable induces an ordering of the samples, and the algorithm can split that ordered list anywhere. A binary variable can only be split in one place, and a categorical variable with $q$ levels can be split in $\frac{2^{q}}{2} - 1$ ways.

An important sidenote: we don't actually have to search all the partitions because there are efficient algorithms for both binary classification and regression that are guaranteed to find the optimal split in linear time — see page 310 of the Elements of Statistical Learning. No such guarantee exists for multinomial classification, but there is a heuristic.

By one-hot encoding a categorical variable, we create many binary variables, and from the splitting algorithm's point of view, they're all independent. This means a categorical variable is already disadvantaged over continuous variables. But there's a further problem: these binary variables are *sparse*. Imagine our categorical variable has 100 levels, each appearing about as often as the others. The best the algorithm can expect to do by splitting on one of its one-hot encoded dummies is to reduce impurity by $\approx 1\%$, since each of the dummies will be 'hot' for around $1\%$ of the samples.

The result of all this is that, if we start by one-hot encoding a high-cardinality variable, the tree building algorithm is unlikely to select one of its dummies as the splitting variable near the root of the tree, instead choosing continuous variables. In datasets like the one we created here, that leads to inferior performance.

In contrast, by considering all of the levels of $c$ at once, H2O's algorithm is able to select $c$ at the very top of the tree.

The importance score assigned to each feature is a measure of how often that feature was selected, and how much of an effect it had in reducing impurity when it was selected. (We don't consider permutation feature importance here; this might help combat the preference for continuous variables over binary ones, but it will not help with the induced sparsity.)

H2O assigns about $70\%$ of its importance to $c$, and the remaining $30\%$ to $z$. Scikit-learn, in contrast, assigns less than $10\%$ in total to the one-hot encodings of $c$, $30\%$ to $z$ and almost $60\%$ collectively to $x_i$, features that are entirely unnecessarily to perfectly classify the data!

As we discussed, this problem is especially profound for high-cardinality categorical variables. If the categorical variables have few levels, then the induced sparsity is less severe and the one-hot encoded versions have a chance of competing with the continuous ones.

```
In [14]:
```sklearn_ensemble_results = evaluate_sklearn_model(
data_onehot,
feature_names=get_feature_names(data_onehot, include_c=True),
target_col='y',
model=RandomForestClassifier(n_estimators=100))
print_auc_mean_std(sklearn_ensemble_results)

```
```

```
In [15]:
```h2o_ensemble_results = evaluate_h2o_model(
data_categorical,
feature_names=get_feature_names(data_categorical, include_c=True),
target_col='y',
model=H2ORandomForestEstimator(ntrees=100))
print_auc_mean_std(h2o_ensemble_results)

```
```

`LogisticRegressionCV`

classifier, to compare a linear classifier with the tree-based ones:

```
In [16]:
```sklearn_regression_results = evaluate_sklearn_model(
data_onehot,
feature_names=get_feature_names(data_onehot, include_c=True),
target_col='y',
model=LogisticRegressionCV())
print_auc_mean_std(sklearn_regression_results)

```
```

`OR`

relations, and the features $x_i$ are all normally-distributed and contribute to a Bernoulli sampling probability for $y.$

We trained a model on the same data using R's `rpart`

and that achieved near-perfect performance when $c$ was included, and similar performance to scikit-learn and H2O without it. It seems to be correctly treating $c$ as a single categorical variable.

However, R's `randomForest`

package only allows up to 53 levels per categorical variable (this has recently increased from 32) and even then, it gets very slow as the number of levels increases. It's hard to know why this should be the case for binary classification or regression, since efficient algorithms exist to find the optimal partition, but it seems like it is performing a somewhat-exhaustive search of partitions of $c$. Please send us a message if you have any insight!

So, what can be done? There is some discussion about incorporating support for categorical features in scikit-learn, but that seems to be a way off. In the meantime, perhaps consider using H2O. Whatever you do, at least be aware that, despite theoretical results suggesting the expressive power of random forests with one-hot encoded features is the same as for categorical variables, the practical results are very different!