Decoding values from OneHotEncoder

Let's say we encoded a categorical variable into one-hot encoding, ie. a binary vector with each column indicating one of the values. The scikit-learn library provides a class called OneHotEncoder to do this job. However, it is not well documented how to decode the binary vectors back to the numeric values, ie. to make an inverse transform. In this notebook we'll explore a very simple way to do it. Since OneHotEncoder supports both dense and spares matrix format, we'll try successfully both.


In [62]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

In [63]:
orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])

Dense matrix


In [64]:
ohe_dense = OneHotEncoder(sparse=False)

In [65]:
# reshape() since it needs input as a 2D matrix
encoded_dense = ohe_dense.fit_transform(orig.reshape(-1, 1))
encoded_dense


Out[65]:
array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.]])

In [66]:
ohe_dense.active_features_


Out[66]:
array([2, 3, 4, 5, 6, 8, 9])

The key insight is that the active_features_ attribute of the OHE model represents the original values for each binary column. Thus we can decode the binary-encoded number by simply computing a dot-product with the index (active_features_). For each data point there's just a 1 at the position of the original value.


In [67]:
decoded_dense = encoded_dense.dot(ohe_dense.active_features_).astype(int)
decoded_dense


Out[67]:
array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])

In [68]:
np.allclose(orig, decoded_dense)


Out[68]:
True

Sparse matrix


In [69]:
ohe_sparse = OneHotEncoder(sparse=True)

In [70]:
encoded_sparse = ohe_sparse.fit_transform(orig.reshape(-1, 1))
encoded_sparse


Out[70]:
<10x7 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [71]:
ohe_sparse.active_features_


Out[71]:
array([2, 3, 4, 5, 6, 8, 9])

In [72]:
np.allclose(orig, encoded_sparse.dot(ohe_sparse.active_features_))


Out[72]:
True