Decoding values from OneHotEncoder

Let's say we encoded a categorical variable into one-hot encoding, ie. a binary vector with each column indicating one of the values. The scikit-learn library provides a class called OneHotEncoder to do this job. However, it is not well documented how to decode the binary vectors back to the numeric values, ie. to make an inverse transform. In this notebook we'll explore a very simple way to do it. Since OneHotEncoder supports both dense and spares matrix format, we'll try successfully both.



In [62]:

    
from sklearn.preprocessing import OneHotEncoder
import numpy as np



In [63]:

    
orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])

Dense matrix



In [64]:

    
ohe_dense = OneHotEncoder(sparse=False)



In [65]:

    
# reshape() since it needs input as a 2D matrix
encoded_dense = ohe_dense.fit_transform(orig.reshape(-1, 1))
encoded_dense









    Out[65]:





array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.]])



In [66]:

    
ohe_dense.active_features_









    Out[66]:





array([2, 3, 4, 5, 6, 8, 9])

The key insight is that the active_features_ attribute of the OHE model represents the original values for each binary column. Thus we can decode the binary-encoded number by simply computing a dot-product with the index (active_features_). For each data point there's just a 1 at the position of the original value.



In [67]:

    
decoded_dense = encoded_dense.dot(ohe_dense.active_features_).astype(int)
decoded_dense









    Out[67]:





array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])



In [68]:

    
np.allclose(orig, decoded_dense)









    Out[68]:





True

Sparse matrix



In [69]:

    
ohe_sparse = OneHotEncoder(sparse=True)



In [70]:

    
encoded_sparse = ohe_sparse.fit_transform(orig.reshape(-1, 1))
encoded_sparse









    Out[70]:





<10x7 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>



In [71]:

    
ohe_sparse.active_features_









    Out[71]:





array([2, 3, 4, 5, 6, 8, 9])



In [72]:

    
np.allclose(orig, encoded_sparse.dot(ohe_sparse.active_features_))









    Out[72]:





True