One Hot Encoding Tutorial -- Introduction to natural language processing:

Author: Dr. Rahul Remanan

CEO and Chief Imagination Officer, Moad Computer

This notebook is a modified fork from Machine Learning Mastery blog.

Part 01 -- Basics of one-hot encoding:

Importing dependent libraries:



In [0]:

    
import numpy as np

Define the input string:



In [2]:

    
data = 'hello world'
print(data)









    



hello world

Define universe of possible input values:



In [0]:

    
alphabet = 'abcdefghijklmnopqrstuvwxyz '

Define a mapping of characters to corresponding integers:



In [0]:

    
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

Integer encoding of the input data:



In [5]:

    
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)









    



[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

One hot encoding:



In [0]:

    
onehot_encoded = list()
for value in integer_encoded:
	letter = [0 for _ in range(len(alphabet))]
	letter[value] = 1
	onehot_encoded.append(letter)



In [7]:

    
print(onehot_encoded)









    



[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Dencoding one hot encoded data -- First character:



In [8]:

    
inverted = int_to_char[np.argmax(onehot_encoded[0])]
print(inverted)

Dencoding one hot encoded data -- Entire one-hot encoded input:



In [0]:

    
decoded = list()
for i in range(len(onehot_encoded)):
  decoded_char = int_to_char[np.argmax(onehot_encoded[i])]
  decoded.append(decoded_char)



In [10]:

    
print (''.join([str(item) for item in decoded]))









    



hello world

Part 02a -- One hot encoding using sci-kit learn:

Importing libraries:



In [0]:

    
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Define example :



In [12]:

    
data = ['cold', 
        'cold', 
        'warm', 
        'cold', 
        'hot', 
        'hot', 
        'warm', 
        'cold', 
        'warm', 
        'hot']
values = array(data)
print(values)









    



['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']

Integer encoding:



In [13]:

    
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(values)
print(label_encoded)









    



[0 0 2 0 1 1 2 0 2 1]

Binary encoding:



In [14]:

    
onehot_encoder = OneHotEncoder(sparse=False)
label_encoded = label_encoded.reshape(len(label_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(label_encoded)
print(onehot_encoded)









    



[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Invert first example:



In [15]:

    
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])









    



/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

Output the decoded example:



In [16]:

    
print(inverted)









    



['cold']

Part 02b -- One-hot encode using keras:

Importing libraries:



In [17]:

    
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical









    



Using TensorFlow backend.

Define the variable:



In [18]:

    
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)









    



['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']

Integer encoding:



In [0]:

    
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(values)



In [20]:

    
print(label_encoded)









    



[0 0 2 0 1 1 2 0 2 1]



In [21]:

    
# one hot encode
encoded = to_categorical(label_encoded)
print(encoded)
# invert encoding
label_encoded = argmax(encoded[0])
inverted = label_encoder.inverse_transform(label_encoded)









    



[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]






    



/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:



In [22]:

    
print(inverted)









    



cold

Part 02c -- One-hot encode using keras for numerical categories:



In [23]:

    
from numpy import array
from numpy import argmax
from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
print(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)
# invert encoding
inverted = argmax(encoded[0])









    



[1 3 2 0 3 2 2 1 0 1]
[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]]



In [24]:

    
print(inverted)