One Hot Encoding Tutorial -- Introduction to natural language processing:

Author: Dr. Rahul Remanan

CEO and Chief Imagination Officer, Moad Computer

This notebook is a modified fork from Machine Learning Mastery blog.

Part 01 -- Basics of one-hot encoding:

Importing dependent libraries:


In [0]:
import numpy as np

Define the input string:


In [2]:
data = 'hello world'
print(data)


hello world

Define universe of possible input values:


In [0]:
alphabet = 'abcdefghijklmnopqrstuvwxyz '

Define a mapping of characters to corresponding integers:


In [0]:
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

Integer encoding of the input data:


In [5]:
integer_encoded = [char_to_int[char] for char in data]
print(integer_encoded)


[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]

One hot encoding:


In [0]:
onehot_encoded = list()
for value in integer_encoded:
	letter = [0 for _ in range(len(alphabet))]
	letter[value] = 1
	onehot_encoded.append(letter)

In [7]:
print(onehot_encoded)


[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Dencoding one hot encoded data -- First character:


In [8]:
inverted = int_to_char[np.argmax(onehot_encoded[0])]
print(inverted)


h

Dencoding one hot encoded data -- Entire one-hot encoded input:


In [0]:
decoded = list()
for i in range(len(onehot_encoded)):
  decoded_char = int_to_char[np.argmax(onehot_encoded[i])]
  decoded.append(decoded_char)

In [10]:
print (''.join([str(item) for item in decoded]))


hello world

Part 02a -- One hot encoding using sci-kit learn:

Importing libraries:


In [0]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Define example :


In [12]:
data = ['cold', 
        'cold', 
        'warm', 
        'cold', 
        'hot', 
        'hot', 
        'warm', 
        'cold', 
        'warm', 
        'hot']
values = array(data)
print(values)


['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']

Integer encoding:


In [13]:
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(values)
print(label_encoded)


[0 0 2 0 1 1 2 0 2 1]

Binary encoding:


In [14]:
onehot_encoder = OneHotEncoder(sparse=False)
label_encoded = label_encoded.reshape(len(label_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(label_encoded)
print(onehot_encoded)


[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Invert first example:


In [15]:
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])


/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

Output the decoded example:


In [16]:
print(inverted)


['cold']

Part 02b -- One-hot encode using keras:

Importing libraries:


In [17]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical


Using TensorFlow backend.

Define the variable:


In [18]:
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot']
values = array(data)
print(values)


['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']

Integer encoding:


In [0]:
label_encoder = LabelEncoder()
label_encoded = label_encoder.fit_transform(values)

In [20]:
print(label_encoded)


[0 0 2 0 1 1 2 0 2 1]

In [21]:
# one hot encode
encoded = to_categorical(label_encoded)
print(encoded)
# invert encoding
label_encoded = argmax(encoded[0])
inverted = label_encoder.inverse_transform(label_encoded)


[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

In [22]:
print(inverted)


cold

Part 02c -- One-hot encode using keras for numerical categories:


In [23]:
from numpy import array
from numpy import argmax
from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
print(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)
# invert encoding
inverted = argmax(encoded[0])


[1 3 2 0 3 2 2 1 0 1]
[[0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]]

In [24]:
print(inverted)


1