Integrating normalization with tokenization


We define functions to tokenize and normalize. We tokenize the input and then normalize each token. This first example is here only to illustrate the general program logic. If you try to run it, it returns an error because we haven’t defined the functions yet.

In [ ]:
def tokenize(input):

def normalize(input):

sample = "Hello, Mom!"
tokens = tokenize(sample)
# print(tokens)
normalized = [normalize(token) for token in tokens]


Tokenize on white space and normalize as lower case

In [ ]:
def tokenize(input): # tokenize on white space
    return input.split()

def normalize(input): # normalize as lower case
    return (input, input.lower())

sample = "Hello, Mom!"
tokens = tokenize(sample)
# print(tokens)
normalized = [normalize(token) for token in tokens]

Use NLTK word tokenization and normalize as POS

In [ ]:
import nltk

def tokenize(input): # use NLTK word tokenization
    return nltk.word_tokenize(input)

def normalize(input): # normalize as POS
    pos = nltk.pos_tag([input]) # since it’s a single word, make it a list
    return (pos)

sample = "Hello, Mom!"
tokens = tokenize(sample)
# print(tokens)
normalized = [normalize(token) for token in tokens]

Use NLTK word tokenization and strip vowels and punctuation to normalize

In [ ]:
import re

def tokenize(input): # use NLTK word tokenization
    return nltk.word_tokenize(input)

def normalize(input): # normalize as POS
    return (input, re.sub('[AEIOUaeiou]','',input))

sample = "Hello, Mom!"
tokens = tokenize(sample)
# print(tokens)
normalized = [normalize(token) for token in tokens]