Title: Tokenize Text
Slug: tokenize_text
Summary: How to tokenize text from unstructured text data for machine learning in Python.
Date: 2016-09-08 12:00
Category: Machine Learning
Tags: Preprocessing Text Authors: Chris Albon

Preliminaries


In [1]:
# Load library
from nltk.tokenize import word_tokenize, sent_tokenize

Create Text Data


In [9]:
# Create text
string = "The science of today is the technology of tomorrow. Tomorrow is today."

Tokenize Words


In [10]:
# Tokenize words
word_tokenize(string)


Out[10]:
['The',
 'science',
 'of',
 'today',
 'is',
 'the',
 'technology',
 'of',
 'tomorrow',
 '.',
 'Tomorrow',
 'is',
 'today',
 '.']

Tokenize Sentences


In [11]:
# Tokenize sentences
sent_tokenize(string)


Out[11]:
['The science of today is the technology of tomorrow.', 'Tomorrow is today.']