Title: Tokenize Text
Slug: tokenize_text
Summary: How to tokenize text from unstructured text data for machine learning in Python.
Date: 2016-09-08 12:00
Category: Machine Learning
Tags: Preprocessing Text Authors: Chris Albon

Preliminaries



In [1]:

    
# Load library
from nltk.tokenize import word_tokenize, sent_tokenize

Create Text Data



In [9]:

    
# Create text
string = "The science of today is the technology of tomorrow. Tomorrow is today."

Tokenize Words



In [10]:

    
# Tokenize words
word_tokenize(string)









    Out[10]:





['The',
 'science',
 'of',
 'today',
 'is',
 'the',
 'technology',
 'of',
 'tomorrow',
 '.',
 'Tomorrow',
 'is',
 'today',
 '.']

Tokenize Sentences



In [11]:

    
# Tokenize sentences
sent_tokenize(string)









    Out[11]:





['The science of today is the technology of tomorrow.', 'Tomorrow is today.']