Title: Stemming Words
Slug: stemming_words
Summary: How to stem words in unstructured text data for machine learning in Python.
Date: 2016-09-09 12:00
Category: Machine Learning
Tags: Preprocessing Text

Authors: Chris Albon

Preliminaries


In [1]:
# Load library
from nltk.stem.porter import PorterStemmer

Create Text Data


In [2]:
# Create word tokens
tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

Stem Words

Stemming reduces a word to its stem by identifying and removing affixes (e.g. gerunds) while keeping the root meaning of the word. NLTK's PorterStemmer implements the widely used Porter stemming algorithm.


In [3]:
# Create stemmer
porter = PorterStemmer()

# Apply stemmer
[porter.stem(word) for word in tokenized_words]


Out[3]:
['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']