Title: Tag Parts Of Speech
Slug: tag_parts_of_speech
Summary: How to tag parts of speech in unstructured text data for machine learning in Python.
Date: 2016-09-09 12:00
Category: Machine Learning
Tags: Preprocessing Text

Authors: Chris Albon

Preliminaries


In [1]:
# Load libraries
from nltk import pos_tag
from nltk import word_tokenize

Create Text Data


In [2]:
# Create text
text_data = "Chris loved outdoor running"

Tag Parts Of Speech


In [3]:
# Use pre-trained part of speech tagger
text_tagged = pos_tag(word_tokenize(text_data))

# Show parts of speech
text_tagged


Out[3]:
[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

Common Penn Treebank Parts Of Speech Tags

The output is a list of tuples with the word and the tag of the part of speech. NLTK uses the Penn Treebank parts for speech tags.

Tag Part Of Speech
NNP Proper noun, singular
NN Noun, singular or mass
RB Adverb
VBD Verb, past tense
VBG Verb, gerund or present participle
JJ Adjective
PRP Personal pronoun