Introduction to Natural Language Processing (NLP) using Python's NLTK

One of the most frequent tasks in computational text analysis is quickly summarizing the content of text. In this lesson we will learn how to summarze text by counting frequent words using Python's nltk. In the process we'll learn a different way to count the words in a text, and we'll cover some common pre-processing steps.

Natural Language Processing is an umbrella term that incorporates many techiques and methods to process, analyze, and understand natural languages (as opposed to artificial languages like logics, or Python).

Learning Goals:

The goal of this lesson is to jump right in to text analysis and natural language processing. This lesson will demonstrate some neat things you can do with a minimal amount of coding.

In this tutorial you will:

  • Get an introduction to the Python package NLTK, what functionality it offers, and learn some important NLP terms
  • Learn how to do a variety of forms of counting using NLTK, and think about why these might help researchers analyze text
  • Learn about pre-processing steps and why you might or might not do them

Lesson Outline:

Key Terms:

  • stop words:
    • The most common words in a language.
  • tokenization:
    • Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
  • token:
    • A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
  • type:
    • A type is the class of all tokens containing the same character sequence.

Further Resources:

Check out the full range of techniques included in Python's nltk package here: http://www.nltk.org/book/

0. Assigning Text as a Variable in Python

First, we assign a sample sentence, our "text", to a variable called "sentence".

Note: This sentence is a quote about what digital humanities means, from digital humanist Kathleen Fitzpatrick. Source: "On Scholarly Communication and the Digital Humanities: An Interview with Kathleen Fitzpatrick", In the Library with the Lead Pipe


In [ ]:
#assign the desired sentence to the variable called 'sentence.'
sentence = "For me it has to do with the work that gets done at the crossroads of \
digital media and traditional humanistic study. And that happens in two different ways. \
On the one hand, it's bringing the tools and techniques of digital media to bear \
on traditional humanistic questions; on the other, it's also bringing humanistic modes \
of inquiry to bear on digital media."

#print the content
print(sentence)

1. Tokenizing Text and Type-Token Ratios

Reminder: computers are completely naive. The computer has no idea that the variable "sentence" is in a natural language, that it is composed of words (it doesn't even know what a word is), that it is a sentence, that punctuation is different than a letter, etc. Everything is 0s and 1s for a computer. We have to tell it everything else.

We have been splitting text on white spaces to approximate words, with each "word" as an element of a list.

Question: Why is this a problem?

Computational linguists have done a lot of work figuring out better ways to split a text into meaningful chunks that mimic natural languages. Chunking the text into meaningful bits is referred to as tokenizing text, and produces a list of tokens.


In [ ]:
#First import the Python package nltk (Natural Language Tool Kit)
import nltk

#The difficult work is done for us.
# We simply import the function, "word_tokenize", which splits the text into tokens
from nltk import word_tokenize

#create new variable that applies the word_tokenize function to our sentence.
sentence_tokens = word_tokenize(sentence)

#This new variable contains the tokenized text. What datatype is the variable "sentence_tokens"?
print(sentence_tokens)

Notice each token is either a word or punctuation. Notice also this is not a trivial task, nor does this completely solve the issue.

For example, one of these tokens is ["'s"], which indicates "is" in the contraction "it is". How would we distinguish this from ["'s"] as a possessive? Should we rather tokenize it's to ['it', 'is']? But then we would erase the contraction, which may be important later. There are many questions like this. The problem with natural language is they are ambiguous, and never strictly follow a set of rules.

Assumming we're happy with the tokenization, why is it helpful?

We can now summarize the sentence/text in interesting and potentially helpful ways. For example, we can count the number of tokens in the sentence.


In [ ]:
#The number of tokens is the length of the list, or the number of elements in the list
print(len(sentence_tokens))

Question: Is this the same as the number of words? Why?

We might want to remove punctuation, so we can better identify the number of words, and can do more analyses on it. We can remove punctuation using list comprehension! And by important a new package, string, which has a list of common punctuation. This is identified using the command string.punctuation


In [ ]:
import string
print(string.punctuation)

#assign the string of common punctuation symbols to a variable and turn it into a list
punctuations = list(string.punctuation)

#see what punctuation is included
print(punctuations)

Now we can remove punctuation using list comprehension.


In [ ]:
sentence_tokens_clean = [word for word in sentence_tokens if word not in punctuations]
print(sentence_tokens_clean)

We can also calculate the type-token ratio (TTR). We know what a token is. But many tokens are repeated in a text. For example, in this sentence, the token "the" appears 5 times. "The" is a type. The 5 "the"s in the sentence are tokens. The TTR is simply the number of types divided by the number of tokens. A high TTR indicates a large amount of lexical variation or lexical diversity and a low TTR indicates relatively little lexical variation. The type-token ratio of speech, for example, is less than that of written language. What might we expect of the two novels we have been analyzing?


In [ ]:
set(sentence_tokens_clean)

Question: What do you notice about this output? What datatype is it?


In [ ]:
##EX: Print the type-token ratio for this sentence

2. Most Frequent Words

We are often also interested in the most frequent words, which can help us quickly summarize a text. NLTK has a built-in function to do this!


In [ ]:
#apply the nltk function FreqDist to count the number of times each token occurs.
word_frequency = nltk.FreqDist(sentence_tokens)

#this creates an NLTK object
print(word_frequency)

In [ ]:
#print out the 10 most frequent words using the function most_common
print(word_frequency.most_common(10))

The most frequent words do suggest what the sentence is about, in particular the words "humanistic", "digital", "media", and "traditional".

But there are many frequent words that are not helpful in summarizing the text, for example, "the", "and", "to", and "." So the most frequent words do not necessarily help us understand the content of a text.

How can we use a computer to identify important, interesting, or content words in a text? There are many ways to do this. Today, we'll learn one simple way to identify words that will help us summarize the content of a text. We'll explore a different method later.

3. Pre-Processing: Lower Case, Removing Stop Words

First, scholars typically go through a number of pre-processing steps before getting to the actual analysis. We have already removed punctuation, which is a common pre-processing step. Another common step is converting all words to lower-case, so that the word "Humanities" and "humanities" count as the same word. (For some tasks this is appropriate. Think of reasons why we might NOT want to do this.)

To convert to lower case we use the function lower()


In [ ]:
sentence_tokens_clean_lc = [word.lower() for word in sentence_tokens_clean]

#see the result
print(sentence_tokens_clean_lc)

In [ ]:
#Ex: Calculate the type-token ratio for the sentence after you have changed everything to lowercas

Words like "the", "to", and "and" are what text analysis call "stop words." Stop words are the most common words in a language, and while necessary and useful for some analysis purposes, do not tell us much about the substance of a text. Another common pre-processing steps is to simply remove punctuation and stop words. NLTK contains a built-in stop words list, which we use to remove stop words from our list of tokens.


In [ ]:
#import the stopwords list
from nltk.corpus import stopwords

#take a look at what stop words are included:
print(stopwords.words('english'))

Exericse


In [ ]:
####Ex: create a new variable that contains the sentence tokens without the stopwords
#Ex: Count the most frequent 10 words on this new variable and print out the results

Better! The 10 most frequent words now give us a pretty good sense of the substance of this sentence. But this won't allways be perfect. One solution is to keep adding stop words to our stop word list, but this could go on forever and is not a good solution when processing lots of text.

When we calculate TTR, would we want to do this before or after the pre-processing steps? Why?

We will go over another way of identifying content words next time, and it involves identifying the part of speech of each word.


In [ ]:
####Ex: Calculate the TTR for two of the novels in our data folder. Print the most frequent words for these two novels.