Generally speaking, Computational Text Analysis is a set of interpretive methods which seek to understand patterns in human discourse, in part through statistics. More familiar methods, such as close reading, are exceptionally well-suited to the analysis of individual texts, however our research questions typically compel us to look for relationships across texts, sometimes counting in the thousands or even millions. We have to zoom out, in order to perform so-called distant reading. Fortunately for us, computers are well-suited to identify the kinds of textual relationships that exist at scale.
We will spend the week exploring research questions that computational methods can help to answer and thinking about how these complement -- rather than displace -- other interpretive methods. Before moving to that conceptual level, however, we will familiarize ourselves with the basic tools of the trade.
Natural Language Processing is an umbrella term for the methods by which a computer handles human language text. This includes transforming the text into a numerical form that the computer manipulates natively, as well as the measurements that reserchers often perform. In the parlance, natural language refers to a language spoken by humans, as opposed to a formal language, such as Python, which comprises a set of logical operations.
The goal of this lesson is to jump right in to text analysis and natural language processing. Rather than starting with the nitty gritty of programming in Python, this lesson will demonstrate some neat things you can do with a minimal amount of coding. Today, we aim to build intuition about how computers read human text and learn some of the basic operations we'll perform with them.
Check out the full range of techniques included in Python's nltk package here: http://www.nltk.org/book/
In [ ]:
print("For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media.")
In [ ]:
# Assign the quote to a variable, so we can refer back to it later
# We get to make up the name of our variable, so let's give it a descriptive label: "sentence"
sentence = "For me it has to do with the work that gets done at the crossroads of digital media and traditional humanistic study. And that happens in two different ways. On the one hand, it's bringing the tools and techniques of digital media to bear on traditional humanistic questions; on the other, it's also bringing humanistic modes of inquiry to bear on digital media."
In [ ]:
# Oh, also: anything on a line starting with a hashtag is called a comment,
# and is meant to clarify code for human readers. The computer ignores these lines.
In [ ]:
# Print the contents of the variable 'sentence'
print(sentence)
The above output is how a human would read that sentence. Next we look the main way in which a computer "reads", or parses, that sentence.
The first step is typically to tokenize it, or to change it into a series of tokens. Each token roughly corresponds to either a word or punctuation mark. These smaller units are more straight-forward for the computer to handle for tasks like counting.
In [ ]:
# Import the NLTK (Natural Language Tool Kit) package
import nltk
In [ ]:
# Tokenize our sentence!
nltk.word_tokenize(sentence)
In [ ]:
# Create new variable that contains our tokenized sentence
sentence_tokens = nltk.word_tokenize(sentence)
In [ ]:
# Inspect our new variable
# Note the square braces at the beginning and end that indicate we are looking at a list-type object
print(sentence_tokens)
While seemingly simple, tokenization is a non-trivial task.
For example, notice how the tokenizer has handled contractions: a contracted word is divided into two separate tokens! What do you think is the motivation for this? How else might you tokenize them?
Also notice each token is either a word or punctuation mark. In practice, it is sometimes useful to remove punctuation marks and at other times to include them, depending on the situation.
In the coming days, we will see other tokenizers and have opportunities to explore their reasoning. For now, we will look at a few examples of NLP tasks that tokenization enables.
In [ ]:
# How many tokens are in our list?
len(sentence_tokens)
In [ ]:
# How often does each token appear in our list?
import collections
collections.Counter(sentence_tokens)
In [ ]:
# Assign those token counts to a variable
token_frequency = collections.Counter(sentence_tokens)
In [ ]:
# Get an ordered list of the most frequent tokens
token_frequency.most_common(10)
Some of the most frequent words appear to summarize the sentence: in particular the words "humanistic", "digital", and "media". However, most of the these terms seem to add noise in the summary: "the", "it", "to", ".", etc.
There are many strategies for identifying the most important words in a text, and we will cover the most popular ones in the next week. Today, we will look at two of them. In the first, we will simply remove the noisey tokens. In the second, we will identify important words using their parts of speech.
Typically, a text goes through a number of pre-processing steps before beginning to the actual analysis. We have already seen the tokenization step. Typically, pre-processing includes transforming tokens to lower case and removing stop words and punctuation marks.
Again, pre-processing is a non-trivial process that can have large impacts on the analysis that follows. For instance, what will be the most common token in our example sentence, once we set all tokens to lower case?
In [ ]:
# Let's revisit our original sentence
sentence
In [ ]:
# And now transform it to lower case, all at once
sentence.lower()
In [ ]:
# Okay, let's set our list of tokens to lower case, one at a time
# The syntax of the line below is tricky. Don't worry about it for now.
# We'll spend plenty of time on it tomorrow!
lower_case_tokens = [ word.lower() for word in sentence_tokens ]
In [ ]:
# Inspect
print(lower_case_tokens)
In [ ]:
# Import the stopwords list
from nltk.corpus import stopwords
In [ ]:
# Take a look at what stop words are included
print(stopwords.words('english'))
In [ ]:
# Try another language
print(stopwords.words('spanish'))
In [ ]:
# Create a new variable that contains the sentence tokens but NOT the stopwords
tokens_nostops = [ word for word in lower_case_tokens if word not in stopwords.words('english') ]
In [ ]:
# Inspect
print(tokens_nostops)
In [ ]:
# Import a list of punctuation marks
import string
In [ ]:
# Inspect
string.punctuation
In [ ]:
# Remove punctuation marks from token list
tokens_clean = [word for word in tokens_nostops if word not in string.punctuation]
In [ ]:
# See what's left
print(tokens_clean)
In [ ]:
# Count the new token list
word_frequency_clean = collections.Counter(tokens_clean)
In [ ]:
# Most common words
word_frequency_clean.most_common(10)
Better! The ten most frequent words now give us a pretty good sense of the substance of this sentence. But we still have problems. For example, the token "'s" sneaked in there. One solution is to keep adding stop words to our list, but this could go on forever and is not a good solution when processing lots of text.
There's another way of identifying content words, and it involves identifying the part of speech of each word.
You may have noticed that stop words are typically short function words, like conjunctions and prepositions. Intuitively, if we could identify the part of speech of a word, we would have another way of identifying which contribute to the text's subject matter. NLTK can do that too!
NLTK has a POS Tagger, which identifies and labels the part-of-speech (POS) for every token in a text. The particular labels that NLTK uses come from the Penn Treebank corpus, a major resource from corpus linguistics.
You can find a list of all Penn POS tags here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Note that, from this point on, the code is going to get a little more complex. Don't worry about the particularities of each line. For now, we will focus on the NLP tasks themselves and the textual patterns they identify.
In [ ]:
# Let's revisit our original list of tokens
print(sentence_tokens)
In [ ]:
# Use the NLTK POS tagger
nltk.pos_tag(sentence_tokens)
In [ ]:
# Assign POS-tagged list to a variable
tagged_tokens = nltk.pos_tag(sentence_tokens)
In [ ]:
# We'll tread lightly here, and just say that we're counting POS tags
tag_frequency = collections.Counter( [ tag for (word, tag) in tagged_tokens ])
In [ ]:
# POS Tags sorted by frequency
tag_frequency.most_common()
The "IN" tag refers to prepositions, so it's no surprise that it should be the most common. However, we can see at a glance now that the sentence contains a lot of adjectives, "JJ". This feels like it tells us something about the rhetorical style or structure of the sentence: certain qualifiers seem to be important to the meaning of the sentence.
Let's dig in to see what those adjectives are.
In [ ]:
# Let's filter our list, so it only keeps adjectives
adjectives = [word for word,pos in tagged_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']
In [ ]:
# Inspect
print( adjectives )
In [ ]:
# Tally the frequency of each adjective
adj_frequency = collections.Counter(adjectives)
In [ ]:
# Most frequent adjectives
adj_frequency.most_common(5)
In [ ]:
# Let's do the same for nouns.
nouns = [word for word,pos in tagged_tokens if pos=='NN' or pos=='NNS']
In [ ]:
# Inspect
print(nouns)
In [ ]:
# Tally the frequency of the nouns
noun_frequency = collections.Counter(nouns)
In [ ]:
# Most Frequent Nouns
print(noun_frequency.most_common(5))
And now verbs.
In [ ]:
# And we'll do the verbs in one fell swoop
verbs = [word for word,pos in tagged_tokens if pos == 'VB' or pos=='VBD' or pos=='VBG' or pos=='VBN' or pos=='VBP' or pos=='VBZ']
verb_frequency = collections.Counter(verbs)
print(verb_frequency.most_common(5))
In [ ]:
# If we bring all of this together we get a pretty good summary of the sentence
print(adj_frequency.most_common(3))
print(noun_frequency.most_common(3))
print(verb_frequency.most_common(3))
To illustrate this process on a slightly larger scale, we will do the exactly what we did above, but will do so on two unknown novels. Your challenge: guess the novels from the most frequent words.
We will do this in one chunk of code, so another challenge for you during breaks or the next few weeks is to see how much of the following code you can follow (or, in computer science terms, how much of the code you can parse). If the answer is none, not to worry! Tomorrow we will take a step back and work on the nitty gritty of programming.
In [ ]:
# Read the two text files from your hard drive
# Assign first mystery text to variable 'text1' and second to 'text2'
text1 = open('text1.txt').read()
text2 = open('text2.txt').read()
In [ ]:
# Tokenize both texts
text1_tokens = nltk.word_tokenize(text1)
text2_tokens = nltk.word_tokenize(text2)
In [ ]:
# Set to lower case
text1_tokens_lc = [word.lower() for word in text1_tokens]
text2_tokens_lc = [word.lower() for word in text2_tokens]
In [ ]:
# Remove stopwords
text1_tokens_nostops = [word for word in text1_tokens_lc if word not in stopwords.words('english')]
text2_tokens_nostops = [word for word in text2_tokens_lc if word not in stopwords.words('english')]
In [ ]:
# Remove punctuation using the list of punctuation from the string pacage
text1_tokens_clean = [word for word in text1_tokens_nostops if word not in string.punctuation]
text2_tokens_clean = [word for word in text2_tokens_nostops if word not in string.punctuation]
In [ ]:
# Frequency distribution
text1_word_frequency = collections.Counter(text1_tokens_clean)
text2_word_frequency = collections.Counter(text2_tokens_clean)
In [ ]:
# Guess the novel!
text1_word_frequency.most_common(20)
In [ ]:
# Guess the novel!
text2_word_frequency.most_common(20)
Computational Text Analysis is not simply the processing of texts through computers, but involves reflection on the part of human interpreters. How were you able to tell what each novel was? Do you notice any differences between each novel's list of frequent words?
The patterns that we notice in our computational model often enrich and extend our research questions -- sometimes in surprising ways! What next steps would you take to investigate these novels?
Tallying word frequencies gives us a bird's-eye-view of our text but we lose one important aspect: context. As the dictum goes: "You shall know a word by the company it keeps."
Concordances show us every occurrence of a given word in a text, inside a window of context words that appear before and after it. This is helpful for close reading to get at a word's meaning by seeing how it is used. We can also use the logic of shared context in order to identify which words have similar meanings. To illustrate this, we can compare the way the word "monstrous" is used in our two novels.
In [ ]:
# Transform our raw token lists in NLTK Text-objects
text1_nltk = nltk.Text(text1_tokens)
text2_nltk = nltk.Text(text2_tokens)
In [ ]:
# Really they're no differnt from the raw text, but they have additional useful functions
print(text1_nltk)
print(text2_nltk)
In [ ]:
# Like a concordancer!
text1_nltk.concordance("monstrous")
In [ ]:
text2_nltk.concordance("monstrous")
In [ ]:
# Get words that appear in a similar context to "monstrous"
text1_nltk.similar("monstrous")
In [ ]:
text2_nltk.similar("monstrous")
The methods we have looked at today are the bread-and-butter of NLP. Before moving on, take a moment to reflect on the model of textuality that these rely on. Human language texts are split into tokens. Most often, these are transformed into simple tallies: 'whale' appears 1083 times; "dashwood" appears 249 times. This does not resemble human reading at all! Yet in spite of that, such a list of frequent terms makes a useful summary of the text.
A few questions in closing: