Intro to low level NLP - Tokenization, Stopwords, Frequencies, Bigrams

Lynn Cherny, arnicas@gmail


In [ ]:
import itertools
import nltk
import string

In [ ]:
nltk.data.path

In [ ]:
nltk.data.path.append("../nltk_data")

In [ ]:
nltk.data.path = ['../nltk_data']

Tokenization

Read in a file to use for practice. The directory is one level above us now, in data/books. You can add other files into the data directory if you want.


In [ ]:
ls ../data/books

In [ ]:
# the "U" here is for universal newline mode, because newlines on Mac are \r\n and on Windows are \n.

with open("../data/books/Austen_Emma.txt", "U") as handle:
    text = handle.read()

In [ ]:
text[0:120]

Before we go further, it might be worth saying that even the lines of a text can be interesting as a visual. Here are a couple of books where every line is a line of pixels, and we've applied a simple search in JS to show lines of dialogue in pink. (The entire analysis is done in the web file book_shape.html -- so it's a little slow to load.)

But usually you want to extract some sense of the content, which means crunching the text itself to get insights about the overall file.


In [ ]:
## if you don't want the newlines in there - replace them all.
text = text.replace('\n', ' ')

In [ ]:
text[0:120]

In [ ]:
## Breaking it up by sentence!  Can be very useful for vis :)
nltk.sent_tokenize(text)[0:10]

In [ ]:
tokens = nltk.word_tokenize(text)
tokens[70:85]  # Notice the punctuation:

In [ ]:
# Notice the difference here:
nltk.wordpunct_tokenize(text)[70:85]

There are other options for tokenization in NLTK. You can test some out here: http://text-processing.com/demo/tokenize/

Doing it in textkit at the command line:

Thanks to the work of Bocoup.com, we have a library that will do some simple text analysis at the command line, wrapping up some of the python functions I'll be showing you. The library is at https://github.com/learntextvis/textkit. Be aware it is under development! Also, some of these commands will be slower than running the code in the notebook.

When I say you can run these at the command line, what I mean is that in your terminal window you can type the command you see here after the !. The ! in the Jupyter notebook means this is a shell command.

The | is a "pipe." This means take the output from the previous command and make it the input to the next command.


In [ ]:
# run text2words on this book file at this location, pipe the output to the unix "head" command, showing 20 lines
!textkit text2words ../data/books/Austen_Emma.txt | head -n20

In [ ]:
# Pipe the output through the lowercase textkit operation, before showing 20 lines again!
!textkit text2words ../data/books/Austen_Emma.txt | textkit lowercase | head -n20

What if, at this point, we made a word cloud? Let's say we strip out the punctuation and just count the words. I'll do it quickly just to show you... but we'll go a bit further.


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2counts > ../outputdata/simple_emma_counts.csv

In [ ]:
!ls -al ../outputputdata/simple_emma_counts.csv

Using the html file simple_wordcloud.html and this data file, we can see something basically useless. You don't have to do this yourself, but if you want to, edit that file to point to the ../outputdata/simple_emma_counts.csv at the bottom.

StopWords

"Stopwords" are words that are usually excluded because they are common connectors (or determiners, or short verbs) that are not considered to carry meaning. BEWARE hidden stopword filtering in libraries you use and always check stopword lists to see if you agree with their contents!


In [ ]:
from nltk.corpus import stopwords
english_stops = stopwords.words('english')

In [ ]:
# Notice they are lowercase.  This means we need to be sure we lowercase our text if we want to match against them.

english_stops

In [ ]:
tokens = nltk.word_tokenize(text)
tokens[0:15]

In [ ]:
# this is how many tokens we have:
len(tokens)

We want to strip out stopwords - use a list comprehension. Notice you need to lower case the words before you check for membership!


In [ ]:
# try this without .lower in the if-statement and check the size!
# We are using a python list comprehension to remove the tokens from Emma (after lowercasing them!) that are stopwords
tokens = [token.lower() for token in tokens if token.lower() not in english_stops]
len(tokens)

In [ ]:
# now look at the first 15 words:
tokens[0:15]

Let's get rid of punctuation too, which isn't used in most bag-of-words analyses. "Bag of words" means lists of words where the order doesn't matter. That's how most NLP tasks are done!


In [ ]:
import string
string.punctuation

In [ ]:
# Now remove the punctuation and see how much smaller the token list is now:
tokens = [token for token in tokens if token not in string.punctuation]
len(tokens)

In [ ]:
# But there's some awful stuff still in here:
sorted(tokens)[0:20]

The ugliness of some of those tokens! You have some possibilities now - add to your stopwords list the ones you want removed; or remove all very short words, which will get rid of our puntuation problem too.


In [ ]:
[token for token in tokens if len(token) <= 2][0:20]

In [ ]:
# Let's define a small python function that's a pretty common one for text processing.

def clean_tokens(tokens):
    """ Lowercases, takes out punct and stopwords and short strings """
    return [token.lower() for token in tokens if (token not in string.punctuation) and 
                   (token.lower() not in english_stops) and len(token) > 2]

In [ ]:
clean = clean_tokens(tokens)
clean[0:20]

In [ ]:
len(clean)

So now we've reduced our data set from 191739 to 72576, just by removing stopwords, punctuation, and short strings. If we're interested in "meaning", this is a useful removal of noise.

Using textkit at the commandline for filtering stopwords and punctuation and lowercase and short words:

(We are breaking these lines up with some intermediate output files (emma_lower.txt) because of how long these get.)


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2lower > ../outputdata/emma_lower.txt

In [ ]:
!head -n5 ../outputdata/emma_lower.txt

In [ ]:
!textkit text2words ../outputdata/emma_lower.txt | textkit filterwords | textkit filterlengths -m 3 > ../outputdata/emma_clean.txt

In [ ]:
!head -n10 ../outputdata/emma_clean.txt

Count Word Frequencies

The obvious thing you want to do next is count frequencies of words in texts - NLTK has you covered. (Or you can do it easily yourself using a Counter object.)


In [ ]:
from nltk import Text
cleantext = Text(clean)
cleantext.vocab().most_common()[0:20]

In [ ]:
# if you want to know all the vocabular, without counts - you can remove the [0:10] which just shows the first 10:
cleantext.vocab().keys()[0:10]

In [ ]:
# Another way to do this is with nltk.FreqDist, which creates an object with keys that are 
# the vocabulary, and values for the counts:

nltk.FreqDist(clean)['sashed']

If you wanted to save the words and counts to a file to use, you can do it like this:


In [ ]:
wordpairs = cleantext.vocab().most_common()
with open("../outputdata/emma_word_counts.csv", "w") as handle:
    for pair in wordpairs:
        handle.write(pair[0] + "," + str(pair[1]) + "\n")

In [ ]:
!head -n5 ../outputdata/emma_word_counts.csv

Using Textkit at the command line:

Let's save the output of the filtered, lowercase words into a file called cleantokens.txt:


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2lower > ../outputdata/emma_lower.txt

In [ ]:
!textkit filterwords ../outputdata/emma_lower.txt | textkit filterlengths -m 3 > ../outputdata/emma_clean.txt

In [ ]:
!head -n5 ../outputdata/emma_clean.txt

In [ ]:
!textkit tokens2counts ../outputdata/emma_clean.txt > ../outputdata/emma_word_counts.csv

In [ ]:
!head ../outputdata/emma_word_counts.csv

Now you are ready to make word clouds that are smarter than your average word cloud. Move your counts file into a place where your html can find it. Edit the file "simple_wordcloud.html" to use the name of your file, including the path!

You may still see some words in here you don't love -- names, modal verbs (would, could):

We can actually edit those by hand in the html/js code if you want. Look for the list of stopwords. You can change the color, too, if you want. I've added a few more stops to see how it looks now:

You might want to keep going.

By this point, we already know a lot about how to make texts manageable. A nice example of counting words over time in text appeared in the Washington Post, for SOTU speeches: https://www.washingtonpost.com/graphics/politics/2016-sotu/language/

There have also been a lot of studies of sentence and speech or book length. I hope that seems easy now. You could tokenize by sentence using nltk, and plot those lengths. And you could just count the words in speeches or books to plot them.


In [ ]:

Finding Most Common Pairs of Words ("Bigrams")

Words occur in common sequences, sometimes. We call word pairs "bigrams" (and word triples "trigrams"). We refer to N-grams when we mean "sequences of some length."


In [ ]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

word_fd = nltk.FreqDist(clean) # all the words
bigram_fd = nltk.FreqDist(nltk.bigrams(clean))
finder = BigramCollocationFinder(word_fd, bigram_fd)
scored = finder.score_ngrams(bigram_measures.likelihood_ratio) # a good option here, there are others:
scored[0:50]

In [ ]:
# Trigrams - using raw counts is much faster.

finder = TrigramCollocationFinder.from_words(clean,
    window_size = 15)
finder.apply_freq_filter(2)
#ignored_words = nltk.corpus.stopwords.words('english') # if you hadn't removed them...
# if you want to remove extra words, like character names, you can create the ignored_words list too:
#finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

In [ ]:
finder.nbest(trigram_measures.raw_freq, 20)

In [ ]:
finder.score_ngrams(trigram_measures.raw_freq)[0:20]

In [ ]:
## This is very slow!  Don't run unless you're serious :)

finder = TrigramCollocationFinder.from_words(clean,
    window_size = 20)
finder.apply_freq_filter(2)
#ignored_words = nltk.corpus.stopwords.words('english') # if you hadn't removed them...
# if you want to remove extra words, like character names, you can create the ignored_words list too:
#finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
finder.apply_word_filter(lambda w: len(w) < 3)  # remove short words
finder.nbest(trigram_measures.likelihood_ratio, 10)

What if we wanted to try non-fiction, to see if there are more interesting results?

We need to read and clean the text for another file. Let's try positive movie reviews, located in data/movie_reviews/all_pos.txt.


In [ ]:
with open("../data/movie_reviews/all_pos.txt", "U") as handle:
    text = handle.read()

In [ ]:
tokens = nltk.word_tokenize(text)  # tokenize them - split into words and punct
clean_posrevs = clean_tokens(tokens)  # clean up stopwords and punct

In [ ]:
clean_posrevs[0:10]

In [ ]:
word_fd = nltk.FreqDist(clean_posrevs)
bigram_fd = nltk.FreqDist(nltk.bigrams(clean_posrevs))
finder = BigramCollocationFinder(word_fd, bigram_fd)
scored = finder.score_ngrams(bigram_measures.likelihood_ratio) # other options are 
scored[0:50]

To see more details about the NLTK Text object methods, read the code/doc here: http://www.nltk.org/_modules/nltk/text.html

Bigrams in Textkit at the command line:

Create a file with all the word pairs, after making everything lowercase and removing punctuation and basic stopwords:


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2lower > ../outputdata/emma_lower.txt

In [ ]:
!textkit filterwords ../outputdata/emma_lower.txt | textkit filterlengths -m 3 | textkit words2bigrams > ../outputdata/bigrams_emma.txt

In [ ]:
!head -n10 ../outputdata/bigrams_emma.txt

Then count them to get frequencies of the pairs. This may reveal custom stopwords you want to filter out.


In [ ]:
!textkit tokens2counts ../outputdata/bigrams_emma.txt > ../outputdata/bigrams_emma_counts.txt

In [ ]:
!head -n20 ../outputdata/bigrams_emma_counts.txt

Suppose you didn't want the names in there? Custom stopwords can be created in a file, one per line, and added as an argument to the filterwords command:


In [ ]:
!textkit filterwords --custom ../data/emma_customstops.txt ../outputdata/emma_lower.txt > ../outputdata/emma_custom_stops.txt

In [ ]:
!textkit filterlengths -m 3 ../outputdata/emma_custom_stops.txt | textkit words2bigrams > ../outputdata/bigrams_emma.txt

In [ ]:
!textkit tokens2counts ../outputdata/bigrams_emma.txt > ../outputdata/bigrams_emma_counts.txt

In [ ]:
!head -n20 ../outputdata/bigrams_emma_counts.txt

You could add more if you wanted.

Parts of Speech - Abbreviated POS

To do this, you need to make sure your nltk_data has the the MaxEnt Treebank POS tagger -- you can get it interactively with nltk.download() (on the models tab) - but we have it here already in the nltk_data directory.


In [ ]:
text = nltk.word_tokenize("And now I present your cat with something completely different.")
tagged = nltk.pos_tag(text)  # there are a few options for taggers, details in NLTK books
tagged

In [ ]:
nltk.untag(tagged)

The Penn Treebank part of speech tags are these:

source: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Parts of speech are used in anaysis that's "deeper" than bags-of-words approaches. For instance, chunking (parsing for structure) may be used for entity identification and semantics. See http://www.nltk.org/book/ch07.html for a little more info, and the 2 Perkins NLTK books.

Note also that "real linguists" parse a sentence into a syntactic structure, which is usually a tree form.

(Source)

For instance, try out the Stanford NLP parser visually at http://corenlp.run/.

In TextKit at the command line:

This requires more Unix-foo, since Textkit doesn't have the full capability yet to do just a count of certain POS. We'll use grep to search for all the NNPs (proper names, or characters) and cut to get the first column (the word).


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit tokens2pos | grep NNP | cut -d, -f1 > ../outputdata/emma_nouns.txt

In [ ]:
!textkit tokens2counts ../outputdata/emma_nouns.txt > ../outputdata/emma_NNP_counts.csv

In [ ]:
!head -n10 ../outputdata/emma_NNP_counts.csv

That's all proper names. Maybe not very interesting. Let's look at the verbs now.


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit tokens2pos | grep VB | cut -d, -f1 > ../outputdata/emma_verbs.txt

In [ ]:
!textkit tokens2counts ../outputdata/emma_verbs.txt > ../outputdata/emma_VB_counts.csv

In [ ]:
!head -n20 ../outputdata/emma_VB_counts.csv

Keep in mind that you can filter stopwords before you do this, too, although you should also lowercase things too. Also note that "grep VB" will also match other forms of VB, like VBP!

Suppose you want to make a word cloud of just the verbs... without stopwords, and you want to compare two books by the same author, say Emma and Pride and Prejudice. Let's try it. (I'm using a \ to wrap the line here, so I don't need to use intermediate files for short commands.)


In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit tokens2lower \
| textkit filterwords | textkit tokens2pos | grep VB | cut -d, -f1 > ../outputdata/emma_verbs.txt

In [ ]:
!textkit tokens2counts ../outputdata/emma_verbs.txt > ../outputdata/emma_VB_counts.csv

In [ ]:
!textkit text2words ../data/books/Austen_Pride.txt | textkit tokens2lower \
| textkit filterwords | textkit tokens2pos | grep VB | cut -d, -f1 > ../outputdata/pride_verbs.txt

In [ ]:
!textkit tokens2counts ../outputdata/pride_verbs.txt > ../outputdata/pride_VB_counts.csv

If you load those files into wc_clouds_bars.html (at the bottom!), and use the provided extra stopwords, you'll get this: Underneath the word clouds are simple bar charts, to allow you to more precisely see the top words (it's cut off at 150). This is one of the issues with word clouds, they lack precision in their display.

Another option for showing the difference more clearly is in analytic_wordlist.html.

Stemming / Lemmatizing

The goal is to merge data items that are the same at some "root" meaning level, and reduce the number of features in your data set. "Cats" and "Cat" might want to be treated as the same thing, from a topic or summarization perspective. You can really see this in the word clouds above...so many forms of the same word!


In [ ]:
# stemming removes affixes.  This is the default choice for stemming although other algorithms exist.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('believes')

In [ ]:
# lemmatizing transforms to root words using grammar rules. It is slower. Stemming is more common.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('said', pos='v')  # if you don't specify POS, you get zilch.

In [ ]:
lemmatizer.lemmatize('cookbooks')

In [ ]:
stemmer.stem('wicked')

In [ ]:
lemmatizer.lemmatize("were", pos="v")  # lemmatizing would allow us to collapse all forms of "be" into one token

In [ ]:
# an apparently recommended compression recipe in Perkins Python 3 NLTK book? Not sure I agree.
stemmer.stem(lemmatizer.lemmatize('buses'))

Look at some of the clouds above. How would this be useful, do you think?


In [ ]:
def make_verbs_lemmas(filename, outputfile):
    from collections import Counter
    with open(filename, 'U') as handle:
        emmav = handle.read()
    emmaverbs = emmav.split('\n')
    lemmaverbs = []
    for verb in emmaverbs:
        lemmaverbs.append(lemmatizer.lemmatize(verb, pos='v'))
    counts = Counter(lemmaverbs)
    with open(outputfile, 'w') as handle:
        for key, value in counts.items():
            if key:
                handle.write(key + "," + str(value) + "\n")
    print "wrote ", outputfile

In [ ]:
make_verbs_lemmas("../outputdata/emma_verbs.txt", "../outputdata/emma_lemma_verbs.csv")

In [ ]:
!head -n5 ../outputdata/emma_lemma_verbs.csv

In [ ]:
make_verbs_lemmas("../outputdata/pride_verbs.txt", "../outputdata/pride_lemma_verbs.csv")

In [ ]:
!head -n5 ../outputdata/pride_lemma_verbs.csv

Now let's look at those wordclouds. A giant improvement and changes in the counts, actually. Look what happened with the Pride one, where there's a new second place.


In [ ]: