In [ ]:
import itertools
import nltk
import string
In [ ]:
nltk.data.path
In [ ]:
nltk.data.path.append("../nltk_data")
In [ ]:
nltk.data.path = ['../nltk_data']
In [ ]:
ls ../data/books
In [ ]:
# the "U" here is for universal newline mode, because newlines on Mac are \r\n and on Windows are \n.
with open("../data/books/Austen_Emma.txt", "U") as handle:
text = handle.read()
In [ ]:
text[0:120]
Before we go further, it might be worth saying that even the lines of a text can be interesting as a visual. Here are a couple of books where every line is a line of pixels, and we've applied a simple search in JS to show lines of dialogue in pink. (The entire analysis is done in the web file book_shape.html -- so it's a little slow to load.)
But usually you want to extract some sense of the content, which means crunching the text itself to get insights about the overall file.
In [ ]:
## if you don't want the newlines in there - replace them all.
text = text.replace('\n', ' ')
In [ ]:
text[0:120]
In [ ]:
## Breaking it up by sentence! Can be very useful for vis :)
nltk.sent_tokenize(text)[0:10]
In [ ]:
tokens = nltk.word_tokenize(text)
tokens[70:85] # Notice the punctuation:
In [ ]:
# Notice the difference here:
nltk.wordpunct_tokenize(text)[70:85]
There are other options for tokenization in NLTK. You can test some out here: http://text-processing.com/demo/tokenize/
Thanks to the work of Bocoup.com, we have a library that will do some simple text analysis at the command line, wrapping up some of the python functions I'll be showing you. The library is at https://github.com/learntextvis/textkit. Be aware it is under development! Also, some of these commands will be slower than running the code in the notebook.
When I say you can run these at the command line, what I mean is that in your terminal window you can type the command you see here after the !. The ! in the Jupyter notebook means this is a shell command.
The | is a "pipe." This means take the output from the previous command and make it the input to the next command.
In [ ]:
# run text2words on this book file at this location, pipe the output to the unix "head" command, showing 20 lines
!textkit text2words ../data/books/Austen_Emma.txt | head -n20
In [ ]:
# Pipe the output through the lowercase textkit operation, before showing 20 lines again!
!textkit text2words ../data/books/Austen_Emma.txt | textkit lowercase | head -n20
What if, at this point, we made a word cloud? Let's say we strip out the punctuation and just count the words. I'll do it quickly just to show you... but we'll go a bit further.
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2counts > ../outputdata/simple_emma_counts.csv
In [ ]:
!ls -al ../outputputdata/simple_emma_counts.csv
Using the html file simple_wordcloud.html and this data file, we can see something basically useless. You don't have to do this yourself, but if you want to, edit that file to point to the ../outputdata/simple_emma_counts.csv at the bottom.
"Stopwords" are words that are usually excluded because they are common connectors (or determiners, or short verbs) that are not considered to carry meaning. BEWARE hidden stopword filtering in libraries you use and always check stopword lists to see if you agree with their contents!
In [ ]:
from nltk.corpus import stopwords
english_stops = stopwords.words('english')
In [ ]:
# Notice they are lowercase. This means we need to be sure we lowercase our text if we want to match against them.
english_stops
In [ ]:
tokens = nltk.word_tokenize(text)
tokens[0:15]
In [ ]:
# this is how many tokens we have:
len(tokens)
We want to strip out stopwords - use a list comprehension. Notice you need to lower case the words before you check for membership!
In [ ]:
# try this without .lower in the if-statement and check the size!
# We are using a python list comprehension to remove the tokens from Emma (after lowercasing them!) that are stopwords
tokens = [token.lower() for token in tokens if token.lower() not in english_stops]
len(tokens)
In [ ]:
# now look at the first 15 words:
tokens[0:15]
Let's get rid of punctuation too, which isn't used in most bag-of-words analyses. "Bag of words" means lists of words where the order doesn't matter. That's how most NLP tasks are done!
In [ ]:
import string
string.punctuation
In [ ]:
# Now remove the punctuation and see how much smaller the token list is now:
tokens = [token for token in tokens if token not in string.punctuation]
len(tokens)
In [ ]:
# But there's some awful stuff still in here:
sorted(tokens)[0:20]
The ugliness of some of those tokens! You have some possibilities now - add to your stopwords list the ones you want removed; or remove all very short words, which will get rid of our puntuation problem too.
In [ ]:
[token for token in tokens if len(token) <= 2][0:20]
In [ ]:
# Let's define a small python function that's a pretty common one for text processing.
def clean_tokens(tokens):
""" Lowercases, takes out punct and stopwords and short strings """
return [token.lower() for token in tokens if (token not in string.punctuation) and
(token.lower() not in english_stops) and len(token) > 2]
In [ ]:
clean = clean_tokens(tokens)
clean[0:20]
In [ ]:
len(clean)
So now we've reduced our data set from 191739 to 72576, just by removing stopwords, punctuation, and short strings. If we're interested in "meaning", this is a useful removal of noise.
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2lower > ../outputdata/emma_lower.txt
In [ ]:
!head -n5 ../outputdata/emma_lower.txt
In [ ]:
!textkit text2words ../outputdata/emma_lower.txt | textkit filterwords | textkit filterlengths -m 3 > ../outputdata/emma_clean.txt
In [ ]:
!head -n10 ../outputdata/emma_clean.txt
In [ ]:
from nltk import Text
cleantext = Text(clean)
cleantext.vocab().most_common()[0:20]
In [ ]:
# if you want to know all the vocabular, without counts - you can remove the [0:10] which just shows the first 10:
cleantext.vocab().keys()[0:10]
In [ ]:
# Another way to do this is with nltk.FreqDist, which creates an object with keys that are
# the vocabulary, and values for the counts:
nltk.FreqDist(clean)['sashed']
If you wanted to save the words and counts to a file to use, you can do it like this:
In [ ]:
wordpairs = cleantext.vocab().most_common()
with open("../outputdata/emma_word_counts.csv", "w") as handle:
for pair in wordpairs:
handle.write(pair[0] + "," + str(pair[1]) + "\n")
In [ ]:
!head -n5 ../outputdata/emma_word_counts.csv
Let's save the output of the filtered, lowercase words into a file called cleantokens.txt:
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2lower > ../outputdata/emma_lower.txt
In [ ]:
!textkit filterwords ../outputdata/emma_lower.txt | textkit filterlengths -m 3 > ../outputdata/emma_clean.txt
In [ ]:
!head -n5 ../outputdata/emma_clean.txt
In [ ]:
!textkit tokens2counts ../outputdata/emma_clean.txt > ../outputdata/emma_word_counts.csv
In [ ]:
!head ../outputdata/emma_word_counts.csv
Now you are ready to make word clouds that are smarter than your average word cloud. Move your counts file into a place where your html can find it. Edit the file "simple_wordcloud.html" to use the name of your file, including the path!
You may still see some words in here you don't love -- names, modal verbs (would, could):
We can actually edit those by hand in the html/js code if you want. Look for the list of stopwords. You can change the color, too, if you want. I've added a few more stops to see how it looks now:
You might want to keep going.
In [ ]:
In [ ]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
word_fd = nltk.FreqDist(clean) # all the words
bigram_fd = nltk.FreqDist(nltk.bigrams(clean))
finder = BigramCollocationFinder(word_fd, bigram_fd)
scored = finder.score_ngrams(bigram_measures.likelihood_ratio) # a good option here, there are others:
scored[0:50]
In [ ]:
# Trigrams - using raw counts is much faster.
finder = TrigramCollocationFinder.from_words(clean,
window_size = 15)
finder.apply_freq_filter(2)
#ignored_words = nltk.corpus.stopwords.words('english') # if you hadn't removed them...
# if you want to remove extra words, like character names, you can create the ignored_words list too:
#finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
In [ ]:
finder.nbest(trigram_measures.raw_freq, 20)
In [ ]:
finder.score_ngrams(trigram_measures.raw_freq)[0:20]
In [ ]:
## This is very slow! Don't run unless you're serious :)
finder = TrigramCollocationFinder.from_words(clean,
window_size = 20)
finder.apply_freq_filter(2)
#ignored_words = nltk.corpus.stopwords.words('english') # if you hadn't removed them...
# if you want to remove extra words, like character names, you can create the ignored_words list too:
#finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
finder.apply_word_filter(lambda w: len(w) < 3) # remove short words
finder.nbest(trigram_measures.likelihood_ratio, 10)
Some more help is here: http://www.nltk.org/howto/collocations.html
In [ ]:
with open("../data/movie_reviews/all_pos.txt", "U") as handle:
text = handle.read()
In [ ]:
tokens = nltk.word_tokenize(text) # tokenize them - split into words and punct
clean_posrevs = clean_tokens(tokens) # clean up stopwords and punct
In [ ]:
clean_posrevs[0:10]
In [ ]:
word_fd = nltk.FreqDist(clean_posrevs)
bigram_fd = nltk.FreqDist(nltk.bigrams(clean_posrevs))
finder = BigramCollocationFinder(word_fd, bigram_fd)
scored = finder.score_ngrams(bigram_measures.likelihood_ratio) # other options are
scored[0:50]
To see more details about the NLTK Text object methods, read the code/doc here: http://www.nltk.org/_modules/nltk/text.html
Create a file with all the word pairs, after making everything lowercase and removing punctuation and basic stopwords:
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit filterpunc | textkit tokens2lower > ../outputdata/emma_lower.txt
In [ ]:
!textkit filterwords ../outputdata/emma_lower.txt | textkit filterlengths -m 3 | textkit words2bigrams > ../outputdata/bigrams_emma.txt
In [ ]:
!head -n10 ../outputdata/bigrams_emma.txt
Then count them to get frequencies of the pairs. This may reveal custom stopwords you want to filter out.
In [ ]:
!textkit tokens2counts ../outputdata/bigrams_emma.txt > ../outputdata/bigrams_emma_counts.txt
In [ ]:
!head -n20 ../outputdata/bigrams_emma_counts.txt
Suppose you didn't want the names in there? Custom stopwords can be created in a file, one per line, and added as an argument to the filterwords command:
In [ ]:
!textkit filterwords --custom ../data/emma_customstops.txt ../outputdata/emma_lower.txt > ../outputdata/emma_custom_stops.txt
In [ ]:
!textkit filterlengths -m 3 ../outputdata/emma_custom_stops.txt | textkit words2bigrams > ../outputdata/bigrams_emma.txt
In [ ]:
!textkit tokens2counts ../outputdata/bigrams_emma.txt > ../outputdata/bigrams_emma_counts.txt
In [ ]:
!head -n20 ../outputdata/bigrams_emma_counts.txt
You could add more if you wanted.
In [ ]:
text = nltk.word_tokenize("And now I present your cat with something completely different.")
tagged = nltk.pos_tag(text) # there are a few options for taggers, details in NLTK books
tagged
In [ ]:
nltk.untag(tagged)
The Penn Treebank part of speech tags are these:
source: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Parts of speech are used in anaysis that's "deeper" than bags-of-words approaches. For instance, chunking (parsing for structure) may be used for entity identification and semantics. See http://www.nltk.org/book/ch07.html for a little more info, and the 2 Perkins NLTK books.
Note also that "real linguists" parse a sentence into a syntactic structure, which is usually a tree form.
(Source)
For instance, try out the Stanford NLP parser visually at http://corenlp.run/.
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit tokens2pos | grep NNP | cut -d, -f1 > ../outputdata/emma_nouns.txt
In [ ]:
!textkit tokens2counts ../outputdata/emma_nouns.txt > ../outputdata/emma_NNP_counts.csv
In [ ]:
!head -n10 ../outputdata/emma_NNP_counts.csv
That's all proper names. Maybe not very interesting. Let's look at the verbs now.
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit tokens2pos | grep VB | cut -d, -f1 > ../outputdata/emma_verbs.txt
In [ ]:
!textkit tokens2counts ../outputdata/emma_verbs.txt > ../outputdata/emma_VB_counts.csv
In [ ]:
!head -n20 ../outputdata/emma_VB_counts.csv
Keep in mind that you can filter stopwords before you do this, too, although you should also lowercase things too. Also note that "grep VB" will also match other forms of VB, like VBP!
Suppose you want to make a word cloud of just the verbs... without stopwords, and you want to compare two books by the same author, say Emma and Pride and Prejudice. Let's try it. (I'm using a \ to wrap the line here, so I don't need to use intermediate files for short commands.)
In [ ]:
!textkit text2words ../data/books/Austen_Emma.txt | textkit tokens2lower \
| textkit filterwords | textkit tokens2pos | grep VB | cut -d, -f1 > ../outputdata/emma_verbs.txt
In [ ]:
!textkit tokens2counts ../outputdata/emma_verbs.txt > ../outputdata/emma_VB_counts.csv
In [ ]:
!textkit text2words ../data/books/Austen_Pride.txt | textkit tokens2lower \
| textkit filterwords | textkit tokens2pos | grep VB | cut -d, -f1 > ../outputdata/pride_verbs.txt
In [ ]:
!textkit tokens2counts ../outputdata/pride_verbs.txt > ../outputdata/pride_VB_counts.csv
If you load those files into wc_clouds_bars.html (at the bottom!), and use the provided extra stopwords, you'll get this:
Another option for showing the difference more clearly is in analytic_wordlist.html.
The goal is to merge data items that are the same at some "root" meaning level, and reduce the number of features in your data set. "Cats" and "Cat" might want to be treated as the same thing, from a topic or summarization perspective. You can really see this in the word clouds above...so many forms of the same word!
In [ ]:
# stemming removes affixes. This is the default choice for stemming although other algorithms exist.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer.stem('believes')
In [ ]:
# lemmatizing transforms to root words using grammar rules. It is slower. Stemming is more common.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('said', pos='v') # if you don't specify POS, you get zilch.
In [ ]:
lemmatizer.lemmatize('cookbooks')
In [ ]:
stemmer.stem('wicked')
In [ ]:
lemmatizer.lemmatize("were", pos="v") # lemmatizing would allow us to collapse all forms of "be" into one token
In [ ]:
# an apparently recommended compression recipe in Perkins Python 3 NLTK book? Not sure I agree.
stemmer.stem(lemmatizer.lemmatize('buses'))
Look at some of the clouds above. How would this be useful, do you think?
In [ ]:
def make_verbs_lemmas(filename, outputfile):
from collections import Counter
with open(filename, 'U') as handle:
emmav = handle.read()
emmaverbs = emmav.split('\n')
lemmaverbs = []
for verb in emmaverbs:
lemmaverbs.append(lemmatizer.lemmatize(verb, pos='v'))
counts = Counter(lemmaverbs)
with open(outputfile, 'w') as handle:
for key, value in counts.items():
if key:
handle.write(key + "," + str(value) + "\n")
print "wrote ", outputfile
In [ ]:
make_verbs_lemmas("../outputdata/emma_verbs.txt", "../outputdata/emma_lemma_verbs.csv")
In [ ]:
!head -n5 ../outputdata/emma_lemma_verbs.csv
In [ ]:
make_verbs_lemmas("../outputdata/pride_verbs.txt", "../outputdata/pride_lemma_verbs.csv")
In [ ]:
!head -n5 ../outputdata/pride_lemma_verbs.csv
Now let's look at those wordclouds. A giant improvement and changes in the counts, actually. Look what happened with the Pride one, where there's a new second place.
In [ ]: