Solution to Intro to NLP

Ex: Calculate the TTR for two novels in our data folder. Print the most frequent words for these two novels.

0. Read in text, pre-processing steps


In [ ]:
import nltk
import string
from nltk import word_tokenize
from nltk.corpus import stopwords

#open and read the novels, save them as variables
austen_string = open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../Data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

In [ ]:
#tokenize the text
austen_list = word_tokenize(austen_string)
alcott_list = word_tokenize(alcott_string)
print(austen_list[:10])
print(alcott_list[:10])

In [ ]:
#pre-processing
#remove punctuation and lowercase. We can do this in one line!
punctuation = list(string.punctuation)
austen_list_clean = [word.lower() for word in austen_list if word not in punctuation]
alcott_list_clean = [word.lower() for word in alcott_list if word not in punctuation]

print(austen_list_clean[:10])
print(alcott_list_clean[:10])

1. Type-Token Ratio

Divide the length of the set of the text (the unique elements) by the length of the full list


In [ ]:
print("TTR for Pride and Prejudice")
print(len(set(austen_list_clean))/len(austen_list_clean))
print("TTR for A Garland for Girls")
print(len(set(alcott_list_clean))/len(alcott_list_clean))

2. Frequent Words


In [ ]:
austen_word_frequency = nltk.FreqDist(austen_list_clean)
alcott_word_frequency = nltk.FreqDist(alcott_list_clean)

print("Frequent words in Pride and Prejudice")
print(austen_word_frequency.most_common(10))
print("Frequent words in A Garland for Girls")
print(alcott_word_frequency.most_common(10))

3. Bonus: Plotting word distribution


In [ ]:
austen_word_frequency.plot(50, cumulative=True)
austen_word_frequency.plot(50, cumulative=False)

The most frequent words are, of course, the stop words. These words do not tell us much about the content of the text.

4. Count word frequencies after removing stop words


In [ ]:
austen_list_clean_sw = [word for word in austen_list_clean if word not in stopwords.words('english')]
alcott_list_clean_sw = [word for word in alcott_list_clean if word not in stopwords.words('english')]

austen_word_frequency_sw = nltk.FreqDist(austen_list_clean_sw)
alcott_word_frequency_sw = nltk.FreqDist(alcott_list_clean_sw)

print("Frequent words in Pride and Prejudice")
print(austen_word_frequency_sw.most_common(20))
print()
print("Frequent words in A Garland for Girls")
print(alcott_word_frequency_sw.most_common(20))

5. Bonus: Concordances (if time)

The NLTK package has many built-in functions for natural language processing. I encourage you to explore the full range of techniques available. I'll go over two more here: concordance() and similar().

The concordance() function lists out every time the specified words appears in the text along with the surrounding context.


In [ ]:
marx_string = open('../Data/Marx_CommunistManifesto.txt', encoding='utf-8').read()
prince_string = open('../Data/Machiavelli_ThePrince.txt', encoding='utf-8').read()

marx_list = word_tokenize(marx_string)
prince_list = word_tokenize(prince_string)

marx_nltk = nltk.Text(marx_list)
prince_nltk = nltk.Text(prince_list)
print(prince_nltk)
marx_nltk

In [ ]:
marx_nltk.concordance('people')
prince_nltk.concordance('people')

The text.similar() method takes a word w, finds all contexts w1 w w2, then finds all words w' that appear in the same context, i.e. w1 w' w2


In [ ]:
print("Marx")
marx_nltk.similar('people')
print()
print("Machiavelli")
prince_nltk.similar('people')