It Starts with a Research Question...

Literary Distinction (Probably)

Preview
Review
Pre-Processing

Import Corpus
Stop Words
Feature Selection

Classification

Training, Feature Importance, & Prediction
Literary Distinction
Extra: Cross-Validation

Although Long and So's study of modernist haiku motivates this lesson, a substantial portion of their corpus remains under copyright so they have not made it available publicly. Instead we will apply their methods to the corpus distributed by Ted Underwood and Jordan Sellers in support of their own literary historical study on nineteenth- and early-twentieth century volumes of poetry that were reviewed in prestigious magazines versus not at all. (The idea being that even a negative review indicates valuable, critical engagement.)

In essence, our task will be to learn the vocabulary of literary prestige, rather than that of haiku. We will however be deliberate in using Long and So's methods, since they reflect assumptions about language that are more appropriate to a general introduction.

0. Preview



In [ ]:

    
import nltk
nltk.download('stopwords')

from sklearn.naive_bayes import MultinomialNB
import pandas



In [ ]:

    
# Get texts of interest that belong to identifiably different categories

unladen_swallow = 'high air-speed velocity'
swallow_grasping_coconut = 'low air-speed velocity'



In [ ]:

    
# Transform them into a format scikit-learn can use

columns = ['high','low','air-speed','velocity']
indices = ['unladen', 'coconut']
dtm = [[1,0,1,1],[0,1,1,1]]
dtm_df = pandas.DataFrame(dtm, columns = columns, index = indices)

dtm_df



In [ ]:

    
# Train the Naive Bayes classifier

nb = MultinomialNB()
nb.fit(dtm,indices)



In [ ]:

    
# Make a prediction!

unknown_swallow = "high velocity"
unknown_features = [1,0,0,1]

nb.predict([unknown_features])

1. Review



In [ ]:

    
# Read Moby Dick
moby_string = open('Melville - Moby Dick.txt').read()



In [ ]:

    
# Inspect the text
moby_string



In [ ]:

    
# Make the text lower case
moby_lower = moby_string.lower()



In [ ]:

    
# Tokenize Moby Dick

moby_tokens = moby_lower.split()



In [ ]:

    
# Check out the tokens
moby_tokens



In [ ]:

    
# Just how long is Moby Dick anyway?
len(moby_tokens)



In [ ]:

    
# Create a list comprehension, including an 'if' statement
just_whales = [token for token in moby_tokens if token=='whale']



In [ ]:

    
# Hast seen the White Whale?
just_whales



In [ ]:

    
# Make a new list
maritime = ['ship','harpoon','sail']



In [ ]:

    
# Multiply it
maritime * 2



In [ ]:

    
# Another list
whaling = ['whiteness','whale','ambergris']



In [ ]:

    
# Concatenate
maritime + whaling

2. Pre-Process

In their paper, Long and So describe their pre-processing as consisting of three major steps: stop word removal, lemmatization of nouns, and feature selection (based on document frequency). In this workshop, we will focus on the first and third steps, since they can be integrated seamlessly with our workflow and Underwood and Sellers use them as well.

Lemmatization -- the transformation of words into their dictionary forms; e.g. plural nouns become singular -- is particularly useful to Long and So, since they partly aim to study imagery. That is, they find it congenial to collapse the words mountains and mountain into the same token, since they express a similar image. For an introduction to Lemmatization (and a related technique, Stemming), see NLTK: http://www.nltk.org/book/ch03.html#sec-normalizing-text

Import Corpus

Note that due to issues of copyright, volumes' word order has not been retained, although their total word counts have been. Fortunately, our methods do not require word-order information.

Underwood and Sellers's literary corpus has been divided into three folders: "reviewed", "random", "canonic". (The last of these are canonic poets but who did not have the opportunity to be reviewed, such as Emily Dickinson.)



In [ ]:

    
import os



In [ ]:

    
# Assign file paths to each set of poems

review_path = 'poems/reviewed/'
random_path = 'poems/random/'



In [ ]:

    
# Get lists of text files in each directory

review_files = os.listdir(review_path)
random_files = os.listdir(random_path)



In [ ]:

    
# Inspect

review_files



In [ ]:

    
# Read-in texts as strings from each location

review_texts = [open(review_path+file_name).read() for file_name in review_files]
random_texts = [open(random_path+file_name).read() for file_name in random_files]



In [ ]:

    
# Inspect

review_texts[0]



In [ ]:

    
# Collect all texts in single list

all_texts = review_texts + random_texts



In [ ]:

    
# Get all file names together

all_file_names = review_files + random_files



In [ ]:

    
# Keep track of classes with labels

all_labels = ['reviewed'] * len(review_texts) + ['random'] * len(random_texts)



In [ ]:

    
## EX. How many file names are listed in the directory for reviewed texts? 

## EX. How many texts got read into 'review_texts'? Does it match the number of files in the directory?

Stop Words

Stop words, sometimes refered to as function words, include articles, prepositions, pronouns, and conjunctions among others. Although their frequencies encode information about textual features like authorship, they do not convey semantic meanings and are often removed before analysis.



In [ ]:

    
# By default scikit-learn uses this list of English stop words

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS



In [ ]:

    
# Inspect

ENGLISH_STOP_WORDS



In [ ]:

    
# How many are here?

len(ENGLISH_STOP_WORDS)



In [ ]:

    
# NLTK has its own collection of stop words

from nltk.corpus import stopwords



In [ ]:

    
# Pull up NLTK's list of English-language stop words

stopwords.words('english')



In [ ]:

    
# How many stop words are in the list?

len(stopwords.words('english'))



In [ ]:

    
# NLTK has stopwords for many Western languages

stopwords.words('spanish')



In [ ]:

    
tokenized_sentence = ['what', 'is', 'the', 'air-speed', 'velocity', 'of', 'an', 'unladen', 'swallow']



In [ ]:

    
# Remove stopwords from tokenized sentence

[word for word in tokenized_sentence if word not in stopwords.words('english')]



In [ ]:

    
## Q.  Stop words are typically the most frequent words in a language, yet do not convey semantic meaning.
##     Does this make sense based on the words in NLTK's list of English stop words?
##     What about other languages with which you are familar?

stopword_languages = ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian',\
                      'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

## EX. Use either sklearn or NLTK's stopword list to remove those words from the 'token_list' below.

## EX. How many tokens did you remove from the 'token_list' in total? What percent were removed?

## CHALLENGE.  Stop words are often instrumental in language detection for unknown texts.
##             How might you write a program to do this?



In [ ]:

    
token_list = ['in', 'a', 'station', 'of', 'the', 'metro',\
              'the', 'apparition', 'of', 'these', 'faces', 'in', 'the', 'crowd',\
              'petals', 'on', 'a', 'wet', 'black', 'bough']

Feature Selection

At this point, we transform our texts into a Document-Term Matrix in the same manner we have employed previously. However, it is important to note that neither Long and So nor Underwood and Sellers use all of the words that appear in their respective corpora when constructing their matrices. The process of choosing which words to comprise the columns is referred to as feature selection.

While there are several approaches one may take when selecting features, both of the literary studies under consideration use document frequency as the deciding criterion. The intuition is that a word that appears in a single text out of hundreds will not carry much weight when trying to determine the text's class membership.

In order to be selected as a feature, Long and So require that words appear in at least 2 texts, whereas Underwood and Sellers require that a word appear in about a quarter of all texts. Although this is quite a large difference (a minimum of 2 texts vs. ~180 texts), it perhaps makes sense since the texts are of very different lengths: individual haiku vs entire volumes of poetry. The latter will have much greater overlap in its vocabulary.

The process of feature selection is intimately tied to the object under study and the statistical model chosen.



In [ ]:

    
from sklearn.feature_extraction.text import CountVectorizer



In [ ]:

    
# Intitialize the function that will transform our list of texts to a DTM
# 'min_df' and 'max_features' are arguments that enable flexible feature selection
# 'binary' tells CountVectorizer only to record whether a word appeared in a text or not

cv = CountVectorizer(stop_words = 'english', min_df=180, binary = True, max_features = None)



In [ ]:

    
# Transform our texts to DTM

cv.fit_transform(all_texts)



In [ ]:

    
# Transform our texts to a dense DTM

cv.fit_transform(all_texts).toarray()



In [ ]:

    
# Assign this to a variable

dtm = cv.fit_transform(all_texts).toarray()



In [ ]:

    
# Get the column headings

cv.get_feature_names()



In [ ]:

    
# Assign to a variable

feature_list = cv.get_feature_names()



In [ ]:

    
# Place this in a dataframe for readability

dtm_df = pandas.DataFrame(dtm, columns = feature_list, index = all_file_names)



In [ ]:

    
# Check out the dataframe

dtm_df



In [ ]:

    
# Get the dataframe's dimensions (# texts, # features)

dtm_df.shape



In [ ]:

    
## EX. Re-initialize the CountVectorizer function above with the the argument min_df = 1.
##     How many unique words are in there in the total vocabulary of the corpus?

## EX. Repeat the exercise above with min_df = 360. (That is, words are only included if they appear
##     in at least half of all documents.) What is the size of the vocabulary now?
##     Does the list of these very common words look as you would expect?

3. Classification

Training, Feature Importance, and Prediction

Long and So selected a classification algorithm that specifically relies on Bayes' Theorem to model relationships between textual features and categories in our corpus of poetry volumes. (See link for more information about the method and its assumptions.)

Two ways that we learn about the model are its feature weights and predictions on new texts. The algorithm can explicity report to us which direction each word leans category-wise and how strongly. Based on those weights, it makes further predictions about the valences of previously unseen poetry volumes.



In [ ]:

    
from sklearn.naive_bayes import MultinomialNB



In [ ]:

    
# Train the classifier and assign it to a variable

nb = MultinomialNB()
nb.fit(dtm, all_labels)



In [ ]:

    
# Hand-waving the underlying statistics here...

def most_informative_features(text_class, vectorizer = cv, classifier = nb, top_n = 10):

    import numpy as np

    feature_names = vectorizer.get_feature_names()
    class_index = np.where(classifier.classes_==(text_class))[0][0]
    
    class_prob_distro = np.exp(classifier.feature_log_prob_[class_index])
    alt_class_prob_distro = np.exp(classifier.feature_log_prob_[1 - class_index])
    
    odds_ratios = class_prob_distro / alt_class_prob_distro
    odds_with_fns = sorted(zip(odds_ratios, feature_names), reverse = True)
    
    return odds_with_fns[:top_n]



In [ ]:

    
# Returns feature name and odds ratio for a given class

most_informative_features('reviewed')



In [ ]:

    
# Similarly, for words that indicate 'random' class membership

most_informative_features('random')



In [ ]:

    
# Let's load up two poems that aren't in the training set and make predictions

dickinson_canonic = """Because I could not stop for Death – 
He kindly stopped for me –  
The Carriage held but just Ourselves –  
And Immortality.

We slowly drove – He knew no haste
And I had put away
My labor and my leisure too,
For His Civility – 

We passed the School, where Children strove
At Recess – in the Ring –  
We passed the Fields of Gazing Grain –  
We passed the Setting Sun – 

Or rather – He passed us – 
The Dews drew quivering and chill – 
For only Gossamer, my Gown – 
My Tippet – only Tulle – 

We paused before a House that seemed
A Swelling of the Ground – 
The Roof was scarcely visible – 
The Cornice – in the Ground – 

Since then – ‘tis Centuries – and yet
Feels shorter than the Day
I first surmised the Horses’ Heads 
Were toward Eternity – """


anthem_patriotic = """O! say can you see, by the dawn's early light,
What so proudly we hailed at the twilight's last gleaming,
Whose broad stripes and bright stars through the perilous fight,
O'er the ramparts we watched, were so gallantly streaming?
And the rockets' red glare, the bombs bursting in air,
Gave proof through the night that our flag was still there;
O! say does that star-spangled banner yet wave
O'er the land of the free and the home of the brave?"""



In [ ]:

    
# Transform these into DTMs with the same feature-columns as previously

unknown_dtm = cv.transform([dickinson_canonic,anthem_patriotic]).toarray()



In [ ]:

    
# What does the classifier think?

nb.predict(unknown_dtm)



In [ ]:

    
# Although our classification is binary, Bayes theorem assigns
# a probability of membership in either category

# Just how confident is our classifier of its predictions?

nb.predict_proba(unknown_dtm)



In [ ]:

    
## Q.  What kinds of patterns do you notice among the 'most informative features'?
##     Try looking at the top fifty most informative words for each category.

Literary Distinction

In their study of critical taste, Underwood and Sellers find not only that literary standards change very slowly, but that contemporary evaluations of 'canonicity' resemble those of the nineteenth century.

In order to test this idea, the authors trained a classifier on nineteenth- and early twentieth-century volumes of poetry that received reviews in a prestigious magazine versus those that didn't. The authors then used the classifier to predict a category for volumes of poetry that went unreviewed, in several cases because they were unpublished, but are now included in Norton anthologies.

How closely does critical evaluation today match that of a century ago?



In [ ]:

    
## EX. Import and process the 'canonic' (albeit unreviewed) volumes of poetry.
##     Use the poetry classifier to predict whether they might have been reviewed.
##     Does the output make sense? Is it consistent with Underwood and Sellers's findings?

canonic_path = 'poems/canonic/'

Extra: Cross-Validation

Just how good is our classifier? We can evaluate it by randomly selecting texts from each category and setting them aside before training. We then see how well the classifier predicts their (known) categories.

Remember that if the classifier is trying to predict membership for just two categories, we would expect it to be correct about 50% of the time based on random chance. As a rule of thumb, if this kind of classifier has 65% accuracy or better under cross-validation, it has often identified a meaningful pattern.



In [ ]:

    
# Randomize the order of our texts
import numpy
randomized_review = numpy.random.permutation(review_texts)
randomized_random = numpy.random.permutation(random_texts)



In [ ]:

    
# We'll train our classifier on the first 90% of texts in the randomized list
# Then, we'll test it using the last 10%

training_set = list(randomized_review[:324]) + list(randomized_random[:324])
test_set = list(randomized_review[324:]) + list(randomized_random[324:])

training_labels = ['reviewed'] * 324 + ['random'] * 324
test_labels = ['reviewed'] * 36 + ['random'] * 36



In [ ]:

    
# Transform training and test texts into DTMs
# Note that 'min_df' has been adjusted to one quarter of the size of the training set

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english', min_df = 162, binary=True)
training_dtm = cv.fit_transform(training_set)
test_dtm = cv.transform(test_set)



In [ ]:

    
# Train, Predict, Evaluate

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nb = MultinomialNB()
nb.fit(training_dtm, training_labels)
predictions = nb.predict(test_dtm)
accuracy_score(predictions, test_labels)



In [ ]:

    
## CHALLENGE: In fact, when Underwood and Sellers cross-validate, they do so by setting aside a single
##            author's texts (one or more) from the training set and making a prediction for that author alone.
##            After doing this for all authors, they tally the number of texts that were correctly predicted
##            to calculate their overall accuracy. Implement this.