In [ ]:
# Show graphs in notebook
%matplotlib inline

# Import libraries
import pandas as pd  # tabular data analysis library
import numpy as np  # mathematical operations library
import os  # library for manipulating the file system and Bash
from sklearn.feature_extraction.text import CountVectorizer 
import re  # regular expressions library
import matplotlib.pyplot as plt  # plotting base library
import seaborn as sns  # plotting extension library
from bs4 import BeautifulSoup  # html/lxml parsing library
from datascience import *

Character Space

This notebook recreates results discussed in:

  • Moretti, Franco. "'Operationalizing': or, the function of measurement in modern literary theory". Stanford Literary Lab Pamphlet 6. 2013

In Moretti's study, he offers several measures of the concept of character space. The simplest of these is to measure the relative dialogue belonging to each character in a play. Presumably the main characters will speak more and peripheral characters will speak less.

The statistical moves we will make here are not only counting the raw number of words spoken by each character, but also normalizing them. That is, converting them into a fraction of all words in the play.

In order to focus on the statistical tasks at hand, we need to parse raw text files to figure out who said what. Unfortunately, that's the hard part! We'll walk through the first one and I'll quickly do the ones after.

Jean Racine's Phèdre


In [ ]:
# Read the text of the play from its file on the hard-drive
with open('data/phedre.txt', 'r') as f:
    phedre = f.read()

print(phedre[:200])  # print first 200 characters

In [ ]:
# Create a list, where each entry is a line from the play. We'll split on double line breaks.
# Each line starts with the name of the speaker.
phedre_list = phedre.split('\n\n')

# Create a regex pattern to match words we don't want to start the line
pattern = re.compile(r'ACT|SCENE|Scene')

# Grab list of all the dialogue lines if they don't have the words above in them
phedre_list = [x.strip() for x in phedre_list if re.match(pattern, x) == None and '\n' in x.strip()]

# Print first three dialogue turns
phedre_list[:3]

Now that we have the dialogue texts in a list, we can attribute dialogue words to each character.

"character-space turns smoothly into “word-space”—“the number of words allocated to a particular character”—and, by counting the words each character utters, we can determine how much textual space it occupies." [2]


In [ ]:
# Create a dictionary where each key is the name of a character
# and each entry is a single string of words spoken by them

# Initiate empty dict
dialogue_dict_phedre = {}

# Iterate through list of turns in the dialogue list
for line in phedre_list:
    
    # Get the name of the character
    char = line.split('\n')[0].split()[0]
    
    # Get the dialogue text
    dialogue = '\n'.join(line.split('\n')[1:])
    
    # Add dialogue text to that character
    if char not in dialogue_dict_phedre.keys():
        dialogue_dict_phedre[char] = dialogue
    else:
        dialogue_dict_phedre[char] += dialogue

        
# Print first 200 character's of Phaedra's dialogue
print(dialogue_dict_phedre['PHAEDRA'][:200])

In [ ]:
def plot_character_space(dialogue):
    
    # Create counter to get all words in all dialogue
    total_words = 0
    for char in dialogue.keys():
        total_words += len(dialogue[char].split())
        
    # Create dict to record share of dialogue for each character
    dialogue_share = []
    for char in dialogue.keys():
        dialogue_share.append({'Character': char.title(), 'Dialogue Share': len(dialogue[char].split()) / total_words * 100}) 
        
    my_table = Table.from_records(dialogue_share).sort('Dialogue Share', descending=True)
    my_table.bar(column_for_categories='Character')
    plt.xticks(range(len(my_table.columns[0])), my_table.columns[0], rotation=90)

In [ ]:
plot_character_space(dialogue_dict_phedre)

Macbeth


In [ ]:
# Read in text
with open('data/macbeth.txt', 'r') as f:
    macbeth = f.read()

# Get cast
pattern = re.compile(r'<[A-Z ]*>')
cast = list(set(re.findall(pattern, macbeth)))
cast = [x.replace('>', '').replace('<', '') for x in cast]

# Make dialogue dict
soup = BeautifulSoup(macbeth, 'lxml')
dialogue_dict_macbeth = {}
for c in cast:
    dialogue = [x.text for x in soup.find_all(c.lower().split()[0])]
    dialogue = '\n'.join([re.sub(r'<.*>', '', x).strip() for x in dialogue])
    dialogue_dict_macbeth[c] = dialogue

# Plot
plot_character_space(dialogue_dict_macbeth)

Othello


In [ ]:
# Read in text
with open('data/othello.txt', 'r') as f:
    othello = f.read()

# Get cast
pattern = re.compile(r'<[A-Z ]*>')
cast = list(set(re.findall(pattern, othello)))
cast = [x.replace('>', '').replace('<', '') for x in cast]

# Make dialogue dict
soup = BeautifulSoup(othello, 'lxml')
dialogue_dict_othello = {}
for c in cast:
    dialogue = [x.text for x in soup.find_all(c.lower().split()[0])]
    dialogue = '\n'.join([re.sub(r'<.*>', '', x).strip() for x in dialogue])
    dialogue_dict_othello[c] = dialogue

# Plot
plot_character_space(dialogue_dict_othello)

Antigone


In [ ]:
# Read in text
with open('data/antigone.txt', 'r') as f:
    antigone = f.read()

# Split lines
antigone_list = antigone.split('\n\n')

# Make dialogue dict
dialogue_dict_antigone = {}
for line in antigone_list:
    dex = line.index(' ')
    char = line[:dex]
    if char not in dialogue_dict_antigone.keys():
        dialogue_dict_antigone[char] = line[dex:]
    else:
        dialogue_dict_antigone[char] += line[dex:]

# Plot
plot_character_space(dialogue_dict_antigone)

Operationalizing Tragic Collision: Most Distinctive Words

The code below looks complicated, but all it does is count how many times each character said each word in the entire text. If the character didn't say the word, it just gets tallied as a 0. We then sum all of these counts to get the number of times each word is spoken in the text. If we're intested in the most distinctive words, we'd want to know how many times a character said a specific word compared to how many times it was spoken in the entire text.

We'll make an 'EXPECTED' column that tells us if the word was distributed evenly amongst characters, how many times our target character should have said it. Then we'll add a column for the ratio between the observed occurences and the expected occurences.

TLDR: This code will tell us which words a specific character used more or less frequently than average for a character in a text.

"To do this, the Literary Lab follows an approach (which we call Most Distinctive Words) in several steps. First, we establish how often a word occurs in the corpus, and hence how often a specific character is expected to use it given the amount of words at its disposal; then we count how often the character actually utters the word, and calculate the ratio between actual and expected frequency; the higher the ratio, the greater the deviation from the average, and the more typical the word is of that character." [10]


In [ ]:
def get_mdw(dialogue_dict, character, group=False):
    # Boot up the dtm-maker
    cv = CountVectorizer()

    # Create the dtm
    dtm = cv.fit_transform(dialogue_dict.values()).toarray()

    # Put the dtm into human-readable format
    word_list = cv.get_feature_names()
    
    dtm_df = pd.DataFrame(dtm, columns = word_list, index = dialogue_dict.keys())

    # Create new dataframe
    mdw_df = pd.DataFrame()

    # Add a column for her observed word counts
    mdw_df[character] = dtm_df.loc[character]

    if group == False:
        # Add a column for the total counts of each word in the play
        mdw_df['WORD_TOTAL'] = dtm_df.sum()
    else:
        # Add a column for the total counts of each word for the characters in the defined group
        mdw_df['WORD_TOTAL'] = dtm_df.loc[group].sum()

    # Calculate Antigone's share of the total dialogue
    char_space = sum(mdw_df[character])/float(sum(mdw_df['WORD_TOTAL']))

    # Add a new column in which we calculate an "expected" number of times
    # Antigone would utter each word, based on its overall use in the play
    # and her share of the dialogue.

    mdw_df[character + '_EXPECTED'] = mdw_df['WORD_TOTAL']*char_space

    # How much more/less frequently does Antigone use the word than expected?
    mdw_df['OBS-EXP_RATIO'] = mdw_df[character]/(mdw_df[character + '_EXPECTED'])
    
    # Sort the dataframe by the Observed/Expected Ratio to show 
    # Antigone's 20 "Most Distinctive Words"
    return mdw_df[(mdw_df['OBS-EXP_RATIO']>1)&(mdw_df['WORD_TOTAL']>5)].sort_values('OBS-EXP_RATIO', ascending=False).head(20)

In [ ]:
get_mdw(dialogue_dict_antigone, 'ANTIGONE')

In [ ]:
get_mdw(dialogue_dict_antigone, 'CREON')

Here's what Moretti had as most distinctive words:

But Moretti notes that these are Antigone's and Creon's most distinctive words as compared to the rest of the text (al the characters in the text). What we are interested in only the relationship between the two characters? We can look at the most distinctive words given the dialogue of only Antigone and Creon the same way, just leaving out the rest of the dialogue:


In [ ]:
get_mdw(dialogue_dict_antigone, 'ANTIGONE', group=['ANTIGONE', 'CREON'])

In [ ]:
get_mdw(dialogue_dict_antigone, 'CREON', group=['ANTIGONE', 'CREON'])

Here's what Moretti had:

Challenge

Experiment with looking at the most distinctive words for characters in the other plays we looked at (Phèdre, Macbeth, and Othello).

HINT: You should only have to write one line per text!


In [ ]:
## YOUR CODE HERE

What are each Phèdre, Macbeth, and Othello's most distinctive words? If you've read the text, does this confirm your opinion of it? Does it add anything new?


In [ ]:


If you've already taken Data 8, or your Python text parsing skills are already advanced, try this one:

I've placed two more text files in the data folder for the two remaining dramas Moretti plots: Friedrich Schiller's Don Carlos and Henrik Ibsen's Ghosts. Write some code to plot the character space!


In [ ]:
!ls data

In [ ]:
## YOUR CODE HERE

In [ ]:
plot_character_space(dialogue_dict_doncarlos)
plot_character_space(dialogue_dict_ghosts)