It Starts with a Research Question...

Operationalizing

Review/Preview: Strings & Lists
Pandas
Arithmetic!
Character Space in Antigone

1. Review/Preview

Strings, Lists, & List Comprehensions

Strings will be our go-to data type throughout the workshop. We have already seen strings assigned to variables, split over white spaces, added together, and sliced by index. Let's review those techniques and try out a couple variations.



In [ ]:

    
# Let's assign a string to a new variable
# Using the triple quotation mark, we can simply paste a passage in between
# and Python will treat it as a continuous string

first_sonnet = """From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory"""



In [ ]:

    
# Note that when we print the 'first_sonnet', we see the character
# that represents a line break: '\n'

first_sonnet



In [ ]:

    
# A familiar string method

first_sonnet.split()



In [ ]:

    
# Let's assign the list of tokens to a variable

sonnet_tokens = first_sonnet.split()



In [ ]:

    
# And find out how many words there are in the quatrain

len(sonnet_tokens)



In [ ]:

    
# Let's pull out the tokens from the second line

sonnet_tokens[6:13]



In [ ]:

    
# How long is each word in sonnet_tokens?

[len(token) for token in sonnet_tokens]



In [ ]:

    
# And why not assign that to a variable...

token_lengths = [len(token) for token in sonnet_tokens]



In [ ]:

    
# ... so we can do something fun, like get the average word length

sum(token_lengths) / len(token_lengths)



In [ ]:

    
## EX. Retrieve the word 'thereby' from the list of 'sonnet_tokens' by calling its index.

Extending our Methods

Beyond a simple list, we often find it useful to organize information into lists of lists. That is, a list in which each entry is itself a list of elements. For example, we may not want to treat a poem as a flat list of words but instead would like to group words into their constitutive lines.



In [ ]:

    
# There's a twist!

first_sonnet.split('\n')



In [ ]:

    
# Assign the list of whole lines to a new variable
sonnet_lines = first_sonnet.split('\n')



In [ ]:

    
# How long is this list?

len(sonnet_lines)



In [ ]:

    
# Create a list of lists!

[line.split() for line in sonnet_lines]



In [ ]:

    
# Assign this to a variable

tokens_by_line = [line.split() for line in sonnet_lines]



In [ ]:

    
# Check its length

len(tokens_by_line)



In [ ]:

    
# Pull out the second line

tokens_by_line[1]



In [ ]:

    
# How long is that second line?

len(tokens_by_line[1])



In [ ]:

    
# Pull up an individual word

tokens_by_line[1][3]



In [ ]:

    
## EX. Retrieve the word 'thereby' from the list of 'tokens_by_line' by calling its indices.

## EX. Find the average number of words per line in 'tokens_by_line'.

2. Pandas

We've started to grapple with the weirdly complicated idea of lists of lists and their utility for textual study. In fact, these translate rather easily into the very familiar idea of the spreadsheet. Very often, our data (whether number or text) can be represented as rows and columns. Once in that format, many mathematical operations come naturally.

Pandas is a popular and flexible package whose primary use is its datatype: the DataFrame. The dataframe is essentially a spreadsheet, like you would find in Excel, but it integrates seamlessly into an NLP workflow and it has a few tricks up its sleeve.



In [ ]:

    
# Get ready!

import pandas



In [ ]:

    
# Create a list of three sub-lists, each with three entries

square_list = [[1,2,3],[4,5,6],[7,8,9]]



In [ ]:

    
# Let's slice it by row
# Note that we would have to do some acrobatics in order to slice by column!

square_list[:2]



In [ ]:

    
# Create a dataframe from that list

pandas.DataFrame(square_list)



In [ ]:

    
# Let's create a couple of lists for our column and row labels

column_names = ['Eggs', 'Bacon', 'Sausage']
row_names = ['Served','With','Spam']



In [ ]:

    
# A-ha!

pandas.DataFrame(square_list, columns = column_names, index = row_names)



In [ ]:

    
# Assign this to a variable

spam_df = pandas.DataFrame(square_list, columns = column_names, index=row_names)



In [ ]:

    
# Call up a column of the dataframe

spam_df['Eggs']



In [ ]:

    
# Make that column into a list

list(spam_df['Eggs'])



In [ ]:

    
# Get the indices for the entries in the column

spam_df['Eggs'].index



In [ ]:

    
# Call up a row from the indices

spam_df.loc['Served']



In [ ]:

    
# Call up a couple of rows, using a list of indices!

spam_df.loc[['Spam','Served']]



In [ ]:

    
# Get a specific entry by calling both row and column

spam_df.loc['Spam']['Eggs']



In [ ]:

    
# Temporarily re-order the dataframe by values in the 'Eggs' column

spam_df.sort_values('Eggs', ascending=False)



In [ ]:

    
# Create a new column

spam_df['Lobster Thermidor aux Crevettes'] = [10,11,12]



In [ ]:

    
# Inspect

spam_df



In [ ]:

    
## EX. Call up the entries (5) and (6) from the middle of the dataframe 'spam_df' individually

## CHALLENGE: Call up both entries at the same time

DataFrame Subsetting



In [ ]:

    
# Slice out a column

spam_df['Bacon']



In [ ]:

    
# Evaluate whether each element in the column is greater than 5

spam_df['Bacon']==5



In [ ]:

    
# Use that evaluation to subset the table

spam_df[spam_df['Bacon']==5]



In [ ]:

    
## EX. Slice 'spam_df' to contain only rows in which 'Sausage' is greater than 5

3. Arithmetic!



In [ ]:

    
# Our dataframe

spam_df



In [ ]:

    
# Pandas will produce a few descriptive statistics for each column

spam_df.describe()



In [ ]:

    
# Multiply entries of the dataframe by 10

spam_df*10



In [ ]:

    
# Add 10 to each entry

spam_df+10



In [ ]:

    
# Of course our dataframe hasn't changed

spam_df



In [ ]:

    
# What if we just want to add the values in the column?

sum(spam_df['Bacon'])



In [ ]:

    
# We can also perform operations among columns
# Pandas knows to match up individual entries in each column

spam_df['Bacon']/spam_df['Eggs']

4. Character Space in Antigone

In Moretti's study, he offers several measures of the concept of character. The simplest of these is to measure the relative dialogue belong to each character in a play. Presumably the main characters will speak more and peripheral characters will speak less.

The statistical moves we will make here are not only counting the raw number of words spoken by each character but also normalizing them. That is, converting them into a fraction of all words in the play.

In order to focus on the statistical tasks at hand, we will begin by importing a spreadsheet in which each row is labeled with an individual character's name. Its columns contain metadata about the character herself, as well as a single column containing all of her dialogue as a string.



In [ ]:

    
# Read spreadsheet from the hard drive

dialogue_df = pandas.read_csv('antigone_dialogue.csv', index_col=0)



In [ ]:

    
# Take a look

dialogue_df



In [ ]:

    
# Pulling out a single column acts like a list -- with labels

dialogue_df['NAMED_CHARACTER']



In [ ]:

    
# If we wish, we can use metadata to subset our dataframe

dialogue_df[dialogue_df['NAMED_CHARACTER']=='named']



In [ ]:

    
# Check out the first element of the dialogue column

dialogue_df['DIALOGUE'][0]



In [ ]:

    
# Create a list of lists; split each character's dialogue into a list of tokens

dialogue_tokens = [character.split() for character in dialogue_df['DIALOGUE']]



In [ ]:

    
# A list of lists!

dialogue_tokens



In [ ]:

    
# How many tokens are in each list?

dialogue_len = [len(tokens) for tokens in dialogue_tokens]



In [ ]:

    
# Check the numbers of tokens per character

dialogue_len



In [ ]:

    
# Assign this as a new column in the dataframe

dialogue_df['WORDS_SPOKEN'] = dialogue_len



In [ ]:

    
# Let's visualize!

# Tells Jupyter to produce images in notebook
% pylab inline

# Makes images look good
style.use('ggplot')



In [ ]:

    
# Visualize using the 'plot' method from Pandas

dialogue_df['WORDS_SPOKEN'].plot(kind='bar')



In [ ]:

    
##     Moretti had not simply plotted the number words spoken by each character
##     but the percentage of all words in the play belonging to that character.
##     He also had sorted the columns of his diagram by their height.

## EX. Calculate the share of each character's dialogue as a percentage of the total
##     number of words in the play.

## EX. Reorganize the dataframe such that these percentages appear in descending order.

## EX. Visualize the ordered share of each character's dialogue as a bar chart.

Extra: Transform Dramatic Text into Charcter-CSV

This script uses a data type, a method, and an operation that are all closely related to ones that we've seen. The dictionary resembles a list or a DataFrame. The string-method index sort of reverse engineers our slicing method where we had called up specific characters from a string by their index. The for-loop bears a close resemblence to the list comprehension, although it doesn't necessarily produce a list.

Try playing around with them to see what they do!



In [ ]:

    
# Read the text of Antigone from a file on your hard drive
antigone_text = open('antigone.txt', 'r').read()

# Create a list by splitting the string whereever a double line break occurs
antigone_list = antigone_text.split('\n\n')

# Create a new, empty dictionary
dialogue_dict = {}

# Iterate through each of the play's lines
for line in antigone_list:
    
    # Find the first space in each line
    index_first_space = line.index(' ')
    
    # Slice the line, preceding the first space
    character_name = line[:index_first_space]
    
    # Check whether the character is in our dictionary yet
    if character_name not in dialogue_dict.keys():
        
        # If not, create a new entry whose value is a slice of the line *after* the first space
        dialogue_dict[character_name] = line[index_first_space:]
        
    else:
        
        # If so, add the slice of line to the existing value
        dialogue_dict[character_name] = dialogue_dict[character_name] + line[index_first_space:]
        
# Get ready!
import pandas

# Convert dictionary to DataFrame; instruct pandas that each dictionary entry is a row ('index')
dialogue_df = pandas.DataFrame.from_dict(dialogue_dict, orient='index')

# Add label to spreadsheet column
dialogue_df.columns = ['DIALOGUE']

# Export as csv; save to hard drive
dialogue_df.to_csv('antigone_dialogue_new.csv')



In [ ]:

    
## EX. The text of Hamlet is also contained within the folder for this notebook ('hamlet.txt').
##     Perform Moretti's character space analysis on that play.

##     Note that the dialogue is formatted slightly differently in our copy of Hamlet than it
##     was in Antigone. This means that you will need to tweak the script above if you wish
##     to use it for Hamlet. In reality it is very often the case that a script has to be
##     tailored to different applications!