In [ ]:
# Let's assign a string to a new variable
# Using the triple quotation mark, we can simply paste a passage in between
# and Python will treat it as a continuous string
first_sonnet = """From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory"""
In [ ]:
# Note that when we print the 'first_sonnet', we see the character
# that represents a line break: '\n'
first_sonnet
In [ ]:
# A familiar string method
first_sonnet.split()
In [ ]:
# Let's assign the list of tokens to a variable
sonnet_tokens = first_sonnet.split()
In [ ]:
# And find out how many words there are in the quatrain
len(sonnet_tokens)
In [ ]:
# Let's pull out the tokens from the second line
sonnet_tokens[6:13]
In [ ]:
# How long is each word in sonnet_tokens?
[len(token) for token in sonnet_tokens]
In [ ]:
# And why not assign that to a variable...
token_lengths = [len(token) for token in sonnet_tokens]
In [ ]:
# ... so we can do something fun, like get the average word length
sum(token_lengths) / len(token_lengths)
In [ ]:
## EX. Retrieve the word 'thereby' from the list of 'sonnet_tokens' by calling its index.
Beyond a simple list, we often find it useful to organize information into lists of lists. That is, a list in which each entry is itself a list of elements. For example, we may not want to treat a poem as a flat list of words but instead would like to group words into their constitutive lines.
In [ ]:
# There's a twist!
first_sonnet.split('\n')
In [ ]:
# Assign the list of whole lines to a new variable
sonnet_lines = first_sonnet.split('\n')
In [ ]:
# How long is this list?
len(sonnet_lines)
In [ ]:
# Create a list of lists!
[line.split() for line in sonnet_lines]
In [ ]:
# Assign this to a variable
tokens_by_line = [line.split() for line in sonnet_lines]
In [ ]:
# Check its length
len(tokens_by_line)
In [ ]:
# Pull out the second line
tokens_by_line[1]
In [ ]:
# How long is that second line?
len(tokens_by_line[1])
In [ ]:
# Pull up an individual word
tokens_by_line[1][3]
In [ ]:
## EX. Retrieve the word 'thereby' from the list of 'tokens_by_line' by calling its indices.
## EX. Find the average number of words per line in 'tokens_by_line'.
We've started to grapple with the weirdly complicated idea of lists of lists and their utility for textual study. In fact, these translate rather easily into the very familiar idea of the spreadsheet. Very often, our data (whether number or text) can be represented as rows and columns. Once in that format, many mathematical operations come naturally.
Pandas is a popular and flexible package whose primary use is its datatype: the DataFrame. The dataframe is essentially a spreadsheet, like you would find in Excel, but it integrates seamlessly into an NLP workflow and it has a few tricks up its sleeve.
In [ ]:
# Get ready!
import pandas
In [ ]:
# Create a list of three sub-lists, each with three entries
square_list = [[1,2,3],[4,5,6],[7,8,9]]
In [ ]:
# Let's slice it by row
# Note that we would have to do some acrobatics in order to slice by column!
square_list[:2]
In [ ]:
# Create a dataframe from that list
pandas.DataFrame(square_list)
In [ ]:
# Let's create a couple of lists for our column and row labels
column_names = ['Eggs', 'Bacon', 'Sausage']
row_names = ['Served','With','Spam']
In [ ]:
# A-ha!
pandas.DataFrame(square_list, columns = column_names, index = row_names)
In [ ]:
# Assign this to a variable
spam_df = pandas.DataFrame(square_list, columns = column_names, index=row_names)
In [ ]:
# Call up a column of the dataframe
spam_df['Eggs']
In [ ]:
# Make that column into a list
list(spam_df['Eggs'])
In [ ]:
# Get the indices for the entries in the column
spam_df['Eggs'].index
In [ ]:
# Call up a row from the indices
spam_df.loc['Served']
In [ ]:
# Call up a couple of rows, using a list of indices!
spam_df.loc[['Spam','Served']]
In [ ]:
# Get a specific entry by calling both row and column
spam_df.loc['Spam']['Eggs']
In [ ]:
# Temporarily re-order the dataframe by values in the 'Eggs' column
spam_df.sort_values('Eggs', ascending=False)
In [ ]:
# Create a new column
spam_df['Lobster Thermidor aux Crevettes'] = [10,11,12]
In [ ]:
# Inspect
spam_df
In [ ]:
## EX. Call up the entries (5) and (6) from the middle of the dataframe 'spam_df' individually
## CHALLENGE: Call up both entries at the same time
In [ ]:
# Slice out a column
spam_df['Bacon']
In [ ]:
# Evaluate whether each element in the column is greater than 5
spam_df['Bacon']==5
In [ ]:
# Use that evaluation to subset the table
spam_df[spam_df['Bacon']==5]
In [ ]:
## EX. Slice 'spam_df' to contain only rows in which 'Sausage' is greater than 5
In [ ]:
# Our dataframe
spam_df
In [ ]:
# Pandas will produce a few descriptive statistics for each column
spam_df.describe()
In [ ]:
# Multiply entries of the dataframe by 10
spam_df*10
In [ ]:
# Add 10 to each entry
spam_df+10
In [ ]:
# Of course our dataframe hasn't changed
spam_df
In [ ]:
# What if we just want to add the values in the column?
sum(spam_df['Bacon'])
In [ ]:
# We can also perform operations among columns
# Pandas knows to match up individual entries in each column
spam_df['Bacon']/spam_df['Eggs']
In Moretti's study, he offers several measures of the concept of character. The simplest of these is to measure the relative dialogue belong to each character in a play. Presumably the main characters will speak more and peripheral characters will speak less.
The statistical moves we will make here are not only counting the raw number of words spoken by each character but also normalizing them. That is, converting them into a fraction of all words in the play.
In order to focus on the statistical tasks at hand, we will begin by importing a spreadsheet in which each row is labeled with an individual character's name. Its columns contain metadata about the character herself, as well as a single column containing all of her dialogue as a string.
In [ ]:
# Read spreadsheet from the hard drive
dialogue_df = pandas.read_csv('antigone_dialogue.csv', index_col=0)
In [ ]:
# Take a look
dialogue_df
In [ ]:
# Pulling out a single column acts like a list -- with labels
dialogue_df['NAMED_CHARACTER']
In [ ]:
# If we wish, we can use metadata to subset our dataframe
dialogue_df[dialogue_df['NAMED_CHARACTER']=='named']
In [ ]:
# Check out the first element of the dialogue column
dialogue_df['DIALOGUE'][0]
In [ ]:
# Create a list of lists; split each character's dialogue into a list of tokens
dialogue_tokens = [character.split() for character in dialogue_df['DIALOGUE']]
In [ ]:
# A list of lists!
dialogue_tokens
In [ ]:
# How many tokens are in each list?
dialogue_len = [len(tokens) for tokens in dialogue_tokens]
In [ ]:
# Check the numbers of tokens per character
dialogue_len
In [ ]:
# Assign this as a new column in the dataframe
dialogue_df['WORDS_SPOKEN'] = dialogue_len
In [ ]:
# Let's visualize!
# Tells Jupyter to produce images in notebook
% pylab inline
# Makes images look good
style.use('ggplot')
In [ ]:
# Visualize using the 'plot' method from Pandas
dialogue_df['WORDS_SPOKEN'].plot(kind='bar')
In [ ]:
## Moretti had not simply plotted the number words spoken by each character
## but the percentage of all words in the play belonging to that character.
## He also had sorted the columns of his diagram by their height.
## EX. Calculate the share of each character's dialogue as a percentage of the total
## number of words in the play.
## EX. Reorganize the dataframe such that these percentages appear in descending order.
## EX. Visualize the ordered share of each character's dialogue as a bar chart.
This script uses a data type, a method, and an operation that are all closely related to ones that we've seen. The dictionary resembles a list or a DataFrame. The string-method index sort of reverse engineers our slicing method where we had called up specific characters from a string by their index. The for-loop bears a close resemblence to the list comprehension, although it doesn't necessarily produce a list.
Try playing around with them to see what they do!
In [ ]:
# Read the text of Antigone from a file on your hard drive
antigone_text = open('antigone.txt', 'r').read()
# Create a list by splitting the string whereever a double line break occurs
antigone_list = antigone_text.split('\n\n')
# Create a new, empty dictionary
dialogue_dict = {}
# Iterate through each of the play's lines
for line in antigone_list:
# Find the first space in each line
index_first_space = line.index(' ')
# Slice the line, preceding the first space
character_name = line[:index_first_space]
# Check whether the character is in our dictionary yet
if character_name not in dialogue_dict.keys():
# If not, create a new entry whose value is a slice of the line *after* the first space
dialogue_dict[character_name] = line[index_first_space:]
else:
# If so, add the slice of line to the existing value
dialogue_dict[character_name] = dialogue_dict[character_name] + line[index_first_space:]
# Get ready!
import pandas
# Convert dictionary to DataFrame; instruct pandas that each dictionary entry is a row ('index')
dialogue_df = pandas.DataFrame.from_dict(dialogue_dict, orient='index')
# Add label to spreadsheet column
dialogue_df.columns = ['DIALOGUE']
# Export as csv; save to hard drive
dialogue_df.to_csv('antigone_dialogue_new.csv')
In [ ]:
## EX. The text of Hamlet is also contained within the folder for this notebook ('hamlet.txt').
## Perform Moretti's character space analysis on that play.
## Note that the dialogue is formatted slightly differently in our copy of Hamlet than it
## was in Antigone. This means that you will need to tweak the script above if you wish
## to use it for Hamlet. In reality it is very often the case that a script has to be
## tailored to different applications!