Stylometry

This notebook is designed to reproduce several findings from Emily Thornbury's chapter "The Poet Alone" in her book Becoming a Poet in Anglo-Saxon England. In particular, Fig. 4.5 on page 170.

First, however, we're going to think about what we might do with lists of strings. After all, how else can we count features of a string unless we can somehow make a list of items out of it?

Lists

Here's a list:


In [ ]:
["þæt", "wearð", "underne"]

How do I know?


In [ ]:
type(["þæt", "wearð", "underne"])

We can assign these to variables too!


In [ ]:
first_hemistich = ["þæt", "wearð", "underne"]
second_hemistich = ["eorðbuendum"]
print(first_hemistich)
print(second_hemistich)

And perform mathematical operations:


In [ ]:
print(first_hemistich + second_hemistich)

Let's assign that to first_line:


In [ ]:
first_line = first_hemistich + second_hemistich
print(first_line)

You can get the length of a list using the len function:


In [ ]:
len(first_line)

You can index lists with brackets [], let's get the first word of the first line:


In [ ]:
print(first_line[1])
Don't forget, Python (and many other langauges) start counting from 0.

In [ ]:
print(first_line[0])

You can get ranges using a semi-colon :


In [ ]:
print(first_line[:2])
print(type(first_line[:2]))

Challenge 1

  • Concatenate the first three lines of Christ and Satan.
  • Retrieve the third element from the combined list.
  • Retrieve the fourth through sixth elements from the combined list.

In [ ]:
first_line = ['þæt', 'wearð', 'underne', 'eorðbuendum,']
second_line = ['þæt', 'meotod', 'hæfde', 'miht', 'and', 'strengðo']
third_line = ['ða', 'he', 'gefestnade', 'foldan', 'sceatas.']

In [ ]:


List Comprehension

For now, think of a list comprehension as a fast way to sift out items from a list, instead of writing a for loop that appends to a new one.


In [ ]:
[word for word in first_line if "e" in word]

INSTEAD OF


In [ ]:
has_e = []

for word in first_line:
    if "e" in word:
        has_e.append(word)

has_e

Now you know why list comprehensions are one of the best parts of Python!

Especially for text analysis, these will come in handy when we want to parse and sift through text.

Challenge 2

  • Concatenate the first three lines of Christ and Satan.
  • Create a new list that contains only the words whose last letter is "e"
  • Create a new list that contains the first letter of each word.
  • Create a new list that contains only words longer than two letters.

In [ ]:


Word Frequencies


In [ ]:
with open('data/christ-and-satan.txt', 'r') as f:
    christ_and_satan = f.read()

In [ ]:
tokens = christ_and_satan.split()

In [ ]:
tokens

Looks like a decent start. But we still have verse numbering in there, as well as some punctuation. What if we just want the words?


In [ ]:
from string import punctuation, digits

In [ ]:
punctuation

In [ ]:
digits

Does it feel like time for a list comprehension? It should.

Challenge 3

Write a list comprehension to remove line numbers and punctuation.


In [ ]:


Python comes with the convenient Counter method from the collections library. It returns a dictionary like object that will return the frequency of a particular key.


In [ ]:
from collections import Counter
cs_dict = Counter(tokens)

In [ ]:
cs_dict

In [ ]:
cs_dict.keys()

In [ ]:
cs_dict.values()

In [ ]:
cs_dict.most_common()

Believe it or not, even 1000 years ago "and" was still used all the time :) .

Challenge 4

  • A common measure of lexical diversity for a given text is its Type-Token Ratio: the ratio of unique words (type) to number of all words (tokens) in the text.
  • Calculate the Type-Token Ratio for Christ and Satan.

In [ ]:


Visualization


In [ ]:
%matplotlib inline
from datascience import *
import numpy as np

words, frequency = zip(*cs_dict.items())

t = Table(["Words", "Frequency"])
t.append_column("Words", words)
t.append_column("Frequency", frequency)
top_table = t.sort("Frequency", descending="True").take(np.arange(5))
top_table.bar(column_for_categories="Words")

Ad Hoc Stylometry

We can now put together our knowledge of strings, list comprehensions, and plotting frequencies to look at frequency of alliteration letters. Remember: Alliteration is the repetition of a sound at the beginning of two or more words in the same line.

Let's start by looking at the first letter of every word in the whole text:


In [ ]:
cs_tokens = christ_and_satan.lower().split()
first_letters = [x[0] if x[0] not in ['a','e','i','o','u','y'] else 'a' for x in cs_tokens]
first_l_dict = Counter(first_letters)
first_l_freq = first_l_dict.most_common()
print(first_l_freq)

In [ ]:
# plot
letters, frequency = zip(*first_l_dict.items())

t = Table(["Letters", "Frequency"])
t.append_column("Letters", letters)
t.append_column("Frequency", frequency)
top_table = t.sort("Frequency", descending="True").take(np.arange(5))
top_table.bar(column_for_categories="Letters")

Cool! But we need it within a line, and Thornbury specifically does it for each Fitt. What's a "Fitt"? It's a further division in poetry constituted by a group of lines. Luckily this is nicely delimited by double line breaks (\n\n).


In [ ]:
cs_fitts = christ_and_satan.split('\n\n')
cs_fitts

In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize = (10,10))

# iterate through fitts
for i in range(len(cs_fitts)):
    
    # lowercase the string and get the tokens for each line back
    fitt_tokens = [l.split() for l in cs_fitts[i].lower().split('\n')]
    
    # collect letter of most freq alliteration
    most_freq_allit = []
    
    # cycle through lines
    for l in fitt_tokens:
        
        # get first letter of all words in line
        first_letters = [x[0] if x[0] not in ['a','e','i','o','u','y'] else 'a' for x in l]
        
        # count freq of all first letters
        allit_freq = Counter(first_letters).most_common()
        try:
            # append most freq letter (alliterated letter) to list for all lines
            most_freq_allit.append(allit_freq[0][0])
        except:
            pass
    
    # use Counter to get the most common alliterations
    allit_freq = Counter(most_freq_allit).most_common()

    # need keys for x axis
    common_keys = [x[0] for x in allit_freq]
    
    # need values for y axes
    common_values = [x[1] for x in allit_freq]
    
    # normalize so we can compare across Fitts despite different number of words
    normed_values = [x[1]/sum(common_values) for x in allit_freq]
    
    # add up to get cumulative alliteration of the four most preferred patterns
    cumulative_values = np.cumsum(normed_values)

    # add the Fitt to the plot
    plt.xticks(range(4), ['1st','2nd','3rd','4th'], rotation='vertical')
    plt.plot(cumulative_values[:4], color = plt.cm.bwr(i*.085), lw=3)
plt.legend(labels=['Fitt '+str(i+1) for i in range(12)], loc=0)
plt.show()

Homework: Acrostics

In poetry, an acrostic is a message created by taking certain letters in a pattern over lines. One 9th century German writer, Otfrid of Weissenburg, was notorius for his early use of acrostics, one instance of which is in the text below: Salomoni episcopo Otfridus. His message can be found by taking the first character of every other line. Print Otfrid's message!

Source: http://titus.uni-frankfurt.de/texte/etcs/germ/ahd/otfrid/otfri.htm


In [ ]:
text = '''si sálida gimúati      sálomones gúati, 
     ther bíscof ist nu édiles      kóstinzero sédales; 
     allo gúati gidúe thio sín,      thio bíscofa er thar hábetin, 
     ther ínan zi thiu giládota,      in hóubit sinaz zuívalta! 
     lékza ih therera búachi      iu sentu in suábo richi, 
     thaz ir irkíaset ubar ál,      oba siu frúma wesan scal; 
     oba ir hiar fíndet iawiht thés      thaz wírdig ist thes lésannes: 
     iz iuer húgu irwállo,      wísduames fóllo. 
     mir wárun thio iuo wízzi      ju ófto filu núzzi, 
     íueraz wísduam;      thes duan ih míhilan ruam. 
     ófto irhugg ih múates      thes mánagfalten gúates, 
     thaz ír mih lértut hárto      íues selbes wórto. 
     ni thaz míno dohti      giwérkon thaz io móhti, 
     odo in thén thingon      thio húldi so gilángon; 
     iz datun gómaheiti,      thio íues selbes gúati, 
     íueraz giráti,      nales míno dati. 
     emmizen nu ubar ál      ih druhtin férgon scal, 
     mit lón er iu iz firgélte      joh sínes selbes wórte; 
     páradyses résti      gébe iu zi gilústi; 
     ungilónot ni biléip      ther gotes wízzode kleip. 
     in hímilriches scóne      so wérde iz iu zi lóne 
     mit géltes ginúhti,      thaz ír mir datut zúhti. 
     sínt in thesemo búache,      thes gómo theheiner rúache; 
     wórtes odo gúates,      thaz lích iu iues múates: 
     chéret thaz in múate      bi thia zúhti iu zi gúate, 
     joh zellet tház ana wánc      al in íuweran thanc. 
     ofto wírdit, oba gúat      thes mannes júngoro giduat, 
     thaz es líwit thráto      ther zúhtari gúato. 
     pétrus ther rícho      lono iu es blídlicho, 
     themo zi rómu druhtin gráp      joh hús inti hóf gap; 
     óbana fon hímile      sént iu io zi gámane 
     sálida gimýato      selbo kríst ther gúato! 
     oba ih irbálden es gidár,      ni scal ih firlázan iz ouh ál, 
     nub ih ío bi iuih gerno      gináda sina férgo, 
     thaz hóh er iuo wírdi      mit sínes selbes húldi, 
     joh iu féstino in thaz múat      thaz sinaz mánagfalta gúat; 
     firlíhe iu sines ríches,      thes hohen hímilriches, 
     bi thaz ther gúato hiar io wíaf      joh émmizen zi góte riaf; 
     rihte íue pédi thara frúa      joh míh gifúage tharazúa, 
     tház wir unsih fréwen thar      thaz gotes éwiniga jár, 
     in hímile unsih blíden,      thaz wízi wir bimíden; 
     joh dúe uns thaz gimúati      thúruh thio síno guati! 
     dúe uns thaz zi gúate      blídemo múate! 
     mit héilu er gibóran ward,      ther io thia sálida thar fand, 
     uuanta es ni brístit furdir      (thes gilóube man mír), 
     nirfréwe sih mit múatu      íamer thar mit gúatu. 
     sélbo krist ther guato      firlíhe uns hiar gimúato, 
     wir íamer fro sin múates      thes éwinigen gúates!'''

In [ ]:
# HINT: remember what % does, (maybe) lookup enumerate

Otfrid was more skillful than to settle for the first letter of every other line. What happens if you extract the last letter of the last word of each line, for every other line starting on the second line?


In [ ]:
# HINT: first remove punctuation, tab is represented by \t
from string import punctuation