Dialogue Markup

An attempt to extract the dialogue from Middlemarch, to check how many quotations were of dialogue. Props to Franco Moretti for the idea.



In [1]:

    
import spacy
import textacy
import pandas as pd
import re
from collections import Counter



In [2]:

    
with open('../middlemarch.txt') as f: 
    mm = f.read()



In [3]:

    
dialogueObjs = list(re.finditer('"([A-Za-z].*?("|\n\n))', mm, re.DOTALL))



In [4]:

    
dialogue = [dia.group(0) for dia in dialogueObjs]



In [8]:

    
totalDialogueChars = sum([len(dia.group(0)) for dia in dialogueObjs])
totalDialogueChars









    Out[8]:





588552



In [9]:

    
totalTextChars = len(mm)
totalTextChars









    Out[9]:





1793449



In [27]:

    
percentDialogue = (totalDialogueChars / totalTextChars) * 100
percentDialogue









    Out[27]:





32.816768137817135



In [11]:

    
dialogueLocs = [dia.span() for dia in dialogueObjs]



In [13]:

    
df = pd.read_json('../txt/e4.json')



In [14]:

    
allLocs = df['Locations in A'].values



In [15]:

    
# Make a big list of all possible character offsets in which dialogue occurs.  
dialogueListList = [list(range(loc[0], loc[1])) for loc in dialogueLocs]
dialogueList = []
for item in dialogueListList: 
    dialogueList += item



In [16]:

    
# Check to see whether the start or the end of a critical quotation appears in our big list.  
inDialogue = []
dialogueQuotes = []
for locSet in allLocs: 
    for loc in locSet: 
        if loc[0] in dialogueList or loc[1] in dialogueList: 
            inDialogue.append(1)
            dialogueQuotes.append(mm[loc[0]:loc[1]])
        else: 
            inDialogue.append(0)



In [17]:

    
dialogueCount = Counter(inDialogue)
dialogueCount









    Out[17]:





Counter({0: 229, 1: 71})



In [18]:

    
totalQuotations = dialogueCount[0] + dialogueCount[1]
totalQuotations









    Out[18]:





300



In [25]:

    
# Percentage of dialogue. Unadjusted for non-quotes
percentDialogueInQuotes = (dialogueCount[1] / (totalQuotations) ) * 100
percentDialogueInQuotes









    Out[25]:





23.666666666666668



In [28]:

    
print('Of %s critical quotations, %s, or %s percent, are of dialogue. The novel is about %s percent dialogue.' % (totalQuotations, dialogueCount[1], percentDialogueInQuotes, percentDialogue))









    



Of 300 critical quotations, 71, or 23.666666666666668 percent, are of dialogue. The novel is about 32.816768137817135 percent dialogue.



In [ ]: