Crunch-Shake

Introduction
Preliminaries
Parsing
Processing
Analysis </strong>

Introduction

crunch-shake is a library aimed to help analyze plays/scripts for gender disparities. Given a script, first you have to parse it to the format specified by the library. Then you can do fun stuff like seeing what are the most common words that females or males used, run network analysis to see who are the most important characters, create a graph of plays and even run the bechdel test.

Preliminaries

First lets take a look at the play we will be parsing, Romeo and Juliet by William Shakespeare. Ever wanted to know who was the more important of the romantic duo, Romeo or Juliet? (Hint: it does not dispell any notions that we live in a patriarchy.) I've taken the play from MIT's website.



In [1]:

    
from utils import file_to_list

romeo_juliet_raw = file_to_list("plays/romeo_and_juliet_entire_play.html")

# Showing the beggining
for line in romeo_juliet_raw[:10]:
    print(line, end="")









    



<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
 "http://www.w3.org/TR/REC-html40/loose.dtd">
 <html>
 <head>
 <title>Romeo and Juliet: Entire Play
 </title>
 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 <LINK rel="stylesheet" type="text/css" media="screen"
       href="/shake.css">
 </HEAD>

So obviously there's some stuff here thats not really relevant to us; lets look at some stuff in the middle of the play



In [2]:

    
# Showing the middle portion
for line in romeo_juliet_raw[2992:3007]:
    print(line, end="")









    



<A NAME=speech81><b>ROMEO</b></a>
<blockquote>
<A NAME=2.4.177>And stay, good nurse, behind the abbey wall:</A><br>
<A NAME=2.4.178>Within this hour my man shall be with thee</A><br>
<A NAME=2.4.179>And bring thee cords made like a tackled stair;</A><br>
<A NAME=2.4.180>Which to the high top-gallant of my joy</A><br>
<A NAME=2.4.181>Must be my convoy in the secret night.</A><br>
<A NAME=2.4.182>Farewell; be trusty, and I'll quit thy pains:</A><br>
<A NAME=2.4.183>Farewell; commend me to thy mistress.</A><br>
</blockquote>

<A NAME=speech82><b>Nurse</b></a>
<blockquote>
<A NAME=2.4.184>Now God in heaven bless thee! Hark you, sir.</A><br>
</blockquote>

In additions to dialogue we also have to watch out for act and scene information.



In [3]:

    
for line in romeo_juliet_raw[3315:3317]:
    print(line, end="")









    



<H3>ACT III</h3>
<h3>SCENE I. A public place.</h3>

As well as information regarding when characters enter and exit. These stage directions can happen between dialogues, or within a dialogue (indicating a character should enter/exit while another is speaking).



In [4]:

    
# in between dialogue
for line in romeo_juliet_raw[1778:1782]:
    print(line, end="")









    



<h3>SCENE I. A lane by the wall of Capulet's orchard.</h3>
<p><blockquote>
<i>Enter ROMEO</i>
</blockquote>



In [5]:

    
# within a dialogue
for line in romeo_juliet_raw[3257:3273]:
    print(line, end="")









    



<A NAME=speech3><b>FRIAR LAURENCE</b></a>
<blockquote>
<A NAME=2.6.9>These violent delights have violent ends</A><br>
<A NAME=2.6.10>And in their triumph die, like fire and powder,</A><br>
<A NAME=2.6.11>Which as they kiss consume: the sweetest honey</A><br>
<A NAME=2.6.12>Is loathsome in his own deliciousness</A><br>
<A NAME=2.6.13>And in the taste confounds the appetite:</A><br>
<A NAME=2.6.14>Therefore love moderately; long love doth so;</A><br>
<A NAME=2.6.15>Too swift arrives as tardy as too slow.</A><br>
<p><i>Enter JULIET</i></p>
<A NAME=2.6.16>Here comes the lady: O, so light a foot</A><br>
<A NAME=2.6.17>Will ne'er wear out the everlasting flint:</A><br>
<A NAME=2.6.18>A lover may bestride the gossamer</A><br>
<A NAME=2.6.19>That idles in the wanton summer air,</A><br>
<A NAME=2.6.20>And yet not fall; so light is vanity.</A><br>
</blockquote>

So this the text I'm aiming to parse. Luckily regular expressions are well suited to this task. For this particular play, I've prepared the matchers, found in mit_shakespeare_regex.py. Let's go ahead and try it out.



In [6]:

    
from mit_shakespeare_regex import matcher

line1 = romeo_juliet_raw[1943]
print(line1)









    



<A NAME=2.2.46>By any other name would smell as sweet;</A><br>



In [7]:

    
# Since line 1 is a piece of dialogue, matcher.dialogue should return an object when it searches the line
matcher.dialogue.search(line1)









    Out[7]:





<_sre.SRE_Match object; span=(0, 62), match='<A NAME=2.2.46>By any other name would smell as s>



In [8]:

    
# Since this line does not indicate which character is speaking, it should return None (so nothing)
matcher.character.search(line1)



In [9]:

    
# A line that matcher.character will match
line2 = romeo_juliet_raw[1935]
matcher.character.search(line2)









    Out[9]:





<_sre.SRE_Match object; span=(0, 33), match='<A NAME=speech6><b>JULIET</b></a>'>

The last thing we need before we begin is a gender file specifying the gender of each character in the play. This has to be done by hand.



In [10]:

    
from utils import json_file_to_dict

gender = json_file_to_dict("plays/romeo_and_juliet_entire_play_gender.json")
print(gender)









    



{'MERCUTIO': 'M', 'PARIS': 'M', 'CAPULET': 'M', 'THIRD MUSICIAN': 'M', 'GREGORY': 'M', 'SERVANT': 'M', 'FIRST SERVANT': 'M', 'BALTHASAR': 'M', 'PETER': 'M', 'PAGE': 'M', 'LADY MONTAGUE': 'F', 'FIRST WATCHMAN': 'M', 'BENVOLIO': 'M', 'SECOND MUSICIAN': 'M', 'MONTAGUE': 'M', 'FIRST MUSICIAN': 'M', 'MUSICIAN': 'M', 'TYBALT': 'M', 'SAMPSON': 'M', 'SECOND CAPULET': 'M', 'CHORUS': 'N', 'THIRD WATCHMAN': 'M', 'FRIAR LAURENCE': 'M', 'FIRST CITIZEN': 'M', 'PRINCE': 'M', 'LADY CAPULET': 'F', 'ABRAHAM': 'M', 'ROMEO': 'M', 'SECOND WATCHMAN': 'M', 'NURSE': 'F', 'FRIAR JOHN': 'M', 'APOTHECARY': 'M', 'SECOND SERVANT': 'M', 'JULIET': 'F'}

Parsing

Now we have everything necessary to start using crunch-shake to parse the text. First we need to get the speaking characers in the text. (I get it directly from the play, you might be wondering why not just use the gender file? Well I actually used get_speaking_characters to generate the gender file.)



In [11]:

    
from parse import get_speaking_characters

speaking = get_speaking_characters(romeo_juliet_raw, matcher.character)
print(speaking)









    



{'FIRST SERVANT', 'BENVOLIO', 'FIRST MUSICIAN', 'TYBALT', 'MERCUTIO', 'CHORUS', 'SECOND MUSICIAN', 'PETER', 'FIRST CITIZEN', 'PRINCE', 'LADY CAPULET', 'ROMEO', 'APOTHECARY', 'GREGORY', 'FIRST WATCHMAN', 'CAPULET', 'THIRD MUSICIAN', 'SERVANT', 'BALTHASAR', 'PAGE', 'LADY MONTAGUE', 'THIRD WATCHMAN', 'MONTAGUE', 'SECOND CAPULET', 'MUSICIAN', 'PARIS', 'SAMPSON', 'FRIAR LAURENCE', 'ABRAHAM', 'JULIET', 'SECOND WATCHMAN', 'NURSE', 'FRIAR JOHN', 'SECOND SERVANT'}



In [12]:

    
from parse import parse_raw_text

play_lines = parse_raw_text(romeo_juliet_raw, speaking, matcher)
for line in play_lines[:20]:
    print(line)









    



Act(act=1)
Scene(scene=1)
Enter SAMPSON and GREGORY, of the house of Capulet, armed with swords and bucklers : ["Enter - ['SAMPSON', 'GREGORY', 'CAPULET']"] : None
SAMPSON
1.1 :Gregory, o' my word, we'll not carry coals. : None
GREGORY
1.1 :No, for then we should be colliers. : None
SAMPSON
1.1 :I mean, an we be in choler, we'll draw. : None
GREGORY
1.1 :Ay, while you live, draw your neck out o' the collar. : None
SAMPSON
1.1 :I strike quickly, being moved. : None
GREGORY
1.1 :But thou art not quickly moved to strike. : None
SAMPSON
1.1 :A dog of the house of Montague moves me. : None
GREGORY
1.1 :To move is to stir; and to be valiant is to stand: : None
1.1 :therefore, if thou art moved, thou runn'st away. : None

Processing

Now that we have the play in a format our library can understand, lets move to the processing part. Process will extract useful information from the play, that will be used in our analysis. The first piece of information we extract is the 'adj' object which gives us the number of play lines when a character spoke to another character. The other object 'act_scene_start_end' gives the starting and ending line number for each scene (inclusive, exclusive).



In [13]:

    
from process import process

adj, act_scene_start_end = process(speaking, play_lines)

# adj gives the line number where one character spoke in the precense of another. 
# Lets see all the times when romeo said something in the precense of Juliet.
romeo_to_juliet = adj['ROMEO']['JULIET']
print(romeo_to_juliet)
print()
print("Number of times Romeo said something in the presence of Juliet :", len(romeo_to_juliet))









    



[843, 844, 845, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 914, 915, 916, 917, 924, 928, 929, 933, 934, 938, 939, 945, 954, 955, 959, 1083, 1085, 1086, 1087, 1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1126, 1141, 1142, 1143, 1148, 1149, 1150, 1151, 1152, 1158, 1165, 1166, 1167, 1168, 1172, 1173, 1174, 1178, 1179, 1180, 1181, 1185, 1186, 1187, 1188, 1189, 1214, 1215, 1221, 1228, 1240, 1244, 1249, 1262, 1263, 1264, 1286, 1291, 1292, 1293, 1294, 1305, 1306, 1307, 1311, 1316, 1321, 1326, 1327, 1336, 1916, 1917, 1918, 1919, 1920, 1921, 2674, 2675, 2676, 2677, 2678, 2679, 2687, 2688, 2689, 2690, 2691, 2692, 2693, 2694, 2695, 2708, 2721, 2730, 2731, 2732, 2736, 2737, 2744, 2745, 3748, 3749, 3750, 3751, 3752, 3753, 3754, 3755, 3756, 3757, 3758, 3759, 3760, 3761, 3762, 3763, 3764, 3765, 3769, 3770, 3776, 3777, 3778, 3779, 3793, 3794, 3795, 3796, 3797, 3798, 3799, 3800, 3801, 3802, 3807, 3819, 3820, 3821, 3822, 3823, 3824, 3825, 3826, 3827, 3828, 3829, 3830, 3831, 3832, 3834, 3835, 3836, 3837, 3838, 3839, 3840, 3841, 3842, 3843, 3844, 3845, 3846, 3847, 3848, 3849, 3850, 3851, 3852, 3853, 3854, 3855, 3856, 3857, 3858, 3859, 3860, 3861, 3862, 3863, 3864, 3865, 3867, 3868]

Number of times Romeo said something in the presence of Juliet : 224



In [14]:

    
# Exercise: Replace None with the correct numerical value

print("Number of times Juliet said something in the presence of Romeo :", None)









    



Number of times Juliet said something in the presence of Romeo : None



In [15]:

    
# Gives the starting line and the ending line + 1 for each scene
print(act_scene_start_end)
print()
print("Number of scenes in Romeo and Juliet :", len(act_scene_start_end))









    



[(0, 350), (350, 490), (490, 635), (635, 786), (786, 1020), (1020, 1080), (1080, 1350), (1350, 1470), (1470, 1779), (1779, 1883), (1883, 1933), (1933, 2214), (2214, 2389), (2389, 2617), (2617, 2665), (2665, 3000), (3000, 3166), (3166, 3239), (3239, 3309), (3309, 3361), (3361, 3562), (3562, 3674), (3674, 3717), (3717, 4137)]

Number of scenes in Romeo and Juliet : 24

Analysis

Ok now we are all set to start our analysis. First let's create a graph of the romeo and juliet from adj



In [16]:

    
from analysis import create_graph

adj_num = { speaker : { spoken : len(adj[speaker][spoken]) 
        for spoken in adj[speaker] } 
        for speaker in adj }
# create_graph uses the network x library, which addition to doing network analysis, can also draw graphs.
graph = create_graph(adj_num)

Now let's start some network analysis.

ranking characters



In [17]:

    
from analysis import get_characters_by_importance

# Important for page rank algorithmn
reciprocal_graph = create_graph(adj_num, reciprocal=True)

characters_by_importance = get_characters_by_importance(
    play_lines, 
    speaking, 
    graph,
    reciprocal_graph
)

print(characters_by_importance)









    



[('APOTHECARY', 0.01585736539475655), ('FRIAR JOHN', 0.02321550475947984), ('MUSICIAN', 0.023702921309142005), ('THIRD MUSICIAN', 0.02370984304914233), ('SECOND MUSICIAN', 0.02705325596939725), ('ABRAHAM', 0.03242844178476222), ('SECOND WATCHMAN', 0.041869755401344536), ('FIRST MUSICIAN', 0.04397455651981348), ('THIRD WATCHMAN', 0.04477726344922253), ('GREGORY', 0.049313905243656704), ('SECOND CAPULET', 0.05000591048459262), ('FIRST SERVANT', 0.053943094491733325), ('FIRST CITIZEN', 0.0579841246283659), ('LADY MONTAGUE', 0.05886208079077443), ('CHORUS', 0.060120522045912436), ('SECOND SERVANT', 0.06042591300931741), ('SAMPSON', 0.06183860496708061), ('PAGE', 0.07851049904984327), ('SERVANT', 0.0798751438042504), ('FIRST WATCHMAN', 0.08392680072758302), ('BALTHASAR', 0.12485718674269185), ('PETER', 0.1370512067688623), ('TYBALT', 0.14437695980343654), ('MONTAGUE', 0.16811136622347814), ('PARIS', 0.17308954688014092), ('PRINCE', 0.22463136951106422), ('LADY CAPULET', 0.29913785568265694), ('MERCUTIO', 0.3612351059008487), ('BENVOLIO', 0.3766634432634548), ('NURSE', 0.45508380447592245), ('CAPULET', 0.5459063866032349), ('FRIAR LAURENCE', 0.570213990909113), ('JULIET', 0.7104949203375388), ('ROMEO', 0.9907407407407407)]

How are the characters ranked? Well here's the default weight that the current alogrithm gives to each metric used to rank characters

lines_by_character , 0.625
out_degree , 0.125
page_rank , 0.125
betweeness, 0.125

lines_by_character : number of lines character speaks out_degree : the fraction of other characters this character is connected to page_rank : how many important people does this character speak to betweenness, the sum of the fraction of all-pairs shortest paths that pass through the character

By this default setting (which I can about by messing with character rankings for romeo and juiet and all's well that ends well, so take it with a grain of salt), romeo comes up on top with juliet as second.

What changes if we change the metric weights?



In [18]:

    
# order of metrics [lines_by_character, out_degree, page_rank, betweenness]
metrics_weight = [0, 0, 1, 0] # Just using page rank

characters_by_importance = get_characters_by_importance(
    play_lines, 
    speaking, 
    graph,
    reciprocal_graph,
    metrics_weight=metrics_weight
)

print(characters_by_importance)









    



[('MUSICIAN', 0.02798681675796601), ('THIRD MUSICIAN', 0.028042190677968617), ('SECOND MUSICIAN', 0.03027969011843936), ('SECOND WATCHMAN', 0.030491812273937123), ('APOTHECARY', 0.032632343637355225), ('ABRAHAM', 0.033392675890298235), ('FIRST MUSICIAN', 0.034930819508506725), ('FIRST CITIZEN', 0.035874990068413534), ('THIRD WATCHMAN', 0.03741200737591531), ('SECOND CAPULET', 0.037423170007782974), ('LADY MONTAGUE', 0.03897943499721942), ('FRIAR JOHN', 0.0424778507120043), ('SECOND SERVANT', 0.045700792997794636), ('GREGORY', 0.045927363953610904), ('SERVANT', 0.04894154551427329), ('FIRST SERVANT', 0.04964446411830971), ('SAMPSON', 0.05625568069525053), ('PETER', 0.05715050950577004), ('PAGE', 0.06534200296684048), ('CHORUS', 0.0702887951036829), ('FIRST WATCHMAN', 0.10042096000891407), ('BALTHASAR', 0.11862691083904153), ('PARIS', 0.1332299114041219), ('TYBALT', 0.14211594455303272), ('LADY CAPULET', 0.18302177449767654), ('MONTAGUE', 0.20029465308121233), ('NURSE', 0.33504290500310485), ('PRINCE', 0.34090045555782966), ('MERCUTIO', 0.3854249970129457), ('CAPULET', 0.42161757378918047), ('JULIET', 0.447962958873778), ('BENVOLIO', 0.48808276909583703), ('FRIAR LAURENCE', 0.5765064251028214), ('ROMEO', 1.0)]

I've tried and there's basically no way to get Juliet to be number 1

Moving on

Vocabulary differences

Now let's see what vocabulary female characters prefer over male characters and vice verse.



In [19]:

    
from analysis import vocab_difference

diff = vocab_difference(play_lines, gender)

# words frequented by gender 1
print("gender1", diff[:25])

# words frequented by gender 2
print("gender2", diff[-25:])









    



gender1 ['quoth', 'woful', 'mother', 'Whats', 'Lord', 'fourteen', 'ever', 'husband', 'age', 'hes', 'madam', 'words', 'counsel', 'Madam', 'Tybalts', 'Laurence', 'news', 'behold', 'thousand', 'day', 'Marry', 'Peter', 'into', 'only', 'weep']
gender2 ['blessed', 'silver', 'exile', 'thank', 'dreams', 'rough', 'flowers', 'care', 'bite', 'friends', 'hit', 'ground', 'itself', 'heads', 'tender', 'Mercutios', 'fire', 'fingers', 'maids', 'churchyard', 'reason', 'beauty', 'while', 'read', 'far']

Can you guess which is gender 1 and which is gender 2?

Bechdel test

Finally we come to the Bechdel test, how does Romeo and Juliet do on it? It does have a female character in the title, so it shouldn't do too bad.



In [25]:

    
from analysis import bechdel_test   

# First we need to reset characters by importance
characters_by_importance = get_characters_by_importance(
    play_lines, 
    speaking, 
    graph,
    reciprocal_graph
)

bechdel_scenes = bechdel_test(play_lines, characters_by_importance, adj,
            gender, act_scene_start_end)
print(bechdel_scenes)
print(len(bechdel_scenes[0]))









    



([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, False, False, True], 0.16666666666666666)
24

While the play overall does pass the Bechdel test, it does so poorly with only 3 out of 24 scenes passing. This is because although Juliet is a main character, whenever females talk to each other, its likely to include references to males or marriage.