crunch-shake is a library aimed to help analyze plays/scripts for gender disparities. Given a script, first you have to parse it to the format specified by the library. Then you can do fun stuff like seeing what are the most common words that females or males used, run network analysis to see who are the most important characters, create a graph of plays and even run the bechdel test.
First lets take a look at the play we will be parsing, Romeo and Juliet by William Shakespeare. Ever wanted to know who was the more important of the romantic duo, Romeo or Juliet? (Hint: it does not dispell any notions that we live in a patriarchy.) I've taken the play from MIT's website.
In [1]:
from utils import file_to_list
romeo_juliet_raw = file_to_list("plays/romeo_and_juliet_entire_play.html")
# Showing the beggining
for line in romeo_juliet_raw[:10]:
print(line, end="")
So obviously there's some stuff here thats not really relevant to us; lets look at some stuff in the middle of the play
In [2]:
# Showing the middle portion
for line in romeo_juliet_raw[2992:3007]:
print(line, end="")
In additions to dialogue we also have to watch out for act and scene information.
In [3]:
for line in romeo_juliet_raw[3315:3317]:
print(line, end="")
As well as information regarding when characters enter and exit. These stage directions can happen between dialogues, or within a dialogue (indicating a character should enter/exit while another is speaking).
In [4]:
# in between dialogue
for line in romeo_juliet_raw[1778:1782]:
print(line, end="")
In [5]:
# within a dialogue
for line in romeo_juliet_raw[3257:3273]:
print(line, end="")
So this the text I'm aiming to parse. Luckily regular expressions are well suited to this task. For this particular play, I've prepared the matchers, found in mit_shakespeare_regex.py. Let's go ahead and try it out.
In [6]:
from mit_shakespeare_regex import matcher
line1 = romeo_juliet_raw[1943]
print(line1)
In [7]:
# Since line 1 is a piece of dialogue, matcher.dialogue should return an object when it searches the line
matcher.dialogue.search(line1)
Out[7]:
In [8]:
# Since this line does not indicate which character is speaking, it should return None (so nothing)
matcher.character.search(line1)
In [9]:
# A line that matcher.character will match
line2 = romeo_juliet_raw[1935]
matcher.character.search(line2)
Out[9]:
The last thing we need before we begin is a gender file specifying the gender of each character in the play. This has to be done by hand.
In [10]:
from utils import json_file_to_dict
gender = json_file_to_dict("plays/romeo_and_juliet_entire_play_gender.json")
print(gender)
Now we have everything necessary to start using crunch-shake to parse the text. First we need to get the speaking characers in the text. (I get it directly from the play, you might be wondering why not just use the gender file? Well I actually used get_speaking_characters to generate the gender file.)
In [11]:
from parse import get_speaking_characters
speaking = get_speaking_characters(romeo_juliet_raw, matcher.character)
print(speaking)
In [12]:
from parse import parse_raw_text
play_lines = parse_raw_text(romeo_juliet_raw, speaking, matcher)
for line in play_lines[:20]:
print(line)
Now that we have the play in a format our library can understand, lets move to the processing part. Process will extract useful information from the play, that will be used in our analysis. The first piece of information we extract is the 'adj' object which gives us the number of play lines when a character spoke to another character. The other object 'act_scene_start_end' gives the starting and ending line number for each scene (inclusive, exclusive).
In [13]:
from process import process
adj, act_scene_start_end = process(speaking, play_lines)
# adj gives the line number where one character spoke in the precense of another.
# Lets see all the times when romeo said something in the precense of Juliet.
romeo_to_juliet = adj['ROMEO']['JULIET']
print(romeo_to_juliet)
print()
print("Number of times Romeo said something in the presence of Juliet :", len(romeo_to_juliet))
In [14]:
# Exercise: Replace None with the correct numerical value
print("Number of times Juliet said something in the presence of Romeo :", None)
In [15]:
# Gives the starting line and the ending line + 1 for each scene
print(act_scene_start_end)
print()
print("Number of scenes in Romeo and Juliet :", len(act_scene_start_end))
In [16]:
from analysis import create_graph
adj_num = { speaker : { spoken : len(adj[speaker][spoken])
for spoken in adj[speaker] }
for speaker in adj }
# create_graph uses the network x library, which addition to doing network analysis, can also draw graphs.
graph = create_graph(adj_num)
In [17]:
from analysis import get_characters_by_importance
# Important for page rank algorithmn
reciprocal_graph = create_graph(adj_num, reciprocal=True)
characters_by_importance = get_characters_by_importance(
play_lines,
speaking,
graph,
reciprocal_graph
)
print(characters_by_importance)
How are the characters ranked? Well here's the default weight that the current alogrithm gives to each metric used to rank characters
lines_by_character : number of lines character speaks out_degree : the fraction of other characters this character is connected to page_rank : how many important people does this character speak to betweenness, the sum of the fraction of all-pairs shortest paths that pass through the character
By this default setting (which I can about by messing with character rankings for romeo and juiet and all's well that ends well, so take it with a grain of salt), romeo comes up on top with juliet as second.
What changes if we change the metric weights?
In [18]:
# order of metrics [lines_by_character, out_degree, page_rank, betweenness]
metrics_weight = [0, 0, 1, 0] # Just using page rank
characters_by_importance = get_characters_by_importance(
play_lines,
speaking,
graph,
reciprocal_graph,
metrics_weight=metrics_weight
)
print(characters_by_importance)
I've tried and there's basically no way to get Juliet to be number 1
In [19]:
from analysis import vocab_difference
diff = vocab_difference(play_lines, gender)
# words frequented by gender 1
print("gender1", diff[:25])
# words frequented by gender 2
print("gender2", diff[-25:])
Finally we come to the Bechdel test, how does Romeo and Juliet do on it? It does have a female character in the title, so it shouldn't do too bad.
In [25]:
from analysis import bechdel_test
# First we need to reset characters by importance
characters_by_importance = get_characters_by_importance(
play_lines,
speaking,
graph,
reciprocal_graph
)
bechdel_scenes = bechdel_test(play_lines, characters_by_importance, adj,
gender, act_scene_start_end)
print(bechdel_scenes)
print(len(bechdel_scenes[0]))
While the play overall does pass the Bechdel test, it does so poorly with only 3 out of 24 scenes passing. This is because although Juliet is a main character, whenever females talk to each other, its likely to include references to males or marriage.