用Python分析《美女与野兽》

在一篇最近发表的论文A quantitative analysis of gendered compliments in Disney Princess films中,Carmen Fought和Karen Eisenhauer发现在这部迪士尼经典影片中女性角色的对话要多于迪士尼近期的电影作品。作者在网络上发现了美女与《野兽》的脚本,因此我立刻用Python重做了他们的分析。
更多地,我在文章最后加入了对《玩具总动员》的分析,这个脚本的形式完全不同,但其中91%的对白来自男性角色。

点击下边的cell,点击上方工具栏里的执行图标,即可执行代码块,看到输出结果。代码块左边的In[]出现In[*]表示代码正在执行


In [ ]:
from __future__ import division

import re
from collections import defaultdict

import requests
import pandas as pd
import matplotlib

%matplotlib inline
matplotlib.style.use('ggplot')

In [ ]:
# Load the script which comes as a text file

script_url = 'http://www.fpx.de/fp/Disney/Scripts/BeautyAndTheBeast.txt'
script = requests.get(script_url).text

我们看下脚本的开篇:


In [ ]:
# Let's look at the beginning of the script

script.splitlines()[:20]

再在中间随意选取一段:


In [ ]:
# Let's look at a random place

script.splitlines()[500:520]

看上去很容易分析,因为角色和对白间用:隔开


In [ ]:
# seems fairly easy to parse since 
# each new speaking line has : and begins with all caps

def remove_spaces(line):
    # remove the weird spaces
    return re.sub(' +',' ',line)

def remove_paren(line):
    # remove directions that are not spoken
    return re.sub(r'\([^)]*\)', '', line)


lines = []
line = ''
for row in script.splitlines():
    if ': ' in row and row[:3].upper() == row[:3]:
        line = remove_spaces(line)
        line = remove_paren(line)
        lines.append(line)
        line = row
    elif '          ' in row:
        line = line + ' ' + row.lstrip()
# don't forget the last line
lines.append(remove_spaces(line))

In [ ]:
lines[:15]

看看结尾什么样:


In [ ]:
# How does the end look

lines[-5:]

In [ ]:
# 我们去掉可能的空白行

print (len(lines))
lines = [l for l in lines if len(l) > 0]
print (len(lines))

现在,我们找出所有角色,并计算他们的出场次数(对白数)


In [ ]:
# now figure out the roles and how many times they appear

roles = defaultdict(int)

for line in lines:
    # take advantage of the fact that the speaker is always listed before the :
    speaker = line.split(':')[0]
    roles[speaker] = roles[speaker] + 1

In [ ]:
len(roles)

看一下每个角色出现的相对频率:


In [ ]:
# take a look at the relative frequency of each role
roles

看起来有一行“to think about”是乱入的(恰好满足了parse条件),我们忽略它


In [ ]:
# Looks like there is one bum line ('to think about'')
# But I'll ignore that for now.

# Quickly eye ball which roles are female and which are possibly mixed groups.

females = ['WOMAN 1',
           'WOMAN 2',
           'WOMAN 3',
           'WOMAN 4',
           'WOMAN 5',
           'OLD CRONIES',
           'MRS. POTTS',
           'BELLE',
           'BIMBETTE 1'
           'BIMBETTE 2',
           'BIMBETTE 3']

groups = ['MOB',
          'ALL',
          'BOTH']

将每一行对白根据角色性别进行标记,并统计不同性别的对白数量


In [ ]:
# Mark each line of dialogue by sex and count them

sex_lines = {'Male':   0,
             'Female': 0}

for line in lines:
    # Extract speaker 
    speaker = line.split(':')[0]
    
    if speaker in females:
        sex_lines['Female'] += 1
        
    elif sex_lines not in groups:
        sex_lines['Male'] += 1

print (sex_lines)
print (sex_lines['Male']/(sex_lines['Male'] + sex_lines['Female']))

我们使用一张图来显示结果:


In [ ]:
# Quick graphical representation 

df = pd.DataFrame([sex_lines.values()],columns=sex_lines.keys())
df.plot(kind='bar')

也许男性角色和女性角色的对白长度有明显不同?我们来看一看
这次我们计算对白中单词数量而不是计算对白次数:


In [ ]:
# Maybe men and women talk for different lengths? This counts words instead of 

sex_words = {'Male':   0,
             'Female': 0}

for line in lines:
    speaker = line.split(':')[0]
    dialogue = line.split(':')[1]  
    # remove the 
    # tokenize sentence by spaces
    word_count = len(dialogue.split(' ')) 
                    
    if speaker in females:
        sex_words['Female'] += word_count
    elif speaker not in groups:
        sex_words['Male'] += word_count

print (sex_words)
print (sex_words['Male']/(sex_words['Male'] + sex_words['Female']))

也用图表显示出来:


In [ ]:
# Quick graphical representation 

df = pd.DataFrame([sex_words.values()],columns=sex_words.keys())
df.plot(kind='bar')

下面是额外的《玩具总动员》的分析


In [ ]:
# Bonus toy story analysis

url = 'http://www.dailyscript.com/scripts/toy_story.html'
toy_story_script = requests.get(url).text

# toy_story_script.splitlines()[250:350]

lines = []
speaker = ''
dialogue = ''
for row in toy_story_script.splitlines()[90:]:
    if '                     ' in row: 
        if ':' not in speaker:
            lines.append( {'Speaker': remove_paren(speaker).strip(),
                           'Dialogue': remove_paren(dialogue).strip() } )
        
        speaker = remove_spaces(row.strip())
        dialogue = ''
    elif '            ' in row:
        dialogue = dialogue + ' ' + remove_spaces(row)
lines.append( {'Speaker': remove_paren(speaker).strip(),
               'Dialogue': remove_paren(dialogue).strip() } )

roles = defaultdict(int)

for line in lines:
    speaker = line['Speaker']
    roles[speaker] = roles[speaker] + 1

toy_story_df = pd.DataFrame(lines[1:])
toy_story_df.head()

toy_story_df.Speaker.value_counts()

def what_sex(speaker):
    if speaker in ["SID'S MOM", 'MRS. DAVIS', 'HANNAH', 'BO PEEP']:
        return 'Female'
    return 'Male'

toy_story_df['Sex'] = toy_story_df['Speaker'].apply(what_sex)

sex_df = toy_story_df.groupby('Sex').size()
sex_df.plot(kind='bar')
sex_df

In [ ]:
def word_count(dialogue):
    return len(dialogue.split())

toy_story_df['Word Count'] = toy_story_df['Dialogue'].apply(word_count)

word_df = toy_story_df.groupby('Sex')['Word Count'].sum()
word_df.plot(kind='bar')
word_df