Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar
Do men and women come up in different contexts in the newspaper? One quick way to answer that question is to compare the words in sentences that discuss women with the words in sentences that discuss men. Here's an example of how to do this sort of analysis using Python.
The data comes from last week's (February 27, 2013-March 6, 2013) New York Times. I downloaded
all the articles available through LexisNexis excluding only the corrections and paid
obituaries. This totals 1,379 articles, or about 200 per day. Using a modified version of an old Python
script, I removed all the
metadata. put the text
of each article in its own file, and placed all of the text files in a folder called articles
.
It is not the most efficient way to go about it, but sometimes the text data comes that way
so I thought I would be useful to set it up that way for didactic purposes.
We begin by loading a few modules.
The only modules that you might need to install is nltk
,
which is a powerful suite for text processing and analysis.
For this analysis, I'm only using the NLTK
function that splits text into sentences.
glob
is a useful module for
retrieving the contents of a directory, and string.punctuation
is just a string with all the
ASCII punctuation marks, that is !"#$%&'()*+,-/:;<=>?@[\]^_
{|}~.
In [35]:
from __future__ import division
import glob
import nltk
from string import punctuation
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
The heart of the analysis will be figuring out whether a sentence is talking about a man, woman, both or neither. As a first pass, I'm going to assume that the sentence is talking about man if it uses words like, "he", "dad" or "Mr.", and is probably talking about a woman if it uses words like, "she", "mother", or "Ms.". It isn't perfect, but depending on the text, it can be quite useful. Rather than start from scratch, I build off of Danielle Sucher's list from her Jailbreak the Patriarchy browser plugin.
In [36]:
#Two lists of words that are used when a man or woman is present, based on Danielle Sucher's https://github.com/DanielleSucher/Jailbreak-the-Patriarchy
male_words=set(['guy','spokesman','chairman',"men's",'men','him',"he's",'his','boy','boyfriend','boyfriends','boys','brother','brothers','dad','dads','dude','father','fathers','fiance','gentleman','gentlemen','god','grandfather','grandpa','grandson','groom','he','himself','husband','husbands','king','male','man','mr','nephew','nephews','priest','prince','son','sons','uncle','uncles','waiter','widower','widowers'])
female_words=set(['heroine','spokeswoman','chairwoman',"women's",'actress','women',"she's",'her','aunt','aunts','bride','daughter','daughters','female','fiancee','girl','girlfriend','girlfriends','girls','goddess','granddaughter','grandma','grandmother','herself','ladies','lady','lady','mom','moms','mother','mothers','mrs','ms','niece','nieces','priestess','princess','queens','she','sister','sisters','waitress','widow','widows','wife','wives','woman'])
I'm storing them as sets rather than lists because later on I want to look at whether or not words in a sentence overlap with these words, and Python will return the intersection of sets, but not lists.
The function below takes a work list and returns the gender of the person being talked about, if any, based on the number of words a sentence has in common with either the male or female word lists.
In [37]:
def gender_the_sentence(sentence_words):
mw_length=len(male_words.intersection(sentence_words))
fw_length=len(female_words.intersection(sentence_words))
if mw_length>0 and fw_length==0:
gender='male'
elif mw_length==0 and fw_length>0:
gender='female'
elif mw_length>0 and fw_length>0:
gender='both'
else:
gender='none'
return gender
I don't really care about proper nouns, especially people's names (e.g. it is boring that 'Boehner' is always male), so I need a way to identify them. To do that, I'm going to count how many times a word's first letter is capitalized and how many times it isn't. With a large enough text and if you ignore the first words of sentences, this is pretty robust way to identify proper nouns.
In [38]:
def is_it_proper(word):
if word[0]==word[0].upper():
case='upper'
else:
case='lower'
word_lower=word.lower()
try:
proper_nouns[word_lower][case] = proper_nouns[word_lower].get(case,0)+1
except Exception,e:
#This is triggered when the word hasn't been seen yet
proper_nouns[word_lower]= {case:1}
Note that here I'm using .get()
to retrieve the values stored the proper noun dictionary.
This is one way to avoid error messages when the key isn't in the dictionary. Here, proper_nouns[word_lower].get(case,0)
returns the value of word_lower
if that combination of word and capitalization has been seen before and
0 if has not. The except
is only triggered when the word hasn't been seen yet.
I'm going to keep track of each the words in each sentence with a couple of counters. This
function
doesn't return anything but it does increment the word_freq
, word_counter
,
and sentence_counter
dictionaries.
In [39]:
def increment_gender(sentence_words,gender):
sentence_counter[gender]+=1
word_counter[gender]+=len(sentence_words)
for word in sentence_words:
word_freq[gender][word]=word_freq[gender].get(word,0)+1
And so we begin. I set up the counters to store the various quantities of interest. These
are the ones that modified in the increment_gender
function. Some
of the values probably don't need to be entered now, particularly for the word and sentence
counters, but starting with zeroes helps remind me what they are for.
In [40]:
sexes=['male','female','none','both']
sentence_counter={sex:0 for sex in sexes}
word_counter={sex:0 for sex in sexes}
word_freq={sex:{} for sex in sexes}
proper_nouns={}
I've stored all the files at text files in a directory called articles and I wanted to grab all their names.
In [41]:
file_list=glob.glob('articles/*.txt')
The basic idea is to read each file, split it into sentences, and then process each sentence. The processing begins by splitting the sentence into words and removing punctuation. Then for each word that doesn't begin the sentence, I figure out if it is capitalized or not as part of the hunt for proper nouns. Then, I estimate whether the sentence is likely talking about a man or a woman, based on the occurrences of the various gender lists. Finally, I add word that is used to the appropriate gender word frequencies counter. So the sentence, "She is lovely." would add 'she','is', and 'lovely' to our count of words used when talking about a female. It would also increment the lower case counters for 'is' and 'lovely'.
In [42]:
for file_name in file_list:
#Open the file
text=open(file_name,'rb').read()
#Split into sentences
sentences=tokenizer.tokenize(text)
for sentence in sentences:
#word tokenize and strip punctuation
sentence_words=sentence.split()
sentence_words=[w.strip(punctuation) for w in sentence_words
if len(w.strip(punctuation))>0]
#figure out how often each word is capitalized
[is_it_proper(word) for word in sentence_words[1:]]
#lower case it
sentence_words=set([w.lower() for w in sentence_words])
#Figure out if there are gendered words in the sentence by computing the length of the intersection of the sets
gender=gender_the_sentence(sentence_words)
#Increment some counters
increment_gender(sentence_words,gender)
After all the articles are parsed, it is time to start analyzing the word frequencies.
First, I create a set consisting of all words which were capitalized more often than not.
In [43]:
proper_nouns=set([word for word in proper_nouns if
proper_nouns[word].get('upper',0) /
(proper_nouns[word].get('upper',0) +
proper_nouns[word].get('lower',0))>.50])
I don't really care about rare words, so I select the top 1,000 words, based on frequencies, from both the male and female word dictionaries. From that list, I subtract the words used to identify the sentence as either male or female along with the proper nouns.
In [44]:
common_words=set([w for w in sorted (word_freq['female'],
key=word_freq['female'].get,reverse=True)[:1000]]+[w for w in sorted (word_freq['male'],key=word_freq['male'].get,reverse=True)[:1000]])
common_words=list(common_words-male_words-female_words-proper_nouns)
I compute how likely the word appears in a male subject sentence versus a
female subject sentence. (My first instinct was to create ratios, but they are undefined when
a word is not used to talk about the sex used in the denominator.) I also need to control for the fact
that there is likely an imbalance in how many words are written about men and women. If 'hair' is
mentioned in 10 male-subjected sentences and 10 female-subject sentences, that could be taken as a
sign of parity, but not if there a total of 20 female-subject (50%) sentences and 100 male-subject
sentences (10%). I'll score 'hair' as a 16.6% male, which is (10%)/(50%+10%). Later on, if we want,
we can recover the ratios by computing (100-16.6)/16.6
, which is 5x, the same as 50%/10%
.
In [45]:
male_percent={word:(word_freq['male'].get(word,0) / word_counter['male'])
/ (word_freq['female'].get(word,0) / word_counter['female']+word_freq['male'].get(word,0)/word_counter['male']) for word in common_words}
We can print out some basic statistics based on our counters about overall rates of coverage.
In [46]:
print '%.1f%% gendered' % (100*(sentence_counter['male']+sentence_counter['female'])/
(sentence_counter['male']+sentence_counter['female']+sentence_counter['both']+sentence_counter['none']))
print '%s sentences about men.' % sentence_counter['male']
print '%s sentences about women.' % sentence_counter['female']
print '%.1f sentences about men for each sentence about women.' % (sentence_counter['male']/sentence_counter['female'])
Finally, I print out the words that are disproporately found in the male and female subject sentences. For the 50 distincitve female and male words, I print the ratio of gendered %s along with the count of the number of male-subject and female-subject sentences that had the word. This script isn't particularly pretty, but it gets the job done.
In [47]:
header ='Ratio\tMale\tFemale\tWord'
print 'Male words'
print header
for word in sorted (male_percent,key=male_percent.get,reverse=True)[:50]:
try:
ratio=male_percent[word]/(1-male_percent[word])
except:
ratio=100
print '%.1f\t%02d\t%02d\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)
print '\n'*2
print 'Female words'
print header
for word in sorted (male_percent,key=male_percent.get,reverse=False)[:50]:
try:
ratio=(1-male_percent[word])/male_percent[word]
except:
ratio=100
print '%.1f\t%01d\t%01d\t%s' % (ratio,word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)
My quick interepretation: If your knowledge of men's and women's roles in society came just from reading last week's New York Times, you would think that men play sports and run the government. Women do feminine and domestic things. To be honest, I was a little shocked at how stereotypical the words used in the women subject sentences were.
Now this is only data from one week, and certainly some of the findings are driven by that. Coverage of suffrage, for example, was presumably driven by the 100th anniversary of the Woman Suffrage Procession. Similarly, the male list is also tied to recent news events, as one one would expect from data from a newspaper. These lists also just reported the extreme words, many of which were only used in a handful of articles. A more rigorous analysis would probably look at the complete distribution of words.
I should also add that after I ran this analysis for the first time, I noticed a few words, like 'spokesman' and 'actress' that should have been included on the original lists.
If you wanted to output the full table, you could easily write it to a tab delimited file.
In [48]:
outfile_name='gender.tsv'
tsv_outfile=open(outfile_name,'wb')
header='percent_male\tmale_count\tfemalecount\tword\n'
tsv_outfile.write(header)
for word in common_words:
row = '%.2f\t%01d\t%01d\t%s\n' % (100*male_percent[word],word_freq['male'].get(word,0),word_freq['female'].get(word,0),word)
tsv_outfile.write(row)
tsv_outfile.close()
As an addendum, we can look at the most popular words. In this case, we will look at the 100 most frequently used words, and then compare what proportion of male subject sentences had those words and what proportion of female subject sentences had those words.
In [49]:
all_words=[w for w in word_freq['none']]+[w for w in word_freq['both']]+[w for w in word_freq['male']]+[w for w in word_freq['female']]
all_words={w:(word_freq['male'].get(w,0)+word_freq['female'].get(w,0)+word_freq['both'].get(w,0)+word_freq['none'].get(w,0)) for w in set(all_words)}
print 'word\tMale\tFemale'
for word in sorted (all_words,key=all_words.get,reverse=True)[:100]:
print '%s\t%.1f%%\t%.1f%%' % (word,100*word_freq['male'].get(word,0)/sentence_counter['male'],100*word_freq['female'].get(word,0)/sentence_counter['female'])
While there's a couple of interesting findings here, for the most part, the basic building blocks of sentences are fairly similarly in the male and female subject sentences. Now, this is just based on word frequencies, and a more nuanced examination would probably discover additionally findings of interest. For example, my guess is that that 'work', near the bottom of this list, is used not only more frequently in the female subject sentences, but also in a different context and as a different part of speech (e.g. 'men work', 'women juggle home and work responsibilities.'). Comparing word frequencies only gets you so far, but it is pretty quick and easy way to conduct some preliminary data analysis.
In [1]:
#ignore code below. Imports style sheet for this page.
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]:
In [ ]: