Words Associated to each Gender (through PMI)

In this notebook we compute PMI scores for the vocabulary obtained in the previous notebook.

By Eduardo Graells-Garrido.


In [1]:
from __future__ import print_function, unicode_literals, division
from cytoolz.dicttoolz import valmap
from collections import Counter

import pandas as pd
import json 
import gzip
import numpy as np
import pandas as pd
import dbpedia_config

In [2]:
target_folder = dbpedia_config.TARGET_FOLDER

First, we load a list of English stopwords. We also add some stopwords that we found on the dataset while exploring word frequency.

Note that we store a list of stopwords in the file stopwords_en.txt in our target folder (in the case of the English edition).


In [3]:
with open('{0}/stopwords_{1}.txt'.format(target_folder, dbpedia_config.MAIN_LANGUAGE), 'r') as f:
    stopwords = f.read().split()

stopwords.extend('Monday Tuesday Wednesday Thursday Friday Saturday Sunday'.lower().split())
stopwords.extend('January February March April May June July August September October November December'.lower().split())
stopwords.extend('one two three four five six seven eight nine ten'.lower().split())

len(stopwords)


Out[3]:
252

We also load our person data.


In [4]:
person_data = pd.read_csv('{0}/person_data_en.csv.gz'.format(target_folder), encoding='utf-8', index_col='uri')


Out[4]:
wikidata_entity class gender edition_count available_english available_editions birth_year death_year same_as label
uri
http://dbpedia.org/resource/Melanie_Paschke Q452439 http://dbpedia.org/ontology/Athlete female 8 True en|fr|de|it|sv|wikidata|cs|pl 1970 NaN http://fr.dbpedia.org/resource/Melanie_Paschke Melanie Paschke
http://dbpedia.org/resource/A._Laurent Q278894 http://dbpedia.org/ontology/Person male 15 True en|el|eo|hy|zh|pt|no|de|fr|wikidata|pl|uk|nds|... NaN NaN http://pt.dbpedia.org/resource/A._Laurent A. Laurent
http://dbpedia.org/resource/Addison_S._McClure Q24217 http://dbpedia.org/ontology/Politician male 3 True en|de|wikidata 1839 1903 http://wikidata.dbpedia.org/resource/Q24217 Addison S. McClure
http://dbpedia.org/resource/Pina_Conti Q7194774 http://dbpedia.org/ontology/Person female 2 True en|wikidata NaN NaN http://wikidata.dbpedia.org/resource/Q7194774 Pina Conti
http://dbpedia.org/resource/Ri_Kum-suk Q274956 http://dbpedia.org/ontology/Athlete female 5 True en|fr|de|wikidata|ja 1978 NaN http://fr.dbpedia.org/resource/Ri_Kum-suk Ri Kum-suk

In [11]:
N = person_data.gender.value_counts()
N


Out[11]:
male                  777429
female                142381
transgender female         2
Name: gender, dtype: int64

And our vocabulary. We will consider only words that appear in both genders (so it makes sense to compare association).


In [8]:
with gzip.open('{0}/vocabulary.json.gz'.format(target_folder), 'rb') as f:
    vocabulary = valmap(Counter, json.load(f))

common_words = list(set(vocabulary['male'].keys()) & set(vocabulary['female'].keys()))
len(common_words)


Out[8]:
277999

In [20]:
def word_iter():
     for w in common_words:
        if w in stopwords:
            continue
        yield {'male': vocabulary['male'][w], 'female': vocabulary['female'][w], 'word': w}
    
words = pd.DataFrame.from_records(word_iter(), index='word')

Now we estimate PMI. Recall that PMI is:

$$\mbox{PMI}(c, w) = \log \frac{p(c, w)}{p(c) p(w)}$$

Where c is a class (or gender) and w is a word (or bigram in our case). To normalize PMI we can divide by $-\log p(c,w)$.


In [22]:
p_c = N / N.sum()
p_c


Out[22]:
male                  0.845204
female                0.154794
transgender female    0.000002
Name: gender, dtype: float64

In [23]:
words['p_w'] = (words['male'] + words['female']) / N.sum()
words['p_w'].head(5)


Out[23]:
word
biennials           0.000047
verplank            0.000010
soestdijk           0.000007
megumi_yokota       0.000008
kwame_kilpatrick    0.000008
Name: p_w, dtype: float64

In [30]:
words['p_male_w'] = words['male'] / N.sum()
words['p_female_w'] = words['female'] / N.sum()

In [31]:
words['pmi_male'] = np.log(words['p_male_w'] / (words['p_w'] * p_c['male'])) / -np.log(words['p_male_w'])
words['pmi_female'] = np.log(words['p_female_w'] / (words['p_w'] * p_c['female'])) / -np.log(words['p_female_w'])

In [32]:
words.head()


Out[32]:
female male p_w p_male_y p_female_y p_male_w p_female_w pmi_male pmi_female
word
biennials 7 36 0.000047 0.000039 0.000008 0.000039 0.000008 -0.000937 0.004274
verplank 1 8 0.000010 0.000009 0.000001 0.000009 0.000001 0.004325 -0.024145
soestdijk 2 4 0.000007 0.000004 0.000002 0.000004 0.000002 -0.019220 0.058828
megumi_yokota 5 2 0.000008 0.000002 0.000005 0.000002 0.000005 -0.083182 0.126145
kwame_kilpatrick 3 4 0.000008 0.000004 0.000003 0.000004 0.000003 -0.031707 0.080609

Now we are ready to explore PMI. Recall that PMI overweights words that have extremely low frequencies. We need to set a threshold for it. For instance, in our previous paper we considered 1% of biographies as threshold. But this time we have more biographies, and with 1% we don't have 200 words for women.

Hence, this time we lower the bar up to 0.1%.


In [84]:
min_p = 0.001

In [91]:
top_female = words[words.p_w > min_p].sort_values(by=['pmi_female'], ascending=False)
top_female.head(10)


Out[91]:
female male p_w p_male_y p_female_y p_male_w p_female_w pmi_male pmi_female
word
actress 33469 3461 0.040150 0.003763 0.036387 0.003763 0.036387 -0.393954 0.533343
women_s 19521 2256 0.023675 0.002453 0.021223 0.002453 0.021223 -0.349232 0.455864
female 9627 1509 0.012107 0.001641 0.010466 0.001641 0.010466 -0.285457 0.377238
her_husband 7250 1002 0.008971 0.001089 0.007882 0.001089 0.007882 -0.284408 0.358486
women 10849 3583 0.015690 0.003895 0.011795 0.003895 0.011795 -0.220814 0.355913
woman 7477 2894 0.011275 0.003146 0.008129 0.003146 0.008129 -0.192344 0.319695
first_woman 3077 137 0.003494 0.000149 0.003345 0.000149 0.003345 -0.338985 0.319655
miss 5032 1368 0.006958 0.001487 0.005471 0.001487 0.005471 -0.211152 0.312034
pageant 1898 158 0.002235 0.000172 0.002063 0.000172 0.002063 -0.276578 0.288791
feminist 1889 169 0.002237 0.000184 0.002054 0.000184 0.002054 -0.271031 0.287644

In [92]:
top_male = words[words.p_w > min_p].sort_values(by=['pmi_male'], ascending=False)
top_male.head(10)


Out[92]:
female male p_w p_male_y p_female_y p_male_w p_female_w pmi_male pmi_female
word
played 12268 181306 0.210450 0.197112 0.013338 0.197112 0.013338 0.063242 -0.206849
football 742 41227 0.045628 0.044821 0.000807 0.044821 0.000807 0.048417 -0.304619
footballer_who 158 29429 0.032166 0.031995 0.000172 0.031995 0.000172 0.047302 -0.388361
served 10066 125870 0.147787 0.136843 0.010944 0.136843 0.010944 0.045875 -0.163313
league 3429 65122 0.074527 0.070799 0.003728 0.070799 0.003728 0.044134 -0.202015
major_league 82 20595 0.022480 0.022390 0.000089 0.022390 0.000089 0.043221 -0.392956
john 5018 74777 0.086751 0.081296 0.005455 0.081296 0.005455 0.041132 -0.172854
football_player 449 23833 0.026399 0.025911 0.000488 0.025911 0.000488 0.040928 -0.278667
first_class 195 18834 0.020688 0.020476 0.000212 0.020476 0.000212 0.040601 -0.320970
son 4648 68820 0.079873 0.074820 0.005053 0.074820 0.005053 0.039658 -0.169212

What we will do is to save both lists of top-200 words and then manually annotate them according to the following categories:

  • F: Family
  • R: Relationship
  • G: Gender
  • O: Other

We will add that categorization to the column "cat", and we will process it in the following notebook.


In [93]:
top_male.head(200).to_csv('{0}/top-200-pmi-male.csv'.format(target_folder), encoding='utf-8')
top_female.head(200).to_csv('{0}/top-200-pmi-female.csv'.format(target_folder), encoding='utf-8')