Annotated PMI Keywords with Categories

In this notebook we evaluate the differences in PMI keywords for each gender. We load the data from the previous notebooks; recall that we load PMI data with annotated categories. We test differences in those categories using a chi-square test.

Code by Eduardo Graells-Garrido. Annotation by Claudia Wagner and Eduardo Graells-Garrido


In [1]:
import pandas as pd
import re
import numpy as np
import dbpedia_config
from scipy.stats import chisquare

In [2]:
target_folder = dbpedia_config.TARGET_FOLDER

Load Data

Some bigrams include apostrophes at the end. Here we recover their original form.


In [3]:
apost = re.compile('_s$')

Here we load the DataFrame from the previous notebook. Note that there is an additional column cat.


In [4]:
female_pmi = pd.read_csv('{0}/top-200-pmi-female.csv'.format(target_folder), encoding='utf-8')
female_pmi.word = female_pmi.word.map(lambda x: apost.sub('\'s', x))
female_pmi.head()


Out[4]:
word female male p_w p_male_y p_female_y p_male_w p_female_w pmi_male pmi_female cat
0 actress 33469 3461 0.040150 0.003763 0.036387 0.003763 0.036387 -0.393954 0.533343 O
1 women's 19521 2256 0.023675 0.002453 0.021223 0.002453 0.021223 -0.349232 0.455864 G
2 female 9627 1509 0.012107 0.001641 0.010466 0.001641 0.010466 -0.285457 0.377238 G
3 her_husband 7250 1002 0.008971 0.001089 0.007882 0.001089 0.007882 -0.284408 0.358486 R
4 women 10849 3583 0.015690 0.003895 0.011795 0.003895 0.011795 -0.220814 0.355913 G

In [5]:
female_pmi.cat.value_counts() / female_pmi.shape[0] * 100.0


Out[5]:
O    82
G     8
F     6
R     4
Name: cat, dtype: float64

In [6]:
male_pmi = pd.read_csv('{0}/top-200-pmi-male.csv'.format(target_folder), encoding='utf-8')
male_pmi.word = male_pmi.word.map(lambda x: apost.sub('\'s', x))
male_pmi.head()


Out[6]:
word female male p_w p_male_y p_female_y p_male_w p_female_w pmi_male pmi_female cat
0 played 12268 181306 0.210450 0.197112 0.013338 0.197112 0.013338 0.063242 -0.206849 O
1 football 742 41227 0.045628 0.044821 0.000807 0.044821 0.000807 0.048417 -0.304619 O
2 footballer_who 158 29429 0.032166 0.031995 0.000172 0.031995 0.000172 0.047302 -0.388361 O
3 served 10066 125870 0.147787 0.136843 0.010944 0.136843 0.010944 0.045875 -0.163313 O
4 league 3429 65122 0.074527 0.070799 0.003728 0.070799 0.003728 0.044134 -0.202015 O

In [7]:
male_pmi.cat.value_counts() / male_pmi.shape[0] * 100.0


Out[7]:
O    97.0
G     2.5
F     0.5
Name: cat, dtype: float64

Test Proportions and Effect Size

We test proportions of categories for each gender and then we estimate Conhen's w as effect size.


In [8]:
m_proportions = []
f_proportions = []

m_count = male_pmi.cat.value_counts() / male_pmi.shape[0] * 100.0
f_count = female_pmi.cat.value_counts() / female_pmi.shape[0] * 100.0

for c in ('F', 'G', 'O', 'R'):
    m_proportions.append(m_count[c] if c in m_count.index else 0.0)
    f_proportions.append(f_count[c] if c in f_count.index else 0.0)
    
m_proportions, f_proportions


Out[8]:
([0.5, 2.5, 97.0, 0.0], [6.0, 8.0, 82.0, 4.0])

In [9]:
chisquare(m_proportions, f_proportions)


Out[9]:
Power_divergenceResult(statistic=15.566819105691058, pvalue=0.0013910792609485086)

In [10]:
p0 = np.array(m_proportions)
p1 = np.array(f_proportions)
np.sqrt(np.sum(np.power(p1 - p0, 2) / p1))


Out[10]:
3.945480845941475

Word clouds

To visualize and explore the distributions of words per category we use word clouds. In particular we use the matta library.


In [14]:
import matta
matta.init_javascript(path='https://rawgit.com/carnby/matta/master/matta/libs')


Out[14]:
matta Javascript code added.

In [12]:
matta.wordcloud(dataframe=female_pmi.loc[:, ('word', 'pmi_female', 'cat')], text='word',
                typeface='Lato', font_weight='bold',
               font_size={'value': 'pmi_female'},
               font_color={'palette': 'Set2', 'n_colors': 4, 'value': 'cat', 'scale': 'ordinal'})



In [13]:
matta.wordcloud(dataframe=male_pmi.loc[:, ('word', 'pmi_male', 'cat')], text='word',
                typeface='Lato', font_weight='bold',
               font_size={'value': 'pmi_male'},
               font_color={'palette': 'Set2', 'n_colors': 4, 'value': 'cat', 'scale': 'ordinal'})