Annotated PMI Keywords with Categories

In this notebook we evaluate the differences in PMI keywords for each gender. We load the data from the previous notebooks; recall that we load PMI data with annotated categories. We test differences in those categories using a chi-square test.

Code by Eduardo Graells-Garrido. Annotation by Claudia Wagner and Eduardo Graells-Garrido



In [1]:

    
import pandas as pd
import re
import numpy as np
import dbpedia_config
from scipy.stats import chisquare



In [2]:

    
target_folder = dbpedia_config.TARGET_FOLDER

Load Data

Some bigrams include apostrophes at the end. Here we recover their original form.



In [3]:

    
apost = re.compile('_s$')

Here we load the DataFrame from the previous notebook. Note that there is an additional column cat.



In [4]:

    
female_pmi = pd.read_csv('{0}/top-200-pmi-female.csv'.format(target_folder), encoding='utf-8')
female_pmi.word = female_pmi.word.map(lambda x: apost.sub('\'s', x))
female_pmi.head()









    Out[4]:






  
    
      
      word
      female
      male
      p_w
      p_male_y
      p_female_y
      p_male_w
      p_female_w
      pmi_male
      pmi_female
      cat
    
  
  
    
      0
      actress
      33469
      3461
      0.040150
      0.003763
      0.036387
      0.003763
      0.036387
      -0.393954
      0.533343
      O
    
    
      1
      women's
      19521
      2256
      0.023675
      0.002453
      0.021223
      0.002453
      0.021223
      -0.349232
      0.455864
      G
    
    
      2
      female
      9627
      1509
      0.012107
      0.001641
      0.010466
      0.001641
      0.010466
      -0.285457
      0.377238
      G
    
    
      3
      her_husband
      7250
      1002
      0.008971
      0.001089
      0.007882
      0.001089
      0.007882
      -0.284408
      0.358486
      R
    
    
      4
      women
      10849
      3583
      0.015690
      0.003895
      0.011795
      0.003895
      0.011795
      -0.220814
      0.355913
      G



In [5]:

    
female_pmi.cat.value_counts() / female_pmi.shape[0] * 100.0









    Out[5]:





O    82
G     8
F     6
R     4
Name: cat, dtype: float64



In [6]:

    
male_pmi = pd.read_csv('{0}/top-200-pmi-male.csv'.format(target_folder), encoding='utf-8')
male_pmi.word = male_pmi.word.map(lambda x: apost.sub('\'s', x))
male_pmi.head()









    Out[6]:






  
    
      
      word
      female
      male
      p_w
      p_male_y
      p_female_y
      p_male_w
      p_female_w
      pmi_male
      pmi_female
      cat
    
  
  
    
      0
      played
      12268
      181306
      0.210450
      0.197112
      0.013338
      0.197112
      0.013338
      0.063242
      -0.206849
      O
    
    
      1
      football
      742
      41227
      0.045628
      0.044821
      0.000807
      0.044821
      0.000807
      0.048417
      -0.304619
      O
    
    
      2
      footballer_who
      158
      29429
      0.032166
      0.031995
      0.000172
      0.031995
      0.000172
      0.047302
      -0.388361
      O
    
    
      3
      served
      10066
      125870
      0.147787
      0.136843
      0.010944
      0.136843
      0.010944
      0.045875
      -0.163313
      O
    
    
      4
      league
      3429
      65122
      0.074527
      0.070799
      0.003728
      0.070799
      0.003728
      0.044134
      -0.202015
      O



In [7]:

    
male_pmi.cat.value_counts() / male_pmi.shape[0] * 100.0









    Out[7]:





O    97.0
G     2.5
F     0.5
Name: cat, dtype: float64

Test Proportions and Effect Size

We test proportions of categories for each gender and then we estimate Conhen's w as effect size.



In [8]:

    
m_proportions = []
f_proportions = []

m_count = male_pmi.cat.value_counts() / male_pmi.shape[0] * 100.0
f_count = female_pmi.cat.value_counts() / female_pmi.shape[0] * 100.0

for c in ('F', 'G', 'O', 'R'):
    m_proportions.append(m_count[c] if c in m_count.index else 0.0)
    f_proportions.append(f_count[c] if c in f_count.index else 0.0)
    
m_proportions, f_proportions









    Out[8]:





([0.5, 2.5, 97.0, 0.0], [6.0, 8.0, 82.0, 4.0])



In [9]:

    
chisquare(m_proportions, f_proportions)









    Out[9]:





Power_divergenceResult(statistic=15.566819105691058, pvalue=0.0013910792609485086)



In [10]:

    
p0 = np.array(m_proportions)
p1 = np.array(f_proportions)
np.sqrt(np.sum(np.power(p1 - p0, 2) / p1))









    Out[10]:





3.945480845941475

Word clouds

To visualize and explore the distributions of words per category we use word clouds. In particular we use the matta library.



In [14]:

    
import matta
matta.init_javascript(path='https://rawgit.com/carnby/matta/master/matta/libs')









    Out[14]:





matta Javascript code added.



In [12]:

    
matta.wordcloud(dataframe=female_pmi.loc[:, ('word', 'pmi_female', 'cat')], text='word',
                typeface='Lato', font_weight='bold',
               font_size={'value': 'pmi_female'},
               font_color={'palette': 'Set2', 'n_colors': 4, 'value': 'cat', 'scale': 'ordinal'})



In [13]:

    
matta.wordcloud(dataframe=male_pmi.loc[:, ('word', 'pmi_male', 'cat')], text='word',
                typeface='Lato', font_weight='bold',
               font_size={'value': 'pmi_male'},
               font_color={'palette': 'Set2', 'n_colors': 4, 'value': 'cat', 'scale': 'ordinal'})

	word	female	male	p_w	p_male_y	p_female_y	p_male_w	p_female_w	pmi_male	pmi_female	cat
0	actress	33469	3461	0.040150	0.003763	0.036387	0.003763	0.036387	-0.393954	0.533343	O
1	women's	19521	2256	0.023675	0.002453	0.021223	0.002453	0.021223	-0.349232	0.455864	G
2	female	9627	1509	0.012107	0.001641	0.010466	0.001641	0.010466	-0.285457	0.377238	G
3	her_husband	7250	1002	0.008971	0.001089	0.007882	0.001089	0.007882	-0.284408	0.358486	R
4	women	10849	3583	0.015690	0.003895	0.011795	0.003895	0.011795	-0.220814	0.355913	G

	word	female	male	p_w	p_male_y	p_female_y	p_male_w	p_female_w	pmi_male	pmi_female	cat
0	played	12268	181306	0.210450	0.197112	0.013338	0.197112	0.013338	0.063242	-0.206849	O
1	football	742	41227	0.045628	0.044821	0.000807	0.044821	0.000807	0.048417	-0.304619	O
2	footballer_who	158	29429	0.032166	0.031995	0.000172	0.031995	0.000172	0.047302	-0.388361	O
3	served	10066	125870	0.147787	0.136843	0.010944	0.136843	0.010944	0.045875	-0.163313	O
4	league	3429	65122	0.074527	0.070799	0.003728	0.070799	0.003728	0.044134	-0.202015	O