In this notebook we evaluate the differences in PMI keywords for each gender. We load the data from the previous notebooks; recall that we load PMI data with annotated categories. We test differences in those categories using a chi-square test.
Code by Eduardo Graells-Garrido. Annotation by Claudia Wagner and Eduardo Graells-Garrido
In [1]:
import pandas as pd
import re
import numpy as np
import dbpedia_config
from scipy.stats import chisquare
In [2]:
target_folder = dbpedia_config.TARGET_FOLDER
Some bigrams include apostrophes at the end. Here we recover their original form.
In [3]:
apost = re.compile('_s$')
Here we load the DataFrame from the previous notebook. Note that there is an additional column cat.
In [4]:
female_pmi = pd.read_csv('{0}/top-200-pmi-female.csv'.format(target_folder), encoding='utf-8')
female_pmi.word = female_pmi.word.map(lambda x: apost.sub('\'s', x))
female_pmi.head()
Out[4]:
In [5]:
female_pmi.cat.value_counts() / female_pmi.shape[0] * 100.0
Out[5]:
In [6]:
male_pmi = pd.read_csv('{0}/top-200-pmi-male.csv'.format(target_folder), encoding='utf-8')
male_pmi.word = male_pmi.word.map(lambda x: apost.sub('\'s', x))
male_pmi.head()
Out[6]:
In [7]:
male_pmi.cat.value_counts() / male_pmi.shape[0] * 100.0
Out[7]:
In [8]:
m_proportions = []
f_proportions = []
m_count = male_pmi.cat.value_counts() / male_pmi.shape[0] * 100.0
f_count = female_pmi.cat.value_counts() / female_pmi.shape[0] * 100.0
for c in ('F', 'G', 'O', 'R'):
m_proportions.append(m_count[c] if c in m_count.index else 0.0)
f_proportions.append(f_count[c] if c in f_count.index else 0.0)
m_proportions, f_proportions
Out[8]:
In [9]:
chisquare(m_proportions, f_proportions)
Out[9]:
In [10]:
p0 = np.array(m_proportions)
p1 = np.array(f_proportions)
np.sqrt(np.sum(np.power(p1 - p0, 2) / p1))
Out[10]:
To visualize and explore the distributions of words per category we use word clouds. In particular we use the matta library.
In [14]:
import matta
matta.init_javascript(path='https://rawgit.com/carnby/matta/master/matta/libs')
Out[14]:
In [12]:
matta.wordcloud(dataframe=female_pmi.loc[:, ('word', 'pmi_female', 'cat')], text='word',
typeface='Lato', font_weight='bold',
font_size={'value': 'pmi_female'},
font_color={'palette': 'Set2', 'n_colors': 4, 'value': 'cat', 'scale': 'ordinal'})
In [13]:
matta.wordcloud(dataframe=male_pmi.loc[:, ('word', 'pmi_male', 'cat')], text='word',
typeface='Lato', font_weight='bold',
font_size={'value': 'pmi_male'},
font_color={'palette': 'Set2', 'n_colors': 4, 'value': 'cat', 'scale': 'ordinal'})