Comparing perspectives on gender

Judith Butler has argued that gender is a performative concept, which implies an audience. But different audiences may perceive the performance in different ways. This notebook gathers a few (very tentative) experiments that try to illustrate the different conceptions of gender implicit in books by men and by women.

The underlying data used here is a collection of roughly 78,000 characters from 1800 to 1999, of which about 28,000 are drawn from books written by women. This is itself a subset of a larger collection.


In [2]:
import pandas as pd
import numpy as np
import csv
from collections import Counter
from scipy.stats import pearsonr

In [3]:
metadata = pd.read_csv('../metadata/balanced_character_subset.csv')
timeslice = metadata[(metadata.firstpub >= 1800) & (metadata.firstpub < 2000)]
print('Number of characters: ', len(timeslice.gender))
print('Number identified as women or girls:', sum(timeslice.gender == 'f'))
print('Number drawn from books written by women:', sum(timeslice.authgender == 'f'))


Number of characters:  78268
Number identified as women or girls: 39134
Number drawn from books written by women: 28183

Using a separate script (reproduce_character_models.py), I have trained six different models on subsets of 3000 characters drawn from this larger set. Each training set is divided equally between masculine and feminine characters. Three of the training sets are drawn from books by men; three from books by women.

First let's start by comparing the coefficients of these models. This is not going to be terribly rigorous, quantitatively. I just want to get a sense of a few words that tend to be used differently by men and women, so I can flesh out my observation that these models could--in principle--be considered different "perspectives" on gender.


In [4]:
# We're going to load the features of six models, treating
# them simply as ranked lists of words. Words at the beginning
# of each list tend to be associated with masculine characters;
# words toward the end tend to be associated with feminine characters.
# We could of course use the actual coefficients instead of simple
# ranking, but I'm not convinced that adding and subtracting
# coefficients has a firmer mathematical foundation than
# adding and subtracting ranks.

# In order to compare these lists, we will start by filtering out words
# that don't appear in all six lists. This is a dubious choice,
# but see below for a better way of measuring similarity between
# models based on their predictions.

rootpath = 'models/'
masculineperspectives = []
feminineperspectives = []
for letter in ['A', 'B', 'C']:
    feminineperspectives.append(rootpath + 'onlywomenwriters' + letter + '.coefs.csv')
    masculineperspectives.append(rootpath + 'onlymalewriters' + letter + '.coefs.csv')

def intersection_of_models(fpaths, mpaths):
    paths = fpaths.extend(mpaths)
    words = []
    for p in fpaths:
        thislist = []
        with open(p, encoding = 'utf-8') as f:
            reader = csv.reader(f)
            for row in reader:
                if len(row) > 0:
                    thislist.append(row[0])
        words.append(thislist)

    shared_features = set.intersection(set(words[0]), set(words[1]), set(words[2]),
                                   set(words[3]), set(words[4]), set(words[5]))
    
    filtered_features = []
    for i in range(6):
        newlist = []
        for w in words[i]:
            if w in shared_features:
                newlist.append(w)
            
        filtered_features.append(newlist)
    
    feminine_lists = filtered_features[0 : 3]
    masculine_lists = filtered_features[3 : 6]
    
    return feminine_lists, masculine_lists
    
                                     
feminine_lists, masculine_lists = intersection_of_models(feminineperspectives, masculineperspectives)

# now let's create a consensus ranking for both groups of writers

def get_consensus(three_lists):
    '''
    Given three lists, constructs a consensus ranking for each
    word. We normalize to a 0-1 scale--not strictly necessary,
    since all lists are the same lengths, but it may be more
    legible than raw ranks.
    '''
    assert len(three_lists) == 3
    assert len(three_lists[1]) == len(three_lists[2])
    
    denominator = len(three_lists[0]) * 3
    # we multiple the denominator by three
    # because there are going to be three lists
    
    sum_of_ranks = Counter()
    for alist in three_lists:
        for index, word in enumerate(alist):
            sum_of_ranks[word] += index / denominator
    
    return sum_of_ranks

feminine_rankings = get_consensus(feminine_lists)
masculine_rankings = get_consensus(masculine_lists)

# Now we're going to sort words based on the DIFFERENCE
# between feminine and masculine perspectives. 

# Negative scores will be words that are strongly associated with
# men (for women) and women (for men).

# Scores near zero will be words that are around the same position
# in both models of gender.

# Strongly positive scores will be words strongly associated with
# women (for women) and men (for men).

wordrank_pairs = []

for word, ranking in feminine_rankings.items():
    if word not in masculine_rankings:
        print(error)
    else:
        difference = ranking - masculine_rankings[word]
        wordrank_pairs.append((difference, word))

wordrank_pairs.sort()

In [5]:
# The first hundred words will be negative scores,
# strongly associated with men (for women) and women (for men).

wordrank_pairs[0: 50]

# as you'll see there's a lot of courtship and
# romance here


Out[5]:
[(-0.8916876574307305, 'love'),
 (-0.710495382031906, 'attentions'),
 (-0.704450041981528, 'free'),
 (-0.6740554156171286, 'was-tell'),
 (-0.6530646515533165, 'chosen'),
 (-0.6485306465155332, 'loved'),
 (-0.6203190596137698, 'was-marry'),
 (-0.6147774979009236, 'was-held'),
 (-0.6058774139378673, 'suspicions'),
 (-0.6020151133501261, 'asked'),
 (-0.5968094038623006, 'liked'),
 (-0.5924433249370278, 'firm'),
 (-0.5867338371116709, 'patient'),
 (-0.5838790931989923, 'better'),
 (-0.5732997481108313, 'was-saw'),
 (-0.5662468513853904, 'was-forgotten'),
 (-0.5645675902602854, 'accused'),
 (-0.5583543240973972, 'gone'),
 (-0.5516372795969773, 'stroked'),
 (-0.5459277917716204, 'slept'),
 (-0.543744752308984, 'was-watching'),
 (-0.5413937867338372, 'was-mean'),
 (-0.5326616288832915, 'arms'),
 (-0.5296389588581024, 'moving'),
 (-0.5151973131821999, 'fond'),
 (-0.5126784214945426, 'past'),
 (-0.5123425692695214, 'gazed'),
 (-0.5037783375314862, 'was-seen'),
 (-0.4979009235936188, 'visited'),
 (-0.4953820319059614, 'name'),
 (-0.48816120906801, 'loves'),
 (-0.48782535684298894, 'held'),
 (-0.4864819479429051, 'was-holding'),
 (-0.4827875734676742, 'forget'),
 (-0.476910159529807, 'brought'),
 (-0.472544080604534, 'talking'),
 (-0.4695214105793451, 'drawing'),
 (-0.4651553316540723, 'was-regarded'),
 (-0.46230058774139376, 'buried'),
 (-0.4589420654911839, 'was-marrying'),
 (-0.4582703610411418, 'bear'),
 (-0.45071368597816963, 'standing'),
 (-0.4502099076406383, 'married'),
 (-0.4485306465155331, 'understood'),
 (-0.4475230898404702, 'thrown'),
 (-0.44399664147774975, 'leaving'),
 (-0.44265323257766587, 'was-touch'),
 (-0.43912678421494544, 'path'),
 (-0.4381192275398824, 'despised'),
 (-0.43660789252728804, 'consciousness')]

In [6]:
# The last hundred words will be positive scores,
# strongly associated with women (for women) and men (for men).

# To keep the most important words at the top of the list,
# I reverse it.

positive = wordrank_pairs[-50 : ]
positive.reverse()
for pair in positive:
    print(pair)

# Much harder to characterize, and I won't actually characterize
# this list in the article, but between you and me, I would say 
# there's a lot of effort, endeavoring, and thinking here.

# "Jaw," "chin" and "head" are also interesting. Perhaps in some weird way
# they are signs of effort? "She set her jaw ..." Again, I'm not going
# to actually infer anything from that -- just idly speculating.


(0.7509655751469353, 'spend')
(0.634592779177162, 'jaw')
(0.6241813602015114, 'conscience')
(0.5890848026868178, 'account')
(0.5669185558354324, 'chair')
(0.5667506297229219, 'wrote')
(0.5655751469353484, 'drove')
(0.5608732157850546, 'sent')
(0.5521410579345088, 'busy')
(0.543408900083963, 'was-caught')
(0.5424013434089001, 'endeavoured')
(0.5365239294710327, 'was-tired')
(0.5355163727959696, 'palm')
(0.5353484466834592, 'thoughts')
(0.533165407220823, 'attendants')
(0.5323257766582704, 'chin')
(0.5306465155331654, 'history')
(0.5251049538203191, 'gift')
(0.5195633921074727, 'help')
(0.5185558354324097, 'assumed')
(0.5109991603694374, 'attack')
(0.5036104114189757, 'thought')
(0.5026028547439128, 'palms')
(0.49874055415617136, 'carried')
(0.495549958018472, 'tried')
(0.4953820319059613, 'was-want')
(0.4947103274559195, 'was-treat')
(0.49454240134340893, 'was-relieved')
(0.4945424013434089, 'think')
(0.48984047019311505, 'years')
(0.4879932829554996, 'half')
(0.4849706129303107, 'brain')
(0.48110831234256923, 'imagination')
(0.48110831234256923, 'committed')
(0.4789252728799328, 'wondered')
(0.4743912678421495, 'pursued')
(0.47069689336691867, 'receive')
(0.47052896725440807, 'set')
(0.4701931150293871, 'custom')
(0.46935348446683467, 'supposed')
(0.46414777497900916, 'head')
(0.4624685138539043, 'forced')
(0.46179680940386225, 'listening')
(0.45910999160369437, 'grabbed')
(0.456255247691016, 'remarked')
(0.45591939546599497, 'effort')
(0.4554156171284635, 'was-reassured')
(0.453568429890848, 'promised')
(0.4475230898404703, 'was-joined')
(0.4463476070528968, 'explain')

Uncertainty

How stable and reliable are these differences?

We can find out by testing each of the nine possible pairings between our three masculine models and our three feminine models. The answer is that, for words at the top of the list like "love," the differences are pretty robust. They become rapidly less robust as you move down the list, so we should characterize them cautiously.


In [16]:
def get_variation(word, feminine_lists, masculine_lists):
    differences = []
    for f in feminine_lists:
        for m in masculine_lists:
            d = (f.index(word) /len(f)) - (m.index(word) / len(m))
            differences.append(d)
    return differences

print('love')
print(get_variation('love', masculine_lists, feminine_lists))
print('\nwas-marry')
print(get_variation('was-marry', masculine_lists, feminine_lists))
print('\nspend')
print(get_variation('spend', masculine_lists, feminine_lists))
print('\nconscience')
print(get_variation('conscience', masculine_lists, feminine_lists))
print('\nimagination')
print(get_variation('imagination', masculine_lists, feminine_lists))


love
[0.9622166246851385, 0.9546599496221662, 0.8035264483627204, 0.9385390428211586, 0.9309823677581863, 0.7798488664987405, 0.9405541561712846, 0.9329974811083123, 0.7818639798488665]

was-marry
[0.4931989924433249, 0.8901763224181359, 0.8841309823677581, 0.5083123425692695, 0.9052896725440805, 0.8992443324937027, 0.07153652392947107, 0.4685138539042822, 0.4624685138539043]

spend
[-0.7622166246851386, -0.6816120906801008, -0.5994962216624685, -0.8619647355163729, -0.781360201511335, -0.6992443324937028, -0.8720403022670026, -0.7914357682619647, -0.7093198992443325]

conscience
[-0.8282115869017632, -0.3858942065491183, -0.6554156171284634, -0.8967254408060453, -0.45440806045340043, -0.7239294710327455, -0.7627204030226701, -0.3204030226700252, -0.5899244332493703]

imagination
[-0.7279596977329974, -0.6811083123425692, -0.6846347607052896, -0.4921914357682619, -0.4453400503778337, -0.4488664987405541, -0.31335012594458433, -0.2664987405541561, -0.27002518891687655]

Comparing the average similarity between models

Okay. The quantitative methodology above was not super-rigorous. I was just trying to get a rough sense of a few words that have notably different gender implications for writers who are men, or women. Let's try to compare these six models a little more rigorously by looking at the predictions they make.

A separate function in reproduce_character_models has already gone through all six of the models used above and applied them to a balanced_test_set comprised of 1000 characters from books by women, and 1000 characters from books by men. (The characters themselves are also equally balanced by gender.) We now compare pairs of predictions about these characters, to see whether models based on books by women agree with each other more than they agree with models based on books by men, and vice-versa.


In [55]:
def model_correlation(firstpath, secondpath):
    one = pd.read_csv(firstpath, index_col = 'docid')
    two = pd.read_csv(secondpath, index_col = 'docid')
    justpredictions = pd.concat([one['logistic'], two['logistic']], axis=1, keys=['one', 'two'])
    justpredictions.dropna(inplace = True)
    r, p = pearsonr(justpredictions.one, justpredictions.two)
    return r

def compare_amongst_selves(listofpredictions):
    r_scores = []
    already_done = []
    for path in listofpredictions:
        for otherpath in listofpredictions:
            if path == otherpath:
                continue
            elif (path, otherpath) in already_done:
                continue
            else:
                r = model_correlation(path, otherpath)
                r_scores.append(r)
                already_done.append((otherpath, path))
                # no need to compare a to b AND b to a
    return r_scores

def average_r(r_scores):
    '''
    Technically, you don't directly average r scores; you use a
    Fisher's transformation into z scores first. In practice, this
    makes only a tiny difference, but ...
    '''
    z_scores = []
    for r in r_scores:
        z = np.arctanh(r)
        z_scores.append(z)
    mean_z = sum(z_scores) / len(z_scores)
    mean_r = np.tanh(mean_z)
    return mean_r

rootpath = 'predictions/'
masculineperspectives = []
feminineperspectives = []
for letter in ['A', 'B', 'C']:
    feminineperspectives.append(rootpath + 'onlywomenwriters' + letter + '.results')
    masculineperspectives.append(rootpath + 'onlymalewriters' + letter + '.results')

f_compare = compare_amongst_selves(feminineperspectives)
print(f_compare)
print("similarity among models of characters by women:", average_r(f_compare))
m_compare = compare_amongst_selves(masculineperspectives)
print(m_compare)
print("similarity among models of characters by men:", average_r(m_compare))


[0.57028174743687998, 0.59669847910151441, 0.61282836587295142]
similarity among models of characters by women: 0.593548944757
[0.67252242404799922, 0.65358399158794922, 0.67195355393313982]
similarity among models of characters by men: 0.666111448115

In [57]:
def compare_against_each_other(listofmasculinemodels, listoffemininemodels):
    r_scores = []
    
    for m in listofmasculinemodels:
        for f in listoffemininemodels:
            r = model_correlation(m, f)
            r_scores.append(r)
            
    return r_scores

both_compared = compare_against_each_other(masculineperspectives, feminineperspectives)
print(both_compared)
print('similarity between pairs of models that cross')
print('the gender boundary: ', average_r(both_compared))


[0.52090949748622484, 0.56643384953528197, 0.55423932332355386, 0.53094621866695213, 0.55722508876328036, 0.54946407323176882, 0.53756463252398334, 0.55990130469704724, 0.55846360992514921]
similarity between pairs of models that cross
the gender boundary:  0.548508482942

conclusions

So we end up with three different correlation coefficients. Models based on books by men agree with each other rather strongly; this corresponds to other evidence that men tend to write conventionally gendered characters, which are easy to sort.

Models of gender based on books by women tend to vary more from one random sample to another, suggesting patterns that are not quite as clearly marked. And when we compare a model based on characters written by women to one based on characters written by men, the correlation is weakest of all. Men and women don't entirely agree about definitions of gender.

I have also printed the raw scores above so you get a quick and dirty grasp of uncertainty. We're not being super-systematic about this, and we only have six models. But I think there's going to be a meaningful separation between the three comparisons we're making.


In [ ]: