Experimenting with estimating the gender of mailing list participants.


In [1]:
%matplotlib inline


/home/sb/anaconda/envs/bigbang/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Import the BigBang modules as needed. These should be in your Python environment if you've installed BigBang correctly.


In [2]:
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
reload(process)

import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os

from bigbang import parse
from gender_detector import GenderDetector

pd.options.display.mpl_style = 'default' # pandas has a set of preferred graph formatting options


/home/sb/anaconda/envs/bigbang/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2885: FutureWarning: 
mpl_style had been deprecated and will be removed in a future version.
Use `matplotlib.pyplot.style.use` instead.

  exec(code_obj, self.user_global_ns, self.user_ns)

Now let's load the data for analysis.


In [3]:
urls = ["http://www.ietf.org/mail-archive/text/ietf-privacy/",
        "http://lists.w3.org/Archives/Public/public-privacy/"]
mlists = [(url, mailman.open_list_archives(url,"../../archives")) for url in urls]
#activities = [Archive.get_activity(Archive(ml)) for ml in mlists]


Opening 47 archive files
Opening 29 archive files

For each of our lists, we'll clean up the names, find the first name if there is one, and guess its gender. Pandas groups the data together for comparison. We keep count of the names we find that are ambiguous, for the next step.


In [4]:
detector = GenderDetector('us')

gender_ambiguous_names = {}

def guess_gender(name):
    if not name:
        return 'name unknown'
    try:
        if detector.guess(name) == 'unknown':
            if name in gender_ambiguous_names:
                gender_ambiguous_names[name] += 1
            else:
                gender_ambiguous_names[name] = 1
        
        return detector.guess(name)
    except:
        return 'error'

def ml_shortname(url):
    return url.rstrip("/").split("/")[-1]

series = []  
for (url, ml) in mlists:
    activity = Archive.get_activity(Archive(ml)).sum(0)
    activityFrame = pd.DataFrame(activity, columns=['Message Count'])
    
    activityFrame['Name'] = activityFrame.index.map(lambda x: parse.clean_from(x))    
    activityFrame['First Name'] = activityFrame['Name'].map(lambda x: parse.guess_first_name(x))
    activityFrame['Guessed Gender'] = activityFrame['First Name'].map(guess_gender)
    
    activityFrame.to_csv(('senders_guessed_gender-%s.csv' % ml_shortname(url)),encoding='utf-8')
    
    counts = activityFrame.groupby('Guessed Gender')['Message Count'].count()
    counts.name=url
    series.append(counts)

pd.DataFrame(series)


/home/sb/projects/bigbang-tmp/bigbang/bigbang/archive.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  self.data.sort(columns='Date', inplace=True)
Out[4]:
error female male name unknown unknown
http://www.ietf.org/mail-archive/text/ietf-privacy/ NaN 6.0 68.0 12.0 31.0
http://lists.w3.org/Archives/Public/public-privacy/ 2.0 28.0 159.0 42.0 60.0

Let's quickly visualize the names that couldn't be guessed with our estimator and their distribution.


In [5]:
ser = pd.Series(gender_ambiguous_names)
ser.sort_values(ascending=False).plot(kind='bar')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3144f94350>
/home/sb/anaconda/envs/bigbang/lib/python2.7/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family [u'monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

This distribution may vary by the particular list, but it seems to be a power distribution. That is, with a fairly small supplement of manually providing genders for the names/identities on the list, we can very signficantly improve the fraction of messages with an estimated gender.

With a couple minutes of manual work from someone familiar with the group, I've created an updated CSV that contains a manually-entered gender column, in addition to the automated guess. Let's see how much of a difference that makes.


In [6]:
url = "http://lists.w3.org/Archives/Public/public-privacy/"
csv_guessed = ('senders_guessed_gender-%s.csv' % ml_shortname(url))
csv_manual = ('senders_manual_gender-%s.csv' % ml_shortname(url))

guessed = pd.read_csv(csv_guessed)
manual = pd.read_csv(csv_manual)

In [7]:
def combined_gender(row):
    if str(row['Manual Gender']) != 'nan':
        return row['Manual Gender']
    else:
        return row['Guessed Gender']
manual['Combined Gender'] = manual.apply(combined_gender, axis=1)

combined_series = manual.groupby('Combined Gender')['Message Count'].sum()
guessed_series = manual.groupby('Guessed Gender')['Message Count'].sum()
compared_counts = pd.DataFrame({'Manual':combined_series, 'Guessed':guessed_series})
compared_counts.plot(kind='bar')


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3141e2b110>

In [8]:
figure,axes = plt.subplots(ncols=2, figsize=(8,4))
guessed_series.rename("Guessed").plot(kind='pie', ax=axes[0])
combined_series.rename("Manual").plot(kind='pie', ax=axes[1])
axes[0].set_aspect('equal')
axes[1].set_aspect('equal')



In [ ]:


In [ ]: