2. Gender Detection

Figuring out genders from names

We're going to use 3 different methods, all of which use a similar philosophy. Essentially, each of these services have build databases from datasets where genders are known or can be identified. For example, national census data and social media profiles.

  • GenderDetector can be run locally, but only provides "male", "female" or "unknown", and has a limitted number of names in the database.
  • genderize.io and Gender API are web services that allow us to query names and return genders
    • Each of these services provides a "probability" that the gender is correct (so if "Jamie" shows up 80 times in their data as a female name, and 20 times as a male name, they'll say it's "female" with a probability of 0.8)
    • They also tell us how certain we can be of that gender by telling us how many times that name shows up (in the above example, the count would be 100. This is useful because some names might only have 1 or 2 entries, in which case a 100% probability of being male would be less reliable than a name that has 1000 entries.

The web APIs have superior data, but the problem is that they are services that require you to pay if you make more than a certain number of queries in a short period of time. The owners of both services have generously provided me with enough queries to do this research for free.

Getting names to query

First, we'll take the names from our pubmed queries and collapse them into sets. We don't really need to query the name "John" a thousand times - once will do. I'm going to loop through the csv we wrote out in the last section and pull the fourth column, which contains our author name.


In [ ]:
import os
os.chdir("../data/pubdata")

In [ ]:
names = []
with open("comp.csv") as infile:
    for line in infile:
        names.append(line.split(",")[5])

Then we'll convert the list to a set, which is an unordered array of unique values (so it removes duplicates)


In [ ]:
print(len(names))

names = set(names)
print(len(names))

Here's a function that does the same thing.


In [ ]:
def get_unique_names(csv_file):
    names = []
    with open(csv_file) as infile:
        for line in infile:
            names.append(line.split(",")[5])
            
    return set(names)

The set.union() function will merge 2 sets into a single set, so we'll do this with our other datasets.


In [ ]:
names = names.union(get_unique_names("bio.csv"))

print(len(all_names))

Getting genders from names

GenderDetector

First up - GenderDetector. The usage is pretty straighforward:


In [ ]:
from gender_detector import GenderDetector
detector = GenderDetector('us')
print(detector.guess("kevin"))
print(detector.guess("melanie"))
print(detector.guess("ajasja"))

In [ ]:
gender_dict = {}
counter = 0


for name in names:
    try:
        gender = detector.guess(name)
        gender_dict[name] = gender
    except:
        print(name)

In [ ]:
print(len(gender_dict))

In [ ]:
print(sum([1 for x in gender_dict if gender_dict[x] == 'unknown']))
print(sum([1 for x in gender_dict if gender_dict[x] != 'unknown']))

Output datasets


In [ ]:
import json

with open("GenderDetector_genders.json", "w+") as outfile:
    outfile.write(json.dumps(gender_dict, indent=4))

Genderize.io

This one is a bit more complicated, since we have to make a call to the web api, and then parse the json that's returned. Happily, someone already wrote a python package to do most of the work. We can query 10 names at a time rather than each one individually, and we'll get back a list of dictionaries, one for each query:

[{u'count': 1037, u'gender': u'male', u'name': u'James', u'probability': 0.99},
 {u'count': 234, u'gender': u'female', u'name': u'Eva', u'probability': 1.0},
 {u'gender': None, u'name': u'Thunderhorse'}]

I will turn that into a dictionary of dictionaries, where the name is the key, and the other elements are stored under them. Eg:

{
u'James':{
    u'count': 1037,
    u'gender': u'male',
    u'probability': 0.99
    },
u'Eva':{
    u'count': 234, 
    u'gender': u'female',
    u'probability': 1.0
    },
u'Thunderhorse':{
    u'count: 0,
    u'gender': None,
    u'probability': None
    }
}

Note:

I've got an API key stored in a separate file called api_keys.py (that I'm not putting on git because you can't have my queries!) that looks like this:

genderize_key = "s0m3numb3rsandl3tt3rs"
genderAPI_key = "0th3rnumb3rsandl3tt3rs"

You can get a key from both services for free, but you'll be limited in the number of queries you can make. Just make a similar file, or add them in below in place of the proper variables.


In [ ]:
from api_keys import genderize_key
from genderize import Genderize

all_names = list(all_names)

genderize = Genderize(
        user_agent='Kevin_Bonham',
        api_key=genderize_key)

genderize_dict = {}

for i in range(0, len(all_names), 10):
    
    query = all_names[i:i+10]
    genders = genderize.get(query)

    for gender in genders:
        n = gender["name"]
        g = gender["gender"]
        if g != None:
            p = gender["probability"]
            c = gender["count"]
        else:
            p = None
            c = 0

        genderize_dict[n] = {"gender":g, "probability":p, "count": c}
    
with open("genderize_genders.json", "w+") as outfile:
    outfile.write(json.dumps(genderize_dict, indent=4))

In [ ]:
print(len(genderize_dict))
print(sum([1 for x in genderize_dict if genderize_dict[x]["gender"] == 'unknown']))
print(sum([1 for x in genderize_dict if genderize_dict[x]["gender"] != 'unknown']))

Gender-API

This is a similar service, but I didn't find a python package for it. Thankfully, it's pretty easy too. The following code is for python2, but you can find the python3 code on the website. The vaule that gets returned comes in the form of a dictionary as well:

{u'accuracy': 99,
 u'duration': u'26ms',
 u'gender': u'male',
 u'name': u'markus',
 u'samples': 26354}

Which I'll convert to the same keys and value types used from genderize above (eg. "probability" instead of "accuracy", "count" instead of "samples", and 0.99 instead of 99),


In [ ]:
from api_keys import genderAPI_key
import urllib2


genderAPI_dict = {}
counter = 0

for i in range(counter, len(all_names), 20):
    
    names = all_names[i:i+20]
    query = ";".join(names)
    
    data = json.load(urllib2.urlopen("https://gender-api.com/get?key={}&name={}".format(genderAPI_key, query)))
    for r in data['result']:
        n = r["name"]
        g = r["gender"]

        if g != u"unknown":
            p = float(r["accuracy"]) / 100
            c = r["samples"]
        else:
            p = None
            c = 0

        genderAPI_dict[n] = {"gender":g, "probability":p, "count": c}
    
    
with open("../data/pubs/genderAPI_genders.json", "w+") as outfile:
    outfile.write(json.dumps(genderAPI_dict, indent=4))

If you want to do this without going through this notebook and you have a python2 installation, you can use the included gender_detection.py. The first argument should be genderize or genderapi depending on which method you want to use, (or if nothing, it will try to use GenderDetector). The second argument should be a path to an output file (like genders.json), and then the rest of the arguments should be the csv files output from the previous notebook. The script will pull all the names together into a set, and then use the relevant API or GenderDetector.

$ python2 gender_detection.py genderize path/to/dataset1.csv path/to/dataset2.csv