Now that we've got our author data set and inferences for genders, it's time to do some exploratory data analysis. I'm going to do this in julia. I made two functions - importauthors() and getgenderprob() that take the csv fils from write_names_to_file found in xml parsing and create julia DataFrames.
In [44]:
include("../src/dataimport.jl")
Out[44]:
In [53]:
using DataFrames
bio = importauthors("../data/pubdata/bio.csv", "bio")
comp = importauthors("../data/pubdata/comp.csv", "comp")
comp[1:5, :]
Out[53]:
Now let's combine all the data - we can subset it again later. I'm also going to clear the bio and comp variables to free up some memory.
We'll also use the getgenderprob() function to add columns for the probability that the author is female (P) using the different apis and the number of times that name showed up in the respective database, which gives us some sense of how certain we can be in the result (Count).
Finally, we'll use pool!, which makes the represenation of factored data (data that has distinct rather than continuous values) a bit more efficient in memory (and will make queries faster later on).
In [54]:
alldata = vcat(bio, comp)
bio = 0
comp = 0
alldata[:izeP], alldata[:izeCount] = getgenderprob(
alldata, "../data/genders/genderize_genders.json", :Author_First_Name)
alldata[:apiP], alldata[:apiCount] = getgenderprob(
alldata, "../data/genders/genderAPI_genders.json", :Author_First_Name)
pool!(alldata)
In julia, we can subset our dataframes pretty easily. For example, we can pull back out rows for our different datasets.
In [55]:
alldata = alldata[!isna(alldata[:Journal]), :] # remove rows where there's no Journal
biodata = alldata[alldata[:Dataset] .== "bio", :] # get all columns for rows where the Dataset column is "bio"
compdata = alldata[alldata[:Dataset] .== "comp", :]
biodata[1:5, :] # get the first 5 rows, and all columns
Out[55]:
Now we're going to use the plotting package Plots, which allows us to use different plotting backends to take a look. First, we need to reshape the data a little bit to make it easier to plot.
In [57]:
using StatPlots
gr()
izemeans = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:izeP]))))
apimeans = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:apiP]))))
izemeans[:method] = "genderize"
apimeans[:method] = "genderAPI"
allmeans = vcat(izemeans, apimeans)
Out[57]:
The rather complicated expression below makes an ndarray that can be passed to "groupedbar" to make the plot.
In [80]:
ys = hcat([allmeans[allmeans[:Dataset] .== x, :MeanPF] for x in levels(allmeans[:Dataset])]...)
groupedbar(ys, bar_position=:dodge,
ylims=(0,1), xticks=([1,2],["genderize", "genderAPI"]),
lab=["Bio", "comp"],
xlabel="Gender Calling Method",
ylabel="Percent Female",
title="Proportion of Female Authors")
Out[80]:
In [70]:
genderize_byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:izeP])))
genderapi_byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:apiP])))
Out[70]:
In [84]:
ys = hcat([genderize_byposition[genderize_byposition[:Dataset] .== x, :x1] for x in levels(genderize_byposition[:Dataset])]...)
groupedbar(ys, bar_position=:dodge,
ylims=(0,0.6), xticks=(1:5,levels(genderize_byposition[:Position])),
lab=levels(genderize_byposition[:Dataset]),
xlabel="Author Position",
ylabel="Percent Female",
title="By Position, Genderize.io")
Out[84]:
In [86]:
ys = hcat([genderapi_byposition[genderapi_byposition[:Dataset] .== x, :x1] for x in levels(genderapi_byposition[:Dataset])]...)
groupedbar(ys, bar_position=:dodge,
ylims=(0,0.6), xticks=(1:5,levels(genderapi_byposition[:Position])),
lab=levels(genderapi_byposition[:Dataset]),
xlabel="Author Position",
ylabel="Percent Female",
title="By Position, GenerAPI")
Out[86]:
The good news:
The bad news - this recapitulates previously published data that women are under-represented in biology publishing.
New finding: It seems to be worse in computational biology than in all of biology, though not by as much as I expected.
Using genderize, it looks like women are better represented than when using genderAPI. Which one is better?
A couple of things to consider:
To do this, we'll start by reshaping our dataframe to show the stats for each name, including the number of times they show up.
In [88]:
names = by(biodata, [:Author_First_Name, :izeP, :apiP], df -> DataFrame(
izeCount = mean(df[:izeCount]),
apiCount = mean(df[:apiCount]),
Frequency = length(df[:izeCount])
)
)
names[1:5, :]
Out[88]:
One difference that's immediately apparent is that Genderize has guesses for initials, while genderAPI doesn't.
In [90]:
initials = names[map(x->length(x), names[:Author_First_Name]) .== 1, :]
initials[1:26, :]
Out[90]:
Another thing that you might have noticed is that there are a lot of names that come through as initials, so this might make a pretty significant difference.
In [91]:
# 1. how many of our names can the service guess?
println("Gender-API: $(length(names[names[:apiCount] .!= 0, :Author_First_Name]) / length(names[:Author_First_Name]))")
println("Genderize.io: $(length(names[names[:izeCount] .!= 0, :Author_First_Name]) / length(names[:Author_First_Name]))")
So it looks like GenderAPI can guess lot more of the unique names, but this doesn't take into consideration how many times each name shows up. Maybe Genderize has a lot more of the more frequent names
In [92]:
# 2. what proportion of authors can the service guess (this is a different question)
println("Gender-API: $(length(biodata[biodata[:apiCount] .!= 0, :Author_First_Name]) / length(biodata[:Author_First_Name]))")
println("Genderize.io: $(length(biodata[biodata[:izeCount] .!= 0, :Author_First_Name]) / length(biodata[:Author_First_Name]))")
Here it seems that genderize has the upperhand. Then again, remember all of those names that are just initials. What if we take those out of the mix?
In [93]:
println("Gender-API: $(length(biodata[(biodata[:apiCount] .!= 0) & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :Author_First_Name]) / length(biodata[:Author_First_Name]))")
println("Genderize.io: $(length(biodata[(biodata[:izeCount] .!= 0) & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :Author_First_Name]) / length(biodata[:Author_First_Name]))")
It looks like most of genderize's advantage comes from the fact that it's guessing on initials. It's unclear whether we should include these names. There's evidence that women are more likely to use initials when publishing than men. At the same time, most of genderize's guesses for gender based on initials skew towards male. The combination of these things would lead me to expect genderize to underpredict female authorship, yet we saw above that the genderize guesses skew female compared to GenderAPI.
What happens to the genderize data when we drop the initials?
In [94]:
# Bio data, first author
println(mean(dropna(biodata[biodata[:Position] .== "first", :izeP])))
println(mean(dropna(biodata[(biodata[:Position] .== "first") & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :izeP])))
In [95]:
# Comp data, first author
println(mean(dropna(compdata[compdata[:Position] .== "first", :izeP])))
println(mean(dropna(compdata[(compdata[:Position] .== "first") & (map(x->length(x), compdata[:Author_First_Name]) .!= 1), :izeP])))
In [96]:
# Bio data, last author
println(mean(dropna(biodata[biodata[:Position] .== "last", :izeP])))
println(mean(dropna(biodata[(biodata[:Position] .== "last") & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :izeP])))
In [97]:
# Comp data, last author
println(mean(dropna(compdata[compdata[:Position] .== "last", :izeP])))
println(mean(dropna(compdata[(compdata[:Position] .== "last") & (map(x->length(x), compdata[:Author_First_Name]) .!= 1), :izeP])))
So, it doesn't look like leaving in the predictions based on intitials is substantially swaying the predictions one way or the other. What else might explain the difference?
In [99]:
# 3. for names that can be guessed, how certain can we be that the gender assignment is correct?
println("Gender-API: $(mean(biodata[biodata[:apiCount] .!= 0, :apiCount]))")
println("Genderize.io: $(mean(biodata[biodata[:izeCount] .!= 0, :izeCount]))")
# Excluding initials
println("Genderize no initials: $(mean(biodata[(biodata[:izeCount] .!= 0) & (map(x->length(x), biodata[:Author_First_Name]) .!= 1), :izeCount]))")
In [105]:
n = names[(names[:apiCount] .> 0) & (names[:izeCount] .> 0), :]
n[1:5, :]
Out[105]:
In [111]:
scatter(n, :Frequency, :izeCount,
lab="genderize.io", α=0.5,
yaxis=("Count", :log10),
xaxis=("Name Frequency", :log10))
scatter!(n, :Frequency, :apiCount, lab="genderAPI", α=0.5)
Out[111]: