Since the GenderAPI data seems a bit more comprehensive (see here), that's what we'll use going forward. This first block recapitulates what I did in the beginning of the last notebook.
In [4]:
using StatPlots
include("../src/dataimport.jl") # `importauthors()` and `getgenderprob()` functions
bio = importauthors("../data/pubdata/bio.csv", "bio")
comp = importauthors("../data/pubdata/comp.csv", "comp")
alldata = vcat(bio, comp)
bio = 0 # to free up memory
comp = 0
alldata[:Pfemale], alldata[:Count] = getgenderprob(alldata, "../data/genders/genderAPI_genders.json", :Author_First_Name)
pool!(alldata)
alldata = alldata[!isna(alldata[:Journal]), :] # remove rows where there's no Journal
alldata[1:5, :]
Out[4]:
In [6]:
means = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:Pfemale]))))
Out[6]:
In [20]:
bar(means, :MeanPF,
xaxis=("Dataset", ([1,2], means[:Dataset])),
yaxis=("Percent Female", (0, 0.6), 0:0.1:0.6),
legend=false,
grid=false,
title="Proportion of Female Authors")
Out[20]:
Currently, the author positions are ordered in the dataframe in alphabetical order (first, last, other, penultimate, second), so I'm going to define a "less than" function to do a custom sort (thanks Stack Overflow!). This function needs to return true for the call x < y
for the strings in the following order: ["first", "second", "other", "penultimate", "last"].
In [24]:
order = Dict(key => ix for (ix, key) in enumerate(["first", "second", "other", "penultimate", "last"]))
function authororder(pos1, pos2)
return order[pos1] < order[pos2]
end
println(authororder("first", "second"))
println(authororder("second", "first"))
So now we can sort the dataframe using our custom function and the lt
keyword.
In [31]:
sort!(alldata, cols=:Position, lt=authororder)
alldata[1:5, 1:7]
Out[31]:
In [32]:
byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:Pfemale])))
sort!(byposition, cols=:Position, lt=authororder)
ys = hcat([byposition[byposition[:Dataset] .== x, :x1] for x in levels(byposition[:Dataset])]...)
groupedbar(ys, bar_position=:dodge,
xaxis=("Author Position", (1:5, levels(alldata[:Position]))),
yaxis=("Percent Female", (0, 0.6), 0:0.1:0.6),
legend=false,
grid=false,
title="Proportion of Female Authors")
Out[32]:
To reiterate, these data suggest:
New finding: It seems to be worse in computational biology than in all of biology, though not by as much as I expected.
Let's see how this holds up. We can't do the "normal" sorts of statistics that folks often do (like T-tests, chi squared etc), since we're not taking a random sample of a population, we're looking at the whole population. An alternative is to use bootstrap analysis), where we randomly resample the data a bunch of times, and get statistics on those samples.
Happily, julia has a bootstrap package that does a lot of the work for us.
In [9]:
using Bootstrap
function getbootstrap(v, n=1000, i=0.95)
bs = boot_basic(v, mean, n)
return ci_basic(bs, i)
end
Out[9]:
This function takes a vector of values v
(like, say, a group of gender probabilities), and then samples them with replacement n
times. In other words, if I had a vector t = [1, 2, 3, 4, 5]
and I called getbootstrap(t), it would go through and take 5 samples at random... sometimes it would be [5, 2, 4, 3, 3]
, sometimes [1, 1, 1, 1, 1]
etc. It would do this 1000 times, getting the mean each time.
In [17]:
t = [1,2,3,4,5]
getbootstrap(t)
Out[17]:
The estimate for the mean is 3, just as it should be, but I also get a confidence interval. That means that 95% of the time, the mean is between 1.8 and 4.2
I can use this to look at the author genders:
In [18]:
by(alldata, [:Dataset, :Position]) do df
ci = getbootstrap(dropna(df[:Pfemale]))
return DataFrame(Mean=ci.t0, Lower=interval(ci)[1], Upper=interval(ci)[2])
end
Out[18]:
That code takes the data, subsets it by :Dataset
and :Position
(that's the first two columns) and then returns the mean and 95% confidence intervals for each subset.
We can make generic functions for this, which you can find in src/bootstrapping.jl
.
In [19]:
function bystats(df::DataFrame, by1::Symbol, n=1000, i=0.95)
by(df, by1) do df2
ci = getbootstrap(dropna(df2[:Pfemale]), n, i)
return DataFrame(Mean=ci.t0, Lower=interval(ci)[1], Upper=interval(ci)[2])
end
end
function bystats(df::DataFrame, cols::Vector{Symbol}, n=1000, i=0.95)
by(df, cols) do df2
ci = getbootstrap(dropna(df2[:Pfemale]), n, i)
return DataFrame(Mean=ci.t0, Lower=interval(ci)[1], Upper=interval(ci)[2])
end
end
Out[19]:
In [27]:
big3 = alldata[
(alldata[:Journal] .== utf8("Nature"))|
(alldata[:Journal] .== utf8("Science"))|
(alldata[:Journal] .== utf8("Cell")), :]
big3byposition = bystats(big3, [:Position, :Dataset])
plot(big3byposition, x=:Position, y=:Mean, color=:Dataset, ymin=:Lower, ymax=:Upper,
Scale.color_discrete_manual(my_colors...),
Geom.bar(position = :dodge), Guide.title("Authors in Nature, Science and Cell"),
Guide.YLabel("Percent Female"),
Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
Theme(bar_spacing=2mm))
Out[27]:
In [28]:
plos = alldata[map(x->contains(x, "PLoS"), alldata[:Journal]), :]
plosbyposition = by(plos, [:Position, :Dataset], df -> mean(dropna(df[:Pfemale])))
plot(plosbyposition, x=:Position, y=:x1, color=:Dataset,
Scale.color_discrete_manual(my_colors...),
Geom.bar(position = :dodge), Guide.title("Authors in PLoS Journals"),
Guide.YLabel("Percent Female"),
Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
Theme(bar_spacing=2mm))
Out[28]:
In [29]:
bioids = Set(levels(alldata[alldata[:Dataset] .== "bio", :ID]))
compids = Set(levels(alldata[alldata[:Dataset] .== "comp", :ID]))
println("There are $(length(bioids)) articles in the \"bio\" dataset")
println("There are $(length(compids)) articles in the \"comp\" dataset")
dif = length(setdiff(compids, bioids))
println("There are $dif articles in the \"comp\" dataset that aren't in the \"bio\" dataset")
In [32]:
plosfocus = alldata[(alldata[:Dataset] .== "bio")&
((alldata[:Journal] .== utf8("PLoS Biol."))|
(alldata[:Journal] .== utf8("PLoS Comput. Biol."))), :]
plosfocusbyposition = bystats(plosfocus, [:Position, :Journal])
plot(plosfocusbyposition, x=:Position, y=:Mean, color=:Journal,
Scale.color_discrete_manual(my_colors...),
Geom.bar(position = :dodge), Guide.title("Authors in Specified PLoS Journals"),
Guide.YLabel("Percent Female"),
Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
Theme(bar_spacing=2mm))
Out[32]:
In [33]:
comppubs = alldata[(alldata[:Dataset] .== "bio")&
((alldata[:Journal] .== utf8("PLoS Comput. Biol."))| # impact factor 4.62
(alldata[:Journal] .== utf8("Nucleic Acids Res."))| # 9.112
(alldata[:Journal] .== utf8("BMC Bioinformatics"))| # 2.576
(alldata[:Journal] .== utf8("Bioinformatics"))), :] # 4.981
genpubs = alldata[(alldata[:Dataset] .== "bio")&
((alldata[:Journal] .== utf8("Proc. Natl. Acad. Sci. U.S.A."))| # impact factor 9.674
(alldata[:Journal] .== utf8("BMC Biol."))| # 9.112
(alldata[:Journal] .== utf8("PLoS Biol."))| # 9.343
(alldata[:Journal] .== utf8("Biol. Lett."))), :] # 3.248
comppubs[:JournalSet] = "comp"
genpubs[:JournalSet] = "gen"
pubs = vcat(comppubs, genpubs)
comppubs = 0
genpubs = 0
pubsbyposition = bystats(pubs, [:Position, :JournalSet])
plot(pubsbyposition, x=:Position, y=:Mean, color=:JournalSet,
Scale.color_discrete_manual(my_colors...),
Geom.bar(position = :dodge), Guide.title("Authors in Computational vs General Bio Publications"),
Guide.YLabel("Percent Female"),
Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
Theme(bar_spacing=2mm))
Out[33]:
In [34]:
alldata[:Year] = map(x -> Dates.year(Date(x)), alldata[:Date])
byyear = bystats(alldata, [:Year, :Dataset])
plot(byyear, x=:Year, y=:Mean, color=:Dataset,
Scale.color_discrete_manual(my_colors...),
Geom.point, Geom.line, Guide.title("Female Authors Over Time"),
Guide.YLabel("Percent Female"), Guide.XLabel("Year"))
Out[34]:
What about in various author positions?
In [35]:
biobyyear = bystats(alldata[alldata[:Dataset] .== "bio", :], [:Year, :Position])
plot(biobyyear, x=:Year, y=:Mean, color=:Position,
Scale.color_discrete_manual(my_colors...),
Geom.point, Geom.line, Guide.title("Female Authors Over Time in Biology"),
Guide.YLabel("Percent Female"), Guide.XLabel("Year"))
In [36]:
compbyyear = bystats(alldata[alldata[:Dataset] .== "comp", :], [:Year, :Position])
plot(compbyyear, x=:Year, y=:Mean, color=:Position,
Scale.color_discrete_manual(my_colors...),
Geom.point, Geom.line, Guide.title("Female Authors Over Time In Computational Biology"),
Guide.YLabel("Percent Female"), Guide.XLabel("Year"))
Why does all the data suck before ~2002?
In [15]:
println("There are $(length(alldata[(alldata[:Year] .< 2002) & (alldata[:Dataset] .== "bio"), :Year])) pubs in the data set before 2002")
println("There are $(length(alldata[(alldata[:Year] .>= 2002) & (alldata[:Dataset] .== "bio"), :Year])) pubs in the data set after 2002")
for year in 1997:2015
println("There are $(length(alldata[(alldata[:Year] .== year) & (alldata[:Dataset] .== "bio"), :Year])) pubs in the data set from $year")
end
Previous work suggests that women are far less likely to publish in computer science. Unfortunately, pubmed doesn't index computer science research.
The arXiv has preprints in many fields, including computer science. The sorts of papers posted here are likely to be different, so we can't compare directly to the stuff on pubmed, but there's also quantatative biology...
In [37]:
arxivcs = importauthors("../data/pubs/arxivcs.csv", "arxivcs")
arxivbio = importauthors("../data/pubs/arxivbio.csv", "arxivbio")
arxiv = vcat(arxivbio, arxivcs)
arxivcs = 0
arxivbio = 0
pool!(arxiv)
arxiv = arxiv[!isna(arxiv[:Author_Name]), :]
arxiv[:Pfemale], arxiv[:Count] = getgenderprob(arxiv, "../data/genders/genderAPI_genders.json", :Author_Name)
arxivbyposition = bystats(arxiv, [:Dataset, :Position])
Out[37]:
In [39]:
plot(arxivbyposition, x=:Position, y=:Mean, color=:Dataset,
Scale.color_discrete_manual(my_colors...),
Guide.title("Female Authors in arXiv"),
Geom.bar(position = :dodge),
Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
Theme(bar_spacing=2mm))
Out[39]:
In [40]:
arxiv[:Year] = map(x -> Dates.year(Date(x)), arxiv[:Date])
arxivbyyear = bystats(arxiv, [:Year, :Dataset])
plot(arxivbyyear, x=:Year, y=:Mean, color=:Dataset,
Scale.color_discrete_manual(my_colors...),
Geom.point, Geom.line, Guide.title("Female Authors in PLoS Journals Over Time"),
Guide.YLabel("Percent Female"), Guide.XLabel("Year"), Guide.yticks(ticks=[0:0.1:0.5]))
Out[40]: