Gender Stats

Since the GenderAPI data seems a bit more comprehensive (see here), that's what we'll use going forward. This first block recapitulates what I did in the beginning of the last notebook.


In [4]:
using StatPlots


include("../src/dataimport.jl") # `importauthors()` and `getgenderprob()` functions

bio = importauthors("../data/pubdata/bio.csv", "bio")
comp = importauthors("../data/pubdata/comp.csv", "comp")
alldata = vcat(bio, comp)
bio = 0 # to free up memory
comp = 0

alldata[:Pfemale], alldata[:Count] = getgenderprob(alldata, "../data/genders/genderAPI_genders.json", :Author_First_Name)

pool!(alldata)
alldata = alldata[!isna(alldata[:Journal]), :] # remove rows where there's no Journal

alldata[1:5, :]


Out[4]:
IDDateJournalAuthor_First_NameAuthor_Last_NameAuthor_InitialsPositionTitleDatasetPfemaleCount
1264664252015-10-15Southeast Asian J. Trop. Med. Public HealthSuwitChotinunNAfirstPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.020000000000000018306
2264664252015-10-15Southeast Asian J. Trop. Med. Public HealthPrapasPatchaneeNAlastPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.03000000000000002758
3264664252015-10-15Southeast Asian J. Trop. Med. Public HealthSuvichaiRojanasthienNAsecondPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bioNA0
4264664252015-10-15Southeast Asian J. Trop. Med. Public HealthPakpoomTadeeNApenultimatePREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.020000000000000018116
5264664252015-10-15Southeast Asian J. Trop. Med. Public HealthFredUngerNAotherPREVALENCE AND ANTIMICROBIAL RESISTANCE OF SALMONELLA ISOLATED FROM CARCASSES, PROCESSING FACILITIES AND THE ENVIRONMENT SURROUNDING SMALL SCALE POULTRY SLAUGHTERHOUSES IN THAILAND.bio0.04000000000000003649394

In [6]:
means = by(alldata, [:Dataset], df -> DataFrame(MeanPF = mean(dropna(df[:Pfemale]))))


Out[6]:
DatasetMeanPF
1bio0.34048754243415824
2comp0.29633429724211385

In [20]:
bar(means, :MeanPF,
        xaxis=("Dataset", ([1,2], means[:Dataset])),
        yaxis=("Percent Female", (0, 0.6), 0:0.1:0.6),
        legend=false,
        grid=false,
        title="Proportion of Female Authors")


Out[20]:

Currently, the author positions are ordered in the dataframe in alphabetical order (first, last, other, penultimate, second), so I'm going to define a "less than" function to do a custom sort (thanks Stack Overflow!). This function needs to return true for the call x < y for the strings in the following order: ["first", "second", "other", "penultimate", "last"].


In [24]:
order = Dict(key => ix for (ix, key) in enumerate(["first", "second", "other", "penultimate", "last"]))


function authororder(pos1, pos2)
    return order[pos1] < order[pos2]
end

println(authororder("first", "second"))
println(authororder("second", "first"))


true
false

So now we can sort the dataframe using our custom function and the lt keyword.


In [31]:
sort!(alldata, cols=:Position, lt=authororder)
alldata[1:5, 1:7]


Out[31]:
IDDateJournalAuthor_First_NameAuthor_Last_NameAuthor_InitialsPosition
1264664252015-10-15Southeast Asian J. Trop. Med. Public HealthSuwitChotinunNAfirst
2264664212015-10-15Southeast Asian J. Trop. Med. Public HealthChariyaChomvarinNAfirst
3264664182015-10-15Southeast Asian J. Trop. Med. Public HealthMeng-BinTangNAfirst
4264604002015-10-11Ann. Hum. Genet.ChristopherSteeleDfirst
5262559442015-08-10Mutat. Res.SaraSkiöldNAfirst

In [32]:
byposition = by(alldata, [:Position, :Dataset], df -> mean(dropna(df[:Pfemale])))
sort!(byposition, cols=:Position, lt=authororder)
ys = hcat([byposition[byposition[:Dataset] .== x, :x1] for x in levels(byposition[:Dataset])]...)

groupedbar(ys, bar_position=:dodge,
        xaxis=("Author Position", (1:5, levels(alldata[:Position]))),
        yaxis=("Percent Female", (0, 0.6), 0:0.1:0.6),
        legend=false,
        grid=false,
        title="Proportion of Female Authors")


Out[32]:

To reiterate, these data suggest:

  • Women are less likely to be authors than men
  • Women are less likely to be first authors than second authors
  • Women are less likely to be last authors than first authors
  • this recapitulates previously published data that women are under-represented in biology publishing.

New finding: It seems to be worse in computational biology than in all of biology, though not by as much as I expected.

Additional Annalysis

Let's see how this holds up. We can't do the "normal" sorts of statistics that folks often do (like T-tests, chi squared etc), since we're not taking a random sample of a population, we're looking at the whole population. An alternative is to use bootstrap analysis), where we randomly resample the data a bunch of times, and get statistics on those samples.

Happily, julia has a bootstrap package that does a lot of the work for us.


In [9]:
using Bootstrap

function getbootstrap(v, n=1000, i=0.95)
    bs = boot_basic(v, mean, n)
    return ci_basic(bs, i)
end


Out[9]:
getbootstrap (generic function with 3 methods)

This function takes a vector of values v (like, say, a group of gender probabilities), and then samples them with replacement n times. In other words, if I had a vector t = [1, 2, 3, 4, 5] and I called getbootstrap(t), it would go through and take 5 samples at random... sometimes it would be [5, 2, 4, 3, 3], sometimes [1, 1, 1, 1, 1] etc. It would do this 1000 times, getting the mean each time.


In [17]:
t = [1,2,3,4,5]
getbootstrap(t)


Out[17]:
Bootstrap Confidence Interval
  Estimate: 3.0
  Interval: [1.7999999999999998,4.2]
  Level:    0.95
  Method:   basic

The estimate for the mean is 3, just as it should be, but I also get a confidence interval. That means that 95% of the time, the mean is between 1.8 and 4.2

I can use this to look at the author genders:


In [18]:
by(alldata, [:Dataset, :Position]) do df
    ci = getbootstrap(dropna(df[:Pfemale]))
    return DataFrame(Mean=ci.t0, Lower=interval(ci)[1], Upper=interval(ci)[2])
end


Out[18]:
DatasetPositionMeanLowerUpper
1biofirst0.3755747510134970.373348180178815770.3777158577042995
2biolast0.244587692145477180.24253618849650180.24670336730005965
3bioother0.368088473466838260.36659706578411570.36957724436593176
4biopenultimate0.27936921296296280.27672643097643080.2817461016414139
5biosecond0.37859078590785930.37599882780573850.38095996902826196
6compfirst0.316338658146964770.31208133074038050.3204951034865952
7complast0.20729178007621140.203215047178370660.21104291417165685
8compother0.330706481635761830.32796068291637880.3336850914046319
9comppenultimate0.23617388670099970.231641894231185360.2408402864932704
10compsecond0.321790971540726060.31677854689471460.32652382412729547

That code takes the data, subsets it by :Dataset and :Position (that's the first two columns) and then returns the mean and 95% confidence intervals for each subset.

We can make generic functions for this, which you can find in src/bootstrapping.jl.


In [19]:
function bystats(df::DataFrame, by1::Symbol, n=1000, i=0.95)
    by(df, by1) do df2
        ci = getbootstrap(dropna(df2[:Pfemale]), n, i)
        return DataFrame(Mean=ci.t0, Lower=interval(ci)[1], Upper=interval(ci)[2])
    end
end

function bystats(df::DataFrame, cols::Vector{Symbol}, n=1000, i=0.95)
    by(df, cols) do df2
        ci = getbootstrap(dropna(df2[:Pfemale]), n, i)
        return DataFrame(Mean=ci.t0, Lower=interval(ci)[1], Upper=interval(ci)[2])
    end
end


Out[19]:
bystats (generic function with 6 methods)

In [27]:
big3 = alldata[
    (alldata[:Journal] .== utf8("Nature"))|
    (alldata[:Journal] .== utf8("Science"))|
    (alldata[:Journal] .== utf8("Cell")), :]


big3byposition = bystats(big3, [:Position, :Dataset])


plot(big3byposition, x=:Position, y=:Mean, color=:Dataset, ymin=:Lower, ymax=:Upper,
Scale.color_discrete_manual(my_colors...),
            Geom.bar(position = :dodge), Guide.title("Authors in Nature, Science and Cell"),
            Guide.YLabel("Percent Female"),
            Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
            Theme(bar_spacing=2mm))


Out[27]:
Position first second other penultimate last bio comp Dataset -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 -0.5 0.0 0.5 1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percent Female Authors in Nature, Science and Cell

In [28]:
plos = alldata[map(x->contains(x, "PLoS"), alldata[:Journal]), :]

plosbyposition = by(plos, [:Position, :Dataset], df -> mean(dropna(df[:Pfemale])))

plot(plosbyposition, x=:Position, y=:x1, color=:Dataset,
Scale.color_discrete_manual(my_colors...),
                Geom.bar(position = :dodge), Guide.title("Authors in PLoS Journals"),
                Guide.YLabel("Percent Female"),
                Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
                Theme(bar_spacing=2mm))


Out[28]:
Position first second other penultimate last bio comp Dataset -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 -0.5 0.0 0.5 1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percent Female Authors in PLoS Journals

Based on Journal Specialty

Another way to do this is to split by journals that tend to publish computational biology articles vs those that are more generalist. Here we'll only use the articles in the "bio" dataset to avoid double-dipping (the "comp" dataset is almost entirely a subset of "bio")


In [29]:
bioids = Set(levels(alldata[alldata[:Dataset] .== "bio", :ID]))
compids = Set(levels(alldata[alldata[:Dataset] .== "comp", :ID]))

println("There are $(length(bioids)) articles in the \"bio\" dataset")
println("There are $(length(compids)) articles in the \"comp\" dataset")

dif = length(setdiff(compids, bioids))

println("There are $dif articles in the \"comp\" dataset that aren't in the \"bio\" dataset")


There are 202816 articles in the "bio" dataset
There are 42880 articles in the "comp" dataset
There are 236 articles in the "comp" dataset that aren't in the "bio" dataset

In [32]:
plosfocus = alldata[(alldata[:Dataset] .== "bio")&
    ((alldata[:Journal] .== utf8("PLoS Biol."))|
    (alldata[:Journal] .== utf8("PLoS Comput. Biol."))), :]

plosfocusbyposition = bystats(plosfocus, [:Position, :Journal])

plot(plosfocusbyposition, x=:Position, y=:Mean, color=:Journal,
                Scale.color_discrete_manual(my_colors...),
                Geom.bar(position = :dodge), Guide.title("Authors in Specified PLoS Journals"),
                Guide.YLabel("Percent Female"),
                Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
                Theme(bar_spacing=2mm))


Out[32]:
Position first second other penultimate last PLoS Biol. PLoS Comput. Biol. Journal -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 -0.5 0.0 0.5 1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percent Female Authors in Specified PLoS Journals

In [33]:
comppubs = alldata[(alldata[:Dataset] .== "bio")&
    ((alldata[:Journal] .== utf8("PLoS Comput. Biol."))| # impact factor 4.62
    (alldata[:Journal] .== utf8("Nucleic Acids Res."))| # 9.112
    (alldata[:Journal] .== utf8("BMC Bioinformatics"))| # 2.576
    (alldata[:Journal] .== utf8("Bioinformatics"))), :] # 4.981

genpubs = alldata[(alldata[:Dataset] .== "bio")&
    ((alldata[:Journal] .== utf8("Proc. Natl. Acad. Sci. U.S.A."))| # impact factor 9.674
    (alldata[:Journal] .== utf8("BMC Biol."))| # 9.112
    (alldata[:Journal] .== utf8("PLoS Biol."))| # 9.343
    (alldata[:Journal] .== utf8("Biol. Lett."))), :] # 3.248

comppubs[:JournalSet] = "comp"
genpubs[:JournalSet] = "gen"

pubs = vcat(comppubs, genpubs)
comppubs = 0
genpubs = 0


pubsbyposition = bystats(pubs, [:Position, :JournalSet])

plot(pubsbyposition, x=:Position, y=:Mean, color=:JournalSet,
                Scale.color_discrete_manual(my_colors...),
                Geom.bar(position = :dodge), Guide.title("Authors in Computational vs General Bio Publications"),
                Guide.YLabel("Percent Female"),
                Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
                Theme(bar_spacing=2mm))


Out[33]:
Position first second other penultimate last comp gen JournalSet -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 -0.5 0.0 0.5 1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percent Female Authors in Computational vs General Bio Publications

Subsetting Conclusions

With a few exceptions, each of these subsets show a similar trend: women are less likely to be authors in computational biology publications.

Changes over time

Previous work suggests that female authorship has been increasing over time. Let's see if the trends hold over time as well.


In [34]:
alldata[:Year] = map(x -> Dates.year(Date(x)), alldata[:Date])

byyear = bystats(alldata, [:Year, :Dataset])

plot(byyear, x=:Year, y=:Mean, color=:Dataset,
        Scale.color_discrete_manual(my_colors...),
        Geom.point, Geom.line, Guide.title("Female Authors Over Time"),
        Guide.YLabel("Percent Female"), Guide.XLabel("Year"))


Out[34]:
Year 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 1960 1980 2000 2020 2040 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 bio comp Dataset -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 -0.62 -0.60 -0.58 -0.56 -0.54 -0.52 -0.50 -0.48 -0.46 -0.44 -0.42 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16 1.18 1.20 1.22 -1 0 1 2 -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 Percent Female Female Authors Over Time

What about in various author positions?


In [35]:
biobyyear = bystats(alldata[alldata[:Dataset] .== "bio", :], [:Year, :Position])

plot(biobyyear, x=:Year, y=:Mean, color=:Position,
        Scale.color_discrete_manual(my_colors...),
        Geom.point, Geom.line, Guide.title("Female Authors Over Time in Biology"),
        Guide.YLabel("Percent Female"), Guide.XLabel("Year"))


LoadError: ArgumentError: quantiles are undefined in presence of NaNs
while loading In[35], in expression starting on line 1

 in quantile! at statistics.jl:545
 in quantile at statistics.jl:609
 in ci_basic at /Users/KBLaptop/.julia/v0.4/Bootstrap/src/ci.jl:55
 in getbootstrap at In[9]:5
 in anonymous at In[19]:10
 in map at /Users/KBLaptop/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:181
 in bystats at In[19]:9 (repeats 2 times)

In [36]:
compbyyear = bystats(alldata[alldata[:Dataset] .== "comp", :], [:Year, :Position])

plot(compbyyear, x=:Year, y=:Mean, color=:Position,
        Scale.color_discrete_manual(my_colors...),
        Geom.point, Geom.line, Guide.title("Female Authors Over Time In Computational Biology"),
        Guide.YLabel("Percent Female"), Guide.XLabel("Year"))


LoadError: ArgumentError: quantiles are undefined in presence of NaNs
while loading In[36], in expression starting on line 1

 in quantile! at statistics.jl:545
 in quantile at statistics.jl:609
 in ci_basic at /Users/KBLaptop/.julia/v0.4/Bootstrap/src/ci.jl:55
 in getbootstrap at In[9]:5
 in anonymous at In[19]:10
 in map at /Users/KBLaptop/.julia/v0.4/DataFrames/src/groupeddataframe/grouping.jl:181
 in bystats at In[19]:9 (repeats 2 times)

Why does all the data suck before ~2002?


In [15]:
println("There are $(length(alldata[(alldata[:Year] .< 2002) & (alldata[:Dataset] .== "bio"), :Year])) pubs in the data set before 2002")
println("There are $(length(alldata[(alldata[:Year] .>= 2002) & (alldata[:Dataset] .== "bio"), :Year])) pubs in the data set after 2002")

for year in 1997:2015
    println("There are $(length(alldata[(alldata[:Year] .== year) & (alldata[:Dataset] .== "bio"), :Year])) pubs in the data set from $year")
end


There are 65587 pubs in the data set before 2002
There are 1046177 pubs in the data set after 2002
There are 6518 pubs in the data set from 1997
There are 9846 pubs in the data set from 1998
There are 11956 pubs in the data set from 1999
There are 15208 pubs in the data set from 2000
There are 22059 pubs in the data set from 2001
There are 24398 pubs in the data set from 2002
There are 33229 pubs in the data set from 2003
There are 45461 pubs in the data set from 2004
There are 54317 pubs in the data set from 2005
There are 58886 pubs in the data set from 2006
There are 65584 pubs in the data set from 2007
There are 73427 pubs in the data set from 2008
There are 84511 pubs in the data set from 2009
There are 99106 pubs in the data set from 2010
There are 110234 pubs in the data set from 2011
There are 119325 pubs in the data set from 2012
There are 124923 pubs in the data set from 2013
There are 133125 pubs in the data set from 2014
There are 19651 pubs in the data set from 2015

Data from the arXiv

Previous work suggests that women are far less likely to publish in computer science. Unfortunately, pubmed doesn't index computer science research.

The arXiv has preprints in many fields, including computer science. The sorts of papers posted here are likely to be different, so we can't compare directly to the stuff on pubmed, but there's also quantatative biology...


In [37]:
arxivcs = importauthors("../data/pubs/arxivcs.csv", "arxivcs")
arxivbio = importauthors("../data/pubs/arxivbio.csv", "arxivbio")

arxiv = vcat(arxivbio, arxivcs)
arxivcs = 0
arxivbio = 0

pool!(arxiv)
arxiv = arxiv[!isna(arxiv[:Author_Name]), :]

arxiv[:Pfemale], arxiv[:Count] = getgenderprob(arxiv, "../data/genders/genderAPI_genders.json", :Author_Name)


arxivbyposition = bystats(arxiv, [:Dataset, :Position])


Out[37]:
DatasetPositionMeanLowerUpper
1arxivbiofirst0.183879589632829550.177958018358531450.19044420446364302
2arxivbiolast0.14774314639972710.141191417358662420.1544600443635538
3arxivbioother0.26453646477132240.251598516687267960.2773817676143384
4arxivbiopenultimate0.195796308954203830.182805622009569570.20920189678742315
5arxivbiosecond0.209704749679075850.200370117366587270.21925201723821752
6arxivcsfirst0.157335443551659880.154936417498579750.15950626572518473
7arxivcslast0.155346379481966420.15284720016095910.15801725543766032
8arxivcsother0.18843652471559470.182136514648142760.194720980569818
9arxivcspenultimate0.17539664112684790.16988079483012170.18095756907360122
10arxivcssecond0.175379533363468930.171524879168260450.17898814284224077

In [39]:
plot(arxivbyposition, x=:Position, y=:Mean, color=:Dataset,
        Scale.color_discrete_manual(my_colors...),
        Guide.title("Female Authors in arXiv"),
        Geom.bar(position = :dodge),
        Scale.x_discrete(levels=["first", "second", "other", "penultimate", "last"]),
        Theme(bar_spacing=2mm))


Out[39]:
Position first second other penultimate last arxivbio arxivcs Dataset -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -0.31 -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 -0.5 0.0 0.5 1.0 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 Mean Female Authors in arXiv

In [40]:
arxiv[:Year] = map(x -> Dates.year(Date(x)), arxiv[:Date])
arxivbyyear = bystats(arxiv, [:Year, :Dataset])

plot(arxivbyyear, x=:Year, y=:Mean, color=:Dataset,
        Scale.color_discrete_manual(my_colors...),
        Geom.point, Geom.line, Guide.title("Female Authors in PLoS Journals Over Time"),
        Guide.YLabel("Percent Female"), Guide.XLabel("Year"), Guide.yticks(ticks=[0:0.1:0.5]))


Out[40]:
Year 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 1998.0 1998.5 1999.0 1999.5 2000.0 2000.5 2001.0 2001.5 2002.0 2002.5 2003.0 2003.5 2004.0 2004.5 2005.0 2005.5 2006.0 2006.5 2007.0 2007.5 2008.0 2008.5 2009.0 2009.5 2010.0 2010.5 2011.0 2011.5 2012.0 2012.5 2013.0 2013.5 2014.0 2014.5 2015.0 2015.5 2016.0 2016.5 2017.0 2017.5 2018.0 2018.5 2019.0 2019.5 2020.0 2020.5 2021.0 2021.5 2022.0 1990 2000 2010 2020 2030 1998.0 1998.5 1999.0 1999.5 2000.0 2000.5 2001.0 2001.5 2002.0 2002.5 2003.0 2003.5 2004.0 2004.5 2005.0 2005.5 2006.0 2006.5 2007.0 2007.5 2008.0 2008.5 2009.0 2009.5 2010.0 2010.5 2011.0 2011.5 2012.0 2012.5 2013.0 2013.5 2014.0 2014.5 2015.0 2015.5 2016.0 2016.5 2017.0 2017.5 2018.0 2018.5 2019.0 2019.5 2020.0 2020.5 2021.0 2021.5 2022.0 arxivbio arxivcs Dataset 0.0 0.1 0.2 0.3 0.4 0.5 Percent Female Female Authors in PLoS Journals Over Time