bioRxiv license choices, before & after the blog post

Did the blog post, "The licensing of bioRxiv preprints", have an effect on which licensing option authors choose for their bioRxiv preprints? The blog post was published on December 05, 2016.

While the blog post was popular on Twitter, it has only recieved a total of 349 pageviews (as of March 29, 2017). So let's dive into the data.

Import packages and load data

See notebooks 1 & 2 for how dataset was created. bioRxiv preprint information was obtained from PrePubMed's data repository.


In [1]:
import os
import json
import math

import pandas
import altair
import vega
from statsmodels.stats.proportion import proportions_ztest

In [2]:
path = os.path.join('data', 'preprints.tsv')
preprint_df = pandas.read_table(path, parse_dates=['Date'])

Licenses over time (aggregated monthly)

Below we recreate the first figure from the blog post with the current data (up to March 25, 2017). There appears to be an increase of CC BY licensing near the end of 2016, but it's difficult to tell at this scale.


In [3]:
path = os.path.join('figure', 'license-vs-time', 'vega-lite-spec.json')
with open(path) as read_file:
    spec_text = read_file.read()
spec = json.loads(spec_text)
vega.VegaLite(spec, preprint_df)


Compute weekly license distribution frequencies


In [4]:
# Add a year_week column
ts_to_yearweek = lambda x: '{0}-{1:02d}'.format(*x.isocalendar())
preprint_df['year_week'] = preprint_df.Date.map(ts_to_yearweek)
preprint_df.head(2)


Out[4]:
DOI Date License year_week
0 10.1101/000026 2014-09-08 CC BY 2014-37
1 10.1101/000042 2013-12-01 CC BY 2013-48

In [5]:
# Compute frequencies
weekly_percents = pandas.crosstab(preprint_df.year_week, preprint_df.License, normalize='index')
weekly_percents = weekly_percents.applymap('{:.1%}'.format)
weekly_totals = preprint_df.year_week.value_counts().rename('total')
week_df = weekly_percents.join(weekly_totals).reset_index()
week_df.tail(2)


Out[5]:
year_week CC BY CC BY-NC CC BY-NC-ND CC BY-ND None total
186 2017-22 23.8% 8.5% 32.3% 3.6% 31.8% 223
187 2017-23 20.4% 13.0% 22.2% 7.4% 37.0% 54

In [6]:
# Export as TSV
path = os.path.join('data', 'weeks.tsv')
week_df.to_csv(path, sep='\t', index=False)

Licenses over time (aggregated weekly)

The blog was released on December 05, 2016, which was the first day of the 49th week of 2016. Below we modify the vega-lite specification to display the weekly-aggregated distributions. As the plot illustrates, there appears to be a slight shift towards more open licenses starting on 2016-49.


In [7]:
# Find week of blog post (year, week, day)
blog_ts = pandas.Timestamp('2016-12-05')
blog_ts.isocalendar()


Out[7]:
(2016, 49, 1)

In [8]:
# Modify vega-lite spec to plot weekly distributions
spec = json.loads(spec_text)
del spec['encoding']['x']['timeUnit']
del spec['encoding']['x']['scale']
del spec['encoding']['x']['axis']['format']
del spec['encoding']['x']['axis']['labelAngle']
spec['encoding']['x']['type'] = 'ordinal'
spec['encoding']['x']['field'] = 'year_week'
spec['width'] = 600

# Start midway through 2016
df = preprint_df.query("year_week >= '2016-26'")
vega.VegaLite(spec, df)


Compare eras: 100 days before and after the blog post

Next, we compare the distribution of licenses for the 100 days preceeding the blog to the 100 days following the blog post.


In [9]:
td = pandas.Timedelta('100 days')
before_df = preprint_df[(preprint_df.Date >= blog_ts - td) & (preprint_df.Date < blog_ts)].copy()
after_df = preprint_df[(preprint_df.Date >= blog_ts) & (preprint_df.Date < blog_ts + td)].copy()
before_df['era'] = 'before'
after_df['era'] = 'after'
era_df = pandas.concat([before_df, after_df])
era_df = pandas.crosstab(era_df.License, era_df.era, margins=True)[['before', 'after', 'All']].reset_index()

# Preprint counts, 100 days before and after
era_df


Out[9]:
era License before after All
0 CC BY 206 434 640
1 CC BY-NC 130 217 347
2 CC BY-NC-ND 589 672 1261
3 CC BY-ND 118 119 237
4 None 458 604 1062
5 All 1501 2046 3547

In [10]:
def prop_test(series):
    """
    Compare proportions after to before.
    """
    count = series.after, series.before
    nobs = era_df.after.iloc[-1], era_df.before.iloc[-1]
    z_score, p_value = proportions_ztest(count, nobs)
    before = count[1] / nobs[1]
    row = {
        'License': series.License,
        'before': '{:.1%}'.format(before),
        'after': '{:.1%}'.format(count[0] / nobs[0]),
        'impact': round(count[0] - before * nobs[0], 2),
        'z_score': round(z_score, 2),
        'p_value': '{:.2e}'.format(p_value),
        'nlog10_p_value': round(-math.log10(p_value), 2),
    }
    return pandas.Series(row)

test_df = (
    era_df
    .iloc[:-1, ]
    .apply(prop_test, axis='columns')
    .sort_values('z_score')
    [['License', 'before', 'after', 'z_score', 'p_value', 'nlog10_p_value', 'impact']]
)

# Results from the comparison of proportions
test_df.sort_values('z_score')


Out[10]:
License before after z_score p_value nlog10_p_value impact
2 CC BY-NC-ND 39.2% 32.8% -3.93 8.43e-05 4.07 -130.86
3 CC BY-ND 7.9% 5.8% -2.41 1.60e-02 1.80 -41.84
4 None 30.5% 29.5% -0.64 5.24e-01 0.28 -20.30
1 CC BY-NC 8.7% 10.6% 1.93 5.40e-02 1.27 39.80
0 CC BY 13.7% 21.2% 5.73 1.01e-08 8.00 153.20

In [11]:
# Bonferroni significance threshold
-math.log10(0.5 / len(test_df))


Out[11]:
1.0

Findings

Licensing options appear to have become more open following the blog post. In the 100 days prior, only 13.7% of preprints were CC BY (the only open license of the bunch). However, over the next 100 days, 21.1% of preprints were CC BY. While the effect size is not huge, the difference is highly significant (p = 1.67 × 10-8). Assuming licensing choices would have remained constant absent the blog post, the post resulted in 151 additional CC BY preprints in the 100 days following its release.

The increase in CC BY licensing appears to correspond to a decrease in CC BY-NC-ND licensing, which declined in prevalence from 39.2% to 33.2%. Overall, CC BY and CC BY-NC increased in prevalence, while CC BY-NC-ND, CC BY-ND, and no license decreased in prevalence. All of these changes were significant at p < 0.01 besides the proportion of unlicensed preprints. Unfortunately, the prevalence of unlicensed preprints remained at ~30%. Unlicensed preprints represent a major concern for the reusability and archivability of the scientific record.

Motivations

In the comments for my blog post, I mentioned that bioRxiv should switch the order of their license selector to display open licenses first. As Jessica Polka pointed out on March 21, 2017, bioRxiv adopted this suggestion at an unknown date following the blog post:

It looks like bioRxiv has reversed the order of licenses! Would love to know when this happened and if it has caused any change in license choice.

Thanks Jessica for motivating this analysis! #ASAPBio

Limitations

This analysis does not take into account changes to the composition of subjects or authors submitting preprints. Such relative changes could confound this analysis — for example, if preprints gained popularity in more open-minded subject areas relative to close-minded subject areas.