Did the blog post, "The licensing of bioRxiv preprints", have an effect on which licensing option authors choose for their bioRxiv preprints? The blog post was published on December 05, 2016.
While the blog post was popular on Twitter, it has only recieved a total of 349 pageviews (as of March 29, 2017). So let's dive into the data.
See notebooks 1 & 2 for how dataset was created. bioRxiv preprint information was obtained from PrePubMed's data repository.
In [1]:
import os
import json
import math
import pandas
import altair
import vega
from statsmodels.stats.proportion import proportions_ztest
In [2]:
path = os.path.join('data', 'preprints.tsv')
preprint_df = pandas.read_table(path, parse_dates=['Date'])
Below we recreate the first figure from the blog post with the current data (up to March 25, 2017). There appears to be an increase of CC BY licensing near the end of 2016, but it's difficult to tell at this scale.
In [3]:
path = os.path.join('figure', 'license-vs-time', 'vega-lite-spec.json')
with open(path) as read_file:
spec_text = read_file.read()
spec = json.loads(spec_text)
vega.VegaLite(spec, preprint_df)
In [4]:
# Add a year_week column
ts_to_yearweek = lambda x: '{0}-{1:02d}'.format(*x.isocalendar())
preprint_df['year_week'] = preprint_df.Date.map(ts_to_yearweek)
preprint_df.head(2)
Out[4]:
In [5]:
# Compute frequencies
weekly_percents = pandas.crosstab(preprint_df.year_week, preprint_df.License, normalize='index')
weekly_percents = weekly_percents.applymap('{:.1%}'.format)
weekly_totals = preprint_df.year_week.value_counts().rename('total')
week_df = weekly_percents.join(weekly_totals).reset_index()
week_df.tail(2)
Out[5]:
In [6]:
# Export as TSV
path = os.path.join('data', 'weeks.tsv')
week_df.to_csv(path, sep='\t', index=False)
The blog was released on December 05, 2016, which was the first day of the 49th week of 2016. Below we modify the vega-lite specification to display the weekly-aggregated distributions. As the plot illustrates, there appears to be a slight shift towards more open licenses starting on 2016-49.
In [7]:
# Find week of blog post (year, week, day)
blog_ts = pandas.Timestamp('2016-12-05')
blog_ts.isocalendar()
Out[7]:
In [8]:
# Modify vega-lite spec to plot weekly distributions
spec = json.loads(spec_text)
del spec['encoding']['x']['timeUnit']
del spec['encoding']['x']['scale']
del spec['encoding']['x']['axis']['format']
del spec['encoding']['x']['axis']['labelAngle']
spec['encoding']['x']['type'] = 'ordinal'
spec['encoding']['x']['field'] = 'year_week'
spec['width'] = 600
# Start midway through 2016
df = preprint_df.query("year_week >= '2016-26'")
vega.VegaLite(spec, df)
In [9]:
td = pandas.Timedelta('100 days')
before_df = preprint_df[(preprint_df.Date >= blog_ts - td) & (preprint_df.Date < blog_ts)].copy()
after_df = preprint_df[(preprint_df.Date >= blog_ts) & (preprint_df.Date < blog_ts + td)].copy()
before_df['era'] = 'before'
after_df['era'] = 'after'
era_df = pandas.concat([before_df, after_df])
era_df = pandas.crosstab(era_df.License, era_df.era, margins=True)[['before', 'after', 'All']].reset_index()
# Preprint counts, 100 days before and after
era_df
Out[9]:
In [10]:
def prop_test(series):
"""
Compare proportions after to before.
"""
count = series.after, series.before
nobs = era_df.after.iloc[-1], era_df.before.iloc[-1]
z_score, p_value = proportions_ztest(count, nobs)
before = count[1] / nobs[1]
row = {
'License': series.License,
'before': '{:.1%}'.format(before),
'after': '{:.1%}'.format(count[0] / nobs[0]),
'impact': round(count[0] - before * nobs[0], 2),
'z_score': round(z_score, 2),
'p_value': '{:.2e}'.format(p_value),
'nlog10_p_value': round(-math.log10(p_value), 2),
}
return pandas.Series(row)
test_df = (
era_df
.iloc[:-1, ]
.apply(prop_test, axis='columns')
.sort_values('z_score')
[['License', 'before', 'after', 'z_score', 'p_value', 'nlog10_p_value', 'impact']]
)
# Results from the comparison of proportions
test_df.sort_values('z_score')
Out[10]:
In [11]:
# Bonferroni significance threshold
-math.log10(0.5 / len(test_df))
Out[11]:
Licensing options appear to have become more open following the blog post. In the 100 days prior, only 13.7% of preprints were CC BY (the only open license of the bunch). However, over the next 100 days, 21.1% of preprints were CC BY. While the effect size is not huge, the difference is highly significant (p = 1.67 × 10-8). Assuming licensing choices would have remained constant absent the blog post, the post resulted in 151 additional CC BY preprints in the 100 days following its release.
The increase in CC BY licensing appears to correspond to a decrease in CC BY-NC-ND licensing, which declined in prevalence from 39.2% to 33.2%. Overall, CC BY and CC BY-NC increased in prevalence, while CC BY-NC-ND, CC BY-ND, and no license decreased in prevalence. All of these changes were significant at p < 0.01 besides the proportion of unlicensed preprints. Unfortunately, the prevalence of unlicensed preprints remained at ~30%. Unlicensed preprints represent a major concern for the reusability and archivability of the scientific record.
In the comments for my blog post, I mentioned that bioRxiv should switch the order of their license selector to display open licenses first. As Jessica Polka pointed out on March 21, 2017, bioRxiv adopted this suggestion at an unknown date following the blog post:
It looks like bioRxiv has reversed the order of licenses! Would love to know when this happened and if it has caused any change in license choice.
Thanks Jessica for motivating this analysis! #ASAPBio
This analysis does not take into account changes to the composition of subjects or authors submitting preprints. Such relative changes could confound this analysis — for example, if preprints gained popularity in more open-minded subject areas relative to close-minded subject areas.