When will monthly arXiv submissions hit 10,000?


In [1]:
from collections import Counter
import itertools as it
from IPython.display import display

In [2]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [3]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']

Use pandas to load the data from the CSV file.


In [4]:
data = pandas.read_csv('get_monthly_submissions')

In [5]:
data


Out[5]:
month submissions historical_delta
0 1991-07 0 -2
1 1991-08 27 -1
2 1991-09 58 0
3 1991-10 76 0
4 1991-11 64 0
5 1991-12 78 0
6 1992-01 88 -105
7 1992-02 124 -10
8 1992-03 117 -3
9 1992-04 184 -41
10 1992-05 226 -11
11 1992-06 232 -5
12 1992-07 274 -22
13 1992-08 223 -4
14 1992-09 315 -21
15 1992-10 372 -29
16 1992-11 437 -16
17 1992-12 394 -10
18 1993-01 364 -36
19 1993-02 412 -16
20 1993-03 495 -9
21 1993-04 488 -17
22 1993-05 531 -7
23 1993-06 520 -16
24 1993-07 608 -35
25 1993-08 516 -25
26 1993-09 502 -16
27 1993-10 642 -36
28 1993-11 698 -16
29 1993-12 723 -15
... ... ... ...
273 2014-04 7855 0
274 2014-05 7975 0
275 2014-06 7874 0
276 2014-07 8549 0
277 2014-08 7119 0
278 2014-09 8676 0
279 2014-10 8871 0
280 2014-11 8006 0
281 2014-12 8770 0
282 2015-01 7912 0
283 2015-02 8054 0
284 2015-03 9191 0
285 2015-04 8367 0
286 2015-05 8172 0
287 2015-06 9217 0
288 2015-07 8995 0
289 2015-08 7983 0
290 2015-09 9318 0
291 2015-10 9223 0
292 2015-11 9472 0
293 2015-12 9376 0
294 2016-01 8251 0
295 2016-02 9142 0
296 2016-03 9746 0
297 2016-04 8948 0
298 2016-05 9792 0
299 2016-06 9644 0
300 2016-07 8911 0
301 2016-08 9016 0
302 2016-09 372 0

303 rows × 3 columns


In [6]:
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
measured_month_names = list(it.chain(*[["{} '{}".format(month, year % 100)
                                        for month in month_names]
                                       for year in range(1991, 2016 + 1)]))[6:-4]

We'll use all but the latest month's data, since submissions are still being added to September's total.


In [7]:
ax = data[:-1].plot(y='submissions')
tick_locations = [0, len(data) - 2]
ax.set_xticks(tick_locations)
ax.set_xticklabels(np.array(measured_month_names)[tick_locations])


Out[7]:
[<matplotlib.text.Text at 0x7f877e786d68>,
 <matplotlib.text.Text at 0x7f877e78a5f8>]

There's lots of noise, but one might expect part of that comes from yearly cycles (some months might be more popular months to submit papers in due to the academic calendar), so let's start by modeling how the yearly total grows to simplify our picture.

Yearly growth modeling


In [8]:
yearly_totals = np.array([sum(data['submissions'][12*n:12*n+12])
                          for n in range(len(data) // 12)])
start_years = np.array([1991 + n for n in range(len(yearly_totals))])

In [9]:
plt.plot(start_years, yearly_totals)


Out[9]:
[<matplotlib.lines.Line2D at 0x7f877e208a90>]

It looks like we might be seeing some exponential growth in more recent years, but the growth looks pretty linear at the start. Let's look at it on a semilog plot.


In [10]:
plt.semilogy(start_years, yearly_totals)


Out[10]:
[<matplotlib.lines.Line2D at 0x7f877e1c7278>]

Sure enough, the semilog plot looks pretty linear in recent years, so we might try and find an exponential fit to the yearly data. We'll fit the data with a variety of start points and see what the earliest start point that gives us a reasonable goodness-of-fit result is as a crude way of determining what year the exponential trend began.

We need at least three data points to get meaningful fits, so our most recent start time for the regressions will be back two years from our most recent year.


In [11]:
regressions = [stats.linregress(start_years[n:], np.log(yearly_totals[n:]))
               for n in range(len(yearly_totals) - 2)]

In [12]:
regressions[0]


Out[12]:
LinregressResult(slope=0.13127509891504549, intercept=-252.55958069326965, rvalue=0.91389611628770318, pvalue=1.7656919346863394e-10, stderr=0.012158840331414851)

In [13]:
slopes, intercepts, rvalues, pvalues, stderrs = [*zip(*regressions)]

In [14]:
plt.plot(start_years[:-2], rvalues, label='1-R value')
plt.legend()


Out[14]:
<matplotlib.legend.Legend at 0x7f877e111d30>

In [15]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(start_years[:-2], pvalues, label='P value')
ax.plot(start_years[:-2], stderrs, label='Standard error')
ax.set_ylim(0, 0.013)
ax.legend()


Out[15]:
<matplotlib.legend.Legend at 0x7f877d250dd8>

In [16]:
(start_years[np.argmax(rvalues)], start_years[np.argmin(stderrs)],
 start_years[np.argmin(pvalues)])


Out[16]:
(1999, 1999, 1998)

These metrics tell us we should probably think of the exponential growth starting around 1999.

Let's plot our fit against the data.

First in semilog


In [17]:
plt.semilogy(start_years, yearly_totals, label='data')
plt.semilogy(start_years, np.exp(slopes[np.argmax(rvalues)]*start_years
                                 + intercepts[np.argmax(rvalues)]), label='fit')
plt.legend()


Out[17]:
<matplotlib.legend.Legend at 0x7f877d23f5f8>

Then normally.


In [18]:
plt.plot(start_years, yearly_totals, label='data')
plt.plot(start_years, np.exp(slopes[np.argmax(rvalues)]*start_years
                             + intercepts[np.argmax(rvalues)]), label='fit')
plt.legend(loc=2)


Out[18]:
<matplotlib.legend.Legend at 0x7f877d0ef7f0>

The fits certainly look decent to the eye!

Monthly fluctuations

Now we would like to know how the month-to-month fluctuations within a year function. One way to model these fluctuations is to assume that the relative productivities of different months within a year is roughly constant across all years. Let's calculate the fraction of yearly output for each month (going up through June 2016, since we don't have a yearly total for July 2016 through June 2017).


In [19]:
monthly_fracs = data['submissions'][:12*len(yearly_totals)].reshape((-1, 12)).T / yearly_totals

In [20]:
monthly_fracs.shape


Out[20]:
(12, 25)

In [21]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(np.repeat(list(range(12)), 25), monthly_fracs.reshape(-1))
ax.fill_between(list(range(12)), np.percentile(monthly_fracs, 25, axis=1),
                np.percentile(monthly_fracs, 75, axis=1), alpha=0.5)
ax.plot(list(range(12)), np.median(monthly_fracs, axis=1))
ax.set_xticks(list(range(12)))
ax.set_xticklabels(month_names[6:] + month_names[:6])
ax.set_xlim(-0.5, 11.5)


Out[21]:
(-0.5, 11.5)

It looks like we have some significant oscillations from month-to-month as measured by the width of the inner quartiles (shaded region), as well as some outliers. The median is plotted as the solid blue line.

Let's see how well combining the median and inner quartile region for the monthly fractions with our yearly exponential fit matches the data.


In [22]:
modeled_start_years = np.array([1991 + n for n in range(len(yearly_totals) + 1)])
modeled_yearly_totals = np.exp(slopes[np.argmax(rvalues)]*modeled_start_years
                               + intercepts[np.argmax(rvalues)])

In [23]:
monthly_fracs_quartiles = np.percentile(monthly_fracs, [25, 50, 75], axis=1)

In [24]:
monthly_fracs_quartiles.shape


Out[24]:
(3, 12)

In [25]:
modeled_yearly_totals.shape


Out[25]:
(26,)

In [26]:
modeled_quartiles = (monthly_fracs_quartiles[:,np.newaxis,:]
                     * modeled_yearly_totals[np.newaxis,:,np.newaxis]).reshape((3,-1))

In [27]:
modeled_quartiles.shape


Out[27]:
(3, 312)

In [28]:
modeled_month_names = list(it.chain(*[["{} '{}".format(month, year % 100)
                                  for month in month_names]
                                 for year in range(1991, 2017 + 1)]))[6:-6]

In [29]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.fill_between(range(modeled_quartiles.shape[1]), modeled_quartiles[0],
                modeled_quartiles[2], alpha=0.5)
ax.plot(data['submissions'][:-1], color='g')
tick_locations = [0, modeled_quartiles.shape[1] - 1]
ax.set_xticks(tick_locations)
ax.set_xticklabels(np.array(modeled_month_names)[tick_locations])
ax.set_xlim(tick_locations)


Out[29]:
(0, 311)

In [30]:
tick_locations = range(227, modeled_quartiles.shape[1], 12)
xmin = tick_locations[0]
xmax = tick_locations[-1]
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.fill_between(range(modeled_quartiles.shape[1]), np.log10(modeled_quartiles[0]),
                np.log10(modeled_quartiles[2]), alpha=0.5)
ax.plot(np.log10(data['submissions'][:-1]), color='g', linewidth=2)
ax.hlines(4, xmin, xmax, linestyle='--')
ax.set_xticks(tick_locations)
ax.set_xticklabels(np.array(modeled_month_names)[tick_locations])
ax.set_yticks([3.75, 3.875, 4])
ax.set_yticklabels(['$10^{3.75}$', '$10^{3.875}$', '$10^4$'])
ax.set_xlim(xmin, xmax)
ax.set_ylim([3.7, 4.1])


Out[30]:
(3.7, 4.1)

This looks decent, although there is clearly a lot of noise that we haven't accounted for. We do seem to roughly track the fluctuations with our simple model, though, so we'll run with it.

The prediction

Now let's assume that in the coming months the number of monthly submissions will be the predicted yearly total times a fraction sampled uniformly from the fractions of yearly submissions the month in question has had historically. This allows us to calculate a probability for each upcoming month to have a number of submissions in excess of 10,000. The probability that a given month will be the first to have 10,000 submissions will then be the probability that none of the preceeding months had 10,000 submissions times the probability that the given month will. Using this, we can come up with probabilities for upcoming months to be the first month to have 10,000 submissions.


In [31]:
class html_table(list):
    def __init__(self, list_data, first_row=None, first_col=None):
        super().__init__(list_data)
        self.first_row = first_row
        self.first_col = first_col

    def _repr_html_(self):
        html = '<table>'
        if self.first_row is not None:
            html += '<tr>'
            if self.first_col is not None:
                html += '<td></td>'
            for entry in self.first_row:
                html += '<td><strong>{}</strong></td>'.format(entry)
            html += '</tr>'
        for n, row in enumerate(self):
            html += '<tr>'
            if self.first_col is not None:
                html += '<td><strong>{}</strong></td>'.format(self.first_col[n])
            for entry in row:
                html += '<td>{}</td>'.format(entry)
            html += '</tr>'
        html += '</table>'
        return html

In [32]:
modeled_samples = (monthly_fracs * modeled_yearly_totals[-1])

In [33]:
counts = Counter(np.where(modeled_samples >= 10000)[0])

In [34]:
month_probs = [(n, counts[n]/25) for n in counts if n > 1]

In [35]:
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(*zip(*month_probs), color='r', s=64)
ax.set_xticks(range(2, 12))
ax.set_xticklabels(modeled_month_names[-10:])
ax.set_yticks(np.linspace(0, 1, 6))
ax.set_yticklabels(['$0$'] + [r'$\frac{' + str(n) + '}{5}$' for n in range(1, 5)] + ['$1$'],
                  fontsize=14)
ax.set_xlim(1.5, 11.5)
ax.set_ylabel('Probability > 10,000')
plt.show()
display(html_table([['{:.0f}%'.format(100*prob) for prob in list(zip(*month_probs))[1]]],
                   modeled_month_names[-10:]))


Sep '16Oct '16Nov '16Dec '16Jan '17Feb '17Mar '17Apr '17May '17Jun '17
60%84%72%64%20%16%84%52%92%92%

In [36]:
first_probs = []
months = []
prob_attained = 0.
for month, prob in month_probs:
    prob_this_month = (1 - prob_attained) * prob
    first_probs.append(prob_this_month)
    months.append(month)
    prob_attained = prob_attained + prob_this_month
cumulative_probs = list(it.accumulate(first_probs))

Our model predicts that the arXiv will almost certainly have a month with at least 10,000 submissions by the end of the year:


In [37]:
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(months, first_probs, color='r', s=64, label='First', zorder=1)
ax.fill_between(np.repeat(np.linspace(1.5, 11.5, 11), 2)[1:-1],
                np.repeat(cumulative_probs, 2), color='b', alpha=0.5,
                label='Cumulative', zorder=0)
ax.set_xticks(range(2, 12))
ax.set_xticklabels(modeled_month_names[-10:])
ax.set_yticks(np.linspace(0, 1, 6))
ax.set_yticklabels(['$0$'] + [r'$\frac{' + str(n) + '}{5}$' for n in range(1, 5)] + ['$1$'],
                  fontsize=14)
ax.set_xlim(1.5, 11.5)
ax.set_ylim(0, 1)
ax.set_ylabel('Probability first to 10,000')
ax.legend()
plt.show()
display(html_table([['{:.3f}%'.format(100*prob) for prob in first_probs],
                    ['{:.3f}%'.format(100*prob) for prob in cumulative_probs]],
                   modeled_month_names[-10:],
                   ['First', 'Cumulative']))


Sep '16Oct '16Nov '16Dec '16Jan '17Feb '17Mar '17Apr '17May '17Jun '17
First60.000%33.600%4.608%1.147%0.129%0.083%0.364%0.036%0.031%0.002%
Cumulative60.000%93.600%98.208%99.355%99.484%99.566%99.931%99.967%99.997%100.000%

In [ ]: