One million is a lot

This notebook presents analysis of data from the first million page views on my blog, Probably Overthinking It.

Copyright 2015 Allen Downey

MIT License: http://opensource.org/licenses/MIT


In [1]:
%matplotlib inline

In [7]:
import pandas as pd

def read_table(filename):
    fp = open(filename)
    t = pd.read_html(fp)
    table = t[5]
    return table

In [13]:
table1 = read_table('blogger1.html')
table1.shape


Out[13]:
(100, 8)

In [14]:
table2 = read_table('blogger2.html')
table2.shape


Out[14]:
(20, 8)

In [18]:
table = pd.concat([table1, table2], ignore_index=True)
table.shape


Out[18]:
(120, 9)

In [19]:
import string
chars = string.ascii_letters + ' '

def convert(s):
    return (int(s.rstrip(chars)))

def clean(s):
    i = s.find('Edit')
    return s[:i]

In [20]:
table['title'] = table[1].apply(clean)
table.title


Out[20]:
0                                   One million is a lot
1                    When will I win the Great Bear Run?
2                                    Bayes meets Fourier
3                First babies are more likely to be late
4                Bayesian analysis of gluten sensitivity
5                             Bayes theorem in real life
6                   The Inspection Paradox is Everywhere
7                                 Orange is the new stat
8                     Will Millennials Ever Get Married?
9                                     Bayesian Billiards
10                           The Sleeping Beauty Problem
11             Hypothesis testing is only mostly useless
12                 Two hour marathon by 2041 -- probably
13      Bayesian survival analysis for "Game of Thrones"
14            Statistical inference is only mostly wrong
15          Upcoming talk on survival analysis in Python
16            Bayesian analysis of match rates on Tinder
17       Godless freshmen: now more Nones than Catholics
18              Bayesian predictions for Super Bowl XLIX
19                    Statistics tutorials at PyCon 2015
20                                The Rock Hyrax Problem
21     The World Cup Problem Part 2: Germany v. Argen...
22              The World Cup Problem: Germany v. Brazil
23     On efficient algorithms for finding the goddam...
24                             Two hour marathon in 2041
25                         Bayesian election forecasting
26        Regression with Python, pandas and StatsModels
27         New study: vaccines prevent disease and death
28                     An exercise in hypothesis testing
29               More likely to be killed by a terrorist
                             ...                        
90                          Girl Named Florida solutions
91                     The red-haired girl named Florida
92                             Somebody bet on the Bayes
93                      All your Bayes are belong to us!
94                  My favorite Bayes's Theorem problems
95                              The Blinky Monty Problem
96                    Repeated tests: how bad can it be?
97                         The Jimmy Nut Company problem
98                       Upcoming webcast: Only One Test
99                                News flash: OJ did it.
100                        Postcard from NKS Summer Camp
101           A hierarchical Bayesian model of pond scum
102                         More hypotheses, less trivia
103                              There is only one test!
104                                  Statistics Workshop
105    Think Stats will be published by O'Reilly in June
106                            Two Hour Marathon in 2045
107                    Bayesianness is next to Godliness
108                                    Survival analysis
109              Freshman hordes more godless than ever!
110                            Predicting marathon times
111                                BQ is unfair to women
112                                 Moving the goalposts
113                                        The BQ Effect
114             Are first babies more likely to be late?
115     Yet another reason SAT scores are non-predictive
116                           Are you popular? Hint: no.
117                              Obesity epidemic cured!
118                       Observer effect in relay races
119                             Proofiness and elections
Name: title, dtype: object

In [21]:
table['plusses'] = table[4].fillna(0)
table.plusses.head()


Out[21]:
0    0
1    1
2    7
3    2
4    9
Name: plusses, dtype: float64

In [24]:
table['comments'] = table[5].apply(convert)
table.comments.head()


Out[24]:
0    0
1    1
2    1
3    3
4    1
Name: comments, dtype: int64

In [25]:
table['views'] = table[6].apply(convert)
table.views


Out[25]:
0           0
1         723
2        2363
3         944
4        3110
5        2514
6       30484
7        2131
8         589
9        1273
10       2348
11       1816
12       2891
13      32406
14       4666
15       1242
16       7602
17       1491
18       2254
19       1193
20        648
21       1789
22       3040
23        819
24       3090
25       1621
26       6456
27       1834
28       1057
29       1536
        ...  
90       9454
91       1153
92       2332
93      48836
94      34384
95       3367
96       3797
97       1929
98        885
99          0
100         0
101      2162
102      1520
103      4246
104       203
105      1445
106      1745
107      1083
108      2849
109      1379
110      3847
111       815
112       513
113      3066
114    130722
115     17876
116      1468
117       289
118       725
119       396
Name: views, dtype: int64

In [26]:
table['date'] = pd.to_datetime(table[7])
table.date.head()


Out[26]:
0   2015-11-01
1   2015-10-26
2   2015-10-23
3   2015-09-23
4   2015-09-01
Name: date, dtype: datetime64[ns]

In [34]:
table = table[table.views > 0]
table.shape


Out[34]:
(115, 13)

In [38]:
table.index = range(115, 0, -1)
table.title


Out[38]:
115                  When will I win the Great Bear Run?
114                                  Bayes meets Fourier
113              First babies are more likely to be late
112              Bayesian analysis of gluten sensitivity
111                           Bayes theorem in real life
110                 The Inspection Paradox is Everywhere
109                               Orange is the new stat
108                   Will Millennials Ever Get Married?
107                                   Bayesian Billiards
106                          The Sleeping Beauty Problem
105            Hypothesis testing is only mostly useless
104                Two hour marathon by 2041 -- probably
103     Bayesian survival analysis for "Game of Thrones"
102           Statistical inference is only mostly wrong
101         Upcoming talk on survival analysis in Python
100           Bayesian analysis of match rates on Tinder
99       Godless freshmen: now more Nones than Catholics
98              Bayesian predictions for Super Bowl XLIX
97                    Statistics tutorials at PyCon 2015
96                                The Rock Hyrax Problem
95     The World Cup Problem Part 2: Germany v. Argen...
94              The World Cup Problem: Germany v. Brazil
93     On efficient algorithms for finding the goddam...
92                             Two hour marathon in 2041
91                         Bayesian election forecasting
90        Regression with Python, pandas and StatsModels
89         New study: vaccines prevent disease and death
88                     An exercise in hypothesis testing
87               More likely to be killed by a terrorist
86        Bayesian solution to the Lincoln index problem
                             ...                        
30                    Estimating the age of renal tumors
29                   Comment on "Racism and Meritocracy"
28                          Girl Named Florida solutions
27                     The red-haired girl named Florida
26                             Somebody bet on the Bayes
25                      All your Bayes are belong to us!
24                  My favorite Bayes's Theorem problems
23                              The Blinky Monty Problem
22                    Repeated tests: how bad can it be?
21                         The Jimmy Nut Company problem
20                       Upcoming webcast: Only One Test
19            A hierarchical Bayesian model of pond scum
18                          More hypotheses, less trivia
17                               There is only one test!
16                                   Statistics Workshop
15     Think Stats will be published by O'Reilly in June
14                             Two Hour Marathon in 2045
13                     Bayesianness is next to Godliness
12                                     Survival analysis
11               Freshman hordes more godless than ever!
10                             Predicting marathon times
9                                  BQ is unfair to women
8                                   Moving the goalposts
7                                          The BQ Effect
6               Are first babies more likely to be late?
5       Yet another reason SAT scores are non-predictive
4                             Are you popular? Hint: no.
3                                Obesity epidemic cured!
2                         Observer effect in relay races
1                               Proofiness and elections
Name: title, dtype: object

In [ ]:


In [39]:
dates = table.date.sort_values()
diffs = dates.diff()
diffs.head()


Out[39]:
1      NaT
2   6 days
3   7 days
4   7 days
5   9 days
Name: date, dtype: timedelta64[ns]

In [40]:
diffs.dropna().describe()


Out[40]:
count                        114
mean     15 days 09:41:03.157894
std      20 days 04:36:55.930513
min              1 days 00:00:00
25%              5 days 00:00:00
50%             10 days 00:00:00
75%             17 days 18:00:00
max            180 days 00:00:00
Name: date, dtype: object

In [41]:
table.sort_values(by=['views'], ascending=False)[['title', 'views', 'date']].head(20)


Out[41]:
title views date
6 Are first babies more likely to be late? 130722 2011-02-07
25 All your Bayes are belong to us! 48836 2011-10-27
24 My favorite Bayes's Theorem problems 34384 2011-10-20
103 Bayesian survival analysis for "Game of Thrones" 32406 2015-03-25
110 The Inspection Paradox is Everywhere 30484 2015-08-18
41 Bayesian statistics made simple 23892 2012-03-14
5 Yet another reason SAT scores are non-predictive 17876 2011-02-02
72 Are your data normal? Hint: no. 16152 2013-08-07
36 Freshman hordes even more godless! 10826 2012-01-29
34 Think Complexity 10670 2012-01-23
54 Secularization in America: part six 9773 2012-07-10
28 Girl Named Florida solutions 9454 2011-11-10
55 Secularization in America: part seven 7705 2012-07-11
100 Bayesian analysis of match rates on Tinder 7602 2015-02-10
90 Regression with Python, pandas and StatsModels 6456 2014-09-14
57 Are first babies more likely to be late, revis... 5776 2013-01-08
78 Correlation is evidence of causation 4911 2014-02-20
102 Statistical inference is only mostly wrong 4666 2015-03-02
17 There is only one test! 4246 2011-05-31
65 The Price is Right Problem 4062 2013-04-22

In [56]:
table.sort_values(by=['views'], ascending=True)[['title', 'views', 'date']].head(20)


Out[56]:
title views date
16 Statistics Workshop 203 2011-05-17
3 Obesity epidemic cured! 289 2011-01-17
1 Proofiness and elections 396 2011-01-04
45 Fog warning system: part two 504 2012-04-20
8 Moving the goalposts 513 2011-02-24
108 Will Millennials Ever Get Married? 589 2015-07-13
96 The Rock Hyrax Problem 648 2014-12-04
62 Belly Button Biodiversity: Part Four 675 2013-03-22
46 Fog warning system: part three 704 2012-04-25
115 When will I win the Great Bear Run? 723 2015-10-26
2 Observer effect in relay races 725 2011-01-10
60 Belly Button Biodiversity: Part Two 783 2013-02-08
9 BQ is unfair to women 815 2011-03-02
93 On efficient algorithms for finding the goddam... 819 2014-10-04
70 Belly Button Biodiversity: The End Game 839 2013-05-30
20 Upcoming webcast: Only One Test 885 2011-08-16
50 Secularization in America: part three 927 2012-06-22
61 Belly Button Biodiversity: Part Three 932 2013-02-18
113 First babies are more likely to be late 944 2015-09-23
32 Frank is a scoundrel, probably 947 2012-01-05

In [55]:
import thinkstats2
import thinkplot

cdf = thinkstats2.Cdf(table.views)

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel ='Number of page views', xscale='log', 
                 ylabel='CCDF', yscale='log', 
                 legend=False)



In [45]:
table.sort_values(by=['comments'], ascending=False)[['title', 'comments', 'date']].head(5)


Out[45]:
title comments date
25 All your Bayes are belong to us! 56 2011-10-27
106 The Sleeping Beauty Problem 53 2015-06-12
28 Girl Named Florida solutions 25 2011-11-10
110 The Inspection Paradox is Everywhere 23 2015-08-18
54 Secularization in America: part six 14 2012-07-10

In [44]:
table.sort_values(by=['plusses'], ascending=False)[['title', 'plusses', 'date']].head(5)


Out[44]:
title plusses date
110 The Inspection Paradox is Everywhere 909 2015-08-18
25 All your Bayes are belong to us! 59 2011-10-27
103 Bayesian survival analysis for "Game of Thrones" 54 2015-03-25
67 Software engineering practices for graduate st... 34 2013-05-06
102 Statistical inference is only mostly wrong 31 2015-03-02

In [ ]:


In [ ]: