One million is a lot

This notebook presents analysis of data from the first million page views on my blog, Probably Overthinking It.

MIT License: http://opensource.org/licenses/MIT



In [1]:

    
%matplotlib inline



In [7]:

    
import pandas as pd

def read_table(filename):
    fp = open(filename)
    t = pd.read_html(fp)
    table = t[5]
    return table



In [13]:

    
table1 = read_table('blogger1.html')
table1.shape









    Out[13]:





(100, 8)



In [14]:

    
table2 = read_table('blogger2.html')
table2.shape









    Out[14]:





(20, 8)



In [18]:

    
table = pd.concat([table1, table2], ignore_index=True)
table.shape









    Out[18]:





(120, 9)



In [19]:

    
import string
chars = string.ascii_letters + ' '

def convert(s):
    return (int(s.rstrip(chars)))

def clean(s):
    i = s.find('Edit')
    return s[:i]



In [20]:

    
table['title'] = table[1].apply(clean)
table.title









    Out[20]:





0                                   One million is a lot
1                    When will I win the Great Bear Run?
2                                    Bayes meets Fourier
3                First babies are more likely to be late
4                Bayesian analysis of gluten sensitivity
5                             Bayes theorem in real life
6                   The Inspection Paradox is Everywhere
7                                 Orange is the new stat
8                     Will Millennials Ever Get Married?
9                                     Bayesian Billiards
10                           The Sleeping Beauty Problem
11             Hypothesis testing is only mostly useless
12                 Two hour marathon by 2041 -- probably
13      Bayesian survival analysis for "Game of Thrones"
14            Statistical inference is only mostly wrong
15          Upcoming talk on survival analysis in Python
16            Bayesian analysis of match rates on Tinder
17       Godless freshmen: now more Nones than Catholics
18              Bayesian predictions for Super Bowl XLIX
19                    Statistics tutorials at PyCon 2015
20                                The Rock Hyrax Problem
21     The World Cup Problem Part 2: Germany v. Argen...
22              The World Cup Problem: Germany v. Brazil
23     On efficient algorithms for finding the goddam...
24                             Two hour marathon in 2041
25                         Bayesian election forecasting
26        Regression with Python, pandas and StatsModels
27         New study: vaccines prevent disease and death
28                     An exercise in hypothesis testing
29               More likely to be killed by a terrorist
                             ...                        
90                          Girl Named Florida solutions
91                     The red-haired girl named Florida
92                             Somebody bet on the Bayes
93                      All your Bayes are belong to us!
94                  My favorite Bayes's Theorem problems
95                              The Blinky Monty Problem
96                    Repeated tests: how bad can it be?
97                         The Jimmy Nut Company problem
98                       Upcoming webcast: Only One Test
99                                News flash: OJ did it.
100                        Postcard from NKS Summer Camp
101           A hierarchical Bayesian model of pond scum
102                         More hypotheses, less trivia
103                              There is only one test!
104                                  Statistics Workshop
105    Think Stats will be published by O'Reilly in June
106                            Two Hour Marathon in 2045
107                    Bayesianness is next to Godliness
108                                    Survival analysis
109              Freshman hordes more godless than ever!
110                            Predicting marathon times
111                                BQ is unfair to women
112                                 Moving the goalposts
113                                        The BQ Effect
114             Are first babies more likely to be late?
115     Yet another reason SAT scores are non-predictive
116                           Are you popular? Hint: no.
117                              Obesity epidemic cured!
118                       Observer effect in relay races
119                             Proofiness and elections
Name: title, dtype: object



In [21]:

    
table['plusses'] = table[4].fillna(0)
table.plusses.head()









    Out[21]:





0    0
1    1
2    7
3    2
4    9
Name: plusses, dtype: float64



In [24]:

    
table['comments'] = table[5].apply(convert)
table.comments.head()









    Out[24]:





0    0
1    1
2    1
3    3
4    1
Name: comments, dtype: int64



In [25]:

    
table['views'] = table[6].apply(convert)
table.views









    Out[25]:





0           0
1         723
2        2363
3         944
4        3110
5        2514
6       30484
7        2131
8         589
9        1273
10       2348
11       1816
12       2891
13      32406
14       4666
15       1242
16       7602
17       1491
18       2254
19       1193
20        648
21       1789
22       3040
23        819
24       3090
25       1621
26       6456
27       1834
28       1057
29       1536
        ...  
90       9454
91       1153
92       2332
93      48836
94      34384
95       3367
96       3797
97       1929
98        885
99          0
100         0
101      2162
102      1520
103      4246
104       203
105      1445
106      1745
107      1083
108      2849
109      1379
110      3847
111       815
112       513
113      3066
114    130722
115     17876
116      1468
117       289
118       725
119       396
Name: views, dtype: int64



In [26]:

    
table['date'] = pd.to_datetime(table[7])
table.date.head()









    Out[26]:





0   2015-11-01
1   2015-10-26
2   2015-10-23
3   2015-09-23
4   2015-09-01
Name: date, dtype: datetime64[ns]



In [34]:

    
table = table[table.views > 0]
table.shape









    Out[34]:





(115, 13)



In [38]:

    
table.index = range(115, 0, -1)
table.title









    Out[38]:





115                  When will I win the Great Bear Run?
114                                  Bayes meets Fourier
113              First babies are more likely to be late
112              Bayesian analysis of gluten sensitivity
111                           Bayes theorem in real life
110                 The Inspection Paradox is Everywhere
109                               Orange is the new stat
108                   Will Millennials Ever Get Married?
107                                   Bayesian Billiards
106                          The Sleeping Beauty Problem
105            Hypothesis testing is only mostly useless
104                Two hour marathon by 2041 -- probably
103     Bayesian survival analysis for "Game of Thrones"
102           Statistical inference is only mostly wrong
101         Upcoming talk on survival analysis in Python
100           Bayesian analysis of match rates on Tinder
99       Godless freshmen: now more Nones than Catholics
98              Bayesian predictions for Super Bowl XLIX
97                    Statistics tutorials at PyCon 2015
96                                The Rock Hyrax Problem
95     The World Cup Problem Part 2: Germany v. Argen...
94              The World Cup Problem: Germany v. Brazil
93     On efficient algorithms for finding the goddam...
92                             Two hour marathon in 2041
91                         Bayesian election forecasting
90        Regression with Python, pandas and StatsModels
89         New study: vaccines prevent disease and death
88                     An exercise in hypothesis testing
87               More likely to be killed by a terrorist
86        Bayesian solution to the Lincoln index problem
                             ...                        
30                    Estimating the age of renal tumors
29                   Comment on "Racism and Meritocracy"
28                          Girl Named Florida solutions
27                     The red-haired girl named Florida
26                             Somebody bet on the Bayes
25                      All your Bayes are belong to us!
24                  My favorite Bayes's Theorem problems
23                              The Blinky Monty Problem
22                    Repeated tests: how bad can it be?
21                         The Jimmy Nut Company problem
20                       Upcoming webcast: Only One Test
19            A hierarchical Bayesian model of pond scum
18                          More hypotheses, less trivia
17                               There is only one test!
16                                   Statistics Workshop
15     Think Stats will be published by O'Reilly in June
14                             Two Hour Marathon in 2045
13                     Bayesianness is next to Godliness
12                                     Survival analysis
11               Freshman hordes more godless than ever!
10                             Predicting marathon times
9                                  BQ is unfair to women
8                                   Moving the goalposts
7                                          The BQ Effect
6               Are first babies more likely to be late?
5       Yet another reason SAT scores are non-predictive
4                             Are you popular? Hint: no.
3                                Obesity epidemic cured!
2                         Observer effect in relay races
1                               Proofiness and elections
Name: title, dtype: object



In [ ]:



In [39]:

    
dates = table.date.sort_values()
diffs = dates.diff()
diffs.head()









    Out[39]:





1      NaT
2   6 days
3   7 days
4   7 days
5   9 days
Name: date, dtype: timedelta64[ns]



In [40]:

    
diffs.dropna().describe()









    Out[40]:





count                        114
mean     15 days 09:41:03.157894
std      20 days 04:36:55.930513
min              1 days 00:00:00
25%              5 days 00:00:00
50%             10 days 00:00:00
75%             17 days 18:00:00
max            180 days 00:00:00
Name: date, dtype: object



In [41]:

    
table.sort_values(by=['views'], ascending=False)[['title', 'views', 'date']].head(20)









    Out[41]:






  
    
      
      title
      views
      date
    
  
  
    
      6
      Are first babies more likely to be late?
      130722
      2011-02-07
    
    
      25
      All your Bayes are belong to us!
      48836
      2011-10-27
    
    
      24
      My favorite Bayes's Theorem problems
      34384
      2011-10-20
    
    
      103
      Bayesian survival analysis for "Game of Thrones"
      32406
      2015-03-25
    
    
      110
      The Inspection Paradox is Everywhere
      30484
      2015-08-18
    
    
      41
      Bayesian statistics made simple
      23892
      2012-03-14
    
    
      5
      Yet another reason SAT scores are non-predictive
      17876
      2011-02-02
    
    
      72
      Are your data normal? Hint: no.
      16152
      2013-08-07
    
    
      36
      Freshman hordes even more godless!
      10826
      2012-01-29
    
    
      34
      Think Complexity
      10670
      2012-01-23
    
    
      54
      Secularization in America: part six
      9773
      2012-07-10
    
    
      28
      Girl Named Florida solutions
      9454
      2011-11-10
    
    
      55
      Secularization in America: part seven
      7705
      2012-07-11
    
    
      100
      Bayesian analysis of match rates on Tinder
      7602
      2015-02-10
    
    
      90
      Regression with Python, pandas and StatsModels
      6456
      2014-09-14
    
    
      57
      Are first babies more likely to be late, revis...
      5776
      2013-01-08
    
    
      78
      Correlation is evidence of causation
      4911
      2014-02-20
    
    
      102
      Statistical inference is only mostly wrong
      4666
      2015-03-02
    
    
      17
      There is only one test!
      4246
      2011-05-31
    
    
      65
      The Price is Right Problem
      4062
      2013-04-22



In [56]:

    
table.sort_values(by=['views'], ascending=True)[['title', 'views', 'date']].head(20)









    Out[56]:






  
    
      
      title
      views
      date
    
  
  
    
      16
      Statistics Workshop
      203
      2011-05-17
    
    
      3
      Obesity epidemic cured!
      289
      2011-01-17
    
    
      1
      Proofiness and elections
      396
      2011-01-04
    
    
      45
      Fog warning system: part two
      504
      2012-04-20
    
    
      8
      Moving the goalposts
      513
      2011-02-24
    
    
      108
      Will Millennials Ever Get Married?
      589
      2015-07-13
    
    
      96
      The Rock Hyrax Problem
      648
      2014-12-04
    
    
      62
      Belly Button Biodiversity: Part Four
      675
      2013-03-22
    
    
      46
      Fog warning system: part three
      704
      2012-04-25
    
    
      115
      When will I win the Great Bear Run?
      723
      2015-10-26
    
    
      2
      Observer effect in relay races
      725
      2011-01-10
    
    
      60
      Belly Button Biodiversity: Part Two
      783
      2013-02-08
    
    
      9
      BQ is unfair to women
      815
      2011-03-02
    
    
      93
      On efficient algorithms for finding the goddam...
      819
      2014-10-04
    
    
      70
      Belly Button Biodiversity: The End Game
      839
      2013-05-30
    
    
      20
      Upcoming webcast: Only One Test
      885
      2011-08-16
    
    
      50
      Secularization in America: part three
      927
      2012-06-22
    
    
      61
      Belly Button Biodiversity: Part Three
      932
      2013-02-18
    
    
      113
      First babies are more likely to be late
      944
      2015-09-23
    
    
      32
      Frank is a scoundrel, probably
      947
      2012-01-05



In [55]:

    
import thinkstats2
import thinkplot

cdf = thinkstats2.Cdf(table.views)

thinkplot.PrePlot(1)
thinkplot.Cdf(cdf, complement=True)
thinkplot.Config(xlabel ='Number of page views', xscale='log', 
                 ylabel='CCDF', yscale='log', 
                 legend=False)



In [45]:

    
table.sort_values(by=['comments'], ascending=False)[['title', 'comments', 'date']].head(5)









    Out[45]:






  
    
      
      title
      comments
      date
    
  
  
    
      25
      All your Bayes are belong to us!
      56
      2011-10-27
    
    
      106
      The Sleeping Beauty Problem
      53
      2015-06-12
    
    
      28
      Girl Named Florida solutions
      25
      2011-11-10
    
    
      110
      The Inspection Paradox is Everywhere
      23
      2015-08-18
    
    
      54
      Secularization in America: part six
      14
      2012-07-10



In [44]:

    
table.sort_values(by=['plusses'], ascending=False)[['title', 'plusses', 'date']].head(5)









    Out[44]:






  
    
      
      title
      plusses
      date
    
  
  
    
      110
      The Inspection Paradox is Everywhere
      909
      2015-08-18
    
    
      25
      All your Bayes are belong to us!
      59
      2011-10-27
    
    
      103
      Bayesian survival analysis for "Game of Thrones"
      54
      2015-03-25
    
    
      67
      Software engineering practices for graduate st...
      34
      2013-05-06
    
    
      102
      Statistical inference is only mostly wrong
      31
      2015-03-02



In [ ]:



In [ ]:

	title	views	date
6	Are first babies more likely to be late?	130722	2011-02-07
25	All your Bayes are belong to us!	48836	2011-10-27
24	My favorite Bayes's Theorem problems	34384	2011-10-20
103	Bayesian survival analysis for "Game of Thrones"	32406	2015-03-25
110	The Inspection Paradox is Everywhere	30484	2015-08-18
41	Bayesian statistics made simple	23892	2012-03-14
5	Yet another reason SAT scores are non-predictive	17876	2011-02-02
72	Are your data normal? Hint: no.	16152	2013-08-07
36	Freshman hordes even more godless!	10826	2012-01-29
34	Think Complexity	10670	2012-01-23
54	Secularization in America: part six	9773	2012-07-10
28	Girl Named Florida solutions	9454	2011-11-10
55	Secularization in America: part seven	7705	2012-07-11
100	Bayesian analysis of match rates on Tinder	7602	2015-02-10
90	Regression with Python, pandas and StatsModels	6456	2014-09-14
57	Are first babies more likely to be late, revis...	5776	2013-01-08
78	Correlation is evidence of causation	4911	2014-02-20
102	Statistical inference is only mostly wrong	4666	2015-03-02
17	There is only one test!	4246	2011-05-31
65	The Price is Right Problem	4062	2013-04-22

	title	views	date
16	Statistics Workshop	203	2011-05-17
3	Obesity epidemic cured!	289	2011-01-17
1	Proofiness and elections	396	2011-01-04
45	Fog warning system: part two	504	2012-04-20
8	Moving the goalposts	513	2011-02-24
108	Will Millennials Ever Get Married?	589	2015-07-13
96	The Rock Hyrax Problem	648	2014-12-04
62	Belly Button Biodiversity: Part Four	675	2013-03-22
46	Fog warning system: part three	704	2012-04-25
115	When will I win the Great Bear Run?	723	2015-10-26
2	Observer effect in relay races	725	2011-01-10
60	Belly Button Biodiversity: Part Two	783	2013-02-08
9	BQ is unfair to women	815	2011-03-02
93	On efficient algorithms for finding the goddam...	819	2014-10-04
70	Belly Button Biodiversity: The End Game	839	2013-05-30
20	Upcoming webcast: Only One Test	885	2011-08-16
50	Secularization in America: part three	927	2012-06-22
61	Belly Button Biodiversity: Part Three	932	2013-02-18
113	First babies are more likely to be late	944	2015-09-23
32	Frank is a scoundrel, probably	947	2012-01-05

	title	comments	date
25	All your Bayes are belong to us!	56	2011-10-27
106	The Sleeping Beauty Problem	53	2015-06-12
28	Girl Named Florida solutions	25	2011-11-10
110	The Inspection Paradox is Everywhere	23	2015-08-18
54	Secularization in America: part six	14	2012-07-10