Section 5.3: Reverts per page (setup and exploratory)

This is a data analysis script used to produce findings in the paper, which you can run based entirely off the files in this GitHub repository. This notebook produces part of the analysis for all languages, and the notebook 4-3-reverts-per-page-enwiki-plots is an independent replication of this analysis in R that contains plots for the English Wikipedia, which are included in the paper. Note that the R notebook cannot be run on mybinder due to memory requirements, while this one can be.

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. It takes less than 1 minute to run on a laptop running a Core i5-2540M processor.



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import datetime
import pickle
%matplotlib inline



In [2]:

    
start = datetime.datetime.now()

Load data



In [3]:

    
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_2016.pickle.xz



In [4]:

    
!ls ../../datasets/parsed_dataframes/*.pickle









    



../../datasets/parsed_dataframes/df_all_2016.pickle



In [5]:

    
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
    df_all = pickle.load(f)



In [6]:

    
df_all.sample(2).transpose()









    Out[6]:







  
    
      
      368803
      502072
    
  
  
    
      archived
      False
      False
    
    
      language
      en
      en
    
    
      page_namespace
      4
      14
    
    
      rev_deleted
      False
      False
    
    
      rev_id
      414658262
      321587772
    
    
      rev_minor_edit
      False
      True
    
    
      rev_page
      4626266
      21490082
    
    
      rev_parent_id
      4.14656e+08
      2.70317e+08
    
    
      rev_revert_offset
      1
      2
    
    
      rev_sha1
      b7q4j3lf6nvbgdxpcioi5th06y4zvjd
      20jlq5xvki3aa9pq17i7tiklcylri18
    
    
      rev_timestamp
      20110218185506
      20091023153323
    
    
      rev_user
      13286072
      3171782
    
    
      rev_user_text
      ClueBot NG
      TXiKiBoT
    
    
      reverted_to_rev_id
      414655598
      270316918
    
    
      reverting_archived
      False
      False
    
    
      reverting_comment
      Empty. rm [[Special:Contributions/216.194.32.1...
      [[User:Addbot|Bot:]] Migrating 2 interwiki lin...
    
    
      reverting_deleted
      False
      False
    
    
      reverting_id
      414658788
      546727520
    
    
      reverting_minor_edit
      True
      True
    
    
      reverting_page
      4626266
      21490082
    
    
      reverting_parent_id
      4.14658e+08
      3.98948e+08
    
    
      reverting_sha1
      fdq6jlab7kr949hd0g4f1gw15hd1qhg
      ivhmt840igobwhtqfphygca81t8pxhl
    
    
      reverting_timestamp
      20110218185852
      20130324130609
    
    
      reverting_user
      6327251
      6569922
    
    
      reverting_user_text
      HBC AIV helperbot7
      Addbot
    
    
      revisions_reverted
      1
      2
    
    
      namespace_type
      other page
      category
    
    
      reverted_timestamp_dt
      2011-02-18 18:55:06
      2009-10-23 15:33:23
    
    
      reverting_timestamp_dt
      2011-02-18 18:58:52
      2013-03-24 13:06:09
    
    
      time_to_revert
      0 days 00:03:46
      1247 days 21:32:46
    
    
      time_to_revert_hrs
      0.0627778
      29949.5
    
    
      time_to_revert_days
      0.00261574
      1247.9
    
    
      reverting_year
      2011
      2013
    
    
      time_to_revert_days_log10
      -2.58241
      3.09618
    
    
      time_to_revert_hrs_log10
      -1.20219
      4.47639
    
    
      reverting_comment_nobracket
      Empty. rm .
      Migrating 2 interwiki links, now provided by on
    
    
      botpair
      HBC AIV helperbot7 rv ClueBot NG
      Addbot rv TXiKiBoT
    
    
      botpair_sorted
      ['ClueBot NG', 'HBC AIV helperbot7']
      ['Addbot', 'TXiKiBoT']
    
    
      reverts_per_page_botpair
      10473
      1
    
    
      reverts_per_page_botpair_sorted
      10473
      1

Number of reverts per page per bot pair

Group by language, page ID, and botpair_sorted

Grouping by these three columns creates a very simple and useful intersection for this metric. If there is only one revert for a language/page ID/botpair_sorted set, then the reverting bot's revert was for sure unreciprocated by the reverted bot. If there are two reverts, then the most likely outcome is that the reverting bot's revert was followed by a revert by the reverted bot, although this could also mean that the reverting bot reverted the reverted bot twice. Higher counts imply heavy back-and-forth reverts between two bots on a single page.

We count the number of reverts with the same language, page ID, and sorted botpair, then assign that value to reverts_per_page_botpair_sorted for every revert matching these three columns. Note that this initial analysis is conducted in 0-load-process-data.ipynb, but we have included it again for clarity.



In [7]:

    
groupby_lang_page_bps = df_all.groupby(["language", "rev_page", "botpair_sorted"])



In [8]:

    
df_groupby = pd.DataFrame(groupby_lang_page_bps['rev_id'].count()).reset_index().rename(columns={"rev_id":"reverts_per_page_botpair_sorted"})
df_groupby.sample(25)









    Out[8]:







  
    
      
      language
      rev_page
      botpair_sorted
      reverts_per_page_botpair_sorted
    
  
  
    
      413723
      en
      33088615
      ['KLBot2', 'Thijs!bot']
      1
    
    
      608235
      fr
      3232951
      ['ChuispastonBot', 'EmausBot']
      1
    
    
      221343
      en
      13258304
      ['AvicBot', 'Xqbot']
      1
    
    
      1742
      de
      11128
      ['MerlIwBot', 'Xqbot']
      2
    
    
      611017
      fr
      3637910
      ['Addbot', 'YFdyh-bot']
      1
    
    
      780878
      zh
      1421963
      ['Addbot', 'Mjbmrbot']
      1
    
    
      2115
      de
      15235
      ['MerlIwBot', 'PixelBot']
      2
    
    
      698010
      pt
      241700
      ['EmausBot', 'MerlIwBot']
      1
    
    
      710941
      pt
      1028990
      ['CommonsDelinker', 'Rei-bot']
      1
    
    
      593680
      fr
      1718027
      ['HerculeBot', 'Ptbotgourou']
      1
    
    
      788657
      zh
      2901471
      ['Addbot', 'YFdyh-bot']
      1
    
    
      428405
      en
      35013778
      ['Addbot', 'JackieBot']
      1
    
    
      729655
      pt
      2301901
      ['Alph Bot', 'VolkovBot']
      1
    
    
      474453
      es
      432456
      ['Zwobot', 'タチコマ robot']
      1
    
    
      184979
      en
      9412215
      ['Redirect fixer', 'RussBot']
      1
    
    
      65994
      en
      14276
      ['ArthurBot', 'EmausBot']
      1
    
    
      313140
      en
      23276084
      ['Addbot', 'EmausBot']
      1
    
    
      360230
      en
      27577677
      ['Addbot', 'EmausBot']
      1
    
    
      555603
      fr
      263068
      ['EmausBot', 'MerlIwBot']
      1
    
    
      556326
      fr
      279009
      ['EmausBot', 'Escarbot']
      1
    
    
      334821
      en
      25228834
      ['Addbot', 'EmausBot']
      1
    
    
      436730
      en
      36211570
      ['EmausBot', 'KLBot2']
      1
    
    
      723634
      pt
      1823034
      ['EmausBot', 'MastiBot']
      1
    
    
      153609
      en
      6185714
      ['Addbot', 'EmausBot']
      1
    
    
      110433
      en
      2371167
      ['PbBot', 'Yobot']
      1

Add reverts_per_page_botpair_sorted to df_all



In [9]:

    
df_all = df_all.drop("reverts_per_page_botpair_sorted",1)
    
    

df_all = pd.merge(df_all, df_groupby, how='left',
         left_on=["language", "rev_page", "botpair_sorted"],
         right_on=["language", "rev_page", "botpair_sorted"])

Analysis

Number of reverts by revert_per_page_botpair_sorted, all languages, articles only

For example, 528,104 reverts were not reciprocated at all. 25,528 reverts were part of a two-bot revert chain on the same page in the same language lasting 2 reverts. 3,987 reverts were part of a two-bot revert chain in the same page in the same language lasting 3 reverts, and so on.



In [10]:

    
df_all.query("page_namespace == 0").reverts_per_page_botpair_sorted.value_counts().sort_index()









    Out[10]:





1     528104
2      25528
3       3987
4       1212
5        540
6        336
7        259
8        176
9        135
10        60
11        44
12        96
13       143
14        70
15        90
16        80
17        85
18        54
19        95
20        20
21        42
22        88
23       138
24        72
28        56
29        87
30        90
31        93
35        35
39        39
41        82
Name: reverts_per_page_botpair_sorted, dtype: int64



In [11]:

    
df_all.query("page_namespace == 0").reverts_per_page_botpair_sorted.value_counts().sum()









    Out[11]:





561936



In [12]:

    
import matplotlib.ticker

sns.set(font_scale=1.5, style="whitegrid")
fig, ax = plt.subplots(figsize=[8,6])
df_all.query("page_namespace == 0").reverts_per_page_botpair_sorted.value_counts().sort_index().plot(kind='bar', ax=ax)
ax.set_yscale('log')
ax.set_ylim((pow(10,0),pow(10,6)))
ax.set_ylabel("Number of articles (log scale)")
ax.set_xlabel("Number of reverts on page between the same two bots")
ax.yaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter('%d'))

Number of reverts by revert_per_page_botpair_sorted, English only, articles only



In [13]:

    
df_all.query("page_namespace == 0 and language=='en'").reverts_per_page_botpair_sorted.value_counts().sort_index()









    Out[13]:





1     228198
2      13012
3       1860
4        476
5        120
6         24
7         28
8         32
9         27
10        10
11        11
13        39
14        28
15        60
16        16
17        51
18        54
19        38
20        20
21        21
22        22
23        92
24        72
28        56
29        87
30        90
31        93
35        35
39        39
41        82
Name: reverts_per_page_botpair_sorted, dtype: int64



In [14]:

    
sns.set(font_scale=1.5, style="whitegrid")
df_all.query("page_namespace == 0 and language == 'en'").reverts_per_page_botpair_sorted.value_counts().sort_index().plot(kind='bar')









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f1168d585c0>

Checking that the sum of the counts and the total number of reverts are the same



In [15]:

    
df_all.query("page_namespace == 0 and language=='en'").reverts_per_page_botpair_sorted.value_counts().sum()









    Out[15]:





244793



In [16]:

    
len(df_all.query("page_namespace == 0 and language=='en'"))









    Out[16]:





244793

Finding pages with more than 500 reverts by/on the same bots



In [17]:

    
gb = df_all.query("reverts_per_page_botpair_sorted > 500").groupby(["language", "page_namespace", "rev_page", "botpair_sorted"])



In [18]:

    
gb['rev_id'].count()









    Out[18]:





language  page_namespace  rev_page  botpair_sorted                              
en        4               974956    ['AnomieBOT', 'Legobot']                          521
                          4626266   ['ClueBot NG', 'HBC AIV helperbot']              2047
                                    ['ClueBot NG', 'HBC AIV helperbot11']            2538
                                    ['ClueBot NG', 'HBC AIV helperbot5']             8723
                                    ['ClueBot NG', 'HBC AIV helperbot7']            10473
                                    ['ClueBot', 'HBC AIV helperbot2']                1002
                                    ['ClueBot', 'HBC AIV helperbot3']                5822
                                    ['ClueBot', 'HBC AIV helperbot4']                1545
                                    ['ClueBot', 'HBC AIV helperbot5']                3007
                                    ['ClueBot', 'HBC AIV helperbot7']                4110
                                    ['DatBot', 'HBC AIV helperbot5']                  753
                                    ['HBC AIV helperbot', 'Mr.Z-bot']                 753
                                    ['HBC AIV helperbot3', 'VoABot II']              1889
                                    ['HBC AIV helperbot5', 'Mr.Z-bot']               2235
                                    ['HBC AIV helperbot7', 'Mr.Z-bot']               4504
                                    ['HBC AIV helperbot7', 'VoABot II']               687
                          5964327   ['ClueBot II', 'CorenSearchBot']                 3551
                          11005908  ['EssjayBot', 'Sandbot']                          508
                          11238105  ['DeltaQuadBot', 'HBC AIV helperbot5']            694
                                    ['DeltaQuadBot', 'HBC AIV helperbot7']           3227
                                    ['HBC AIV helperbot2', 'HBC NameWatcherBot']      966
                                    ['HBC AIV helperbot3', 'HBC NameWatcherBot']     4507
                                    ['HBC AIV helperbot4', 'HBC NameWatcherBot']     1064
                                    ['HBC AIV helperbot5', 'HBC NameWatcherBot']     2020
                                    ['HBC AIV helperbot5', 'SoxBot']                  628
                                    ['HBC AIV helperbot7', 'HBC NameWatcherBot']     2620
                                    ['HBC AIV helperbot7', 'SoxBot']                 1639
Name: rev_id, dtype: int64

From a manual lookup:

page_id page_title

974956 Possibly_unfree_files
4626266 Administrator_intervention_against_vandalism/TB2
5964327 Suspected_copyright_violations
11005908 Tutorial/Editing/sandbox
11238105 Usernames_for_administrator_attention/Bot

How many total bot-bot reverts in these pages?



In [19]:

    
len(df_all.query("language == 'en' and rev_page == 4626266"))









    Out[19]:





55739



In [20]:

    
len(df_all.query("language == 'en' and rev_page == 11238105"))









    Out[20]:





18919



In [21]:

    
len(df_all.query("language == 'en' and rev_page == 5964327"))









    Out[21]:





3659

Median time to revert for a Mathbot-curated list



In [22]:

    
df_all.query("language == 'en' and rev_page == 5971841").groupby("botpair")['time_to_revert_days'].median()









    Out[22]:





botpair
FrescoBot rv Mathbot    36.399456
Mathbot rv DrilBot       0.556840
Mathbot rv FrescoBot     0.670440
Mathbot rv Yobot         0.944352
Yobot rv Mathbot        12.057517
Name: time_to_revert_days, dtype: float64

Runtime



In [23]:

    
end = datetime.datetime.now()

time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")









    



Total runtime:  0 minutes,  16 seconds

	368803	502072
archived	False	False
language	en	en
page_namespace	4	14
rev_deleted	False	False
rev_id	414658262	321587772
rev_minor_edit	False	True
rev_page	4626266	21490082
rev_parent_id	4.14656e+08	2.70317e+08
rev_revert_offset	1	2
rev_sha1	b7q4j3lf6nvbgdxpcioi5th06y4zvjd	20jlq5xvki3aa9pq17i7tiklcylri18
rev_timestamp	20110218185506	20091023153323
rev_user	13286072	3171782
rev_user_text	ClueBot NG	TXiKiBoT
reverted_to_rev_id	414655598	270316918
reverting_archived	False	False
reverting_comment	Empty. rm [[Special:Contributions/216.194.32.1...	[[User:Addbot\|Bot:]] Migrating 2 interwiki lin...
reverting_deleted	False	False
reverting_id	414658788	546727520
reverting_minor_edit	True	True
reverting_page	4626266	21490082
reverting_parent_id	4.14658e+08	3.98948e+08
reverting_sha1	fdq6jlab7kr949hd0g4f1gw15hd1qhg	ivhmt840igobwhtqfphygca81t8pxhl
reverting_timestamp	20110218185852	20130324130609
reverting_user	6327251	6569922
reverting_user_text	HBC AIV helperbot7	Addbot
revisions_reverted	1	2
namespace_type	other page	category
reverted_timestamp_dt	2011-02-18 18:55:06	2009-10-23 15:33:23
reverting_timestamp_dt	2011-02-18 18:58:52	2013-03-24 13:06:09
time_to_revert	0 days 00:03:46	1247 days 21:32:46
time_to_revert_hrs	0.0627778	29949.5
time_to_revert_days	0.00261574	1247.9
reverting_year	2011	2013
time_to_revert_days_log10	-2.58241	3.09618
time_to_revert_hrs_log10	-1.20219	4.47639
reverting_comment_nobracket	Empty. rm .	Migrating 2 interwiki links, now provided by on
botpair	HBC AIV helperbot7 rv ClueBot NG	Addbot rv TXiKiBoT
botpair_sorted	['ClueBot NG', 'HBC AIV helperbot7']	['Addbot', 'TXiKiBoT']
reverts_per_page_botpair	10473	1
reverts_per_page_botpair_sorted	10473	1

	language	rev_page	botpair_sorted	reverts_per_page_botpair_sorted
413723	en	33088615	['KLBot2', 'Thijs!bot']	1
608235	fr	3232951	['ChuispastonBot', 'EmausBot']	1
221343	en	13258304	['AvicBot', 'Xqbot']	1
1742	de	11128	['MerlIwBot', 'Xqbot']	2
611017	fr	3637910	['Addbot', 'YFdyh-bot']	1
780878	zh	1421963	['Addbot', 'Mjbmrbot']	1
2115	de	15235	['MerlIwBot', 'PixelBot']	2
698010	pt	241700	['EmausBot', 'MerlIwBot']	1
710941	pt	1028990	['CommonsDelinker', 'Rei-bot']	1
593680	fr	1718027	['HerculeBot', 'Ptbotgourou']	1
788657	zh	2901471	['Addbot', 'YFdyh-bot']	1
428405	en	35013778	['Addbot', 'JackieBot']	1
729655	pt	2301901	['Alph Bot', 'VolkovBot']	1
474453	es	432456	['Zwobot', 'タチコマ robot']	1
184979	en	9412215	['Redirect fixer', 'RussBot']	1
65994	en	14276	['ArthurBot', 'EmausBot']	1
313140	en	23276084	['Addbot', 'EmausBot']	1
360230	en	27577677	['Addbot', 'EmausBot']	1
555603	fr	263068	['EmausBot', 'MerlIwBot']	1
556326	fr	279009	['EmausBot', 'Escarbot']	1
334821	en	25228834	['Addbot', 'EmausBot']	1
436730	en	36211570	['EmausBot', 'KLBot2']	1
723634	pt	1823034	['EmausBot', 'MastiBot']	1
153609	en	6185714	['Addbot', 'EmausBot']	1
110433	en	2371167	['PbBot', 'Yobot']	1