Section 5.1: proportion of all bot edits to articles that are bot-bot reverts

This is a data analysis script for an analysis presented in section 5.1, which you can run based entirely off the files in this GitHub repository. It loads datasets/montly_bot_edits/[language]wiki_20170427.tsv and datasets_monthly_bot_reverts/[language]wiki_20170420.tsv.



In [1]:

    
import pandas as pd
import seaborn as sns
import mwapi
import numpy as np
import glob
%matplotlib inline

Import datasets

Using a dictionary of pandas dataframes, with the key as the language. A better way would be to have a tidy dataframe.



In [2]:

    
!ls -lah ../../datasets/monthly_bot_edits/









    



total 248K
drwxrwxr-x 2 staeiou staeiou 4.0K Sep 10 16:56 .
drwxrwxr-x 7 staeiou staeiou 4.0K Sep 10 16:43 ..
-rw-rw-r-- 1 staeiou staeiou  32K Sep 10 16:43 dewiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  44K Sep 10 16:56 enwiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  35K Sep 10 16:43 eswiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  37K Sep 10 16:43 frwiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  27K Sep 10 16:43 jawiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  25K Sep 10 16:43 ptwiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  29K Sep 10 16:43 zhwiki_20170427.tsv



In [3]:

    
!ls -lah ../../datasets/monthly_bot_reverts/









    



total 412K
drwxrwxr-x 2 staeiou staeiou 4.0K Sep 10 16:43 .
drwxrwxr-x 7 staeiou staeiou 4.0K Sep 10 16:43 ..
-rw-rw-r-- 1 staeiou staeiou  57K Sep 10 16:43 dewiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  79K Sep 10 16:43 enwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  61K Sep 10 16:43 eswiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  60K Sep 10 16:43 frwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  47K Sep 10 16:43 jawiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  46K Sep 10 16:43 ptwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  44K Sep 10 16:43 zhwiki_20170420.tsv

Monthly bot edit counts are in the format: month (YYYYMM), page namespace, and total number of bot edits in that language's namespace that month (n).



In [17]:

    
!head ../../datasets/monthly_bot_edits/enwiki_20170427.tsv









    



month	page_namespace	n
200212	0	32284
200212	1	4
200212	2	2
200212	3	1
200303	0	5
200304	0	8
200305	0	116
200305	2	2
200306	0	6086

Monthly bot revert counts are a bit more complicated and in a slightly different format:

month (YYYYMM01)
page namespace
number of total reverts by all editors (reverts)
number of reverts by bot accounts (bot_reverts)
number of edits by bots that were reverted (bot_reverteds)
number of reverts by bots of edits made by bots (bot2bot_reverts)



In [18]:

    
!head ../../datasets/monthly_bot_reverts/enwiki_20170420.tsv









    



month	page_namespace	reverts	bot_reverts	bot_reverteds	bot2bot_reverts
20010701	0	1	0	0	0
20010801	0	1	0	0	0
20011001	0	8	0	0	0
20011001	1	1	0	0	0
20011001	2	6	0	0	0
20011001	4	1	0	0	0
20011001	5	1	0	0	0
20011101	0	70	0	0	0
20011101	1	1	0	0	0

Processing

We're going to do this in a pretty messy and so-not-best-practice way (which would be a single tidy dataframe with nice hierarchical indexes) by having two dictionaries of dataframes.



In [4]:

    
df_edits_dict = {}
for filename in glob.glob("../../datasets/monthly_bot_edits/??wiki_2017042?.tsv"):
    lang_code = filename[33:35]
    df_edits_dict[lang_code] = pd.read_csv(filename, sep="\t")
    df_edits_dict[lang_code] = df_edits_dict[lang_code].drop_duplicates()



In [5]:

    
for lang, lang_df in df_edits_dict.items():
    print(lang, len(lang_df))









    



de 2213
zh 2017
fr 2589
en 3176
ja 1954
es 2390
pt 1771



In [6]:

    
df_rev_dict = {}
for filename in glob.glob("../../datasets/monthly_bot_reverts/??wiki_2017042?.tsv"):
    lang_code = filename[35:37]
    df_rev_dict[lang_code] = pd.read_csv(filename, sep="\t")
    df_rev_dict[lang_code] = df_rev_dict[lang_code].drop_duplicates()



In [7]:

    
for lang, lang_df in df_rev_dict.items():
    print(lang, len(lang_df))









    



de 2641
zh 2150
fr 2858
en 3486
ja 2293
pt 2184
es 2859



In [8]:

    
langs = ["de", "en", "es", "fr", "ja", "pt", "zh"]

Preview the dataframes in the dictionaries



In [9]:

    
df_edits_dict['en'][0:5]









    Out[9]:







  
    
      
      month
      page_namespace
      n
    
  
  
    
      0
      200212
      0
      32284
    
    
      1
      200212
      1
      4
    
    
      2
      200212
      2
      2
    
    
      3
      200212
      3
      1
    
    
      4
      200303
      0
      5



In [10]:

    
df_rev_dict['en'][0:5]









    Out[10]:







  
    
      
      month
      page_namespace
      reverts
      bot_reverts
      bot_reverteds
      bot2bot_reverts
    
  
  
    
      0
      20010701
      0
      1
      0
      0
      0
    
    
      1
      20010801
      0
      1
      0
      0
      0
    
    
      2
      20011001
      0
      8
      0
      0
      0
    
    
      3
      20011001
      1
      1
      0
      0
      0
    
    
      4
      20011001
      2
      6
      0
      0
      0

Clean and combine the two datasets

Convert dates

Remember that they used different formats for representing months? Gotta fix that.



In [11]:

    
def truncate_my(s):
    """
    Truncate YYYYMMDD format to YYYYMM. For use with df.apply()
    """
    s = str(s)
    return int(s[0:6])

Test function



In [12]:

    
truncate_my(20100101)









    Out[12]:





201001

Yay!

Apply the transformation



In [13]:

    
for lang in langs:
    df_edits_dict[lang] = df_edits_dict[lang].set_index('month')
    df_rev_dict[lang]['month_my'] = df_rev_dict[lang]['month'].apply(truncate_my)
    df_rev_dict[lang] = df_rev_dict[lang].set_index('month_my')

Combine the datasets, looking at only articles / ns0



In [14]:

    
combi_ns0_dict = {}
combi_dict = {}
for lang in langs:
    print(lang)

    combi_ns0_dict[lang] = pd.concat([df_rev_dict[lang].query("page_namespace == 0"), df_edits_dict[lang].query("page_namespace == 0")], axis=1, join='outer')
    combi_ns0_dict[lang]['bot_edits'] = combi_ns0_dict[lang]['n']
    combi_ns0_dict[lang]['prop_bot2bot_rv'] = combi_ns0_dict[lang]['bot2bot_reverts']/combi_ns0_dict[lang]['bot_edits']









    



de
en
es
fr
ja
pt
zh

Preview the combined dataframe

FYI, all things Wikipedia database related (especially bots) are generally way less consistent before 2004.



In [15]:

    
combi_ns0_dict['en'][29:39]









    Out[15]:







  
    
      
      month
      page_namespace
      reverts
      bot_reverts
      bot_reverteds
      bot2bot_reverts
      page_namespace
      n
      bot_edits
      prop_bot2bot_rv
    
  
  
    
      200401
      20040101
      0
      4335
      15
      2
      0
      0.0
      496.0
      496.0
      0.0
    
    
      200402
      20040201
      0
      6331
      12
      4
      0
      0.0
      2362.0
      2362.0
      0.0
    
    
      200403
      20040301
      0
      9046
      5
      14
      0
      0.0
      3308.0
      3308.0
      0.0
    
    
      200404
      20040401
      0
      8514
      6
      1
      0
      0.0
      766.0
      766.0
      0.0
    
    
      200405
      20040501
      0
      8918
      9
      5
      0
      0.0
      1454.0
      1454.0
      0.0
    
    
      200406
      20040601
      0
      7176
      26
      16
      0
      0.0
      38237.0
      38237.0
      0.0
    
    
      200407
      20040701
      0
      9841
      25
      19
      0
      0.0
      54523.0
      54523.0
      0.0
    
    
      200408
      20040801
      0
      12213
      58
      78
      0
      0.0
      38098.0
      38098.0
      0.0
    
    
      200409
      20040901
      0
      17096
      24
      104
      0
      0.0
      28547.0
      28547.0
      0.0
    
    
      200410
      20041001
      0
      21477
      123
      135
      0
      0.0
      44076.0
      44076.0
      0.0

The results: proportion of bot-bot reverts out of all bot edits, articles/ns0 only



In [16]:

    
sum_dict = {}
for lang in langs:
    #print(lang)
    sum_dict[lang] = combi_ns0_dict[lang][['bot_edits','bot2bot_reverts']].sum()
    print(lang, "ns0 proportion:", (sum_dict[lang]['bot2bot_reverts']/sum_dict[lang]['bot_edits']*100).round(4), "%")
    print(sum_dict[lang])
    print("")









    



de ns0 proportion: 0.5043 %
bot_edits          10143166.0
bot2bot_reverts       51156.0
dtype: float64

en ns0 proportion: 0.4968 %
bot_edits          45250860.0
bot2bot_reverts      224821.0
dtype: float64

es ns0 proportion: 0.3771 %
bot_edits          14329191.0
bot2bot_reverts       54032.0
dtype: float64

fr ns0 proportion: 0.3895 %
bot_edits          13066804.0
bot2bot_reverts       50891.0
dtype: float64

ja ns0 proportion: 0.5437 %
bot_edits          5471866
bot2bot_reverts      29749
dtype: int64

pt ns0 proportion: 0.635 %
bot_edits          8853658.0
bot2bot_reverts      56217.0
dtype: float64

zh ns0 proportion: 0.4817 %
bot_edits          5580197.0
bot2bot_reverts      26882.0
dtype: float64



In [ ]:

	month	page_namespace	reverts
0	20010701	0	1
1	20010801	0	1
2	20011001	0	8
3	20011001	1	1
4	20011001	2	6

	month	reverts	bot_reverts	bot_reverteds	n	bot_edits
200401	20040101	4335	15	2	496.0	496.0
200402	20040201	6331	12	4	2362.0	2362.0
200403	20040301	9046	5	14	3308.0	3308.0
200404	20040401	8514	6	1	766.0	766.0
200405	20040501	8918	9	5	1454.0	1454.0
200406	20040601	7176	26	16	38237.0	38237.0
200407	20040701	9841	25	19	54523.0	54523.0
200408	20040801	12213	58	78	38098.0	38098.0
200409	20040901	17096	24	104	28547.0	28547.0
200410	20041001	21477	123	135	44076.0	44076.0