Section 5.1: proportion of all bot edits to articles that are bot-bot reverts

This is a data analysis script for an analysis presented in section 5.1, which you can run based entirely off the files in this GitHub repository. It loads datasets/montly_bot_edits/[language]wiki_20170427.tsv and datasets_monthly_bot_reverts/[language]wiki_20170420.tsv.


In [1]:
import pandas as pd
import seaborn as sns
import mwapi
import numpy as np
import glob
%matplotlib inline

Import datasets

Using a dictionary of pandas dataframes, with the key as the language. A better way would be to have a tidy dataframe.


In [2]:
!ls -lah ../../datasets/monthly_bot_edits/


total 248K
drwxrwxr-x 2 staeiou staeiou 4.0K Sep 10 16:56 .
drwxrwxr-x 7 staeiou staeiou 4.0K Sep 10 16:43 ..
-rw-rw-r-- 1 staeiou staeiou  32K Sep 10 16:43 dewiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  44K Sep 10 16:56 enwiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  35K Sep 10 16:43 eswiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  37K Sep 10 16:43 frwiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  27K Sep 10 16:43 jawiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  25K Sep 10 16:43 ptwiki_20170427.tsv
-rw-rw-r-- 1 staeiou staeiou  29K Sep 10 16:43 zhwiki_20170427.tsv

In [3]:
!ls -lah ../../datasets/monthly_bot_reverts/


total 412K
drwxrwxr-x 2 staeiou staeiou 4.0K Sep 10 16:43 .
drwxrwxr-x 7 staeiou staeiou 4.0K Sep 10 16:43 ..
-rw-rw-r-- 1 staeiou staeiou  57K Sep 10 16:43 dewiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  79K Sep 10 16:43 enwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  61K Sep 10 16:43 eswiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  60K Sep 10 16:43 frwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  47K Sep 10 16:43 jawiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  46K Sep 10 16:43 ptwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  44K Sep 10 16:43 zhwiki_20170420.tsv

Monthly bot edit counts are in the format: month (YYYYMM), page namespace, and total number of bot edits in that language's namespace that month (n).


In [17]:
!head ../../datasets/monthly_bot_edits/enwiki_20170427.tsv


month	page_namespace	n
200212	0	32284
200212	1	4
200212	2	2
200212	3	1
200303	0	5
200304	0	8
200305	0	116
200305	2	2
200306	0	6086

Monthly bot revert counts are a bit more complicated and in a slightly different format:

  • month (YYYYMM01)
  • page namespace
  • number of total reverts by all editors (reverts)
  • number of reverts by bot accounts (bot_reverts)
  • number of edits by bots that were reverted (bot_reverteds)
  • number of reverts by bots of edits made by bots (bot2bot_reverts)

In [18]:
!head ../../datasets/monthly_bot_reverts/enwiki_20170420.tsv


month	page_namespace	reverts	bot_reverts	bot_reverteds	bot2bot_reverts
20010701	0	1	0	0	0
20010801	0	1	0	0	0
20011001	0	8	0	0	0
20011001	1	1	0	0	0
20011001	2	6	0	0	0
20011001	4	1	0	0	0
20011001	5	1	0	0	0
20011101	0	70	0	0	0
20011101	1	1	0	0	0

Processing

We're going to do this in a pretty messy and so-not-best-practice way (which would be a single tidy dataframe with nice hierarchical indexes) by having two dictionaries of dataframes.


In [4]:
df_edits_dict = {}
for filename in glob.glob("../../datasets/monthly_bot_edits/??wiki_2017042?.tsv"):
    lang_code = filename[33:35]
    df_edits_dict[lang_code] = pd.read_csv(filename, sep="\t")
    df_edits_dict[lang_code] = df_edits_dict[lang_code].drop_duplicates()

In [5]:
for lang, lang_df in df_edits_dict.items():
    print(lang, len(lang_df))


de 2213
zh 2017
fr 2589
en 3176
ja 1954
es 2390
pt 1771

In [6]:
df_rev_dict = {}
for filename in glob.glob("../../datasets/monthly_bot_reverts/??wiki_2017042?.tsv"):
    lang_code = filename[35:37]
    df_rev_dict[lang_code] = pd.read_csv(filename, sep="\t")
    df_rev_dict[lang_code] = df_rev_dict[lang_code].drop_duplicates()

In [7]:
for lang, lang_df in df_rev_dict.items():
    print(lang, len(lang_df))


de 2641
zh 2150
fr 2858
en 3486
ja 2293
pt 2184
es 2859

In [8]:
langs = ["de", "en", "es", "fr", "ja", "pt", "zh"]

Preview the dataframes in the dictionaries


In [9]:
df_edits_dict['en'][0:5]


Out[9]:
month page_namespace n
0 200212 0 32284
1 200212 1 4
2 200212 2 2
3 200212 3 1
4 200303 0 5

In [10]:
df_rev_dict['en'][0:5]


Out[10]:
month page_namespace reverts bot_reverts bot_reverteds bot2bot_reverts
0 20010701 0 1 0 0 0
1 20010801 0 1 0 0 0
2 20011001 0 8 0 0 0
3 20011001 1 1 0 0 0
4 20011001 2 6 0 0 0

Clean and combine the two datasets

Convert dates

Remember that they used different formats for representing months? Gotta fix that.


In [11]:
def truncate_my(s):
    """
    Truncate YYYYMMDD format to YYYYMM. For use with df.apply()
    """
    s = str(s)
    return int(s[0:6])

Test function


In [12]:
truncate_my(20100101)


Out[12]:
201001

Yay!

Apply the transformation


In [13]:
for lang in langs:
    df_edits_dict[lang] = df_edits_dict[lang].set_index('month')
    df_rev_dict[lang]['month_my'] = df_rev_dict[lang]['month'].apply(truncate_my)
    df_rev_dict[lang] = df_rev_dict[lang].set_index('month_my')

Combine the datasets, looking at only articles / ns0


In [14]:
combi_ns0_dict = {}
combi_dict = {}
for lang in langs:
    print(lang)

    combi_ns0_dict[lang] = pd.concat([df_rev_dict[lang].query("page_namespace == 0"), df_edits_dict[lang].query("page_namespace == 0")], axis=1, join='outer')
    combi_ns0_dict[lang]['bot_edits'] = combi_ns0_dict[lang]['n']
    combi_ns0_dict[lang]['prop_bot2bot_rv'] = combi_ns0_dict[lang]['bot2bot_reverts']/combi_ns0_dict[lang]['bot_edits']


de
en
es
fr
ja
pt
zh

Preview the combined dataframe

FYI, all things Wikipedia database related (especially bots) are generally way less consistent before 2004.


In [15]:
combi_ns0_dict['en'][29:39]


Out[15]:
month page_namespace reverts bot_reverts bot_reverteds bot2bot_reverts page_namespace n bot_edits prop_bot2bot_rv
200401 20040101 0 4335 15 2 0 0.0 496.0 496.0 0.0
200402 20040201 0 6331 12 4 0 0.0 2362.0 2362.0 0.0
200403 20040301 0 9046 5 14 0 0.0 3308.0 3308.0 0.0
200404 20040401 0 8514 6 1 0 0.0 766.0 766.0 0.0
200405 20040501 0 8918 9 5 0 0.0 1454.0 1454.0 0.0
200406 20040601 0 7176 26 16 0 0.0 38237.0 38237.0 0.0
200407 20040701 0 9841 25 19 0 0.0 54523.0 54523.0 0.0
200408 20040801 0 12213 58 78 0 0.0 38098.0 38098.0 0.0
200409 20040901 0 17096 24 104 0 0.0 28547.0 28547.0 0.0
200410 20041001 0 21477 123 135 0 0.0 44076.0 44076.0 0.0

The results: proportion of bot-bot reverts out of all bot edits, articles/ns0 only


In [16]:
sum_dict = {}
for lang in langs:
    #print(lang)
    sum_dict[lang] = combi_ns0_dict[lang][['bot_edits','bot2bot_reverts']].sum()
    print(lang, "ns0 proportion:", (sum_dict[lang]['bot2bot_reverts']/sum_dict[lang]['bot_edits']*100).round(4), "%")
    print(sum_dict[lang])
    print("")


de ns0 proportion: 0.5043 %
bot_edits          10143166.0
bot2bot_reverts       51156.0
dtype: float64

en ns0 proportion: 0.4968 %
bot_edits          45250860.0
bot2bot_reverts      224821.0
dtype: float64

es ns0 proportion: 0.3771 %
bot_edits          14329191.0
bot2bot_reverts       54032.0
dtype: float64

fr ns0 proportion: 0.3895 %
bot_edits          13066804.0
bot2bot_reverts       50891.0
dtype: float64

ja ns0 proportion: 0.5437 %
bot_edits          5471866
bot2bot_reverts      29749
dtype: int64

pt ns0 proportion: 0.635 %
bot_edits          8853658.0
bot2bot_reverts      56217.0
dtype: float64

zh ns0 proportion: 0.4817 %
bot_edits          5580197.0
bot2bot_reverts      26882.0
dtype: float64


In [ ]: