Loading and processing datasets of all bot-bot reverts for all languages

This dataset loads and parses the reverted_bot2bot datasets for seven languages, which were created by processing the Wikipedia revision history dumps by the scripts called in the Makefile in the root directory of the repository. This is the first script that you can run based entirely off the files in this GitHub repository. This generates:

  • /datasets/parsed_dataframes/df_all_2016.pickle
  • /datasets/parsed_dataframes/df_all_2016.pickle.xz

These datasets are used for the analyses in section 5 (5-*.ipynb) and as the basis of the comment parsing analysis in section 7 and 8.

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. On a laptop running a Core i5-2540M processor, it takes about 5 minutes to run, then another 5 minutes to compress to xz.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import datetime
%matplotlib inline

pd.set_option("display.max_columns",100)

In [2]:
start = datetime.datetime.now()

Data processing

Initial datasets of bot-bot reverts are stored in datasets/reverted_bot2bot/ as TSV files compressed in bzip2 format for each language. We first put them in a dict of dataframes {language:dataframe} then make it into a nice and tidy unified dataframe.


In [3]:
!ls -lah ../../datasets/reverted_bot2bot/*.bz2


-rw-rw-r-- 1 staeiou staeiou 6.3M Sep 10 17:29 ../../datasets/reverted_bot2bot/dewiki_20170420.tsv.bz2
-rw-rw-r-- 1 staeiou staeiou  43M Sep 10 17:29 ../../datasets/reverted_bot2bot/enwiki_20170420.tsv.bz2
-rw-rw-r-- 1 staeiou staeiou 7.8M Sep 10 17:29 ../../datasets/reverted_bot2bot/eswiki_20170420.tsv.bz2
-rw-rw-r-- 1 staeiou staeiou 8.1M Sep 10 17:29 ../../datasets/reverted_bot2bot/frwiki_20170420.tsv.bz2
-rw-rw-r-- 1 staeiou staeiou 4.1M Sep 10 17:29 ../../datasets/reverted_bot2bot/jawiki_20170420.tsv.bz2
-rw-rw-r-- 1 staeiou staeiou 6.1M Sep 10 17:29 ../../datasets/reverted_bot2bot/ptwiki_20170420.tsv.bz2
-rw-rw-r-- 1 staeiou staeiou 4.3M Sep 10 17:29 ../../datasets/reverted_bot2bot/zhwiki_20170420.tsv.bz2

In [4]:
!bunzip2 -kf ../../datasets/reverted_bot2bot/*.bz2

In [5]:
!ls -lah ../../datasets/reverted_bot2bot/*.tsv


-rw-rw-r-- 1 staeiou staeiou  21M Sep 10 17:29 ../../datasets/reverted_bot2bot/dewiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou 165M Sep 10 17:29 ../../datasets/reverted_bot2bot/enwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  26M Sep 10 17:29 ../../datasets/reverted_bot2bot/eswiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  28M Sep 10 17:29 ../../datasets/reverted_bot2bot/frwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  14M Sep 10 17:29 ../../datasets/reverted_bot2bot/jawiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  20M Sep 10 17:29 ../../datasets/reverted_bot2bot/ptwiki_20170420.tsv
-rw-rw-r-- 1 staeiou staeiou  15M Sep 10 17:29 ../../datasets/reverted_bot2bot/zhwiki_20170420.tsv

In [6]:
glob.glob("../../datasets/reverted_bot2bot/??wiki_20170420.tsv")


Out[6]:
['../../datasets/reverted_bot2bot/enwiki_20170420.tsv',
 '../../datasets/reverted_bot2bot/eswiki_20170420.tsv',
 '../../datasets/reverted_bot2bot/frwiki_20170420.tsv',
 '../../datasets/reverted_bot2bot/dewiki_20170420.tsv',
 '../../datasets/reverted_bot2bot/jawiki_20170420.tsv',
 '../../datasets/reverted_bot2bot/ptwiki_20170420.tsv',
 '../../datasets/reverted_bot2bot/zhwiki_20170420.tsv']

In [7]:
df_dict = {}
for filename in glob.glob("../../datasets/reverted_bot2bot/??wiki_20170420.tsv"):
    lang_code = filename[32:34]
    df_dict[lang_code] = pd.read_csv(filename, sep="\t")
    df_dict[lang_code] = df_dict[lang_code].drop_duplicates()

See what we've got


In [8]:
for lang, lang_df in df_dict.items():
    print(lang, len(lang_df))


ja 45149
es 89077
pt 70973
en 512562
fr 96480
de 69690
zh 51536

In [9]:
df_dict['en'][0:2].transpose()


Out[9]:
0 1
rev_id 273691771 136526894
rev_timestamp 20090227173507 20070607044209
rev_user 6505923 4534303
rev_user_text Kbdankbot PbBot
rev_page 5040439 3046554
rev_sha1 qj45ne2z4yfexmpaz5wfnbm2yrmqt4j 3xtnw7u4w9h6cg1smw97mqnr1en6a55
rev_minor_edit False False
rev_deleted False False
rev_parent_id 2.59117e+08 1.20932e+08
archived False False
reverting_id 273789753 269875438
reverting_timestamp 20090228021925 20090210230337
reverting_user 1215485 7328338
reverting_user_text Cydebot Yobot
reverting_page 5040439 3046554
reverting_sha1 od1e685o9f5j7dtpd9zp2fhew3agjeq f8ye1kj73h0ymwqj6rlsimwk3jd9tef
reverting_minor_edit True True
reverting_deleted False False
reverting_parent_id 2.73692e+08 1.36527e+08
reverting_archived False False
reverting_comment Robot - Speedily moving category 9th-century i... Removing {{1911 talk}} per [[Wikipedia:Templat...
rev_revert_offset 1 1
revisions_reverted 1 1
reverted_to_rev_id 259117355 120932016
page_namespace 0 1

Combining into one tidy dataframe for all languages


In [10]:
df_all = df_dict['en'].copy()
df_all = df_all.drop(df_all.index, axis=0)

for lang, lang_df in df_dict.items():
    lang_df['language'] = lang
    df_all = pd.concat([df_all, lang_df])

In [11]:
df_all['language'].value_counts()


Out[11]:
en    512562
fr     96480
es     89077
pt     70973
de     69690
zh     51536
ja     45149
Name: language, dtype: int64

Number of bot-bot reverts, up to 2017-04-20


In [12]:
len(df_all)


Out[12]:
935467

Namespace type


In [13]:
def namespace_type(item):
    """
    Classifies namespace type. To be used with df.apply() on ['page_namespace']
    """    
    if int(item) == 0:
        return 'article'
    elif int(item) == 14:
        return 'category'
    elif int(item) % 2 == 1:
        return 'other talk'
    else:
        return 'other page'

In [14]:
df_all['namespace_type'] = df_all['page_namespace'].apply(namespace_type)

In [15]:
df_all['namespace_type'].value_counts()


Out[15]:
article       566731
category      183779
other page    117600
other talk     67357
Name: namespace_type, dtype: int64

Datetime parsing


In [16]:
def get_year(timestamp):
    return timestamp.year

In [17]:
df_all['reverting_timestamp_dt'] = pd.to_datetime(df_all['reverting_timestamp'], format="%Y%m%d%H%M%S")

df_all['reverted_timestamp_dt'] = pd.to_datetime(df_all['rev_timestamp'], format="%Y%m%d%H%M%S")

df_all = df_all.set_index('reverting_timestamp_dt')

df_all['reverting_timestamp_dt'] = pd.to_datetime(df_all['reverting_timestamp'], format="%Y%m%d%H%M%S")

df_all['time_to_revert'] = df_all['reverting_timestamp_dt']-df_all['reverted_timestamp_dt']

df_all['time_to_revert_hrs'] = df_all['time_to_revert'].astype('timedelta64[s]')/(60*60)

df_all['time_to_revert_days'] = df_all['time_to_revert'].astype('timedelta64[s]')/(60*60*24)

df_all['reverting_year'] = df_all['reverting_timestamp_dt'].apply(get_year)

df_all['time_to_revert_days_log10'] = df_all['time_to_revert_days'].apply(np.log10)

df_all['time_to_revert_hrs_log10'] = df_all['time_to_revert_hrs'].apply(np.log10)

Filter all datasets to the same time bounds: 2001 to 2016


In [18]:
df_all = df_all.loc["2001-01-01":"2016-12-31"]

Counts per year: en


In [19]:
df_all[df_all['language']=='en'].reverting_year.value_counts().sort_index()


Out[19]:
2004         2
2005       130
2006      3118
2007     17038
2008     33110
2009     36378
2010     30731
2011     63161
2012     47983
2013    201171
2014     20570
2015     25880
2016     23638
Name: reverting_year, dtype: int64

Counts per year: all


In [20]:
df_all.reverting_year.value_counts().sort_index()


Out[20]:
2004       302
2005      1597
2006      6354
2007     29409
2008     54958
2009     81467
2010     68328
2011    146709
2012    103033
2013    342052
2014     26688
2015     36693
2016     27355
Name: reverting_year, dtype: int64

Other processing and metadata

Comment parsing: removing text in brackets/parens

Function for removing text within square brackets or parentheses, which is useful for aggregating comment messages.


In [21]:
# by http://stackoverflow.com/questions/14596884/remove-text-between-and-in-python

def remove_brackets(test_str):
    """
    Takes a string and returns that string with text in brackets and parentheses removed
    """
    
    test_str = str(test_str)
    ret = ''
    skip1c = 0
    skip2c = 0
    for i in test_str:
        if i == '[':
            skip1c += 1
        elif i == '(':
            skip2c += 1
        elif i == ']' and skip1c > 0:
            skip1c -= 1
        elif i == ')'and skip2c > 0:
            skip2c -= 1
        elif skip1c == 0 and skip2c == 0:
            ret += i
            
    return " ".join(ret.split())

In [22]:
df_all['reverting_comment_nobracket'] = df_all['reverting_comment'].apply(remove_brackets)

In [23]:

Functions for calculating botpair and botpair_sorted


In [23]:
def concat_botpair(row):
    """
    Concatenate the reverting and reverted user names. To be used with df.apply()
    on the entire row
    """
    return str(row['reverting_user_text']) + " rv " + str(row['rev_user_text'])

def sorted_botpair(row):
    """
    Returns a sorted list of bot pairs (reverted and reverting). To be used with
    df.apply() on the entire row. list.sort() is  locale dependent, but it doesn't
    matter because all we need is a consistent way of uniquely sorting.
    """
    
    return str(sorted([row['reverting_user_text'], row['rev_user_text']]))

In [24]:
df_all['botpair'] = df_all.apply(concat_botpair, axis=1)

In [25]:
df_all['botpair_sorted'] = df_all.apply(sorted_botpair, axis=1)

Reverts per page per botpair

This analysis is also replicated in R in reverts_per_page_R.ipynb


In [26]:
gb_lpb = df_all.groupby(["language", "rev_page", "botpair"])
gb_lpb_s = df_all.groupby(["language", "rev_page", "botpair_sorted"])

In [27]:
df_lpb = pd.DataFrame(gb_lpb['rev_id'].count()).reset_index().rename(columns={"rev_id":"reverts_per_page_botpair"})
df_lpb[0:5]


Out[27]:
language rev_page botpair reverts_per_page_botpair
0 de 61 RedBot rv EmausBot 1
1 de 81 DumZiBoT rv CarsracBot 1
2 de 81 MerlIwBot rv ZéroBot 1
3 de 82 Alecs.bot rv SieBot 1
4 de 101 Xqbot rv Dinamik-bot 1

In [28]:
df_lpb_s = pd.DataFrame(gb_lpb_s['rev_id'].count()).reset_index().rename(columns={"rev_id":"reverts_per_page_botpair_sorted"})
df_lpb_s[0:5]


Out[28]:
language rev_page botpair_sorted reverts_per_page_botpair_sorted
0 de 61 ['EmausBot', 'RedBot'] 1
1 de 81 ['CarsracBot', 'DumZiBoT'] 1
2 de 81 ['MerlIwBot', 'ZéroBot'] 1
3 de 82 ['Alecs.bot', 'SieBot'] 1
4 de 101 ['Dinamik-bot', 'Xqbot'] 1

In [29]:
df_all = pd.merge(df_all, df_lpb, how='left',
         left_on=["language", "rev_page", "botpair"],
         right_on=["language", "rev_page", "botpair"])

df_all = pd.merge(df_all, df_lpb_s, how='left',
         left_on=["language", "rev_page", "botpair_sorted"],
         right_on=["language", "rev_page", "botpair_sorted"])

Check time to revert for negatives


In [30]:
len(df_all.query("time_to_revert_days < 0"))


Out[30]:
0

In [31]:
len(df_all.query("time_to_revert_days > 0"))


Out[31]:
924945

In [32]:
df_all.query("time_to_revert_days < 0").groupby("language")['rev_id'].count()


Out[32]:
Series([], Name: rev_id, dtype: int64)

In [33]:
df_all.query("time_to_revert_days > 0").groupby("language")['rev_id'].count()


Out[33]:
language
de     69268
en    502910
es     88771
fr     96446
ja     45147
pt     70951
zh     51452
Name: rev_id, dtype: int64

Final data format


In [34]:
len(df_all)


Out[34]:
924945

In [35]:
df_all.sample(2).transpose()


Out[35]:
380107 712281
archived False False
language en fr
page_namespace 4 14
rev_deleted False False
rev_id 529614482 27360970
rev_minor_edit False True
rev_page 4626266 435683
rev_parent_id 5.29614e+08 1.65655e+07
rev_revert_offset 1 1
rev_sha1 dukzvpj8lf9az6bafkejb5jyqrih5gz iyps8lr1z8wletbwuqa1jg8hdkzxq7f
rev_timestamp 20121224160445 20080314141530
rev_user 9738871 115455
rev_user_text SDPatrolBot Escarbot
reverted_to_rev_id 529614091 16565485
reverting_archived False False
reverting_comment Empty. rm [[Special:Contributions/Natokerkotha... robot Retire: [[el:Κατηγορία:Θάνατοι το 1403]]
reverting_deleted False False
reverting_id 529615157 30902377
reverting_minor_edit True True
reverting_page 4626266 435683
reverting_parent_id 5.29614e+08 2.7361e+07
reverting_sha1 fdq6jlab7kr949hd0g4f1gw15hd1qhg l1ztben3w76qodpjn2o2sz3p369pjux
reverting_timestamp 20121224161212 20080623043426
reverting_user 6327251 327431
reverting_user_text HBC AIV helperbot7 Alexbot
revisions_reverted 1 1
namespace_type other page category
reverted_timestamp_dt 2012-12-24 16:04:45 2008-03-14 14:15:30
reverting_timestamp_dt 2012-12-24 16:12:12 2008-06-23 04:34:26
time_to_revert 0 days 00:07:27 100 days 14:18:56
time_to_revert_hrs 0.124167 2414.32
time_to_revert_days 0.00517361 100.596
reverting_year 2012 2008
time_to_revert_days_log10 -2.28621 2.00258
time_to_revert_hrs_log10 -0.905995 3.38279
reverting_comment_nobracket Empty. rm . robot Retire:
botpair HBC AIV helperbot7 rv SDPatrolBot Alexbot rv Escarbot
botpair_sorted ['HBC AIV helperbot7', 'SDPatrolBot'] ['Alexbot', 'Escarbot']
reverts_per_page_botpair 278 1
reverts_per_page_botpair_sorted 278 1

Output to file


In [36]:
!rm ../../datasets/parsed_dataframes/df_all_2016.p*


rm: cannot remove '../../datasets/parsed_dataframes/df_all_2016.p*': No such file or directory

In [37]:
df_all.to_pickle("../../datasets/parsed_dataframes/df_all_2016.pickle")

In [38]:
!xz -k -e -9 ../../datasets/parsed_dataframes/df_all_2016.pickle

In [39]:
df_all.to_csv("../../datasets/parsed_dataframes/df_all_2016.tsv", sep="\t")

In [40]:
!xz -k -e -9 ../../datasets/parsed_dataframes/df_all_2016.tsv

In [41]:
end = datetime.datetime.now()

time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")


Total runtime:  28 minutes,  11 seconds