Section 5.1: Descriptive statistics on the bot-bot revert dataset

This is the first data analysis script used to produce findings in the paper, which you can run based entirely off the files in this GitHub repository.

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. It takes about 1 minute to run on a laptop running a Core i5-2540M processor.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pickle
import datetime
%matplotlib inline

In [2]:
start = datetime.datetime.now()

Data processing


In [3]:
!unxz -kf ../../datasets/parsed_dataframes/df_all_2016.pickle.xz

In [4]:
!ls -lah ../../datasets/parsed_dataframes/*


-rw-rw-r-- 1 staeiou staeiou 398M Sep 10 17:46 ../../datasets/parsed_dataframes/df_all_2016.pickle
-rw-rw-r-- 1 staeiou staeiou  98M Sep 10 17:46 ../../datasets/parsed_dataframes/df_all_2016.pickle.xz
-rw-rw-r-- 1 staeiou staeiou 507M Sep 10 17:53 ../../datasets/parsed_dataframes/df_all_2016.tsv
-rw-rw-r-- 1 staeiou staeiou 105M Sep 10 17:53 ../../datasets/parsed_dataframes/df_all_2016.tsv.xz
-rw-rw-r-- 1 staeiou staeiou 407M Sep 10 18:12 ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle
-rw-rw-r-- 1 staeiou staeiou  98M Sep 10 18:12 ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle.xz
-rw-rw-r-- 1 staeiou staeiou 558M Sep 10 18:19 ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv
-rw-rw-r-- 1 staeiou staeiou 105M Sep 10 18:19 ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv.xz
-rw-r--r-- 1 staeiou staeiou  12M Sep 10 18:44 ../../datasets/parsed_dataframes/possible_botfights.pickle
-rw-r--r-- 1 staeiou staeiou 2.5M Sep 10 18:39 ../../datasets/parsed_dataframes/possible_botfights.pickle.xz
-rw-r--r-- 1 staeiou staeiou  18M Sep 10 18:44 ../../datasets/parsed_dataframes/possible_botfights.tsv
-rw-r--r-- 1 staeiou staeiou 2.7M Sep 10 18:39 ../../datasets/parsed_dataframes/possible_botfights.tsv.xz

In [5]:
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
    df_all = pickle.load(f)

In [6]:
len(df_all)


Out[6]:
924945

Format of dataset


In [7]:
df_all.sample(2).transpose()


Out[7]:
257762 190199
archived False False
language en pt
page_namespace 0 0
rev_deleted False False
rev_id 505057470 14262319
rev_minor_edit True True
rev_page 16811020 652663
rev_parent_id 5.0451e+08 1.39473e+07
rev_revert_offset 1 1
rev_sha1 pgx38ss7493qq3m09w8s284q141yna5 3qfqf373f8mmj9rjevkvojmcqigih8p
rev_timestamp 20120731065441 20090218014509
rev_user 11969601 372437
rev_user_text ZéroBot CarsracBot
reverted_to_rev_id 504509717 13947282
reverting_archived False False
reverting_comment [[User:Addbot|Bot:]] Migrating 1 interwiki lin... Bot: Modificando: [[ar:مغامرات عدنان]]
reverting_deleted False False
reverting_id 545299797 14671599
reverting_minor_edit True True
reverting_page 16811020 652663
reverting_parent_id 5.05057e+08 1.42623e+07
reverting_sha1 2xujpdce40lewj70g53fgx8u247ukgl 59k17crtpl8kdchta88rk5amep8snxy
reverting_timestamp 20130318232427 20090323134430
reverting_user 6569922 498058
reverting_user_text Addbot Almabot
revisions_reverted 1 1
namespace_type article article
reverted_timestamp_dt 2012-07-31 06:54:41 2009-02-18 01:45:09
reverting_timestamp_dt 2013-03-18 23:24:27 2009-03-23 13:44:30
time_to_revert 230 days 16:29:46 33 days 11:59:21
time_to_revert_hrs 5536.5 803.989
time_to_revert_days 230.687 33.4995
reverting_year 2013 2009
time_to_revert_days_log10 2.36302 1.52504
time_to_revert_hrs_log10 3.74323 2.90525
reverting_comment_nobracket Migrating 1 interwiki links, now provided by on Bot: Modificando:
botpair Addbot rv ZéroBot Almabot rv CarsracBot
botpair_sorted ['Addbot', 'ZéroBot'] ['Almabot', 'CarsracBot']
reverts_per_page_botpair 1 1
reverts_per_page_botpair_sorted 1 1

Descriptive statistics

Number of bot-bot reverts per language over time, articles only

EGBF looked at bot-bot reverts from 2001-2010, how have things changed since 2010?

Paper section:


In [8]:
gb = df_all[df_all['page_namespace']==0].groupby(["language","reverting_year"])

In [9]:
sns.set(font_scale=1.5)
gb['rev_id'].count().unstack().transpose()


Out[9]:
language de en es fr ja pt zh
reverting_year
2004 6.0 2.0 NaN 2.0 284.0 1.0 5.0
2005 376.0 108.0 492.0 227.0 29.0 205.0 20.0
2006 435.0 1112.0 796.0 738.0 117.0 722.0 187.0
2007 1379.0 5488.0 1689.0 2189.0 903.0 3179.0 634.0
2008 3230.0 13317.0 3112.0 3718.0 2189.0 2696.0 2097.0
2009 4710.0 16951.0 3415.0 4340.0 3198.0 5308.0 2755.0
2010 4967.0 13366.0 7109.0 5792.0 5277.0 4850.0 4452.0
2011 10386.0 30629.0 16895.0 14393.0 7642.0 15410.0 6318.0
2012 9554.0 26208.0 9131.0 9735.0 6888.0 7930.0 5742.0
2013 19222.0 99641.0 21166.0 14203.0 6852.0 17896.0 19089.0
2014 482.0 11310.0 1969.0 275.0 827.0 232.0 859.0
2015 148.0 13416.0 1145.0 399.0 236.0 1517.0 675.0
2016 43.0 13245.0 1180.0 170.0 330.0 178.0 196.0

Plot


In [10]:
sns.set(font_scale=1.5)
sns.set_style("whitegrid")

groupby_unstack = gb['revisions_reverted'].count().unstack().transpose()
ax = groupby_unstack.plot(kind='line', logy=True, figsize=[10,6], colormap="Accent")    

plt.xlim(2004,2018)

plt.ylabel("Number of bot-bot reverts (log scaled)")
plt.xlabel("Year of reverting edit")
#plt.suptitle("Bot-bot reverts per language by reverting year, articles only")

leg = plt.legend()
for legobj in leg.legendHandles:
    legobj.set_linewidth(8.0)
    
plt.savefig("reverts-yearly-counts.pdf", bbox_inches='tight', dpi=600)


Number of bot-bot reverts per language, all years, articles only


In [11]:
gb['rev_id'].count().unstack().transpose().sum()


Out[11]:
language
de     54938.0
en    244793.0
es     68099.0
fr     56181.0
ja     34772.0
pt     60124.0
zh     43029.0
dtype: float64

Total number of bot-bot reverts, all 7 languages, all years, articles only


In [12]:
gb['rev_id'].count().unstack().transpose().sum().sum()


Out[12]:
561936.0

Number of bot-bot reverts per language over time, all namespaces


In [13]:
gb_lang_nstype = df_all.groupby(["language", "namespace_type"])

gb_lang_nstype['revisions_reverted'].count().unstack().transpose()


Out[13]:
language de en es fr ja pt zh
namespace_type
article 54938 244793 68099 56181 34772 60124 43029
category 10590 106231 19079 21123 10075 9929 6725
other page 1051 108361 1556 614 277 536 1456
other talk 2689 43525 37 18528 23 362 242

Plot


In [14]:
sns.set(font_scale=2)
sns.set_style("whitegrid")

g = sns.factorplot(data=df_all,
                   x='language',
                   y=None,
                   hue='namespace_type',
                   kind='count',
                   size=8,
                   palette="Accent",
                   aspect = 1)

plt.savefig("reverts-namespace-counts.pdf", bbox_inches='tight', dpi=600)


Number of bot-bot reverts per language, all years, all namespaces


In [15]:
gb_lang_nstype['revisions_reverted'].count().unstack().transpose().sum()


Out[15]:
language
de     69268
en    502910
es     88771
fr     96446
ja     45147
pt     70951
zh     51452
dtype: int64

Number of bot-bot reverts by namespace type, all 7 languages, all years, all namespaces


In [16]:
gb_lang_nstype['revisions_reverted'].count().unstack().sum()


Out[16]:
namespace_type
article       561936
category      183752
other page    113851
other talk     65406
dtype: int64

Proportion of bot-bot reverts by namespace type, all 7 languages, all years, all namespaces


In [17]:
df_all['namespace_type'].value_counts(normalize=True)


Out[17]:
article       0.607535
category      0.198663
other page    0.123089
other talk    0.070713
Name: namespace_type, dtype: float64

Proportion of bot-bot reverts outside of the main/article namespace:

Referenced in paper section 5.1


In [18]:
1 - df_all['namespace_type'].value_counts(normalize=True)['article']


Out[18]:
0.3924654979485267

Runtime


In [19]:
end = datetime.datetime.now()

time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")


Total runtime:  0 minutes,  42 seconds

In [20]: