This is a data analysis script used to produce findings in the paper, which you can run based entirely off the files in this GitHub repository. This notebook produces part of the analysis for all languages, and the notebook 4-3-reverts-per-page-enwiki-plots
is an independent replication of this analysis in R that contains plots for the English Wikipedia, which are included in the paper. Note that the R notebook cannot be run on mybinder due to memory requirements, while this one can be.
This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. It takes less than 1 minute to run on a laptop running a Core i5-2540M processor.
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import datetime
import pickle
%matplotlib inline
In [2]:
start = datetime.datetime.now()
In [3]:
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_2016.pickle.xz
In [4]:
!ls ../../datasets/parsed_dataframes/*.pickle
In [5]:
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
df_all = pickle.load(f)
In [6]:
df_all.sample(2).transpose()
Out[6]:
Grouping by these three columns creates a very simple and useful intersection for this metric. If there is only one revert for a language/page ID/botpair_sorted set, then the reverting bot's revert was for sure unreciprocated by the reverted bot. If there are two reverts, then the most likely outcome is that the reverting bot's revert was followed by a revert by the reverted bot, although this could also mean that the reverting bot reverted the reverted bot twice. Higher counts imply heavy back-and-forth reverts between two bots on a single page.
We count the number of reverts with the same language, page ID, and sorted botpair, then assign that value to reverts_per_page_botpair_sorted
for every revert matching these three columns. Note that this initial analysis is conducted in 0-load-process-data.ipynb
, but we have included it again for clarity.
In [7]:
groupby_lang_page_bps = df_all.groupby(["language", "rev_page", "botpair_sorted"])
In [8]:
df_groupby = pd.DataFrame(groupby_lang_page_bps['rev_id'].count()).reset_index().rename(columns={"rev_id":"reverts_per_page_botpair_sorted"})
df_groupby.sample(25)
Out[8]:
In [9]:
df_all = df_all.drop("reverts_per_page_botpair_sorted",1)
df_all = pd.merge(df_all, df_groupby, how='left',
left_on=["language", "rev_page", "botpair_sorted"],
right_on=["language", "rev_page", "botpair_sorted"])
For example, 528,104 reverts were not reciprocated at all. 25,528 reverts were part of a two-bot revert chain on the same page in the same language lasting 2 reverts. 3,987 reverts were part of a two-bot revert chain in the same page in the same language lasting 3 reverts, and so on.
In [10]:
df_all.query("page_namespace == 0").reverts_per_page_botpair_sorted.value_counts().sort_index()
Out[10]:
In [11]:
df_all.query("page_namespace == 0").reverts_per_page_botpair_sorted.value_counts().sum()
Out[11]:
In [12]:
import matplotlib.ticker
sns.set(font_scale=1.5, style="whitegrid")
fig, ax = plt.subplots(figsize=[8,6])
df_all.query("page_namespace == 0").reverts_per_page_botpair_sorted.value_counts().sort_index().plot(kind='bar', ax=ax)
ax.set_yscale('log')
ax.set_ylim((pow(10,0),pow(10,6)))
ax.set_ylabel("Number of articles (log scale)")
ax.set_xlabel("Number of reverts on page between the same two bots")
ax.yaxis.set_major_formatter(matplotlib.ticker.FormatStrFormatter('%d'))
In [13]:
df_all.query("page_namespace == 0 and language=='en'").reverts_per_page_botpair_sorted.value_counts().sort_index()
Out[13]:
In [14]:
sns.set(font_scale=1.5, style="whitegrid")
df_all.query("page_namespace == 0 and language == 'en'").reverts_per_page_botpair_sorted.value_counts().sort_index().plot(kind='bar')
Out[14]:
In [15]:
df_all.query("page_namespace == 0 and language=='en'").reverts_per_page_botpair_sorted.value_counts().sum()
Out[15]:
In [16]:
len(df_all.query("page_namespace == 0 and language=='en'"))
Out[16]:
In [17]:
gb = df_all.query("reverts_per_page_botpair_sorted > 500").groupby(["language", "page_namespace", "rev_page", "botpair_sorted"])
In [18]:
gb['rev_id'].count()
Out[18]:
From a manual lookup:
page_id page_title
In [19]:
len(df_all.query("language == 'en' and rev_page == 4626266"))
Out[19]:
In [20]:
len(df_all.query("language == 'en' and rev_page == 11238105"))
Out[20]:
In [21]:
len(df_all.query("language == 'en' and rev_page == 5964327"))
Out[21]:
In [22]:
df_all.query("language == 'en' and rev_page == 5971841").groupby("botpair")['time_to_revert_days'].median()
Out[22]:
In [23]:
end = datetime.datetime.now()
time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")