This is a data analysis script used to produce findings in the paper, which you can run based entirely off the files in this GitHub repository. It is an R Jupyter notebook, requiring the IRKernel, as well as the packages data.table and ggplot2. This notebook cannot be run in mybinder due to memory requirements. 4-3-reverts-per-page-exploratory.ipynb
is an independent reproduction of this notebook, but this notebook has the plots for English Wikipedia in the paper.
This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. It takes about 10 minute to run on a laptop running a Core i5-2540M processor.
In [1]:
library(data.table)
library(ggplot2)
dt = data.table(read.table("../../datasets/reverted_bot2bot/enwiki_20170420.tsv.bz2", sep="\t", header=T, quote="",
comment.char=""))
dt$rev_ts = as.POSIXct(format(dt$rev_timestamp, scientific=F), format="%Y%m%d%H%M%S")
dt$rev_month = as.Date(paste(format(dt$rev_ts, "%Y-%m-"), "01", sep=""))
dt$rev_day = as.Date(dt$rev_ts)
dt$reverting_ts = as.POSIXct(format(dt$reverting_timestamp, scientific=F), format="%Y%m%d%H%M%S")
dt$reverting_month = as.Date(paste(format(dt$reverting_ts, "%Y-%m-"), "01", sep=""))
dt$reverting_day = as.Date(dt$reverting_ts)
In [2]:
dt.by_page = setkey(dt, rev_page, rev_user_text, reverting_user_text)
page_reverts = dt[
page_namespace == 0,
list(reverts=length(rev_id),
bots_involved=length(unique(c(rev_user_text, reverting_user_text))),
reverting_bots=length(unique(reverting_user_text)),
reverted_bots=length(unique(rev_user_text)),
first_revert=min(reverting_ts),
last_revert=max(reverting_ts)),
list(rev_page)]
In [3]:
page_reverts[order(page_reverts$reverts, decreasing=T),][1:10,]
In [4]:
ggplot(page_reverts, aes(x=reverts)) +
geom_density(adjust=10) +
scale_x_log10()
In [5]:
ggplot(page_reverts[reverts >= 5,], aes(x=reverts)) +
geom_density(adjust=3) +
scale_x_log10()
In [6]:
ggplot(page_reverts[reverts >= 10,], aes(x=reverts)) +
geom_histogram(binwidth=1)
In [7]:
page_reverts[reverts >= 10,][1:10,]
In [8]:
page_bot_pairs = dt[
page_namespace == 0,
list(reverts=length(unique(rev_id)),
first_revert=min(reverting_ts),
last_revert=max(reverting_ts)),
list(rev_page, bots=paste(pmin(as.character(reverting_user_text), as.character(rev_user_text)),
pmax(as.character(reverting_user_text), as.character(rev_user_text))))]
In [9]:
p = ggplot(page_bot_pairs, aes(x=reverts)) +
theme_bw() +
geom_bar() +
scale_y_log10("frequency (log scaled)") +
scale_x_continuous("bot-pair article reverts")
png("enwiki_bot_pair_article_reverts.png", height=1600, width=1600, res=400)
print(p)
dev.off()
print(p)
table(page_bot_pairs$reverts)
sum(page_bot_pairs[reverts > 2,]$reverts)
sum(page_bot_pairs$reverts)
In [10]:
reverts_duration = page_bot_pairs[,
list(
duration.mean=mean(as.numeric(difftime(last_revert, first_revert, units="secs"))),
duration.median=median(as.numeric(difftime(last_revert, first_revert, units="secs")))),
list(reverts)]
p = ggplot(reverts_duration[reverts >= 2,],
aes(x=reverts, y=duration.median)) +
theme_bw() +
geom_line(color="gray", linetype=2) +
geom_point() +
scale_y_log10("Median revert-pair duration (log scaled)",
breaks=c(60*60*24, 60*60*24*7, 60*60*24*30, 60*60*24*365, 60*60*24*365*2),
labels=c("day", "week", "month", "year", "2 years")) +
scale_x_continuous("bot-pair article reverts", breaks=c(1, 5, 10, 15, 20, 25, 30, 35, 40))
png("enwiki_bot_revert_pair_duration.png", height=1600, width=1600, res=400)
print(p)
dev.off()
print(p)
In [11]:
page_bot_pairs[order(page_bot_pairs$reverts, decreasing=T),][1:10,]
This is a gold mine!
The longest single-page mutual bot-on-bot revert sequence lasted 41 reverts and it continued over the course of 2 and a half years. It happened on "List of Mathematicians (X)" between Mathbot and FrescoBot. Mathbot updates the lists of mathematicians based on categorizations in Wikipedia. FrescoBot fixes link syntax. When the target of the link and the label are the same, it simplifies the link. Like clockwork, FrescoBot writes out a link of the structure [[
Actually, it's a tie and honestly, this second case might be more interesting. AnomieBOT and CyberBot II also had an 41 revert sequence on a single page, but it lasted for 41 reverts over the course of only 4 days! On the article about "Foreign relations of the Central African Republic", AnomieBOT claimed to be "rescuing orphaned refs" -- adding a reference to dead link by using the internet archive to provide a copy of the old referenced PDF titled "International Criminal Court: Background – Situation in the Central African Republic". Every time that AnomieBOT "rescued" the link, Cyberbot II swing by and removed the reference with the confusing comment "Rescuing 1 sources". This case is arguably worse than FrescoBot and Mathbot because it spanned many pages. The bots had similar fights on the biography of the songwriter Rico Love (35 reverts), the broadcaster Dougie Vipond (31 reverts), and the song "Seasons Change" (31 reverts). The list keeps going. All told, these bots reverted each other 396 times on 15 pages -- constantly adding links to Internet Archive pages and then removing them again.
In [12]:
ggplot(page_bot_pairs, aes(x=reverts)) +
geom_histogram(binwidth=1)
In [13]:
ggplot(page_bot_pairs[reverts > 2,], aes(x=reverts)) +
geom_histogram(binwidth=1)
There's definitely some shape to this. There's something around 20 reverts where, beyond that, there's some density. Let's look at 3-10, 10-20, 20-30, and 30-40.
In [14]:
page_bot_pairs[reverts > 2 & reverts < 10,][1:10,]
This is mostly fighting, but some of the fights are less obvious given so few interactions.
In [15]:
page_bot_pairs[reverts >= 12 & reverts < 25,
list(reverts=sum(reverts),
duration.meadian=median(as.numeric(difftime(last_revert, first_revert, units="day")))),
list(bots)]
OK, I'm satisfied. Most single-page revert activity between bots that involves more than 2 edits is a fight. Let's look at how many reverts are accounted for.
In [16]:
length(unique(dt[page_namespace == 0,]$rev_id))
In [17]:
sum(page_bot_pairs[reverts > 2,]$reverts)
lol. So, maybe 1.4%.
In [18]:
page_bot_pairs[reverts >= 12 & reverts < 25 & (bots == "Mathbot Yobot"),]
In [ ]:
In [ ]: