In [1]:
library(data.table)
library(ggplot2)
dt = data.table(read.table("../../datasets/reverted_bot2bot/enwiki_20170420.tsv.bz2", sep="\t", header=T, quote="",
comment.char=""))
dt$rev_ts = as.POSIXct(format(dt$rev_timestamp, scientific=F), format="%Y%m%d%H%M%S")
dt$rev_month = as.Date(paste(format(dt$rev_ts, "%Y-%m-"), "01", sep=""))
dt$rev_day = as.Date(dt$rev_ts)
dt$reverting_ts = as.POSIXct(format(dt$reverting_timestamp, scientific=F), format="%Y%m%d%H%M%S")
dt$reverting_month = as.Date(paste(format(dt$reverting_ts, "%Y-%m-"), "01", sep=""))
dt$reverting_day = as.Date(dt$reverting_ts)
Build the page_reverts
dataframe, grouping by page id (rev_page
) and finding for each page:
In [2]:
dt.by_page = setkey(dt, rev_page, rev_user_text, reverting_user_text)
page_reverts = dt[
page_namespace == 0,
list(reverts=length(rev_id),
bots_involved=length(unique(c(rev_user_text, reverting_user_text))),
reverting_bots=length(unique(reverting_user_text)),
reverted_bots=length(unique(rev_user_text)),
first_revert=min(reverting_ts),
last_revert=max(reverting_ts)),
list(rev_page)]
In [3]:
page_reverts[order(page_reverts$reverts, decreasing=T),][1:10,]
In [4]:
ggplot(page_reverts, aes(x=reverts)) +
geom_density(adjust=10) +
scale_x_log10()
KDE histogram of number of bot-bot reverts per page, filtered to >= 5 reverts per page
In [5]:
ggplot(page_reverts[reverts >= 5,], aes(x=reverts)) +
geom_density(adjust=3) +
scale_x_log10()
KDE histogram of number of bot-bot reverts per page, filtered to >= 5 reverts per page
In [6]:
ggplot(page_reverts[reverts >= 10,], aes(x=reverts)) +
geom_density(adjust=2) +
scale_x_log10()
In [7]:
page_reverts[reverts >= 10,][1:10,]
Find top bot-bot revert pairs per page
In [8]:
page_bot_pairs = dt[
page_namespace == 0,
list(reverts=length(unique(rev_id)),
first_revert=min(reverting_ts),
last_revert=max(reverting_ts)),
list(rev_page, bots=paste(pmin(as.character(reverting_user_text), as.character(rev_user_text)),
pmax(as.character(reverting_user_text), as.character(rev_user_text))))]
In [9]:
page_bot_pairs[order(page_bot_pairs$reverts, decreasing=T),][1:10,]
This is a gold mine!
The longest single-page mutual bot-on-bot revert sequence lasted 41 reverts and it continued over the course of 2 and a half years. It happened on "List of Mathematicians (X)" between Mathbot and FrescoBot. Mathbot updates the lists of mathematicians based on categorizations in Wikipedia. FrescoBot fixes link syntax. When the target of the link and the label are the same, it simplifies the link. Like clockwork, FrescoBot writes out a link of the structure [[
Actually, it's a tie and honestly, this second case might be more interesting. AnomieBOT and CyberBot II also had an 82 revert sequence on a single page, but it lasted for 41 reverts over the course of only 4 days! On the article about "Foreign relations of the Central African Republic", AnomieBOT claimed to be "rescuing orphaned refs" -- adding a reference to dead link by using the internet archive to provide a copy of the old referenced PDF titled "International Criminal Court: Background – Situation in the Central African Republic". Every time that AnomieBOT "rescued" the link, Cyberbot II swing by and removed the reference with the confusing comment "Rescuing 1 sources". This case is arguably worse than FrescoBot and Mathbot because it spanned many pages. The bots had similar fights on the biography of the songwriter Rico Love (35 reverts), the broadcaster Dougie Vipond (31 reverts), and the song "Seasons Change" (31 reverts). The list keeps going. All told, these bots reverted each other 396 times on 15 pages -- constantly adding links to Internet Archive pages and then removing them again.
Visualize KDE distribution of AnomieBOT and Cyberbot II's reverts over time
In [10]:
ggplot(dt[reverting_user_text %in% c("AnomieBOT", "Cyberbot II") &
rev_user_text %in% c("AnomieBOT", "Cyberbot II"),],
aes(x=reverting_ts)) +
geom_density()
dt[reverting_user_text %in% c("AnomieBOT", "Cyberbot II") &
rev_user_text %in% c("AnomieBOT", "Cyberbot II"),
list(n=length(unique(rev_id)), pages=length(unique(rev_page)))]
In [11]:
ggplot(page_bot_pairs, aes(x=reverts)) +
geom_density()
In [12]:
ggplot(page_bot_pairs[reverts > 2,], aes(x=reverts)) +
geom_density()
There's definitely some shape to this. There's something around 20 reverts where, beyond that, there's some density. Let's look at 3-10, 10-20, 20-30, and 30-40.
In [13]:
page_bot_pairs[reverts > 2 & reverts < 10,][1:10,]
This is mostly fighting, but some of the fights are less obvious given so few interactions.
In [14]:
page_bot_pairs[reverts >= 10 & reverts < 20,][1:10,]
OK, I'm satisfied. Most single-page revert activity between bots that involves more than 2 edits is a fight. Let's look at how many reverts are accounted for.
In [15]:
length(unique(dt$rev_id))
In [16]:
sum(page_bot_pairs[reverts >= 2,]$reverts)
lol. So, maybe