In [1]:
library(data.table)
library(ggplot2)
dt = data.table(read.table("../../datasets/reverted_bot2bot/enwiki_20170420.tsv.bz2", sep="\t", header=T, quote="",
comment.char=""))
dt$rev_ts = as.POSIXct(format(dt$rev_timestamp, scientific=F), format="%Y%m%d%H%M%S")
dt$rev_month = as.Date(paste(format(dt$rev_ts, "%Y-%m-"), "01", sep=""))
dt$rev_day = as.Date(dt$rev_ts)
dt$reverting_ts = as.POSIXct(format(dt$reverting_timestamp, scientific=F), format="%Y%m%d%H%M%S")
dt$reverting_month = as.Date(paste(format(dt$reverting_ts, "%Y-%m-"), "01", sep=""))
dt$reverting_day = as.Date(dt$reverting_ts)
What does this dataframe look like?
In [2]:
head(dt)
How many bot-bot reverts per month?
In [3]:
month_reverts = dt[page_namespace == 0,list(reverts=length(rev_id)), list(reverting_month)]
month_reverts[order(month_reverts$reverts, decreasing=T)][1:10,]
In the "doom month" of March 2013, which bots made the most bot-bot reverts?
In [4]:
doom_reverts = dt[
page_namespace == 0 & reverting_month == "2013-03-01",
list(reverts=length(rev_id)),
list(reverting_user_text)]
doom_reverts[order(doom_reverts$reverts, decreasing=T)][1:10,]
In this month, who is Addbot reverting?
In [5]:
addbot_reverts = dt[
page_namespace == 0 & reverting_month == "2013-03-01" & reverting_user_text == "Addbot",
list(reverteds=length(rev_id)),
list(rev_user_text)]
addbot_reverts[order(addbot_reverts$reverteds, decreasing=T)][1:10,]
In [6]:
addbot_doom_month = dt[
page_namespace == 0 & reverting_user_text == "Addbot" & reverting_ts >= "2013-03-05" & reverting_ts < "2013-03-25",
list(reverts=length(rev_id)),
list(day=as.Date(reverting_ts))]
ggplot(addbot_doom_month, aes(x=day, y=reverts)) +
theme_bw() +
geom_line()
In [7]:
addbot_time_to_revert = dt[
page_namespace == 0 & reverting_user_text == "Addbot" & reverting_ts >= "2013-03-05" & reverting_ts < "2013-03-25",
list(time_to_revert=as.numeric(difftime(reverting_ts, rev_ts, units="secs")))]
ggplot(addbot_time_to_revert, aes(x=time_to_revert)) +
theme_bw() +
geom_density() +
scale_x_log10(breaks=c(60*60*24, 60*60*24*7, 60*60*24*30, 60*60*24*365, 60*60*24*365*5),
labels=c("day", "week", "month", "year", "5-year"))
In [ ]:
In [ ]: