This is a data analysis script for parsing comments as described in section 7, which you can run based entirely off the files in this GitHub repository. It loads datasets/parsed_dataframes/df_all_2016.pickle.xz
and creates the following files (and compresses them in .xz format):
datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle
datasets/parsed_dataframes/possible_botfights.pickle
datasets/parsed_dataframes/possible_botfights.tsv
This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. On a laptop running a Core i5-2540M processor, it takes about 5 minutes to run the main analysis and another 5 minutes to xz compress the files (if compressed versions do not already exist).
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import pickle
import numpy as np
import datetime
%matplotlib inline
In [2]:
start = datetime.datetime.now()
In [3]:
!ls ../../datasets/parsed_dataframes/*.pickle.xz
In [4]:
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_2016.pickle.xz
In [5]:
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
df_all = pickle.load(f)
In [6]:
df_all.sample(2).transpose()
Out[6]:
There are two functions that are used to parse comments. comment_categorization()
runs first and applies a series of pattern matching to comments. If a match is not found, then interwiki_confirm()
is called, which checks for languages codes in certain patterns that indicate interwiki links.
In [7]:
def comment_categorization(row):
"""
Takes a row from a pandas dataframe or dict and returns a string with a
kind of activity based on metadata. Used with df.apply(). Mostly parses
comments, but makes some use of usernames too.
"""
reverting_user = str(row['reverting_user_text'])
reverted_user = str(row['rev_user_text'])
langcode = str(row['language'])
if reverting_user.find("HBC AIV") >= 0:
return 'AIV helperbot'
try:
comment = str(row['reverting_comment'])
except Exception as e:
return 'other'
comment_lower = comment.lower().strip()
comment_lower = " ".join(comment_lower.split())
if comment == 'nan':
return "deleted revision"
if reverting_user == 'Cyberbot II' and reverted_user == 'AnomieBOT' and comment.find("tagging/redirecting to OCC") >= 0:
return 'botfight: Cyberbot II vs AnomieBOT date tagging'
if reverting_user == 'AnomieBOT' and reverted_user == 'Cyberbot II' and comment.find("{{Deadlink}}") >= 0:
return 'botfight: Cyberbot II vs AnomieBOT date tagging'
if reverting_user == 'RussBot' and reverted_user == 'Cydebot':
return 'botfight: Russbot vs Cydebot category renaming'
if reverting_user == 'Cydebot' and reverted_user == 'RussBot':
return 'botfight: Russbot vs Cydebot category renaming'
elif comment.find("Undoing massive unnecessary addition of infoboxneeded by a (now blocked) bot") >= 0:
return "botfight: infoboxneeded"
elif comment_lower.find("commonsdelinker") >=0 and reverting_user.find("CommonsDelinker") == -1:
return "botfight: reverting CommonsDelinker"
elif comment.find("Reverted edits by [[Special:Contributions/ImageRemovalBot") >= 0:
return "botfight: 718bot vs ImageRemovalBot"
elif comment_lower.find("double redirect") >= 0:
return "fixing double redirect"
elif comment_lower.find("double-redirect") >= 0:
return "fixing double redirect"
elif comment_lower.find("has been moved; it now redirects to") >= 0:
return "fixing double redirect"
elif comment_lower.find("correction du redirect") >= 0:
return "fixing double redirect"
elif comment_lower.find("redirect tagging") >= 0:
return "redirect tagging/sorting"
elif comment_lower.find("sorting redirect") >= 0:
return "redirect tagging/sorting"
elif comment_lower.find("redirecciones") >= 0 and comment_lower.find("categoría") >= 0:
return "category redirect cleanup"
elif comment_lower.find("change redirected category") >= 0:
return "category redirect cleanup"
elif comment_lower.find("redirected category") >=0:
return "category redirect cleanup"
elif comment.find("[[User:Addbot|Bot:]] Adding ") >= 0:
return "template tagging"
elif comment_lower.find("interwiki") >= 0:
return "interwiki link cleanup -- method1"
elif comment_lower.find("langlinks") >= 0:
return "interwiki link cleanup -- method1"
elif comment_lower.find("iw-link") >= 0:
return "interwiki link cleanup -- method1"
elif comment_lower.find("changing category") >= 0:
return "moving category"
elif comment_lower.find("recat per") >= 0:
return "moving category"
elif comment_lower.find("moving category") >= 0:
return "moving category"
elif comment_lower.find("move category") >= 0:
return "moving category"
elif comment_lower.find("re-categorisation") >= 0:
return "moving category"
elif comment_lower.find("recatégorisation") >= 0:
return "moving category"
elif comment_lower.find("Updating users status to") >= 0:
return "user online status update"
elif comment_lower.find("{{Copy to Wikimedia Commons}} either because the file") >= 0:
return "template cleanup"
elif comment_lower.find("removing a protection template") >= 0:
return "protection template cleanup"
elif comment_lower.find("removing categorization template") >= 0:
return "template cleanup"
elif comment_lower.find("rm ibid template per") >= 0:
return "template cleanup"
elif comment_lower.find("page is not protected") >= 0:
return "template cleanup"
elif comment_lower.find("removing protection template") >= 0:
return "template cleanup"
elif comment_lower.find("correcting transcluded template per tfd") >= 0:
return "template cleanup"
elif comment_lower.find("removing orphan t") >= 0:
return "template cleanup"
elif comment_lower.find("non-applicable orphan") >= 0:
return "template cleanup"
elif comment_lower.find("plantilla") >= 0 and comment_lower.find("huérfano") >= 0:
return "template cleanup"
elif comment_lower.find("removed orphan t") >= 0:
return "template cleanup"
elif comment_lower.find("sandbox") >= 0:
return "clearing sandbox"
elif comment_lower.find("archiving") >= 0:
return "archiving"
elif comment_lower.find("duplicate on commons") >= 0:
return "commons image migration"
elif comment_lower.find("user:mathbot/changes to mathlists") >= 0:
return "botfight: mathbot mathlist updates"
elif reverting_user == 'MathBot' or reverted_user == 'MathBot' >= 0:
return "botfight: mathbot mathlist updates"
elif comment_lower.find("link syntax") >= 0:
return "fixing links"
elif comment_lower.find("links syntax") >= 0:
return "fixing links"
elif comment_lower.find("no broken #section links left") >= 0:
return "fixing links"
elif comment_lower.find("removing redlinks") >= 0:
return "fixing links"
elif comment_lower.find("to wikidata") >= 0:
return "interwiki link cleanup -- method1"
elif comment.find("言語間") >=0:
return "interwiki link cleanup -- method1"
elif comment_lower.find("interproyecto") >=0:
return "interwiki link cleanup -- method1"
elif comment.find("语言链接") >=0:
return "interwiki link cleanup -- method1"
elif comment.find("interling") >=0:
return "interwiki link cleanup -- method1"
elif comment.find("interlang") >=0:
return "interwiki link cleanup -- method1"
elif comment.find("双重重定向") >=0 or comment.find("雙重重定向") >= 0:
return "fixing double redirect"
elif comment.find("二重リダイレクト") >=0:
return "fixing double redirect"
elif comment_lower.find("doppelten redirect") >=0:
return "fixing double redirect"
elif comment_lower.find("doppelte weiterleitung") >=0:
return "fixing double redirect"
elif comment_lower.find("redirectauflösung") >=0:
return "fixing double redirect"
elif comment_lower.find("doble redirección") >=0 or comment_lower.find("redirección doble") >= 0:
return "fixing double redirect"
elif comment_lower.find("redireccionamento duplo") >=0:
return "fixing double redirect"
elif comment_lower.find("duplo redirecionamento") >=0:
return "fixing double redirect"
elif comment_lower.find("suppression bandeau") >= 0:
return "template cleanup"
elif comment_lower.find("archiviert") >= 0:
return "archiving"
elif comment_lower.find("revert") >= 0:
return "other w/ revert in comment"
elif comment_lower.find("rv ") >= 0 or comment_lower.find("rv") == 0:
return "other w/ revert in comment"
elif comment_lower.find(" per ") >= 0:
return "other w/ per justification"
elif comment_lower.find(" según") >= 0:
return "other w/ per justification"
elif comment_lower.find("suite à discussion") >= 0:
return "other w/ per justification"
elif comment_lower.find("suite à conservation") >= 0:
return "other w/ per justification"
elif comment_lower.find("conforme pedido") >= 0:
return "other w/ per justification"
else:
return interwiki_confirm(comment, langcode)
In [8]:
def interwiki_confirm(comment, langcode):
"""
Takes a comment string, searches for language codes bordered by
two punctuation marks from [](){},: or one punctuation mark and
one space. Beginning and end of a comment string counts as a
space, not a punctuation mark.
Does not recognize the current langcode.
"""
import string, re
with open("../../datasets/lang_codes.tsv", "r") as f:
lang_codes = f.read().split("\n")
lang_codes.pop() # a blank '' is in the list that gets returned
lang_codes.remove(langcode)
#print(langcode in lang_codes)
try:
comment = str(comment)
comment = comment.lower()
comment = comment.replace(": ", ":")
comment = " " + comment + " " # pad start and end of string with non-punctuation
#print(comment)
except Exception as e:
return 'other'
for lang_code in lang_codes:
lang_code_pos = comment.find(lang_code)
lang_code_len = len(lang_code)
char_before = " "
char_after = " "
if lang_code_pos >= 0:
char_before = comment[lang_code_pos-1]
#print("Char before: '", char_before, "'", sep='')
char_after = comment[lang_code_pos+lang_code_len]
#print("Char after: '", char_after, "'", sep='')
if char_before in string.punctuation and char_after in "[]{}(),:":
#print(comment, lang_code)
return 'interwiki link cleanup -- method2'
elif char_after in string.punctuation and char_before in "[]{}(),:":
#print(comment, lang_code)
return 'interwiki link cleanup -- method2'
elif char_before == " " and char_after in "[]{}(),:":
#print(comment, lang_code)
return 'interwiki link cleanup -- method2'
elif char_after == " " and char_before in "[]{}(),:":
#print(comment, lang_code)
return 'interwiki link cleanup -- method2'
return 'other'
Testing interwiki confirm
In [9]:
tests_yes = ["Robot adding [[es:Test]]",
"adding es:Test",
"linking es, it, en",
"modifying fr:",
"modifying:zh",
"modifying: ja"]
tests_no = ["test",
"discuss policies on enwiki vs eswiki",
"it is done",
"per [[en:WP:AIV]]",
"it's not its",
"its not it's",
"modifying it all",
"modifying italy"]
print("Should return interwiki link cleanup -- method2")
for test in tests_yes:
print("\t", interwiki_confirm(test, 'en'))
print("Should return other")
for test in tests_no:
print("\t", interwiki_confirm(test, 'en'))
Apply categorization
In [10]:
%%time
df_all['bottype'] = df_all.apply(comment_categorization, axis=1)
In [11]:
def bottype_group(bottype):
if bottype == "interwiki link cleanup -- method2":
return "interwiki link cleanup -- method2"
elif bottype == "interwiki link cleanup -- method1":
return "interwiki link cleanup -- method1"
elif bottype.find("botfight") >= 0:
return 'botfight'
elif bottype == 'other':
return 'not classified'
elif bottype == 'fixing double redirect':
return 'fixing double redirect'
elif bottype == 'protection template cleanup':
return 'protection template cleanup'
elif bottype.find("category") >= 0:
return 'category work'
elif bottype.find("template") >= 0:
return 'template work'
elif bottype == "other w/ revert in comment":
return "other w/ revert in comment"
else:
return "other classified"
In [12]:
df_all['bottype_group'] = df_all['bottype'].apply(bottype_group)
Much of what we're interested in are articles, which are in namespace 0.
In [13]:
df_all_ns0 = df_all[df_all['page_namespace']==0].copy()
In [14]:
type_counts = df_all_ns0['bottype'].value_counts().rename("count")
type_percent = df_all_ns0['bottype'].value_counts(normalize=True).rename("percent") * 100
type_percent = type_percent.round(2).astype(str) + "%"
pd.concat([type_counts, type_percent], axis=1)
Out[14]:
In [15]:
counts_dict = {}
for lang in df_all_ns0['language'].unique():
df_lang_ns0 = df_all_ns0[df_all_ns0['language']==lang]
type_counts = df_lang_ns0['bottype'].value_counts().rename("count")
type_percent = df_lang_ns0['bottype'].value_counts(normalize=True).rename("percent") * 100
type_percent = type_percent.round(2).astype(str) + "%"
counts_dict[lang]=pd.concat([type_counts, type_percent], axis=1)
In [16]:
df_all_ns0['language'].unique()
Out[16]:
In [17]:
counts_dict['en']
Out[17]:
In [18]:
counts_dict['ja']
Out[18]:
In [19]:
counts_dict['zh']
Out[19]:
In [20]:
counts_dict['de']
Out[20]:
In [21]:
counts_dict['fr']
Out[21]:
In [22]:
counts_dict['pt']
Out[22]:
In [23]:
counts_dict['es']
Out[23]:
In [24]:
gb_lang_bottype = df_all.query("page_namespace == 0").groupby(["language", "bottype_group"])
In [25]:
gb_lang_bottype['rev_id'].count().unstack()
Out[25]:
In [26]:
In [26]:
df_all[0:2].transpose()
Out[26]:
In [27]:
df_all.to_pickle("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle")
In [28]:
!xz -9 -e --keep ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle
In [29]:
df_all.to_csv("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv", sep="\t")
In [30]:
!xz -9 -e --keep ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv
In [31]:
end = datetime.datetime.now()
In [32]:
time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")
In [33]: