In this notebook, we explore the prevalance and efficacy of methods for moderating personal attacks on Wikipedia.
There is an explicit policy on Wikipedia against making personal attacks. Any user who observes or experiences a personal attack can place a standardized warning message on the offending user's talk page via a set of templates. In these tamplates, users have the option to cite the page on which the attack occured, but they generally do not cite the actual revision where the attack was introduced.
In addition to warnings, Wikipedia administrators have the ability to suspend the editing rights of users who violate the policy on personal attacks. This action is known as blocking a user. The duration and extent of the block is variable and at the discretion of the administrator. The interface admins use for blocking users requires providing a reason for the block. The reasons generally come from a small set of choices from a drop down list and map to the different Wikipedia policies. We are only interested in blocks where the admin selected the "[WP:No personal attacks|Personal attacks] or [WP:Harassment|harassment]" reason. Note that there is a separate policy on Wikipedia against harassment, which encompasses behaviors such as legal threats and posting personal information, which we do not address in this study. Unfortunately, admins generally do to cite the page or revision the incident occurred on when blocking a user. Administrators tend to block users in response to personal attacks reported on the Administrators Noticeboard for incidents, but it is also not uncommon for them to block users in response to attacks they observe during their other activities. Finally, administrators have the ability to delete the revision containing the attack from public view. Note that we only work with comments that have not been deleted.
In order to get data on warnings, we generated a dataset of all public user talk page diffs, identified the diffs that contained a warning and parsed the information in the template. From the template we can extract the following information:
Data on block events comes from the public logging table. Each record provides:
We also ran our machine learning model for detecting personal attacks on two dataset of comments (diffs)
This notebook attempts to provide insight into each of the following questions:
Summary stats of warnings and blocks
How do blocked users behave after being blocked?
What are the "politics" of blocking a user?
How comprehensive is the moderation?
Note: interesting interplay between ground truth data, that comes from Wikipedia community itself, but that may be incomplete and the ML scores which are based on evaluations from people outside community, are a bit noisy, but are totally comprehensive.
In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from load_utils import *
In [2]:
# Load scored diffs and moderation event data
d = load_diffs()
df_block_events, df_blocked_user_text = load_block_events_and_users()
df_warn_events, df_warned_user_text = load_warn_events_and_users()
moderated_users = [('warned', df_warned_user_text), ('blocked', df_blocked_user_text)]
moderation_events = [('warned', df_warn_events), ('blocked', df_block_events)]
In [3]:
print('# block events: %s' % df_block_events.shape[0])
print('# warn events: %s' % df_warn_events.shape[0])
There have been 27343 instances of a user being blocked and 36520 instances of a user being warned for personal attacks in the history of Wikipedia. Warnings have different levels and most are of the form "if you continue to attack users, you may be blocked", so we expect there to be a large difference. However, we will dig deeper into the relationship between warning and blocking later.
In [4]:
print('# block events')
print(df_block_events.groupby('anon').size().to_frame())
print()
print('# warn events')
print(df_warn_events.groupby('anon').size())
Almost half of all block events involved the blocking of an anonymous user. Half of warn events were addressed to anons. Later we will investigate if anons or registered users are disproportionately blocked.
In [5]:
print('# blocked users')
print(df_block_events.groupby('anon').user_text.nunique())
print()
print('# warned users')
print(df_warn_events.groupby('anon').user_text.nunique())
Here we see that there are more moderations events than moderated users, since users can be blocked or warned multiple times.
In [6]:
print('fraction of blocked users with a public user talk comment')
print(d['blocked'].groupby('author_anon').user_text.nunique() / df_block_events.groupby('anon').user_text.nunique())
Only 55% of users blocked have a public comment in the user or talk ns. This can be because users can get blocked for personal attacks for comments they make in the main ns and because the revisions can be made private.
In [7]:
df_block_events.assign(block_count = 1)\
.groupby(['user_text', 'anon'], as_index = False)['block_count'].sum()\
.groupby(['anon', 'block_count']).size()
Out[7]:
The vast majority of blocked users only get blocked once. Again, note that blocks are usually not indefinite but result in a temporary suspension of certain editing rights. Unsurprisingly, registered users are more likely to be reblocked than anons.
In [8]:
df_block_events.year.value_counts().sort_index().plot(label = 'block')
df_warn_events.year.value_counts().sort_index().plot(label = 'warn')
plt.xlabel('year')
plt.ylabel('count')
plt.xlim((2004, 2015))
plt.legend()
Out[8]:
It looks like the warning template was introduced in 2007. The number of block events per year looks a lot like the graph of participation in Wikipedia per year, although there has been a slight uptick attacks in recent years.
In [9]:
b = set(df_blocked_user_text['user_text'])
w = set(df_warned_user_text['user_text'])
len(w.intersection(b)) / len(w)
Out[9]:
Only 11% warned users have been blocked. Lets also, check what fraction of blocked users ever got warned. This is a bit tricky since our group of blocked users includes users who were blocked for harassment, which is quite different.
In [10]:
len(w.intersection(b)) / len(b)
Out[10]:
A first estimate gives that, 14% of blocked users have been warned. This is probably an underestimate because the set of blocked user contains harassers that may not have made a personal attack. We can also check what fraction of blocked users, who have made an attack that our algorithm detected, have been warned.
In [11]:
a = set(d['blocked'].query('pred_recipient_score > 0.7')['user_text'])
len(a.intersection(b).intersection(w)) / len(a.intersection(b))
Out[11]:
31% of blocked editors that our model confirmed made a personal attack where warned before hand.
In conclusion, we see that warning and blocking are fairly different mechanisms. Most people who get warned do not get blocked and most people who get blocked do not get warned. It seems like it would provide transparency to the moderation process if there was a clearer policy.
In [12]:
dfs = []
for k in range(1, 5):
for event, data in moderation_events:
df_k = data.assign(blocked = 1)\
.groupby('user_text', as_index = False)['blocked'].sum()\
.query('blocked >=%f'%k)
df_k1 = data.assign(blocked = 1)\
.groupby('user_text', as_index = False)['blocked'].sum()\
.query('blocked >=%f'%(k+1))\
.assign(again = 1)
df = df_k.merge(df_k1, how = 'left', on = 'user_text')\
.assign(again = lambda x: x.again.fillna(0))\
.assign(k=k, event = event)
dfs.append(df)
sns.pointplot(x = 'k', y = 'again', hue = 'event', data = pd.concat(dfs))
plt.xlabel('k')
plt.ylabel('P(warned/blocked again| warned/blocked at least k times)')
Out[12]:
Every time a user gets blocked, the chance that they will get blocked again increases. This could be explained by the fact that there is some set of persistently toxic users/accounts, who keep coming back after being blocked, while the less persistent ones get discouraged by the blocks. It may also be that being blocked tends to lead to more toxic behavior.
TODO:
The methodology for this is a bit involved.
In [13]:
K = 6
sample = 'blocked'
threshold = 0.5
In [14]:
events = {}
# null events set
e = d[sample][['user_text']].drop_duplicates()
e['timestamp'] = pd.to_datetime('1900')
events[0] = e
# rank block events
ranked_events = df_block_events.copy()
ranks = df_block_events\
.groupby('user_text')['timestamp']\
.rank()
ranked_events['rank'] = ranks
for k in range(1,K):
e = ranked_events.query("rank==%d" % k)[['user_text', 'timestamp']]
events[k] = e
In [15]:
attacks = {}
for k in range(0, K-1):
c = d[sample].merge(events[k], how = 'inner', on='user_text')
c = c.query('timestamp < rev_timestamp')
del c['timestamp']
c = c.merge(events[k+1], how = 'left', on = 'user_text')
c['timestamp'] = c['timestamp'].fillna(pd.to_datetime('2100'))
c = c.query('rev_timestamp < timestamp')
c = c.query('pred_recipient_score > %f' % threshold)
attacks[k] = c
In [16]:
blocked_users = {i:set(events[i]['user_text']) for i in events.keys()}
attackers = {i:set(attacks[i]['user_text']) for i in attacks.keys()}
In [17]:
dfs_sns = []
for k in range(1, K-1):
u_a = attackers[k]
u_b = blocked_users[k+1]
u_ab = u_a.intersection(u_b)
n_a = len(u_a)
n_ab = len(u_ab)
print('k:',k, n_ab/n_a)
dfs_sns.append(pd.DataFrame({'blocked': [1]*n_ab, 'k': [k]*n_ab}))
dfs_sns.append(pd.DataFrame({'blocked': [0]*(n_a- n_ab), 'k': [k]*(n_a- n_ab)}))
In [18]:
sns.pointplot(x = 'k', y = 'blocked', data = pd.concat(dfs_sns))
plt.xlabel('k')
plt.ylabel('P(blocked | attacked and blocked k times already)')
Out[18]:
The probability of being blocked after making a personal attack and increases as a function of how many times the user has been blocked before. This could indicate heightened scrutiny by administrators. The pattern could also occur if users who continue to attack after being blocked make more frequent or more toxic attacks and are hence more likely to be discovered.
TODO
In [19]:
dfs = []
step = 0.2
ts = np.arange(0.4, 0.81, step)
moderated_users = [('warn', df_warned_user_text), ('blocked', df_blocked_user_text)]
for t in ts:
for (event_type, users) in moderated_users:
dfs.append(\
d['2015'].query('pred_recipient_score >= %f and pred_recipient_score <= %f' % (t, t+step))\
[['user_text', 'author_anon']]\
.drop_duplicates()\
.merge(users, how = 'left', on = 'user_text')\
.assign(blocked = lambda x: x.blocked.fillna(0),
threshold = t, event = event_type )
)
df = pd.concat(dfs)
df['author_anon']=df['author_anon'].astype(str)
In [20]:
g = sns.factorplot(x="threshold", y="blocked", hue="author_anon", col="event", data=df, hue_order=["False", "True"])
Anons are less likely to be blocked and warned! That is not what I expected.
TODO:
Methodology:
Consider all editors who made a comment in 2015. Select those that made a comment with an attack score of 0.5 or higher. See how many days they were active before Jan 1 2015. Compute block probability as a function of how many days they were active.
TODO
In [21]:
attackers = d['2015'].query('not author_anon and pred_recipient_score > 0.5')[['user_text']].drop_duplicates()
In [22]:
# get days active
d_tenure = pd.read_csv('../../data/long_term_users.tsv', sep = '\t')
d_tenure.columns = ['user_text', 'n']
attackers = attackers.merge(d_tenure, how = 'left', on = 'user_text')
attackers['n'] = attackers['n'].fillna(0)
In [23]:
# bin days active
tresholds = np.percentile(attackers['n'], np.arange(0, 100.01,10 ))
tresholds = sorted(set(tresholds.astype(int)))
bins = []
for i in range(len(tresholds)-1):
label = '%d-%d' % (tresholds[i], tresholds[i+1]-1)
rnge = range(tresholds[i], tresholds[i+1])
bins.append((label, rnge))
bins = [('<7', range(0, 8)), ('8-365', range(8, 366)), ('365<', range(366, 4500))]
def map_count(x):
for label, rnge in bins:
if x in rnge:
return label
attackers['binned_n'] = attackers['n'].apply(map_count)
In [24]:
# get if blocked
blocked_users_2015 = df_block_events.query("timestamp > '2014-12-31'")[['user_text']].drop_duplicates()
blocked_users_2015['blocked'] = 1
attackers = attackers.merge(blocked_users_2015, how='left', on='user_text')
attackers['blocked'] = attackers['blocked'].fillna(0)
In [25]:
#plot
o = [e[0] for e in bins]
sns.pointplot(x='binned_n', y = 'blocked', data=attackers, order = o)
plt.ylabel('P(blocked | n days active prior to 2015)')
plt.xlabel('n days active prior to 2015')
Out[25]:
New editors are the least likely to be blocked for attacks. Editors with 8-365 active days are the most likely to be blocked for attacks. Experienced editors are less likely than editors with medium experience to be blocked. Note that although the CIs overlap, all differences are significant at alpah=0.05.
TODO:
In [26]:
dfs = []
ts = np.arange(0.5, 0.91, 0.1)
moderated_users = [('warn', df_warned_user_text), ('blocked', df_blocked_user_text)]
for t in ts:
for (event_type, users) in moderated_users:
dfs.append(\
d['2015'].query('pred_recipient_score >= %f' % t)[['user_text', 'author_anon']]\
.drop_duplicates()\
.merge(users, how = 'left', on = 'user_text')\
.assign(blocked = lambda x: x.blocked.fillna(0),
threshold = t, event = event_type )
)
df = pd.concat(dfs)
sns.pointplot(x = 'threshold', y = 'blocked', hue='event', data = df)
plt.ylabel('fraction of attacking users warned/blocked')
plt.savefig('../../paper/figs/fraction_of_attacking_users_warned_and_blocked.png')
Most users who have made at least one attack have never been warned or blocked.
In [27]:
dfs = []
ts = np.arange(0.5, 0.91, 0.1)
moderation_events = [('warn', df_warn_events), ('blocked', df_block_events)]
def get_delta(x):
if x['timestamp'] is not None and x['rev_timestamp'] is not None:
return x['timestamp'] - x['rev_timestamp']
else:
return pd.Timedelta('0 seconds')
for t in ts:
for (event_type, events) in moderation_events:
dfs.append(
d['2015'].query('pred_recipient_score >= %f' % t)\
.loc[:, ['user_text', 'rev_id', 'rev_timestamp']]\
.merge(events, how = 'left', on = 'user_text')\
.assign(delta = lambda x: get_delta(x))\
.assign(blocked= lambda x: (x['delta'] < pd.Timedelta('7 days')) & (x['delta'] > pd.Timedelta('0 seconds')))\
.drop_duplicates(subset = ['rev_id'])\
.assign(threshold = t, event=event_type)
)
ax = sns.pointplot(x='threshold', y='blocked', hue='event', data = pd.concat(dfs))
plt.xlabel('threshold')
plt.ylabel('fraction of attacking comments followed by a moderation event')
Out[27]:
Most attacking comments do not lead to the user being warned/blocked within the next 7 days.
In [28]:
dfs = []
for t in ts:
for event_type, users in moderated_users:
dfs.append(\
d['2015'].query('pred_recipient_score >= %f' % t)\
.merge(users, how = 'left', on = 'user_text')\
.assign(blocked = lambda x: x.blocked.fillna(0),
threshold = t, event = event_type)
)
df = pd.concat(dfs)
sns.pointplot(x = 'threshold', y = 'blocked', hue = 'event', data = df)
plt.ylabel('fraction of attacking comments from warned/blocked users')
Out[28]:
Most attacks do not come from people who have never been warned/blocked for harassment.
In [29]:
def remap(x):
if x < 5:
return str(int(x))
if x < 10:
return '5-10'
else:
return '10+'
t = 0.5
d_temp = d['2015'].assign(attack = lambda x: x.pred_recipient_score >= t)\
.groupby('user_text', as_index = False)['attack'].sum()\
.rename(columns={'attack':'num_attacks'})\
.merge(df_blocked_user_text, how = 'left', on = 'user_text')\
.assign(
blocked = lambda x: x.blocked.fillna(0,),
num_attacks = lambda x: x.num_attacks.apply(remap),
threshold = t)
ax = sns.pointplot(x='num_attacks', y= 'blocked', data=d_temp, hue = 'threshold', order = ('0', '1', '2', '3', '4', '5-10', '10+'))
plt.ylabel('fraction blocked')
Out[29]:
The more attacks a user makes, the more likely it is that they will have been blocked at least once.