Regression analysis notebook for study of harassment on newcomer retention in Wikipedia. See research project page for an overview.
In [1]:
% matplotlib inline
import pandas as pd
from dateutil.relativedelta import relativedelta
import statsmodels.formula.api as sm
import requests
from io import StringIO
import math
import pandas as pd
Pick harassment threshold in [0.01, 0.425, 0.75, 0.85] WARNING: seeing some very threshold sensitive results! High thresholds result in harassment having positive impact on t2 activiy. Construct sample that is concatenation of a random sample and and all users who received harassment in t1.
In [2]:
threshold = 0.425
In [3]:
#Features computes in ./Harassment and Newcomer Retention Data Munging.ipynb
df_random = pd.read_csv("../../data/retention/random_user_sample_features.csv")
df_attacked = pd.read_csv("../../data/retention/attacked_user_sample_features.csv")
In [4]:
# include all harassed newcomer in the sample
df_reg = pd.concat([df_random, df_attacked[df_attacked['m1_num_attack_received_%.3f' % threshold] > 0]])
df_reg = df_reg.drop_duplicates(subset = ['user_id'])
In [5]:
df_reg.shape
Out[5]:
In [6]:
df_reg['m1_harassment_received'] = (df_reg['m1_num_attack_received_%.3f' % threshold] > 0).apply(int)
df_reg['m1_harassment_made'] = (df_reg['m1_num_attack_made_%.3f' % threshold] > 0).apply(int)
In [7]:
df_reg['m1_harassment_received'].value_counts()
Out[7]:
In [8]:
df_reg.shape
Out[8]:
In [9]:
column_map = {
'm1_num_days_active': 'm1_days_active',
'm2_num_days_active' : 'm2_days_active',
'm1_harassment_received': 'm1_received_harassment',
'm1_harassment_made': 'm1_made_harassment',
'm1_fraction_ns0_deleted': 'm1_fraction_ns0_deleted',
'm1_fraction_ns0_reverted': 'm1_fraction_ns0_reverted',
'm1_num_warnings_recieved': 'm1_warnings',
}
df_reg = df_reg.rename(columns=column_map)
In [10]:
def regress(df, f, family = 'linear'):
if family == 'linear':
results = sm.ols(formula=f, data=df).fit()
return results.summary().tables[1]
elif family == 'logistic':
results = sm.logit(formula=f, data=df).fit(disp=0)
return results.summary().tables[1]
else:
return
def get_latex_table(results, famiily = 'linear'):
"""
Mess of a function for turning a statsmodels SimpleTable
into a nice latex table strinf
"""
results = pd.read_csv(StringIO(results.as_csv()))
if family == 'linear':
column_map = {
results.columns[0]: "",
' coef ' : 'coef',
'P>|t| ': "p-val",
' t ': "z-stat",
' [95.0% Conf. Int.]': "95% CI"
}
elif family == 'logistic':
column_map = {
results.columns[0]: "",
' coef ' : 'coef',
'P>|z| ': "p-val",
' z ': "z-stat",
' [95.0% Conf. Int.]': "95% CI"
}
else:
return
results = results.rename(columns=column_map)
results.index = results[""]
del results[""]
results = results[['coef', "z-stat", "p-val", "95% CI"]]
results['coef'] = results['coef'].apply(lambda x: round(float(x), 2))
results['z-stat'] = results['z-stat'].apply(lambda x: round(float(x), 1))
results['p-val'] = results['p-val'].apply(lambda x: round(float(x), 3))
results['95% CI'] = results['95% CI'].apply(reformat_ci)
header = """
\\begin{table}[h]
\\begin{center}
"""
footer = """
\\end{center}
\\caption{%s}
\\label{tab:}
\\end{table}
"""
f = f.replace("_", "\_").replace("~", "\\texttildelow\\")
latex = header + results.to_latex() + footer % f
print(latex)
return results
def reformat_ci(s):
ci = s.strip().split()
ci = (round(float(ci[0]), 1), round(float(ci[1]), 1))
return "[%.1f, %.1f]" % ci
In [11]:
f ="m2_days_active ~ m1_received_harassment"
regress(df_reg, f)
Out[11]:
In [12]:
f= "m2_days_active ~ m1_days_active + m1_received_harassment"
regress(df_reg, f)
Out[12]:
The first regression shows that newcomers who are harassed in m1 tend to be more active in m2, indicating that harassment does not have a chilling effect on continued newcomer activity. However, this result is an artifact of the group of harassed newcomers being more active in general. After controlling for the level of activity in m1, we see that when comparing users of comparable activity levels in m1, those who get harassed are significantly less active in m2.
In [13]:
f="m1_received_harassment ~ is_female"
regress(df_reg.query("has_gender == 1"), f, family = 'logistic')
Out[13]:
In [14]:
f="m2_days_active ~ m1_days_active + m1_received_harassment + m1_received_harassment : is_female"
regress(df_reg.query("has_gender == 1"), f)
Out[14]:
For our gender analysis, we reduce our sample to the set of users who reported a gender. First off, we observe that newcomers who end up reporting a female gender are more likely to receive harassment in m1. To investigate whether the impact of receiving harassment differs across genders, we ran the same regression as in RQ1, but restricted our analysis to users who supplied a gender and added a interaction term between gender and our measure of harassment in m1. We find that when restricting to users who supplied a gender, we again see that users who received harassment have reduced activity in m2. Inspecting the regression results for the interaction term between harassment and gender indicates that the impact is not significantly different for males and females.
In [15]:
f="m2_days_active ~ m1_days_active + m1_received_harassment + m1_received_harassment : m1_made_harassment + m1_received_harassment : m1_warnings"
regress(df_reg, f)
Out[15]:
A serious potential confound in our analyses could be that the users who receive harassment are just bad faith newcomers or sock-puppets. They get attacked for their misbehavior and reduce their activity in m2 because they get blocked or because they never intended to stick around past their own attacks. To reduce this confound, we control for whether the user harassed anyone in m1 and for whether they received an user warning of any type. The results show that even users who receive harassment but did not harass anyone or receive a user warning show reduced activity in m2.
Halfak et al examine how user warnings and deletions and reverts correlate with newcomer retention. Here we add those features and see how they compare to measure of harassment.
In [16]:
f = "m2_days_active ~ m1_days_active + m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("m1_num_ns0_edits > 0"), f)
Out[16]:
In [17]:
f = "m2_days_active ~ m1_days_active + m1_received_harassment + m1_warnings + m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("m1_num_ns0_edits > 0"), f)
Out[17]:
WIP: Receiving harassment is worse for a newcomer than receiving 11 warning messages or having all their first months work deleted or reverted.