Harassment and Newcomer Retention (Paper)

Regression analysis notebook for study of harassment on newcomer retention in Wikipedia. See research project page for an overview.



In [1]:

    
% matplotlib inline
import pandas as pd
from dateutil.relativedelta import relativedelta
import statsmodels.formula.api as sm
import requests
from io import StringIO
import math
import pandas as pd

Load Data and take sample

Pick harassment threshold in [0.01, 0.425, 0.75, 0.85] WARNING: seeing some very threshold sensitive results! High thresholds result in harassment having positive impact on t2 activiy. Construct sample that is concatenation of a random sample and and all users who received harassment in t1.



In [2]:

    
threshold = 0.425



In [3]:

    
#Features computes in ./Harassment and Newcomer Retention Data Munging.ipynb
df_random = pd.read_csv("../../data/retention/random_user_sample_features.csv")
df_attacked = pd.read_csv("../../data/retention/attacked_user_sample_features.csv")



In [4]:

    
# include all harassed newcomer in the sample
df_reg = pd.concat([df_random, df_attacked[df_attacked['m1_num_attack_received_%.3f' % threshold] > 0]])
df_reg = df_reg.drop_duplicates(subset = ['user_id'])



In [5]:

    
df_reg.shape









    Out[5]:





(105492, 40)



In [6]:

    
df_reg['m1_harassment_received'] = (df_reg['m1_num_attack_received_%.3f' % threshold] > 0).apply(int)
df_reg['m1_harassment_made'] = (df_reg['m1_num_attack_made_%.3f' % threshold] > 0).apply(int)



In [7]:

    
df_reg['m1_harassment_received'].value_counts()









    Out[7]:





0    99943
1     5549
Name: m1_harassment_received, dtype: int64



In [8]:

    
df_reg.shape









    Out[8]:





(105492, 42)



In [9]:

    
column_map = {
        'm1_num_days_active': 'm1_days_active',
        'm2_num_days_active' : 'm2_days_active',
        'm1_harassment_received': 'm1_received_harassment',
        'm1_harassment_made': 'm1_made_harassment',
        'm1_fraction_ns0_deleted': 'm1_fraction_ns0_deleted',
        'm1_fraction_ns0_reverted': 'm1_fraction_ns0_reverted',
        'm1_num_warnings_recieved': 'm1_warnings',
        }
        
df_reg = df_reg.rename(columns=column_map)

Regression Analysis



In [10]:

    
def regress(df, f, family = 'linear'):
    if family == 'linear':
        results = sm.ols(formula=f, data=df).fit()
        return results.summary().tables[1]

    elif family == 'logistic':
        results = sm.logit(formula=f, data=df).fit(disp=0)
        return results.summary().tables[1]
    else:
        return
    

def get_latex_table(results, famiily = 'linear'):
    """
    Mess of a function for turning a statsmodels SimpleTable
    into a nice latex table strinf
    """
    
    results = pd.read_csv(StringIO(results.as_csv()))
    
    if family == 'linear':
        column_map = {
            results.columns[0]: "",
            '   coef   ' : 'coef',
           'P>|t| ': "p-val",
            '    t    ': "z-stat",
           ' [95.0% Conf. Int.]': "95% CI"
        }

    elif family == 'logistic':
        column_map = {
            results.columns[0]: "",
            '   coef   ' : 'coef',
           'P>|z| ': "p-val",
            '    z    ': "z-stat",
           ' [95.0% Conf. Int.]': "95% CI"
        }
    else:
        return
        
        
    results = results.rename(columns=column_map)
    results.index = results[""]
    del results[""]
    results = results[['coef', "z-stat", "p-val", "95% CI"]]
    results['coef'] = results['coef'].apply(lambda x: round(float(x), 2))
    results['z-stat'] = results['z-stat'].apply(lambda x: round(float(x), 1))
    results['p-val'] = results['p-val'].apply(lambda x: round(float(x), 3))
    results['95% CI'] = results['95% CI'].apply(reformat_ci)
    header = """
\\begin{table}[h]
\\begin{center}
    """
    footer = """
\\end{center}
\\caption{%s}
\\label{tab:}
\\end{table}
    """
    f = f.replace("_", "\_").replace("~", "\\texttildelow\\")
    latex = header + results.to_latex() + footer % f
    print(latex)
    return results
        
    
def reformat_ci(s):
    ci = s.strip().split()
    ci = (round(float(ci[0]), 1), round(float(ci[1]), 1))
    return "[%.1f, %.1f]" % ci

RQ1: Do newcomers in general show reduced activity after experiencing harassment?



In [11]:

    
f ="m2_days_active ~ m1_received_harassment"
regress(df_reg, f)









    Out[11]:






                            coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                   0.2052      0.007     31.109   0.000      0.192     0.218


  m1_received_harassment      2.8905      0.029    100.519   0.000      2.834     2.947



In [12]:

    
f= "m2_days_active ~ m1_days_active + m1_received_harassment"
regress(df_reg, f)









    Out[12]:






                            coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                  -0.7172      0.005   -131.834   0.000     -0.728    -0.706


  m1_days_active              0.5945      0.002    326.691   0.000      0.591     0.598


  m1_received_harassment     -0.3474      0.023    -15.395   0.000     -0.392    -0.303

The first regression shows that newcomers who are harassed in m1 tend to be more active in m2, indicating that harassment does not have a chilling effect on continued newcomer activity. However, this result is an artifact of the group of harassed newcomers being more active in general. After controlling for the level of activity in m1, we see that when comparing users of comparable activity levels in m1, those who get harassed are significantly less active in m2.

RQ2: Does a newcomer's gender affect how they behave after experiencing harassment?



In [13]:

    
f="m1_received_harassment ~ is_female"
regress(df_reg.query("has_gender == 1"), f, family = 'logistic')









    Out[13]:






               coef      std err       z       P>|z|  [95.0% Conf. Int.] 


  Intercept     -1.9697      0.055    -35.684   0.000     -2.078    -1.862


  is_female      0.3866      0.123      3.146   0.002      0.146     0.627



In [14]:

    
f="m2_days_active ~ m1_days_active + m1_received_harassment + m1_received_harassment : is_female"
regress(df_reg.query("has_gender == 1"), f)









    Out[14]:






                                      coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                            -0.9389      0.060    -15.626   0.000     -1.057    -0.821


  m1_days_active                        0.7568      0.011     66.062   0.000      0.734     0.779


  m1_received_harassment               -0.8463      0.218     -3.888   0.000     -1.273    -0.420


  m1_received_harassment:is_female     -0.7046      0.351     -2.007   0.045     -1.393    -0.016

For our gender analysis, we reduce our sample to the set of users who reported a gender. First off, we observe that newcomers who end up reporting a female gender are more likely to receive harassment in m1. To investigate whether the impact of receiving harassment differs across genders, we ran the same regression as in RQ1, but restricted our analysis to users who supplied a gender and added a interaction term between gender and our measure of harassment in m1. We find that when restricting to users who supplied a gender, we again see that users who received harassment have reduced activity in m2. Inspecting the regression results for the interaction term between harassment and gender indicates that the impact is not significantly different for males and females.

RQ3: How do good faith newcomers behave after experiencing harassment?



In [15]:

    
f="m2_days_active ~ m1_days_active + m1_received_harassment +  m1_received_harassment : m1_made_harassment + m1_received_harassment : m1_warnings"
regress(df_reg, f)









    Out[15]:






                                               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                                     -0.7153      0.005   -131.042   0.000     -0.726    -0.705


  m1_days_active                                 0.5933      0.002    321.065   0.000      0.590     0.597


  m1_received_harassment                        -0.2668      0.025    -10.839   0.000     -0.315    -0.219


  m1_received_harassment:m1_made_harassment      0.2411      0.056      4.294   0.000      0.131     0.351


  m1_received_harassment:m1_warnings            -0.1599      0.011    -13.997   0.000     -0.182    -0.138

A serious potential confound in our analyses could be that the users who receive harassment are just bad faith newcomers or sock-puppets. They get attacked for their misbehavior and reduce their activity in m2 because they get blocked or because they never intended to stick around past their own attacks. To reduce this confound, we control for whether the user harassed anyone in m1 and for whether they received an user warning of any type. The results show that even users who receive harassment but did not harass anyone or receive a user warning show reduced activity in m2.

RQ4: How does experiencing harassment compare to previously studied barriers to newcomer socialization?

Halfak et al examine how user warnings and deletions and reverts correlate with newcomer retention. Here we add those features and see how they compare to measure of harassment.



In [16]:

    
f = "m2_days_active ~ m1_days_active +  m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("m1_num_ns0_edits > 0"), f)









    Out[16]:






                              coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                    -0.7493      0.009    -80.887   0.000     -0.767    -0.731


  m1_days_active                0.5965      0.002    300.840   0.000      0.593     0.600


  m1_fraction_ns0_deleted      -0.0926      0.036     -2.543   0.011     -0.164    -0.021


  m1_fraction_ns0_reverted     -0.0579      0.015     -3.832   0.000     -0.088    -0.028



In [17]:

    
f = "m2_days_active ~ m1_days_active + m1_received_harassment + m1_warnings +  m1_fraction_ns0_deleted + m1_fraction_ns0_reverted "
regress(df_reg.query("m1_num_ns0_edits > 0"), f)









    Out[17]:






                              coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                    -0.7628      0.009    -81.987   0.000     -0.781    -0.745


  m1_days_active                0.6130      0.002    269.827   0.000      0.609     0.617


  m1_received_harassment       -0.3963      0.031    -12.589   0.000     -0.458    -0.335


  m1_warnings                  -0.0872      0.007    -12.498   0.000     -0.101    -0.073


  m1_fraction_ns0_deleted      -0.0807      0.036     -2.223   0.026     -0.152    -0.010


  m1_fraction_ns0_reverted      0.0340      0.016      2.111   0.035      0.002     0.066

WIP: Receiving harassment is worse for a newcomer than receiving 11 warning messages or having all their first months work deleted or reverted.

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	0.2052	0.007	31.109	0.000	0.192 0.218
m1_received_harassment	2.8905	0.029	100.519	0.000	2.834 2.947

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	-1.9697	0.055	-35.684	0.000	-2.078 -1.862
is_female	0.3866	0.123	3.146	0.002	0.146 0.627