Section 5.2: Time to revert

This is a data analysis script used to produce findings in this section of the paper, which you can run based entirely off the files in this GitHub repository.

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. It takes about 2 minutes to run on a laptop running a Core i5-2540M processor.



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import pickle
import datetime
%matplotlib inline



In [2]:

    
start = datetime.datetime.now()

Load data



In [3]:

    
!unxz -kf ../../datasets/parsed_dataframes/df_all_2016.pickle.xz



In [4]:

    
!ls ../../datasets/parsed_dataframes/*









    



../../datasets/parsed_dataframes/df_all_2016.pickle
../../datasets/parsed_dataframes/df_all_2016.pickle.xz
../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle.xz
../../datasets/parsed_dataframes/possible_botfights.pickle.xz
../../datasets/parsed_dataframes/possible_botfights.tsv.xz



In [5]:

    
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
    df_all = pickle.load(f)

Any negative time to revert values?



In [6]:

    
df_all.query("time_to_revert_days < 0").groupby("language")['time_to_revert_days'].count()









    Out[6]:





Series([], Name: time_to_revert_days, dtype: int64)

Descriptive statistics for time to revert

Descriptive statistics for time to revert in days, across all languages, articles only



In [7]:

    
df_all.query("page_namespace == 0")['time_to_revert_days'].describe()









    Out[7]:





count    561936.000000
mean        237.645457
std         411.595084
min           0.000012
25%           6.082454
50%          61.933466
75%         289.782436
max        4823.938021
Name: time_to_revert_days, dtype: float64

Descriptive statistics for time to revert in days, by language, articles only



In [8]:

    
ns0_gb_language = df_all.query("page_namespace == 0").groupby("language")
ns0_gb_language['time_to_revert_days'].describe().unstack()









    Out[8]:





       language
count  de           54938.000000
       en          244793.000000
       es           68099.000000
       fr           56181.000000
       ja           34772.000000
       pt           60124.000000
       zh           43029.000000
mean   de             177.219367
       en             355.094704
       es             153.518941
       fr             121.620191
       ja             107.053649
       pt             177.575667
       zh             120.720974
std    de             323.367816
       en             504.773676
       es             287.060992
       fr             282.023358
       ja             252.744285
       pt             338.002085
       zh             198.609084
min    de               0.000012
       en               0.000012
       es               0.000023
       fr               0.000023
       ja               0.000231
       pt               0.000023
       zh               0.000116
25%    de               3.367940
       en              16.899039
       es               4.496042
       fr               2.934815
       ja               1.888440
       pt               6.571725
       zh               4.884954
50%    de              29.779259
       en             138.333750
       es              47.901678
       fr              21.316944
       ja              13.191661
       pt              56.841302
       zh              33.041887
75%    de             210.346296
       en             506.773461
       es             172.992627
       fr              82.405093
       ja              79.564262
       pt             134.406580
       zh             213.728391
max    de            2812.659664
       en            4823.938021
       es            2718.973171
       fr            2862.087998
       ja            2850.540775
       pt            2798.985463
       zh            2869.869861
dtype: float64

Mean time to revert by languages by year



In [9]:

    
ns0_gb_language_year = df_all.query("page_namespace == 0").groupby(["language", "reverting_year"])
ns0_gb_language_year['time_to_revert_days'].mean().unstack()









    Out[9]:







  
    
      reverting_year
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      language
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      de
      2.437581
      28.855724
      28.558832
      33.218967
      50.478627
      43.831775
      34.720251
      38.194312
      32.745193
      425.013958
      265.680970
      588.075954
      715.966491
    
    
      en
      63.480868
      25.859759
      40.299529
      58.425741
      94.761004
      163.235266
      268.411703
      206.503663
      233.984229
      419.697644
      549.862104
      642.093301
      742.180750
    
    
      es
      NaN
      14.931381
      28.025336
      39.068812
      60.031915
      33.716371
      28.144100
      60.045785
      75.463717
      271.023569
      386.191639
      602.155278
      819.415971
    
    
      fr
      9.640359
      12.541849
      30.581841
      31.957423
      45.370803
      40.599968
      46.552897
      57.867140
      37.486068
      313.290995
      649.454563
      508.514669
      552.050484
    
    
      ja
      5.396280
      16.650297
      34.843058
      20.513915
      27.822040
      35.431424
      37.521447
      34.647785
      33.117964
      376.683202
      235.401395
      104.429029
      98.152000
    
    
      pt
      9.191991
      23.780369
      41.967216
      90.338954
      75.424798
      126.001256
      38.875652
      55.937894
      47.312297
      397.827497
      386.471007
      524.410205
      289.767724
    
    
      zh
      22.467035
      57.266240
      32.101106
      19.854394
      43.018599
      61.714529
      59.993002
      46.967085
      32.936245
      197.100142
      225.110042
      175.436261
      445.105027

Median time to revert by languages by year



In [10]:

    
ns0_gb_language_year = df_all.query("page_namespace == 0").groupby(["language", "reverting_year"])
ns0_gb_language_year['time_to_revert_days'].median().unstack()









    Out[10]:







  
    
      reverting_year
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      language
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      de
      2.439618
      6.311354
      12.703090
      8.252627
      11.214311
      14.526528
      6.202546
      7.732135
      5.972309
      279.297998
      189.250058
      575.239363
      791.212373
    
    
      en
      63.480868
      10.168513
      12.095544
      13.318663
      26.052465
      79.193634
      42.580388
      37.459572
      14.159444
      300.279653
      266.566950
      201.228762
      298.773356
    
    
      es
      NaN
      5.390694
      9.429537
      11.184734
      14.064867
      6.995683
      2.036887
      61.646007
      7.451910
      175.852917
      234.889444
      325.515509
      473.861169
    
    
      fr
      9.640359
      5.506470
      13.408519
      7.887407
      12.352766
      10.870451
      8.908362
      20.019815
      6.189792
      150.262350
      629.622998
      43.618079
      167.325457
    
    
      ja
      3.662130
      9.422280
      17.108843
      8.956308
      10.870405
      8.286950
      7.931516
      5.906661
      6.568229
      232.283553
      155.465810
      28.741464
      4.430874
    
    
      pt
      9.191991
      5.794606
      23.212772
      39.109039
      16.065278
      46.074034
      8.320255
      60.852969
      7.503605
      78.069688
      273.977500
      498.168785
      0.717535
    
    
      zh
      14.220000
      27.748524
      10.899815
      7.685845
      12.552546
      9.607488
      8.915816
      4.883733
      6.289398
      174.701632
      273.385903
      30.850069
      303.214566

Time to revert histogram and Kernel Density Estimate for enwiki, articles only



In [11]:

    
sns.set(font_scale=2)

g = sns.FacetGrid(df_all.query("page_namespace == 0 and language == 'en'"), size=8, aspect=2)
g.map(sns.distplot, "time_to_revert_hrs_log10", bins=25)

g.ax.set_xticks([np.log10(1/60),np.log10(1), np.log10(24), np.log10(24*7), np.log10(24*30), np.log10(24*365)])
g.ax.set_xticklabels(["minute", "hour", "day", "week", "month", "year"])

g.ax.set_xlim(np.log10(1/90), np.log10(24*365*5))

g.ax.set_ylabel("Kernel Density Estimate value")
g.ax.set_xlabel("Time to revert (log10 scaled)")









    Out[11]:





<matplotlib.text.Text at 0x7fefaa2cbeb8>

Time to revert histogram and KDE for all languages, articles only, combined plot



In [19]:

    
sns.set(font_scale=3.25, rc={'lines.linewidth': 4})
pal = sns.color_palette("hls", 10)
g = sns.FacetGrid(df_all.query("page_namespace == 0"), palette=pal, hue="language", size=8, aspect=3.5)
g.map(sns.kdeplot, "time_to_revert_hrs_log10")
#g.add_legend()
leg = plt.legend()
for legobj in leg.legendHandles:
    legobj.set_linewidth(8.0)

g.ax.set_xlim(np.log10(1/90), np.log10(24*365*5))

g.ax.set_ylabel("KDE probability")
g.ax.set_xlabel("Time to revert (log10 scaled)")
#plt.suptitle("Distribution of time to revert for article bot-bot reverts across languages")

g.ax.set_xticks([np.log10(1/60),np.log10(1), np.log10(24), np.log10(24*7), np.log10(24*30), np.log10(24*365)])
g.ax.set_xticklabels(["minute", "hour", "day", "week", "month", "year"])

plt.savefig("time-to-revert-log-articles.pdf", bbox_inches='tight', dpi=600)

Time to revert histogram and KDE for all languages, articles only, paneled plot



In [13]:

    
sns.set(font_scale=1.25)
pal = sns.color_palette("husl", 7)
g = sns.FacetGrid(df_all.query("page_namespace == 0"),
                  palette=pal, row="language", size=3, aspect=4, sharex=False, sharey=False)
g.map(sns.distplot, "time_to_revert_hrs_log10")

xticks = [np.log10(1/60),np.log10(1), np.log10(24), np.log10(24*7), np.log10(24*30), np.log10(24*365)]
xticklabels = ["minute", "hour", "day", "week", "month", "year"]

for ax in g.axes.flatten():
    ax.set_xticks(xticks)
    ax.set_xticklabels(xticklabels)
    ax.set_xlim(np.log10(1/90), np.log10(24*365*5))

Mean days to revert by language and reverting year (articles only, 95% CI bars)



In [20]:

    
sns.set(font_scale=3)

g = sns.factorplot(data=df_all[df_all['page_namespace']==0],
                   x='language',
                   y='time_to_revert_days',
                   hue='reverting_year',
                   kind='bar',
                   size=10,
                   aspect = 3)

#plt.suptitle("Mean days to revert by language and reverting year (articles only, 95% CI bars)")
plt.savefig("mean-days-to-revert-articles.pdf", bbox_inches='tight', dpi=600)

Median days to revert by language and reverting year (articles only, 95% CI bars)



In [15]:

    
sns.set(font_scale=3)

g = sns.factorplot(data=df_all[df_all['page_namespace']==0],
                   x='language',
                   y='time_to_revert_days',
                   hue='reverting_year',
                   kind='bar',
                   estimator=np.median,                   
                   size=10,
                   aspect = 3)

plt.suptitle("Median days to revert by language and reverting year (articles only, 95% CI bars)")









    Out[15]:





<matplotlib.text.Text at 0x7fef8a8e1780>

Runtime



In [16]:

    
end = datetime.datetime.now()

time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")









    



Total runtime:  2 minutes,  7 seconds

reverting_year	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
language
de	2.437581	28.855724	28.558832	33.218967	50.478627	43.831775	34.720251	38.194312	32.745193	425.013958	265.680970	588.075954	715.966491
en	63.480868	25.859759	40.299529	58.425741	94.761004	163.235266	268.411703	206.503663	233.984229	419.697644	549.862104	642.093301	742.180750
es	NaN	14.931381	28.025336	39.068812	60.031915	33.716371	28.144100	60.045785	75.463717	271.023569	386.191639	602.155278	819.415971
fr	9.640359	12.541849	30.581841	31.957423	45.370803	40.599968	46.552897	57.867140	37.486068	313.290995	649.454563	508.514669	552.050484
ja	5.396280	16.650297	34.843058	20.513915	27.822040	35.431424	37.521447	34.647785	33.117964	376.683202	235.401395	104.429029	98.152000
pt	9.191991	23.780369	41.967216	90.338954	75.424798	126.001256	38.875652	55.937894	47.312297	397.827497	386.471007	524.410205	289.767724
zh	22.467035	57.266240	32.101106	19.854394	43.018599	61.714529	59.993002	46.967085	32.936245	197.100142	225.110042	175.436261	445.105027

reverting_year	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
language
de	2.439618	6.311354	12.703090	8.252627	11.214311	14.526528	6.202546	7.732135	5.972309	279.297998	189.250058	575.239363	791.212373
en	63.480868	10.168513	12.095544	13.318663	26.052465	79.193634	42.580388	37.459572	14.159444	300.279653	266.566950	201.228762	298.773356
es	NaN	5.390694	9.429537	11.184734	14.064867	6.995683	2.036887	61.646007	7.451910	175.852917	234.889444	325.515509	473.861169
fr	9.640359	5.506470	13.408519	7.887407	12.352766	10.870451	8.908362	20.019815	6.189792	150.262350	629.622998	43.618079	167.325457
ja	3.662130	9.422280	17.108843	8.956308	10.870405	8.286950	7.931516	5.906661	6.568229	232.283553	155.465810	28.741464	4.430874
pt	9.191991	5.794606	23.212772	39.109039	16.065278	46.074034	8.320255	60.852969	7.503605	78.069688	273.977500	498.168785	0.717535
zh	14.220000	27.748524	10.899815	7.685845	12.552546	9.607488	8.915816	4.883733	6.289398	174.701632	273.385903	30.850069	303.214566