This is a data analysis script for an analysis presented in section 5.1, which you can run based entirely off the files in this GitHub repository. It loads datasets/montly_bot_edits/[language]wiki_20170427.tsv
and datasets_monthly_bot_reverts/[language]wiki_20170420.tsv
.
In [1]:
import pandas as pd
import seaborn as sns
import mwapi
import numpy as np
import glob
%matplotlib inline
In [2]:
!ls -lah ../../datasets/monthly_bot_edits/
In [3]:
!ls -lah ../../datasets/monthly_bot_reverts/
Monthly bot edit counts are in the format: month (YYYYMM), page namespace, and total number of bot edits in that language's namespace that month (n).
In [17]:
!head ../../datasets/monthly_bot_edits/enwiki_20170427.tsv
Monthly bot revert counts are a bit more complicated and in a slightly different format:
In [18]:
!head ../../datasets/monthly_bot_reverts/enwiki_20170420.tsv
In [4]:
df_edits_dict = {}
for filename in glob.glob("../../datasets/monthly_bot_edits/??wiki_2017042?.tsv"):
lang_code = filename[33:35]
df_edits_dict[lang_code] = pd.read_csv(filename, sep="\t")
df_edits_dict[lang_code] = df_edits_dict[lang_code].drop_duplicates()
In [5]:
for lang, lang_df in df_edits_dict.items():
print(lang, len(lang_df))
In [6]:
df_rev_dict = {}
for filename in glob.glob("../../datasets/monthly_bot_reverts/??wiki_2017042?.tsv"):
lang_code = filename[35:37]
df_rev_dict[lang_code] = pd.read_csv(filename, sep="\t")
df_rev_dict[lang_code] = df_rev_dict[lang_code].drop_duplicates()
In [7]:
for lang, lang_df in df_rev_dict.items():
print(lang, len(lang_df))
In [8]:
langs = ["de", "en", "es", "fr", "ja", "pt", "zh"]
In [9]:
df_edits_dict['en'][0:5]
Out[9]:
In [10]:
df_rev_dict['en'][0:5]
Out[10]:
In [11]:
def truncate_my(s):
"""
Truncate YYYYMMDD format to YYYYMM. For use with df.apply()
"""
s = str(s)
return int(s[0:6])
Test function
In [12]:
truncate_my(20100101)
Out[12]:
Yay!
In [13]:
for lang in langs:
df_edits_dict[lang] = df_edits_dict[lang].set_index('month')
df_rev_dict[lang]['month_my'] = df_rev_dict[lang]['month'].apply(truncate_my)
df_rev_dict[lang] = df_rev_dict[lang].set_index('month_my')
In [14]:
combi_ns0_dict = {}
combi_dict = {}
for lang in langs:
print(lang)
combi_ns0_dict[lang] = pd.concat([df_rev_dict[lang].query("page_namespace == 0"), df_edits_dict[lang].query("page_namespace == 0")], axis=1, join='outer')
combi_ns0_dict[lang]['bot_edits'] = combi_ns0_dict[lang]['n']
combi_ns0_dict[lang]['prop_bot2bot_rv'] = combi_ns0_dict[lang]['bot2bot_reverts']/combi_ns0_dict[lang]['bot_edits']
In [15]:
combi_ns0_dict['en'][29:39]
Out[15]:
In [16]:
sum_dict = {}
for lang in langs:
#print(lang)
sum_dict[lang] = combi_ns0_dict[lang][['bot_edits','bot2bot_reverts']].sum()
print(lang, "ns0 proportion:", (sum_dict[lang]['bot2bot_reverts']/sum_dict[lang]['bot_edits']*100).round(4), "%")
print(sum_dict[lang])
print("")
In [ ]: