Section 7.3: Sample diffs

This is a data analysis script for creating tables of sample diffs for validation as described in section 7.3, which you can run based entirely off the files in this GitHub repository. It loads datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle.xz and creates the following files:

  • datasets/sample_tables/[language]_ns0_sample_dict.pickle
  • analysis/main/sample_tables/[language]/ns0/[language]_ns0_sample_all.html
  • analysis/main/sample_tables/[language]/ns0/[language]_ns0_sample_[bottype].html

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. On a laptop running a Core i5-2540M processor, it takes about 45 minutes to run, as it collects data from the Wikipedia API.

IF YOU RUN THIS, YOU MUST REPLACE user_agent_email WITH YOUR E-MAIL


In [1]:
user_agent_email = "REPLACE THIS WITH YOUR EMAIL plz kthxbye"

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import pickle
import numpy as np
import mwapi
%matplotlib inline

In [3]:
import datetime

In [4]:
start = datetime.datetime.now()

Load data


In [5]:
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle.xz

In [6]:
with open("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle", "rb") as f:
    df_all = pickle.load(f)

Final data format


In [7]:
df_all[0:2].transpose()


Out[7]:
0 1
archived False False
language fr fr
page_namespace 0 0
rev_deleted False False
rev_id 88656915 70598552
rev_minor_edit True True
rev_page 4419903 412311
rev_parent_id 8.85978e+07 6.75069e+07
rev_revert_offset 1 1
rev_sha1 lgtqatftj6rma9ezkyy56rsqethdoqf 0zw28ur2rlxg207ms6w3krqd4qzozq3
rev_timestamp 20130211173947 20110930180432
rev_user 1019240 414968
rev_user_text MerlIwBot Luckas-bot
reverted_to_rev_id 88597754 67506906
reverting_archived False False
reverting_comment r2.7.2+) (robot Retire : [[cbk-zam:Tortellá]] robot Retire: [[hy:Հակատանկային կառավարվող հրթ...
reverting_deleted False False
reverting_id 89436503 70750839
reverting_minor_edit True True
reverting_page 4419903 412311
reverting_parent_id 8.86569e+07 7.05986e+07
reverting_sha1 gjz9jni8w2jiccksgid7tbofddevhu0 myxsvdiky34vgddnhrclg9237cus7nn
reverting_timestamp 20130302203329 20111004215328
reverting_user 757129 1019240
reverting_user_text EmausBot MerlIwBot
revisions_reverted 1 1
namespace_type article article
reverted_timestamp_dt 2013-02-11 17:39:47 2011-09-30 18:04:32
reverting_timestamp_dt 2013-03-02 20:33:29 2011-10-04 21:53:28
time_to_revert 19 days 02:53:42 4 days 03:48:56
time_to_revert_hrs 458.895 99.8156
time_to_revert_days 19.1206 4.15898
reverting_year 2013 2011
time_to_revert_days_log10 1.2815 0.618987
time_to_revert_hrs_log10 2.66171 1.9992
reverting_comment_nobracket r2.7.2+) robot Retire:
botpair EmausBot rv MerlIwBot MerlIwBot rv Luckas-bot
botpair_sorted ['EmausBot', 'MerlIwBot'] ['Luckas-bot', 'MerlIwBot']
reverts_per_page_botpair 1 1
reverts_per_page_botpair_sorted 1 1
bottype interwiki link cleanup -- method2 interwiki link cleanup -- method2
bottype_group interwiki link cleanup -- method2 interwiki link cleanup -- method2

In [8]:
%%bash
rm -rf sample_tables
mkdir sample_tables

declare -a arr=("de" "en" "es" "fr" "ja" "pt" "zh")

for i in "${arr[@]}"
do
   mkdir sample_tables/$i/
   mkdir sample_tables/$i/ns0
   # or do whatever with individual element of the array
done

find sample_tables/


sample_tables/
sample_tables/pt
sample_tables/pt/ns0
sample_tables/ja
sample_tables/ja/ns0
sample_tables/zh
sample_tables/zh/ns0
sample_tables/de
sample_tables/de/ns0
sample_tables/fr
sample_tables/fr/ns0
sample_tables/es
sample_tables/es/ns0
sample_tables/en
sample_tables/en/ns0

In [9]:
import mwapi
import difflib

session = {}
for lang in df_all['language'].unique():
    session[lang] = mwapi.Session('https://' + str(lang) + '.wikipedia.org', user_agent="Research script by " + user_agent_email)

In [10]:
def get_revision(rev_id, language):
    
    try:
        rev_get = session[language].get(action='query', prop='revisions', rvprop="content", revids=rev_id)
        rev_pages = rev_get['query']['pages']
        for row in rev_pages.items():
            return(row[1]['revisions'][0]['*'])
    except:
        return np.nan

In [11]:
def get_diff(row):
    #print(row)
    
    try:
        reverted_content = row['reverted_content'].split("\n")
        reverting_content = row['reverting_content'].split("\n")

        diff = difflib.unified_diff(reverted_content, reverting_content)

        return '<br/>'.join(list(diff))
    
    except:
        return np.nan

In [12]:
def get_diff_api(row):
    #print(row)
    rev_id = row['rev_id']
    reverting_id = row['reverting_id']
    #print(rev_id, reverting_id)
    rev_get = session.get(action='compare', fromrev=rev_id, torev=reverting_id)
    #print(rev_get)
    return rev_get['compare']['*']

In [13]:
!mkdir ../../datasets/sample_tables
!mkdir sample_tables


mkdir: cannot create directory ‘../../datasets/sample_tables’: File exists
mkdir: cannot create directory ‘sample_tables’: File exists

In [14]:
def get_lang_diffs(lang):
    print("-----------")
    print(lang)
    print("-----------")
    import os
    pd.options.display.max_colwidth = -1

    df_lang_ns0 = df_all.query("language == '" + lang + "' and page_namespace == 0").copy()
    #df_lang_ns0['bottype'].unique()
    
    df_lang_ns0_sample_dict = {}
    for bottype in df_lang_ns0['bottype'].unique():
        print(bottype)
        type_df = df_lang_ns0[df_lang_ns0['bottype']==bottype]
      
        
        if len(type_df) > 10000:
            type_df_sample = type_df.sample(round(len(type_df)/100))
        elif len(type_df) > 100:
            type_df_sample = type_df.sample(100)
        else:
            type_df_sample = type_df.copy()

        type_df_sample['reverting_content'] = type_df_sample['reverting_id'].apply(get_revision, language=lang)
        type_df_sample['reverted_content'] = type_df_sample['rev_id'].apply(get_revision, language=lang)

        type_df_sample['diff'] = type_df_sample.apply(get_diff, axis=1)

        df_lang_ns0_sample_dict[bottype] = type_df_sample
        
    with open("../../datasets/sample_tables/df_" + lang + "_ns0_sample_dict.pickle", "wb") as f: 
        pickle.dump(df_lang_ns0_sample_dict, f)
    
    
    for bottype, bottype_df in df_lang_ns0_sample_dict.items():

        bottype_file = bottype.replace(" ", "_")
        bottype_file = bottype_file.replace("/", "_")
        filename = "sample_tables/" + lang + "/ns0/" + lang + "_ns0_sample_" + bottype_file + ".html"

        bottype_df[['reverting_id','reverting_user_text',
                                 'rev_user_text',
                                 'reverting_comment',
                                 'diff']].to_html(filename, escape=False)

        with open(filename, 'r+') as f:
            content = f.read()
            f.seek(0, 0)
            f.write("<a name='" + bottype + "'><h1>" + bottype + "</h1></a>\r\n")
            f.write(content)
            

    call_s = "cat sample_tables/" + lang + "/ns0/*.html > sample_tables/" + lang + "/ns0/" + lang + "_ns0_sample_all.html"
    os.system(call_s)   

            
    with open("sample_tables/" + lang + "/ns0/" + lang + "_ns0_sample_all.html", 'r+') as f:
        content = f.read()
        f.seek(0, 0)
        f.write("<head><meta charset='UTF-8'></head>\r\n<body>")
        f.write("""<style>
                    .dataframe {
                        border:1px solid #C0C0C0;
                        border-collapse:collapse;
                        padding:5px;
                        table-layout:fixed;
                    }
                    .dataframe th {
                        border:1px solid #C0C0C0;
                        padding:5px;
                        background:#F0F0F0;
                    }
                    .dataframe td {
                        border:1px solid #C0C0C0;
                        padding:5px;
                    }
                </style>""")
        f.write("<table class='dataframe'>")
        f.write("<thead><tr><th>Bot type</th><th>Total count in " + lang + "wiki ns0</th><th>Number of sample diffs</th>")

        for bottype, bottype_df in df_lang_ns0_sample_dict.items():

            len_df = str(len(df_lang_ns0[df_lang_ns0['bottype']==bottype]))
            len_sample = str(len(bottype_df))

            toc_str = "<tr><td><a href='#" + bottype + "'>" + bottype + "</a></td>\r\n"
            toc_str += "<td>" + len_df + "</td>"
            toc_str += "<td>" + len_sample + "</td></tr>"
            f.write(toc_str)
        f.write("</table>")
        f.write(content)

In [15]:
for lang in df_all['language'].unique():
    get_lang_diffs(lang)


-----------
fr
-----------
interwiki link cleanup -- method2
interwiki link cleanup -- method1
other
other w/ per justification
fixing double redirect
moving category
other w/ revert in comment
botfight: reverting CommonsDelinker
clearing sandbox
-----------
en
-----------
moving category
interwiki link cleanup -- method1
interwiki link cleanup -- method2
fixing double redirect
template cleanup
other
other w/ per justification
botfight: Russbot vs Cydebot category renaming
botfight: 718bot vs ImageRemovalBot
protection template cleanup
redirect tagging/sorting
other w/ revert in comment
template tagging
category redirect cleanup
botfight: infoboxneeded
botfight: reverting CommonsDelinker
botfight: Cyberbot II vs AnomieBOT date tagging
commons image migration
clearing sandbox
botfight: mathbot mathlist updates
fixing links
-----------
es
-----------
interwiki link cleanup -- method1
fixing double redirect
interwiki link cleanup -- method2
other
other w/ revert in comment
botfight: reverting CommonsDelinker
template cleanup
category redirect cleanup
other w/ per justification
clearing sandbox
-----------
zh
-----------
interwiki link cleanup -- method2
fixing double redirect
other w/ revert in comment
interwiki link cleanup -- method1
other
other w/ per justification
botfight: reverting CommonsDelinker
-----------
de
-----------
interwiki link cleanup -- method1
interwiki link cleanup -- method2
other
fixing double redirect
other w/ revert in comment
template cleanup
botfight: reverting CommonsDelinker
botfight: Russbot vs Cydebot category renaming
other w/ per justification
botfight: 718bot vs ImageRemovalBot
moving category
protection template cleanup
category redirect cleanup
-----------
ja
-----------
interwiki link cleanup -- method2
other
interwiki link cleanup -- method1
other w/ revert in comment
other w/ per justification
fixing double redirect
-----------
pt
-----------
interwiki link cleanup -- method1
interwiki link cleanup -- method2
other
fixing double redirect
clearing sandbox
other w/ revert in comment
other w/ per justification
botfight: reverting CommonsDelinker

How long did this take to run?


In [16]:
end = datetime.datetime.now()

In [17]:
time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")


Total runtime:  56 minutes,  56 seconds

In [ ]: