Section 7.3: Sample diffs

This is a data analysis script for creating tables of sample diffs for validation as described in section 7.3, which you can run based entirely off the files in this GitHub repository. It loads datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle.xz and creates the following files:

datasets/sample_tables/[language]_ns0_sample_dict.pickle
analysis/main/sample_tables/[language]/ns0/[language]_ns0_sample_all.html
analysis/main/sample_tables/[language]/ns0/[language]_ns0_sample_[bottype].html

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. On a laptop running a Core i5-2540M processor, it takes about 45 minutes to run, as it collects data from the Wikipedia API.

IF YOU RUN THIS, YOU MUST REPLACE `user_agent_email` WITH YOUR E-MAIL



In [1]:

    
user_agent_email = "REPLACE THIS WITH YOUR EMAIL plz kthxbye"



In [2]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import pickle
import numpy as np
import mwapi
%matplotlib inline



In [3]:

    
import datetime



In [4]:

    
start = datetime.datetime.now()

Load data



In [5]:

    
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle.xz



In [6]:

    
with open("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle", "rb") as f:
    df_all = pickle.load(f)

Final data format



In [7]:

    
df_all[0:2].transpose()









    Out[7]:







  
    
      
      0
      1
    
  
  
    
      archived
      False
      False
    
    
      language
      fr
      fr
    
    
      page_namespace
      0
      0
    
    
      rev_deleted
      False
      False
    
    
      rev_id
      88656915
      70598552
    
    
      rev_minor_edit
      True
      True
    
    
      rev_page
      4419903
      412311
    
    
      rev_parent_id
      8.85978e+07
      6.75069e+07
    
    
      rev_revert_offset
      1
      1
    
    
      rev_sha1
      lgtqatftj6rma9ezkyy56rsqethdoqf
      0zw28ur2rlxg207ms6w3krqd4qzozq3
    
    
      rev_timestamp
      20130211173947
      20110930180432
    
    
      rev_user
      1019240
      414968
    
    
      rev_user_text
      MerlIwBot
      Luckas-bot
    
    
      reverted_to_rev_id
      88597754
      67506906
    
    
      reverting_archived
      False
      False
    
    
      reverting_comment
      r2.7.2+) (robot Retire : [[cbk-zam:Tortellá]]
      robot Retire: [[hy:Հակատանկային կառավարվող հրթ...
    
    
      reverting_deleted
      False
      False
    
    
      reverting_id
      89436503
      70750839
    
    
      reverting_minor_edit
      True
      True
    
    
      reverting_page
      4419903
      412311
    
    
      reverting_parent_id
      8.86569e+07
      7.05986e+07
    
    
      reverting_sha1
      gjz9jni8w2jiccksgid7tbofddevhu0
      myxsvdiky34vgddnhrclg9237cus7nn
    
    
      reverting_timestamp
      20130302203329
      20111004215328
    
    
      reverting_user
      757129
      1019240
    
    
      reverting_user_text
      EmausBot
      MerlIwBot
    
    
      revisions_reverted
      1
      1
    
    
      namespace_type
      article
      article
    
    
      reverted_timestamp_dt
      2013-02-11 17:39:47
      2011-09-30 18:04:32
    
    
      reverting_timestamp_dt
      2013-03-02 20:33:29
      2011-10-04 21:53:28
    
    
      time_to_revert
      19 days 02:53:42
      4 days 03:48:56
    
    
      time_to_revert_hrs
      458.895
      99.8156
    
    
      time_to_revert_days
      19.1206
      4.15898
    
    
      reverting_year
      2013
      2011
    
    
      time_to_revert_days_log10
      1.2815
      0.618987
    
    
      time_to_revert_hrs_log10
      2.66171
      1.9992
    
    
      reverting_comment_nobracket
      r2.7.2+)
      robot Retire:
    
    
      botpair
      EmausBot rv MerlIwBot
      MerlIwBot rv Luckas-bot
    
    
      botpair_sorted
      ['EmausBot', 'MerlIwBot']
      ['Luckas-bot', 'MerlIwBot']
    
    
      reverts_per_page_botpair
      1
      1
    
    
      reverts_per_page_botpair_sorted
      1
      1
    
    
      bottype
      interwiki link cleanup -- method2
      interwiki link cleanup -- method2
    
    
      bottype_group
      interwiki link cleanup -- method2
      interwiki link cleanup -- method2



In [8]:

    
%%bash
rm -rf sample_tables
mkdir sample_tables

declare -a arr=("de" "en" "es" "fr" "ja" "pt" "zh")

for i in "${arr[@]}"
do
   mkdir sample_tables/$i/
   mkdir sample_tables/$i/ns0
   # or do whatever with individual element of the array
done

find sample_tables/









    



sample_tables/
sample_tables/pt
sample_tables/pt/ns0
sample_tables/ja
sample_tables/ja/ns0
sample_tables/zh
sample_tables/zh/ns0
sample_tables/de
sample_tables/de/ns0
sample_tables/fr
sample_tables/fr/ns0
sample_tables/es
sample_tables/es/ns0
sample_tables/en
sample_tables/en/ns0



In [9]:

    
import mwapi
import difflib

session = {}
for lang in df_all['language'].unique():
    session[lang] = mwapi.Session('https://' + str(lang) + '.wikipedia.org', user_agent="Research script by " + user_agent_email)



In [10]:

    
def get_revision(rev_id, language):
    
    try:
        rev_get = session[language].get(action='query', prop='revisions', rvprop="content", revids=rev_id)
        rev_pages = rev_get['query']['pages']
        for row in rev_pages.items():
            return(row[1]['revisions'][0]['*'])
    except:
        return np.nan



In [11]:

    
def get_diff(row):
    #print(row)
    
    try:
        reverted_content = row['reverted_content'].split("\n")
        reverting_content = row['reverting_content'].split("\n")

        diff = difflib.unified_diff(reverted_content, reverting_content)

        return '<br/>'.join(list(diff))
    
    except:
        return np.nan



In [12]:

    
def get_diff_api(row):
    #print(row)
    rev_id = row['rev_id']
    reverting_id = row['reverting_id']
    #print(rev_id, reverting_id)
    rev_get = session.get(action='compare', fromrev=rev_id, torev=reverting_id)
    #print(rev_get)
    return rev_get['compare']['*']



In [13]:

    
!mkdir ../../datasets/sample_tables
!mkdir sample_tables









    



mkdir: cannot create directory ‘../../datasets/sample_tables’: File exists
mkdir: cannot create directory ‘sample_tables’: File exists



In [14]:

    
def get_lang_diffs(lang):
    print("-----------")
    print(lang)
    print("-----------")
    import os
    pd.options.display.max_colwidth = -1

    df_lang_ns0 = df_all.query("language == '" + lang + "' and page_namespace == 0").copy()
    #df_lang_ns0['bottype'].unique()
    
    df_lang_ns0_sample_dict = {}
    for bottype in df_lang_ns0['bottype'].unique():
        print(bottype)
        type_df = df_lang_ns0[df_lang_ns0['bottype']==bottype]
      
        
        if len(type_df) > 10000:
            type_df_sample = type_df.sample(round(len(type_df)/100))
        elif len(type_df) > 100:
            type_df_sample = type_df.sample(100)
        else:
            type_df_sample = type_df.copy()

        type_df_sample['reverting_content'] = type_df_sample['reverting_id'].apply(get_revision, language=lang)
        type_df_sample['reverted_content'] = type_df_sample['rev_id'].apply(get_revision, language=lang)

        type_df_sample['diff'] = type_df_sample.apply(get_diff, axis=1)

        df_lang_ns0_sample_dict[bottype] = type_df_sample
        
    with open("../../datasets/sample_tables/df_" + lang + "_ns0_sample_dict.pickle", "wb") as f: 
        pickle.dump(df_lang_ns0_sample_dict, f)
    
    
    for bottype, bottype_df in df_lang_ns0_sample_dict.items():

        bottype_file = bottype.replace(" ", "_")
        bottype_file = bottype_file.replace("/", "_")
        filename = "sample_tables/" + lang + "/ns0/" + lang + "_ns0_sample_" + bottype_file + ".html"

        bottype_df[['reverting_id','reverting_user_text',
                                 'rev_user_text',
                                 'reverting_comment',
                                 'diff']].to_html(filename, escape=False)

        with open(filename, 'r+') as f:
            content = f.read()
            f.seek(0, 0)
            f.write("<a name='" + bottype + "'><h1>" + bottype + "</h1></a>\r\n")
            f.write(content)
            

    call_s = "cat sample_tables/" + lang + "/ns0/*.html > sample_tables/" + lang + "/ns0/" + lang + "_ns0_sample_all.html"
    os.system(call_s)   

            
    with open("sample_tables/" + lang + "/ns0/" + lang + "_ns0_sample_all.html", 'r+') as f:
        content = f.read()
        f.seek(0, 0)
        f.write("<head><meta charset='UTF-8'></head>\r\n<body>")
        f.write("""<style>
                    .dataframe {
                        border:1px solid #C0C0C0;
                        border-collapse:collapse;
                        padding:5px;
                        table-layout:fixed;
                    }
                    .dataframe th {
                        border:1px solid #C0C0C0;
                        padding:5px;
                        background:#F0F0F0;
                    }
                    .dataframe td {
                        border:1px solid #C0C0C0;
                        padding:5px;
                    }
                </style>""")
        f.write("<table class='dataframe'>")
        f.write("<thead><tr><th>Bot type</th><th>Total count in " + lang + "wiki ns0</th><th>Number of sample diffs</th>")

        for bottype, bottype_df in df_lang_ns0_sample_dict.items():

            len_df = str(len(df_lang_ns0[df_lang_ns0['bottype']==bottype]))
            len_sample = str(len(bottype_df))

            toc_str = "<tr><td><a href='#" + bottype + "'>" + bottype + "</a></td>\r\n"
            toc_str += "<td>" + len_df + "</td>"
            toc_str += "<td>" + len_sample + "</td></tr>"
            f.write(toc_str)
        f.write("</table>")
        f.write(content)



In [15]:

    
for lang in df_all['language'].unique():
    get_lang_diffs(lang)









    



-----------
fr
-----------
interwiki link cleanup -- method2
interwiki link cleanup -- method1
other
other w/ per justification
fixing double redirect
moving category
other w/ revert in comment
botfight: reverting CommonsDelinker
clearing sandbox
-----------
en
-----------
moving category
interwiki link cleanup -- method1
interwiki link cleanup -- method2
fixing double redirect
template cleanup
other
other w/ per justification
botfight: Russbot vs Cydebot category renaming
botfight: 718bot vs ImageRemovalBot
protection template cleanup
redirect tagging/sorting
other w/ revert in comment
template tagging
category redirect cleanup
botfight: infoboxneeded
botfight: reverting CommonsDelinker
botfight: Cyberbot II vs AnomieBOT date tagging
commons image migration
clearing sandbox
botfight: mathbot mathlist updates
fixing links
-----------
es
-----------
interwiki link cleanup -- method1
fixing double redirect
interwiki link cleanup -- method2
other
other w/ revert in comment
botfight: reverting CommonsDelinker
template cleanup
category redirect cleanup
other w/ per justification
clearing sandbox
-----------
zh
-----------
interwiki link cleanup -- method2
fixing double redirect
other w/ revert in comment
interwiki link cleanup -- method1
other
other w/ per justification
botfight: reverting CommonsDelinker
-----------
de
-----------
interwiki link cleanup -- method1
interwiki link cleanup -- method2
other
fixing double redirect
other w/ revert in comment
template cleanup
botfight: reverting CommonsDelinker
botfight: Russbot vs Cydebot category renaming
other w/ per justification
botfight: 718bot vs ImageRemovalBot
moving category
protection template cleanup
category redirect cleanup
-----------
ja
-----------
interwiki link cleanup -- method2
other
interwiki link cleanup -- method1
other w/ revert in comment
other w/ per justification
fixing double redirect
-----------
pt
-----------
interwiki link cleanup -- method1
interwiki link cleanup -- method2
other
fixing double redirect
clearing sandbox
other w/ revert in comment
other w/ per justification
botfight: reverting CommonsDelinker

How long did this take to run?



In [16]:

    
end = datetime.datetime.now()



In [17]:

    
time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")









    



Total runtime:  56 minutes,  56 seconds



In [ ]:

	0	1
archived	False	False
language	fr	fr
page_namespace	0	0
rev_deleted	False	False
rev_id	88656915	70598552
rev_minor_edit	True	True
rev_page	4419903	412311
rev_parent_id	8.85978e+07	6.75069e+07
rev_revert_offset	1	1
rev_sha1	lgtqatftj6rma9ezkyy56rsqethdoqf	0zw28ur2rlxg207ms6w3krqd4qzozq3
rev_timestamp	20130211173947	20110930180432
rev_user	1019240	414968
rev_user_text	MerlIwBot	Luckas-bot
reverted_to_rev_id	88597754	67506906
reverting_archived	False	False
reverting_comment	r2.7.2+) (robot Retire : [[cbk-zam:Tortellá]]	robot Retire: [[hy:Հակատանկային կառավարվող հրթ...
reverting_deleted	False	False
reverting_id	89436503	70750839
reverting_minor_edit	True	True
reverting_page	4419903	412311
reverting_parent_id	8.86569e+07	7.05986e+07
reverting_sha1	gjz9jni8w2jiccksgid7tbofddevhu0	myxsvdiky34vgddnhrclg9237cus7nn
reverting_timestamp	20130302203329	20111004215328
reverting_user	757129	1019240
reverting_user_text	EmausBot	MerlIwBot
revisions_reverted	1	1
namespace_type	article	article
reverted_timestamp_dt	2013-02-11 17:39:47	2011-09-30 18:04:32
reverting_timestamp_dt	2013-03-02 20:33:29	2011-10-04 21:53:28
time_to_revert	19 days 02:53:42	4 days 03:48:56
time_to_revert_hrs	458.895	99.8156
time_to_revert_days	19.1206	4.15898
reverting_year	2013	2011
time_to_revert_days_log10	1.2815	0.618987
time_to_revert_hrs_log10	2.66171	1.9992
reverting_comment_nobracket	r2.7.2+)	robot Retire:
botpair	EmausBot rv MerlIwBot	MerlIwBot rv Luckas-bot
botpair_sorted	['EmausBot', 'MerlIwBot']	['Luckas-bot', 'MerlIwBot']
reverts_per_page_botpair	1	1
reverts_per_page_botpair_sorted	1	1
bottype	interwiki link cleanup -- method2	interwiki link cleanup -- method2
bottype_group	interwiki link cleanup -- method2	interwiki link cleanup -- method2

Section 7.3: Sample diffs

IF YOU RUN THIS, YOU MUST REPLACE user_agent_email WITH YOUR E-MAIL

Load data

Final data format

How long did this take to run?

IF YOU RUN THIS, YOU MUST REPLACE `user_agent_email` WITH YOUR E-MAIL