Section 7.2: Comment parsing

This is a data analysis script for parsing comments as described in section 7, which you can run based entirely off the files in this GitHub repository. It loads datasets/parsed_dataframes/df_all_2016.pickle.xz and creates the following files (and compresses them in .xz format):

datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle
datasets/parsed_dataframes/possible_botfights.pickle
datasets/parsed_dataframes/possible_botfights.tsv

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. On a laptop running a Core i5-2540M processor, it takes about 5 minutes to run the main analysis and another 5 minutes to xz compress the files (if compressed versions do not already exist).



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import pickle
import numpy as np
import datetime
%matplotlib inline



In [2]:

    
start = datetime.datetime.now()

Load data

This dataset was created in load-process-data.ipynb



In [3]:

    
!ls ../../datasets/parsed_dataframes/*.pickle.xz









    



../../datasets/parsed_dataframes/df_all_2016.pickle.xz



In [4]:

    
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_2016.pickle.xz



In [5]:

    
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
    df_all = pickle.load(f)

Initial data format



In [6]:

    
df_all.sample(2).transpose()









    Out[6]:







  
    
      
      775423
      19796
    
  
  
    
      archived
      False
      False
    
    
      language
      fr
      ja
    
    
      page_namespace
      14
      0
    
    
      rev_deleted
      False
      False
    
    
      rev_id
      85590818
      45845042
    
    
      rev_minor_edit
      True
      True
    
    
      rev_page
      3650242
      955522
    
    
      rev_parent_id
      8.34555e+07
      4.21405e+07
    
    
      rev_revert_offset
      1
      1
    
    
      rev_sha1
      9xzyibbces5d01plwyelt04u54pdn4v
      49bzyv8ikprxne502rmtkr2w8vlb06l
    
    
      rev_timestamp
      20121121133147
      20130117052257
    
    
      rev_user
      620954
      273540
    
    
      rev_user_text
      MastiBot
      Xqbot
    
    
      reverted_to_rev_id
      83455513
      42140472
    
    
      reverting_archived
      False
      False
    
    
      reverting_comment
      Retrait de 1 liens interlangues, désormais fou...
      ボット: 言語間リンク 1 件をウィキデータ上の ([[d:Q7830895]] に転記)
    
    
      reverting_deleted
      False
      False
    
    
      reverting_id
      91666043
      46820756
    
    
      reverting_minor_edit
      True
      True
    
    
      reverting_page
      3650242
      955522
    
    
      reverting_parent_id
      8.55908e+07
      4.5845e+07
    
    
      reverting_sha1
      l1gs736x7f5t1nm4tqzvyq0juosp0xt
      03hj9wufn75kyckp6cfcb3mw0cnvr6u
    
    
      reverting_timestamp
      20130404003416
      20130321231808
    
    
      reverting_user
      1504326
      397108
    
    
      reverting_user_text
      Addbot
      EmausBot
    
    
      revisions_reverted
      1
      1
    
    
      namespace_type
      category
      article
    
    
      reverted_timestamp_dt
      2012-11-21 13:31:47
      2013-01-17 05:22:57
    
    
      reverting_timestamp_dt
      2013-04-04 00:34:16
      2013-03-21 23:18:08
    
    
      time_to_revert
      133 days 11:02:29
      63 days 17:55:11
    
    
      time_to_revert_hrs
      3203.04
      1529.92
    
    
      time_to_revert_days
      133.46
      63.7467
    
    
      reverting_year
      2013
      2013
    
    
      time_to_revert_days_log10
      2.12535
      1.80446
    
    
      time_to_revert_hrs_log10
      3.50556
      3.18467
    
    
      reverting_comment_nobracket
      Retrait de 1 liens interlangues, désormais fou...
      ボット: 言語間リンク 1 件をウィキデータ上の
    
    
      botpair
      Addbot rv MastiBot
      EmausBot rv Xqbot
    
    
      botpair_sorted
      ['Addbot', 'MastiBot']
      ['EmausBot', 'Xqbot']
    
    
      reverts_per_page_botpair
      1
      1
    
    
      reverts_per_page_botpair_sorted
      1
      1

Comments analysis

Comment parsing functions

There are two functions that are used to parse comments. comment_categorization() runs first and applies a series of pattern matching to comments. If a match is not found, then interwiki_confirm() is called, which checks for languages codes in certain patterns that indicate interwiki links.



In [7]:

    
def comment_categorization(row):
    """
    Takes a row from a pandas dataframe or dict and returns a string with a
    kind of activity based on metadata. Used with df.apply(). Mostly parses
    comments, but makes some use of usernames too.
    """
    
    reverting_user = str(row['reverting_user_text'])
    
    reverted_user = str(row['rev_user_text'])
    
    langcode = str(row['language'])
    
    if reverting_user.find("HBC AIV") >= 0:
        return 'AIV helperbot'
    
    try:
        comment = str(row['reverting_comment'])
    except Exception as e:
        return 'other'
    
    comment_lower = comment.lower().strip()
    comment_lower = " ".join(comment_lower.split())
 
    if comment == 'nan':
        return "deleted revision"
    
    if reverting_user == 'Cyberbot II' and reverted_user == 'AnomieBOT' and comment.find("tagging/redirecting to OCC") >= 0:
        return 'botfight: Cyberbot II vs AnomieBOT date tagging'
        
    if reverting_user == 'AnomieBOT' and reverted_user == 'Cyberbot II' and comment.find("{{Deadlink}}") >= 0:
        return 'botfight: Cyberbot II vs AnomieBOT date tagging'                

    if reverting_user == 'RussBot' and reverted_user == 'Cydebot':
        return 'botfight: Russbot vs Cydebot category renaming'  

    if reverting_user == 'Cydebot' and reverted_user == 'RussBot':
        return 'botfight: Russbot vs Cydebot category renaming'  
    
    elif comment.find("Undoing massive unnecessary addition of infoboxneeded by a (now blocked) bot") >= 0:
        return "botfight: infoboxneeded"
    
    elif comment_lower.find("commonsdelinker") >=0 and reverting_user.find("CommonsDelinker") == -1:
        return "botfight: reverting CommonsDelinker"
        
    elif comment.find("Reverted edits by [[Special:Contributions/ImageRemovalBot") >= 0:
        return "botfight: 718bot vs ImageRemovalBot"
    
    elif comment_lower.find("double redirect") >= 0:
        return "fixing double redirect"
    
    elif comment_lower.find("double-redirect") >= 0:
        return "fixing double redirect"

    elif comment_lower.find("has been moved; it now redirects to") >= 0:
        return "fixing double redirect"
    
    elif comment_lower.find("correction du redirect") >= 0:
        return "fixing double redirect"   
        
    elif comment_lower.find("redirect tagging") >= 0:
        return "redirect tagging/sorting"
    
    elif comment_lower.find("sorting redirect") >= 0:
        return "redirect tagging/sorting"
    
    elif comment_lower.find("redirecciones") >= 0 and comment_lower.find("categoría") >= 0:
        return "category redirect cleanup"    
    
    elif comment_lower.find("change redirected category") >= 0:
        return "category redirect cleanup"
    
    elif comment_lower.find("redirected category") >=0:
        return "category redirect cleanup"
    
    elif comment.find("[[User:Addbot|Bot:]] Adding ") >= 0:
        return "template tagging"
    
    elif comment_lower.find("interwiki") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment_lower.find("langlinks") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment_lower.find("iw-link") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment_lower.find("changing category") >= 0:
        return "moving category"
    
    elif comment_lower.find("recat per") >= 0:
        return "moving category"
    
    elif comment_lower.find("moving category") >= 0:
        return "moving category"

    elif comment_lower.find("move category") >= 0:
        return "moving category"
    
    elif comment_lower.find("re-categorisation") >= 0:
        return "moving category"
    
    elif comment_lower.find("recatégorisation") >= 0:
        return "moving category"   
    
    elif comment_lower.find("Updating users status to") >= 0:
        return "user online status update"
    
    elif comment_lower.find("{{Copy to Wikimedia Commons}} either because the file") >= 0:
        return "template cleanup"
        
    elif comment_lower.find("removing a protection template") >= 0:
        return "protection template cleanup"
    
    elif comment_lower.find("removing categorization template") >= 0:
        return "template cleanup"    
    
    elif comment_lower.find("rm ibid template per") >= 0:
        return "template cleanup"      
    
    elif comment_lower.find("page is not protected") >= 0:
        return "template cleanup"          
    
    elif comment_lower.find("removing protection template") >= 0:
        return "template cleanup"    
    
    elif comment_lower.find("correcting transcluded template per tfd") >= 0:
        return "template cleanup"   
    
    elif comment_lower.find("removing orphan t") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("non-applicable orphan") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("plantilla") >= 0 and comment_lower.find("huérfano") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("removed orphan t") >= 0:
        return "template cleanup"    
    
    elif comment_lower.find("sandbox") >= 0:
        return "clearing sandbox"
    
    elif comment_lower.find("archiving") >= 0:
        return "archiving"
    
    elif comment_lower.find("duplicate on commons") >= 0:
        return "commons image migration"
    
    elif comment_lower.find("user:mathbot/changes to mathlists") >= 0:
        return "botfight: mathbot mathlist updates"
    
    elif reverting_user == 'MathBot' or reverted_user == 'MathBot' >= 0:
        return "botfight: mathbot mathlist updates"
    
    elif comment_lower.find("link syntax") >= 0:
        return "fixing links"
    
    elif comment_lower.find("links syntax") >= 0:
        return "fixing links" 
    
    elif comment_lower.find("no broken #section links left") >= 0:
        return "fixing links"  
    
    elif comment_lower.find("removing redlinks") >= 0:
        return "fixing links" 
    
    elif comment_lower.find("to wikidata") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment.find("言語間") >=0:
        return "interwiki link cleanup -- method1"
        
    elif comment_lower.find("interproyecto") >=0:
        return "interwiki link cleanup -- method1"    
        
    elif comment.find("语言链接") >=0:
        return "interwiki link cleanup -- method1"  
    
    elif comment.find("interling") >=0:
        return "interwiki link cleanup -- method1"  
    
    elif comment.find("interlang") >=0:
        return "interwiki link cleanup -- method1"      
    
    elif comment.find("双重重定向") >=0 or comment.find("雙重重定向") >= 0:
        return "fixing double redirect"   

    elif comment.find("二重リダイレクト") >=0:
        return "fixing double redirect"  
    
    elif comment_lower.find("doppelten redirect") >=0:
        return "fixing double redirect"  
    
    elif comment_lower.find("doppelte weiterleitung") >=0:
        return "fixing double redirect"      
    
    
    elif comment_lower.find("redirectauflösung") >=0:
        return "fixing double redirect"      
    
    elif comment_lower.find("doble redirección") >=0 or comment_lower.find("redirección doble") >= 0:
        return "fixing double redirect"  
    
    elif comment_lower.find("redireccionamento duplo") >=0:
        return "fixing double redirect"  

    elif comment_lower.find("duplo redirecionamento") >=0:
        return "fixing double redirect"      
    
    elif comment_lower.find("suppression bandeau") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("archiviert") >= 0:
        return "archiving"

    elif comment_lower.find("revert") >= 0:
        return "other w/ revert in comment"  
    
    elif comment_lower.find("rv ") >= 0 or comment_lower.find("rv") == 0:
        return "other w/ revert in comment"  
    
    elif comment_lower.find(" per ") >= 0:
        return "other w/ per justification"  
    
    elif comment_lower.find(" según") >= 0:
        return "other w/ per justification"      
 
    elif comment_lower.find("suite à discussion") >= 0:
        return "other w/ per justification"  
    
    elif comment_lower.find("suite à conservation") >= 0:
        return "other w/ per justification"     
    
    elif comment_lower.find("conforme pedido") >= 0:
        return "other w/ per justification"
    
    else:
        return interwiki_confirm(comment, langcode)



In [8]:

    
def interwiki_confirm(comment, langcode):
    """
    Takes a comment string, searches for language codes bordered by 
    two punctuation marks from [](){},: or one punctuation mark and
    one space. Beginning and end of a comment string counts as a
    space, not a punctuation mark.
    
    Does not recognize the current langcode.
    """
    import string, re
    
    with open("../../datasets/lang_codes.tsv", "r") as f:
        lang_codes = f.read().split("\n")
        
    lang_codes.pop() # a blank '' is in the list that gets returned
    
    lang_codes.remove(langcode)
    
    #print(langcode in lang_codes)
    
    try:
        comment = str(comment)
        comment = comment.lower()
        comment = comment.replace(": ", ":")
        comment = " " + comment + " "  # pad start and end of string with non-punctuation
        #print(comment)
        
    except Exception as e:
        return 'other'
    
    for lang_code in lang_codes:
        
        lang_code_pos = comment.find(lang_code)
        lang_code_len = len(lang_code)
        
        char_before = " "
        char_after = " "
        
        if lang_code_pos >= 0:
            char_before = comment[lang_code_pos-1]
        
            #print("Char before: '", char_before, "'", sep='')
             
            char_after = comment[lang_code_pos+lang_code_len]

            #print("Char after: '", char_after, "'", sep='')
            
            if char_before in string.punctuation and char_after in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'
            
            elif char_after in string.punctuation and char_before in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'
            
            elif char_before == " " and char_after in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'
            
            elif char_after == " " and char_before in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'               
    return 'other'

Testing interwiki confirm



In [9]:

    
tests_yes = ["Robot adding [[es:Test]]",
             "adding es:Test",
             "linking es, it, en",
             "modifying fr:",
             "modifying:zh",
             "modifying: ja"]

tests_no = ["test", 
            "discuss policies on enwiki vs eswiki", 
            "it is done", 
            "per [[en:WP:AIV]]",
            "it's not its", 
            "its not it's",
            "modifying it all",
            "modifying italy"]

print("Should return interwiki link cleanup -- method2")
for test in tests_yes:
    print("\t", interwiki_confirm(test, 'en'))

print("Should return other")
for test in tests_no:
    print("\t", interwiki_confirm(test, 'en'))









    



Should return interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
Should return other
	 other
	 other
	 other
	 other
	 other
	 other
	 other
	 other

Apply categorization



In [10]:

    
%%time
df_all['bottype'] = df_all.apply(comment_categorization, axis=1)









    



CPU times: user 3min 28s, sys: 3.07 s, total: 3min 31s
Wall time: 3min 31s

Consolidate groups



In [11]:

    
def bottype_group(bottype):
    if bottype == "interwiki link cleanup -- method2":
        return "interwiki link cleanup -- method2"
    
    elif bottype == "interwiki link cleanup -- method1":
        return "interwiki link cleanup -- method1"
    
    elif bottype.find("botfight") >= 0:
        return 'botfight'
    
    elif bottype == 'other':
        return 'not classified'
    
    elif bottype == 'fixing double redirect':
        return 'fixing double redirect'
    
    elif bottype == 'protection template cleanup':
        return 'protection template cleanup'
    
    elif bottype.find("category") >= 0:
        return 'category work'
    
    elif bottype.find("template") >= 0:
        return 'template work'
    
    elif bottype == "other w/ revert in comment":
        return "other w/ revert in comment"
    
    else:
        return "other classified"



In [12]:

    
df_all['bottype_group'] = df_all['bottype'].apply(bottype_group)

Analysis

Much of what we're interested in are articles, which are in namespace 0.



In [13]:

    
df_all_ns0 = df_all[df_all['page_namespace']==0].copy()

Bottype counts and percentages across all languages in the dataset, articles only



In [14]:

    
type_counts = df_all_ns0['bottype'].value_counts().rename("count")
type_percent = df_all_ns0['bottype'].value_counts(normalize=True).rename("percent") * 100
type_percent = type_percent.round(2).astype(str) + "%"

pd.concat([type_counts, type_percent], axis=1)









    Out[14]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      244759
      43.56%
    
    
      interwiki link cleanup -- method1
      159747
      28.43%
    
    
      fixing double redirect
      130182
      23.17%
    
    
      other
      13792
      2.45%
    
    
      protection template cleanup
      2841
      0.51%
    
    
      other w/ revert in comment
      2435
      0.43%
    
    
      botfight: Russbot vs Cydebot category renaming
      2105
      0.37%
    
    
      moving category
      1717
      0.31%
    
    
      template cleanup
      1387
      0.25%
    
    
      other w/ per justification
      776
      0.14%
    
    
      botfight: mathbot mathlist updates
      514
      0.09%
    
    
      category redirect cleanup
      444
      0.08%
    
    
      botfight: Cyberbot II vs AnomieBOT date tagging
      301
      0.05%
    
    
      redirect tagging/sorting
      297
      0.05%
    
    
      botfight: reverting CommonsDelinker
      257
      0.05%
    
    
      botfight: 718bot vs ImageRemovalBot
      171
      0.03%
    
    
      botfight: infoboxneeded
      98
      0.02%
    
    
      fixing links
      81
      0.01%
    
    
      template tagging
      24
      0.0%
    
    
      clearing sandbox
      5
      0.0%
    
    
      commons image migration
      3
      0.0%

Bottype counts and percentages for each language, articles only



In [15]:

    
counts_dict = {}
for lang in df_all_ns0['language'].unique():

    df_lang_ns0 = df_all_ns0[df_all_ns0['language']==lang]
    
    type_counts = df_lang_ns0['bottype'].value_counts().rename("count")
    type_percent = df_lang_ns0['bottype'].value_counts(normalize=True).rename("percent") * 100
    type_percent = type_percent.round(2).astype(str) + "%"

    counts_dict[lang]=pd.concat([type_counts, type_percent], axis=1)



In [16]:

    
df_all_ns0['language'].unique()









    Out[16]:





array(['ja', 'es', 'pt', 'en', 'fr', 'de', 'zh'], dtype=object)



In [17]:

    
counts_dict['en']









    Out[17]:







  
    
      
      count
      percent
    
  
  
    
      fixing double redirect
      110513
      45.15%
    
    
      interwiki link cleanup -- method1
      83718
      34.2%
    
    
      interwiki link cleanup -- method2
      37085
      15.15%
    
    
      protection template cleanup
      2831
      1.16%
    
    
      other
      2616
      1.07%
    
    
      botfight: Russbot vs Cydebot category renaming
      2095
      0.86%
    
    
      moving category
      1442
      0.59%
    
    
      template cleanup
      1249
      0.51%
    
    
      other w/ revert in comment
      1009
      0.41%
    
    
      botfight: mathbot mathlist updates
      514
      0.21%
    
    
      category redirect cleanup
      337
      0.14%
    
    
      botfight: Cyberbot II vs AnomieBOT date tagging
      301
      0.12%
    
    
      redirect tagging/sorting
      297
      0.12%
    
    
      botfight: reverting CommonsDelinker
      230
      0.09%
    
    
      other w/ per justification
      179
      0.07%
    
    
      botfight: 718bot vs ImageRemovalBot
      170
      0.07%
    
    
      botfight: infoboxneeded
      98
      0.04%
    
    
      fixing links
      81
      0.03%
    
    
      template tagging
      24
      0.01%
    
    
      commons image migration
      3
      0.0%
    
    
      clearing sandbox
      1
      0.0%



In [18]:

    
counts_dict['ja']









    Out[18]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      27631
      79.46%
    
    
      interwiki link cleanup -- method1
      5042
      14.5%
    
    
      other
      1787
      5.14%
    
    
      fixing double redirect
      294
      0.85%
    
    
      other w/ revert in comment
      11
      0.03%
    
    
      other w/ per justification
      7
      0.02%



In [19]:

    
counts_dict['zh']









    Out[19]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      23618
      54.89%
    
    
      interwiki link cleanup -- method1
      14661
      34.07%
    
    
      fixing double redirect
      3690
      8.58%
    
    
      other
      794
      1.85%
    
    
      other w/ revert in comment
      257
      0.6%
    
    
      other w/ per justification
      6
      0.01%
    
    
      botfight: reverting CommonsDelinker
      3
      0.01%



In [20]:

    
counts_dict['de']









    Out[20]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      35894
      65.34%
    
    
      interwiki link cleanup -- method1
      16573
      30.17%
    
    
      other
      1405
      2.56%
    
    
      fixing double redirect
      989
      1.8%
    
    
      other w/ revert in comment
      28
      0.05%
    
    
      protection template cleanup
      10
      0.02%
    
    
      other w/ per justification
      10
      0.02%
    
    
      botfight: Russbot vs Cydebot category renaming
      10
      0.02%
    
    
      moving category
      7
      0.01%
    
    
      botfight: reverting CommonsDelinker
      5
      0.01%
    
    
      template cleanup
      5
      0.01%
    
    
      botfight: 718bot vs ImageRemovalBot
      1
      0.0%
    
    
      category redirect cleanup
      1
      0.0%



In [21]:

    
counts_dict['fr']









    Out[21]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      41154
      73.25%
    
    
      interwiki link cleanup -- method1
      10082
      17.95%
    
    
      fixing double redirect
      3247
      5.78%
    
    
      other
      1025
      1.82%
    
    
      other w/ per justification
      398
      0.71%
    
    
      moving category
      268
      0.48%
    
    
      other w/ revert in comment
      3
      0.01%
    
    
      botfight: reverting CommonsDelinker
      2
      0.0%
    
    
      clearing sandbox
      2
      0.0%



In [22]:

    
counts_dict['pt']









    Out[22]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      41576
      69.15%
    
    
      interwiki link cleanup -- method1
      13626
      22.66%
    
    
      other
      2970
      4.94%
    
    
      fixing double redirect
      1888
      3.14%
    
    
      other w/ per justification
      36
      0.06%
    
    
      other w/ revert in comment
      26
      0.04%
    
    
      botfight: reverting CommonsDelinker
      1
      0.0%
    
    
      clearing sandbox
      1
      0.0%



In [23]:

    
counts_dict['es']









    Out[23]:







  
    
      
      count
      percent
    
  
  
    
      interwiki link cleanup -- method2
      37801
      55.51%
    
    
      interwiki link cleanup -- method1
      16045
      23.56%
    
    
      fixing double redirect
      9561
      14.04%
    
    
      other
      3195
      4.69%
    
    
      other w/ revert in comment
      1101
      1.62%
    
    
      other w/ per justification
      140
      0.21%
    
    
      template cleanup
      133
      0.2%
    
    
      category redirect cleanup
      106
      0.16%
    
    
      botfight: reverting CommonsDelinker
      16
      0.02%
    
    
      clearing sandbox
      1
      0.0%

Consolidation results



In [24]:

    
gb_lang_bottype = df_all.query("page_namespace == 0").groupby(["language", "bottype_group"])



In [25]:

    
gb_lang_bottype['rev_id'].count().unstack()









    Out[25]:







  
    
      bottype_group
      botfight
      category work
      fixing double redirect
      interwiki link cleanup -- method1
      interwiki link cleanup -- method2
      not classified
      other classified
      other w/ revert in comment
      protection template cleanup
      template work
    
    
      language
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      de
      16.0
      8.0
      989.0
      16573.0
      35894.0
      1405.0
      10.0
      28.0
      10.0
      5.0
    
    
      en
      3408.0
      1779.0
      110513.0
      83718.0
      37085.0
      2616.0
      561.0
      1009.0
      2831.0
      1273.0
    
    
      es
      16.0
      106.0
      9561.0
      16045.0
      37801.0
      3195.0
      141.0
      1101.0
      NaN
      133.0
    
    
      fr
      2.0
      268.0
      3247.0
      10082.0
      41154.0
      1025.0
      400.0
      3.0
      NaN
      NaN
    
    
      ja
      NaN
      NaN
      294.0
      5042.0
      27631.0
      1787.0
      7.0
      11.0
      NaN
      NaN
    
    
      pt
      1.0
      NaN
      1888.0
      13626.0
      41576.0
      2970.0
      37.0
      26.0
      NaN
      NaN
    
    
      zh
      3.0
      NaN
      3690.0
      14661.0
      23618.0
      794.0
      6.0
      257.0
      NaN
      NaN



In [26]:

Final data format



In [26]:

    
df_all[0:2].transpose()









    Out[26]:







  
    
      
      0
      1
    
  
  
    
      archived
      False
      False
    
    
      language
      ja
      ja
    
    
      page_namespace
      14
      0
    
    
      rev_deleted
      False
      False
    
    
      rev_id
      31654187
      36768330
    
    
      rev_minor_edit
      True
      True
    
    
      rev_page
      649447
      10
    
    
      rev_parent_id
      3.1402e+07
      3.67548e+07
    
    
      rev_revert_offset
      1
      1
    
    
      rev_sha1
      hi8o3p1yka5v6fb8hdux0771qjxfewp
      nvmvm6tmzlze06abk9tvb4xiy1tp2ph
    
    
      rev_timestamp
      20100418121115
      20110316095543
    
    
      rev_user
      105371
      105371
    
    
      rev_user_text
      VolkovBot
      VolkovBot
    
    
      reverted_to_rev_id
      31401978
      36754779
    
    
      reverting_archived
      False
      False
    
    
      reverting_comment
      ボット: 言語間リンク 1 件を[[Wikipedia:ウィキデータ|ウィキデータ]]上の ...
      r2.6.4) (ロボットによる 追加: [[ksh:Sprooch]]
    
    
      reverting_deleted
      False
      False
    
    
      reverting_id
      47290084
      36779316
    
    
      reverting_minor_edit
      True
      True
    
    
      reverting_page
      649447
      10
    
    
      reverting_parent_id
      3.16542e+07
      3.67683e+07
    
    
      reverting_sha1
      9i554cxlbugjcgq0ker7p6qsiscktbt
      j6mpty2xkvsejika47aq7m1rj9on882
    
    
      reverting_timestamp
      20130410053616
      20110317101016
    
    
      reverting_user
      397108
      397108
    
    
      reverting_user_text
      EmausBot
      EmausBot
    
    
      revisions_reverted
      1
      1
    
    
      namespace_type
      category
      article
    
    
      reverted_timestamp_dt
      2010-04-18 12:11:15
      2011-03-16 09:55:43
    
    
      reverting_timestamp_dt
      2013-04-10 05:36:16
      2011-03-17 10:10:16
    
    
      time_to_revert
      1087 days 17:25:01
      1 days 00:14:33
    
    
      time_to_revert_hrs
      26105.4
      24.2425
    
    
      time_to_revert_days
      1087.73
      1.0101
    
    
      reverting_year
      2013
      2011
    
    
      time_to_revert_days_log10
      3.03652
      0.00436616
    
    
      time_to_revert_hrs_log10
      4.41673
      1.38458
    
    
      reverting_comment_nobracket
      ボット: 言語間リンク 1 件を上の に転記
      r2.6.4)
    
    
      botpair
      EmausBot rv VolkovBot
      EmausBot rv VolkovBot
    
    
      botpair_sorted
      ['EmausBot', 'VolkovBot']
      ['EmausBot', 'VolkovBot']
    
    
      reverts_per_page_botpair
      1
      1
    
    
      reverts_per_page_botpair_sorted
      1
      1
    
    
      bottype
      interwiki link cleanup -- method1
      interwiki link cleanup -- method2
    
    
      bottype_group
      interwiki link cleanup -- method1
      interwiki link cleanup -- method2

Export data



In [27]:

    
df_all.to_pickle("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle")



In [28]:

    
!xz -9 -e --keep ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle



In [29]:

    
df_all.to_csv("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv", sep="\t")



In [30]:

    
!xz -9 -e --keep ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv

How long did this take to run?



In [31]:

    
end = datetime.datetime.now()



In [32]:

    
time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")









    



Total runtime:  33 minutes,  53 seconds



In [33]:

	775423	19796
archived	False	False
language	fr	ja
page_namespace	14	0
rev_deleted	False	False
rev_id	85590818	45845042
rev_minor_edit	True	True
rev_page	3650242	955522
rev_parent_id	8.34555e+07	4.21405e+07
rev_revert_offset	1	1
rev_sha1	9xzyibbces5d01plwyelt04u54pdn4v	49bzyv8ikprxne502rmtkr2w8vlb06l
rev_timestamp	20121121133147	20130117052257
rev_user	620954	273540
rev_user_text	MastiBot	Xqbot
reverted_to_rev_id	83455513	42140472
reverting_archived	False	False
reverting_comment	Retrait de 1 liens interlangues, désormais fou...	ボット: 言語間リンク 1 件をウィキデータ上の ([[d:Q7830895]] に転記)
reverting_deleted	False	False
reverting_id	91666043	46820756
reverting_minor_edit	True	True
reverting_page	3650242	955522
reverting_parent_id	8.55908e+07	4.5845e+07
reverting_sha1	l1gs736x7f5t1nm4tqzvyq0juosp0xt	03hj9wufn75kyckp6cfcb3mw0cnvr6u
reverting_timestamp	20130404003416	20130321231808
reverting_user	1504326	397108
reverting_user_text	Addbot	EmausBot
revisions_reverted	1	1
namespace_type	category	article
reverted_timestamp_dt	2012-11-21 13:31:47	2013-01-17 05:22:57
reverting_timestamp_dt	2013-04-04 00:34:16	2013-03-21 23:18:08
time_to_revert	133 days 11:02:29	63 days 17:55:11
time_to_revert_hrs	3203.04	1529.92
time_to_revert_days	133.46	63.7467
reverting_year	2013	2013
time_to_revert_days_log10	2.12535	1.80446
time_to_revert_hrs_log10	3.50556	3.18467
reverting_comment_nobracket	Retrait de 1 liens interlangues, désormais fou...	ボット: 言語間リンク 1 件をウィキデータ上の
botpair	Addbot rv MastiBot	EmausBot rv Xqbot
botpair_sorted	['Addbot', 'MastiBot']	['EmausBot', 'Xqbot']
reverts_per_page_botpair	1	1
reverts_per_page_botpair_sorted	1	1

	count	percent
interwiki link cleanup -- method2	244759	43.56%
interwiki link cleanup -- method1	159747	28.43%
fixing double redirect	130182	23.17%
other	13792	2.45%
protection template cleanup	2841	0.51%
other w/ revert in comment	2435	0.43%
botfight: Russbot vs Cydebot category renaming	2105	0.37%
moving category	1717	0.31%
template cleanup	1387	0.25%
other w/ per justification	776	0.14%
botfight: mathbot mathlist updates	514	0.09%
category redirect cleanup	444	0.08%
botfight: Cyberbot II vs AnomieBOT date tagging	301	0.05%
redirect tagging/sorting	297	0.05%
botfight: reverting CommonsDelinker	257	0.05%
botfight: 718bot vs ImageRemovalBot	171	0.03%
botfight: infoboxneeded	98	0.02%
fixing links	81	0.01%
template tagging	24	0.0%
clearing sandbox	5	0.0%
commons image migration	3	0.0%

	count	percent
fixing double redirect	110513	45.15%
interwiki link cleanup -- method1	83718	34.2%
interwiki link cleanup -- method2	37085	15.15%
protection template cleanup	2831	1.16%
other	2616	1.07%
botfight: Russbot vs Cydebot category renaming	2095	0.86%
moving category	1442	0.59%
template cleanup	1249	0.51%
other w/ revert in comment	1009	0.41%
botfight: mathbot mathlist updates	514	0.21%
category redirect cleanup	337	0.14%
botfight: Cyberbot II vs AnomieBOT date tagging	301	0.12%
redirect tagging/sorting	297	0.12%
botfight: reverting CommonsDelinker	230	0.09%
other w/ per justification	179	0.07%
botfight: 718bot vs ImageRemovalBot	170	0.07%
botfight: infoboxneeded	98	0.04%
fixing links	81	0.03%
template tagging	24	0.01%
commons image migration	3	0.0%
clearing sandbox	1	0.0%

	count	percent
interwiki link cleanup -- method2	27631	79.46%
interwiki link cleanup -- method1	5042	14.5%
other	1787	5.14%
fixing double redirect	294	0.85%
other w/ revert in comment	11	0.03%
other w/ per justification	7	0.02%

	count	percent
interwiki link cleanup -- method2	23618	54.89%
interwiki link cleanup -- method1	14661	34.07%
fixing double redirect	3690	8.58%
other	794	1.85%
other w/ revert in comment	257	0.6%
other w/ per justification	6	0.01%
botfight: reverting CommonsDelinker	3	0.01%

	count	percent
interwiki link cleanup -- method2	35894	65.34%
interwiki link cleanup -- method1	16573	30.17%
other	1405	2.56%
fixing double redirect	989	1.8%
other w/ revert in comment	28	0.05%
protection template cleanup	10	0.02%
other w/ per justification	10	0.02%
botfight: Russbot vs Cydebot category renaming	10	0.02%
moving category	7	0.01%
botfight: reverting CommonsDelinker	5	0.01%
template cleanup	5	0.01%
botfight: 718bot vs ImageRemovalBot	1	0.0%
category redirect cleanup	1	0.0%

	count	percent
interwiki link cleanup -- method2	41154	73.25%
interwiki link cleanup -- method1	10082	17.95%
fixing double redirect	3247	5.78%
other	1025	1.82%
other w/ per justification	398	0.71%
moving category	268	0.48%
other w/ revert in comment	3	0.01%
botfight: reverting CommonsDelinker	2	0.0%
clearing sandbox	2	0.0%

	count	percent
interwiki link cleanup -- method2	41576	69.15%
interwiki link cleanup -- method1	13626	22.66%
other	2970	4.94%
fixing double redirect	1888	3.14%
other w/ per justification	36	0.06%
other w/ revert in comment	26	0.04%
botfight: reverting CommonsDelinker	1	0.0%
clearing sandbox	1	0.0%

	count	percent
interwiki link cleanup -- method2	37801	55.51%
interwiki link cleanup -- method1	16045	23.56%
fixing double redirect	9561	14.04%
other	3195	4.69%
other w/ revert in comment	1101	1.62%
other w/ per justification	140	0.21%
template cleanup	133	0.2%
category redirect cleanup	106	0.16%
botfight: reverting CommonsDelinker	16	0.02%
clearing sandbox	1	0.0%

bottype_group	botfight	category work	fixing double redirect	interwiki link cleanup -- method1	interwiki link cleanup -- method2	not classified	other classified	other w/ revert in comment	protection template cleanup	template work
language
de	16.0	8.0	989.0	16573.0	35894.0	1405.0	10.0	28.0	10.0	5.0
en	3408.0	1779.0	110513.0	83718.0	37085.0	2616.0	561.0	1009.0	2831.0	1273.0
es	16.0	106.0	9561.0	16045.0	37801.0	3195.0	141.0	1101.0	NaN	133.0
fr	2.0	268.0	3247.0	10082.0	41154.0	1025.0	400.0	3.0	NaN	NaN
ja	NaN	NaN	294.0	5042.0	27631.0	1787.0	7.0	11.0	NaN	NaN
pt	1.0	NaN	1888.0	13626.0	41576.0	2970.0	37.0	26.0	NaN	NaN
zh	3.0	NaN	3690.0	14661.0	23618.0	794.0	6.0	257.0	NaN	NaN