Section 7.2: Comment parsing

This is a data analysis script for parsing comments as described in section 7, which you can run based entirely off the files in this GitHub repository. It loads datasets/parsed_dataframes/df_all_2016.pickle.xz and creates the following files (and compresses them in .xz format):

  • datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle
  • datasets/parsed_dataframes/possible_botfights.pickle
  • datasets/parsed_dataframes/possible_botfights.tsv

This entire notebook can be run from the beginning with Kernel -> Restart & Run All in the menu bar. On a laptop running a Core i5-2540M processor, it takes about 5 minutes to run the main analysis and another 5 minutes to xz compress the files (if compressed versions do not already exist).


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob
import pickle
import numpy as np
import datetime
%matplotlib inline

In [2]:
start = datetime.datetime.now()

Load data

This dataset was created in load-process-data.ipynb


In [3]:
!ls ../../datasets/parsed_dataframes/*.pickle.xz


../../datasets/parsed_dataframes/df_all_2016.pickle.xz

In [4]:
!unxz --keep --force ../../datasets/parsed_dataframes/df_all_2016.pickle.xz

In [5]:
with open("../../datasets/parsed_dataframes/df_all_2016.pickle", "rb") as f:
    df_all = pickle.load(f)

Initial data format


In [6]:
df_all.sample(2).transpose()


Out[6]:
775423 19796
archived False False
language fr ja
page_namespace 14 0
rev_deleted False False
rev_id 85590818 45845042
rev_minor_edit True True
rev_page 3650242 955522
rev_parent_id 8.34555e+07 4.21405e+07
rev_revert_offset 1 1
rev_sha1 9xzyibbces5d01plwyelt04u54pdn4v 49bzyv8ikprxne502rmtkr2w8vlb06l
rev_timestamp 20121121133147 20130117052257
rev_user 620954 273540
rev_user_text MastiBot Xqbot
reverted_to_rev_id 83455513 42140472
reverting_archived False False
reverting_comment Retrait de 1 liens interlangues, désormais fou... ボット: 言語間リンク 1 件をウィキデータ上の ([[d:Q7830895]] に転記)
reverting_deleted False False
reverting_id 91666043 46820756
reverting_minor_edit True True
reverting_page 3650242 955522
reverting_parent_id 8.55908e+07 4.5845e+07
reverting_sha1 l1gs736x7f5t1nm4tqzvyq0juosp0xt 03hj9wufn75kyckp6cfcb3mw0cnvr6u
reverting_timestamp 20130404003416 20130321231808
reverting_user 1504326 397108
reverting_user_text Addbot EmausBot
revisions_reverted 1 1
namespace_type category article
reverted_timestamp_dt 2012-11-21 13:31:47 2013-01-17 05:22:57
reverting_timestamp_dt 2013-04-04 00:34:16 2013-03-21 23:18:08
time_to_revert 133 days 11:02:29 63 days 17:55:11
time_to_revert_hrs 3203.04 1529.92
time_to_revert_days 133.46 63.7467
reverting_year 2013 2013
time_to_revert_days_log10 2.12535 1.80446
time_to_revert_hrs_log10 3.50556 3.18467
reverting_comment_nobracket Retrait de 1 liens interlangues, désormais fou... ボット: 言語間リンク 1 件をウィキデータ上の
botpair Addbot rv MastiBot EmausBot rv Xqbot
botpair_sorted ['Addbot', 'MastiBot'] ['EmausBot', 'Xqbot']
reverts_per_page_botpair 1 1
reverts_per_page_botpair_sorted 1 1

Comments analysis

Comment parsing functions

There are two functions that are used to parse comments. comment_categorization() runs first and applies a series of pattern matching to comments. If a match is not found, then interwiki_confirm() is called, which checks for languages codes in certain patterns that indicate interwiki links.


In [7]:
def comment_categorization(row):
    """
    Takes a row from a pandas dataframe or dict and returns a string with a
    kind of activity based on metadata. Used with df.apply(). Mostly parses
    comments, but makes some use of usernames too.
    """
    
    reverting_user = str(row['reverting_user_text'])
    
    reverted_user = str(row['rev_user_text'])
    
    langcode = str(row['language'])
    
    if reverting_user.find("HBC AIV") >= 0:
        return 'AIV helperbot'
    
    try:
        comment = str(row['reverting_comment'])
    except Exception as e:
        return 'other'
    
    comment_lower = comment.lower().strip()
    comment_lower = " ".join(comment_lower.split())
 
    if comment == 'nan':
        return "deleted revision"
    
    if reverting_user == 'Cyberbot II' and reverted_user == 'AnomieBOT' and comment.find("tagging/redirecting to OCC") >= 0:
        return 'botfight: Cyberbot II vs AnomieBOT date tagging'
        
    if reverting_user == 'AnomieBOT' and reverted_user == 'Cyberbot II' and comment.find("{{Deadlink}}") >= 0:
        return 'botfight: Cyberbot II vs AnomieBOT date tagging'                

    if reverting_user == 'RussBot' and reverted_user == 'Cydebot':
        return 'botfight: Russbot vs Cydebot category renaming'  

    if reverting_user == 'Cydebot' and reverted_user == 'RussBot':
        return 'botfight: Russbot vs Cydebot category renaming'  
    
    elif comment.find("Undoing massive unnecessary addition of infoboxneeded by a (now blocked) bot") >= 0:
        return "botfight: infoboxneeded"
    
    elif comment_lower.find("commonsdelinker") >=0 and reverting_user.find("CommonsDelinker") == -1:
        return "botfight: reverting CommonsDelinker"
        
    elif comment.find("Reverted edits by [[Special:Contributions/ImageRemovalBot") >= 0:
        return "botfight: 718bot vs ImageRemovalBot"
    
    elif comment_lower.find("double redirect") >= 0:
        return "fixing double redirect"
    
    elif comment_lower.find("double-redirect") >= 0:
        return "fixing double redirect"

    elif comment_lower.find("has been moved; it now redirects to") >= 0:
        return "fixing double redirect"
    
    elif comment_lower.find("correction du redirect") >= 0:
        return "fixing double redirect"   
        
    elif comment_lower.find("redirect tagging") >= 0:
        return "redirect tagging/sorting"
    
    elif comment_lower.find("sorting redirect") >= 0:
        return "redirect tagging/sorting"
    
    elif comment_lower.find("redirecciones") >= 0 and comment_lower.find("categoría") >= 0:
        return "category redirect cleanup"    
    
    elif comment_lower.find("change redirected category") >= 0:
        return "category redirect cleanup"
    
    elif comment_lower.find("redirected category") >=0:
        return "category redirect cleanup"
    
    elif comment.find("[[User:Addbot|Bot:]] Adding ") >= 0:
        return "template tagging"
    
    elif comment_lower.find("interwiki") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment_lower.find("langlinks") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment_lower.find("iw-link") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment_lower.find("changing category") >= 0:
        return "moving category"
    
    elif comment_lower.find("recat per") >= 0:
        return "moving category"
    
    elif comment_lower.find("moving category") >= 0:
        return "moving category"

    elif comment_lower.find("move category") >= 0:
        return "moving category"
    
    elif comment_lower.find("re-categorisation") >= 0:
        return "moving category"
    
    elif comment_lower.find("recatégorisation") >= 0:
        return "moving category"   
    
    elif comment_lower.find("Updating users status to") >= 0:
        return "user online status update"
    
    elif comment_lower.find("{{Copy to Wikimedia Commons}} either because the file") >= 0:
        return "template cleanup"
        
    elif comment_lower.find("removing a protection template") >= 0:
        return "protection template cleanup"
    
    elif comment_lower.find("removing categorization template") >= 0:
        return "template cleanup"    
    
    elif comment_lower.find("rm ibid template per") >= 0:
        return "template cleanup"      
    
    elif comment_lower.find("page is not protected") >= 0:
        return "template cleanup"          
    
    elif comment_lower.find("removing protection template") >= 0:
        return "template cleanup"    
    
    elif comment_lower.find("correcting transcluded template per tfd") >= 0:
        return "template cleanup"   
    
    elif comment_lower.find("removing orphan t") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("non-applicable orphan") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("plantilla") >= 0 and comment_lower.find("huérfano") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("removed orphan t") >= 0:
        return "template cleanup"    
    
    elif comment_lower.find("sandbox") >= 0:
        return "clearing sandbox"
    
    elif comment_lower.find("archiving") >= 0:
        return "archiving"
    
    elif comment_lower.find("duplicate on commons") >= 0:
        return "commons image migration"
    
    elif comment_lower.find("user:mathbot/changes to mathlists") >= 0:
        return "botfight: mathbot mathlist updates"
    
    elif reverting_user == 'MathBot' or reverted_user == 'MathBot' >= 0:
        return "botfight: mathbot mathlist updates"
    
    elif comment_lower.find("link syntax") >= 0:
        return "fixing links"
    
    elif comment_lower.find("links syntax") >= 0:
        return "fixing links" 
    
    elif comment_lower.find("no broken #section links left") >= 0:
        return "fixing links"  
    
    elif comment_lower.find("removing redlinks") >= 0:
        return "fixing links" 
    
    elif comment_lower.find("to wikidata") >= 0:
        return "interwiki link cleanup -- method1"
    
    elif comment.find("言語間") >=0:
        return "interwiki link cleanup -- method1"
        
    elif comment_lower.find("interproyecto") >=0:
        return "interwiki link cleanup -- method1"    
        
    elif comment.find("语言链接") >=0:
        return "interwiki link cleanup -- method1"  
    
    elif comment.find("interling") >=0:
        return "interwiki link cleanup -- method1"  
    
    elif comment.find("interlang") >=0:
        return "interwiki link cleanup -- method1"      
    
    elif comment.find("双重重定向") >=0 or comment.find("雙重重定向") >= 0:
        return "fixing double redirect"   

    elif comment.find("二重リダイレクト") >=0:
        return "fixing double redirect"  
    
    elif comment_lower.find("doppelten redirect") >=0:
        return "fixing double redirect"  
    
    elif comment_lower.find("doppelte weiterleitung") >=0:
        return "fixing double redirect"      
    
    
    elif comment_lower.find("redirectauflösung") >=0:
        return "fixing double redirect"      
    
    elif comment_lower.find("doble redirección") >=0 or comment_lower.find("redirección doble") >= 0:
        return "fixing double redirect"  
    
    elif comment_lower.find("redireccionamento duplo") >=0:
        return "fixing double redirect"  

    elif comment_lower.find("duplo redirecionamento") >=0:
        return "fixing double redirect"      
    
    elif comment_lower.find("suppression bandeau") >= 0:
        return "template cleanup"
    
    elif comment_lower.find("archiviert") >= 0:
        return "archiving"

    elif comment_lower.find("revert") >= 0:
        return "other w/ revert in comment"  
    
    elif comment_lower.find("rv ") >= 0 or comment_lower.find("rv") == 0:
        return "other w/ revert in comment"  
    
    elif comment_lower.find(" per ") >= 0:
        return "other w/ per justification"  
    
    elif comment_lower.find(" según") >= 0:
        return "other w/ per justification"      
 
    elif comment_lower.find("suite à discussion") >= 0:
        return "other w/ per justification"  
    
    elif comment_lower.find("suite à conservation") >= 0:
        return "other w/ per justification"     
    
    elif comment_lower.find("conforme pedido") >= 0:
        return "other w/ per justification"
    
    else:
        return interwiki_confirm(comment, langcode)

In [8]:
def interwiki_confirm(comment, langcode):
    """
    Takes a comment string, searches for language codes bordered by 
    two punctuation marks from [](){},: or one punctuation mark and
    one space. Beginning and end of a comment string counts as a
    space, not a punctuation mark.
    
    Does not recognize the current langcode.
    """
    import string, re
    
    with open("../../datasets/lang_codes.tsv", "r") as f:
        lang_codes = f.read().split("\n")
        
    lang_codes.pop() # a blank '' is in the list that gets returned
    
    lang_codes.remove(langcode)
    
    #print(langcode in lang_codes)
    
    try:
        comment = str(comment)
        comment = comment.lower()
        comment = comment.replace(": ", ":")
        comment = " " + comment + " "  # pad start and end of string with non-punctuation
        #print(comment)
        
    except Exception as e:
        return 'other'
    
    for lang_code in lang_codes:
        
        lang_code_pos = comment.find(lang_code)
        lang_code_len = len(lang_code)
        
        char_before = " "
        char_after = " "
        
        if lang_code_pos >= 0:
            char_before = comment[lang_code_pos-1]
        
            #print("Char before: '", char_before, "'", sep='')
             
            char_after = comment[lang_code_pos+lang_code_len]

            #print("Char after: '", char_after, "'", sep='')
            
            if char_before in string.punctuation and char_after in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'
            
            elif char_after in string.punctuation and char_before in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'
            
            elif char_before == " " and char_after in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'
            
            elif char_after == " " and char_before in "[]{}(),:":
                #print(comment, lang_code)
                return 'interwiki link cleanup -- method2'               
    return 'other'

Testing interwiki confirm


In [9]:
tests_yes = ["Robot adding [[es:Test]]",
             "adding es:Test",
             "linking es, it, en",
             "modifying fr:",
             "modifying:zh",
             "modifying: ja"]

tests_no = ["test", 
            "discuss policies on enwiki vs eswiki", 
            "it is done", 
            "per [[en:WP:AIV]]",
            "it's not its", 
            "its not it's",
            "modifying it all",
            "modifying italy"]

print("Should return interwiki link cleanup -- method2")
for test in tests_yes:
    print("\t", interwiki_confirm(test, 'en'))

print("Should return other")
for test in tests_no:
    print("\t", interwiki_confirm(test, 'en'))


Should return interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
	 interwiki link cleanup -- method2
Should return other
	 other
	 other
	 other
	 other
	 other
	 other
	 other
	 other

Apply categorization


In [10]:
%%time
df_all['bottype'] = df_all.apply(comment_categorization, axis=1)


CPU times: user 3min 28s, sys: 3.07 s, total: 3min 31s
Wall time: 3min 31s

Consolidate groups


In [11]:
def bottype_group(bottype):
    if bottype == "interwiki link cleanup -- method2":
        return "interwiki link cleanup -- method2"
    
    elif bottype == "interwiki link cleanup -- method1":
        return "interwiki link cleanup -- method1"
    
    elif bottype.find("botfight") >= 0:
        return 'botfight'
    
    elif bottype == 'other':
        return 'not classified'
    
    elif bottype == 'fixing double redirect':
        return 'fixing double redirect'
    
    elif bottype == 'protection template cleanup':
        return 'protection template cleanup'
    
    elif bottype.find("category") >= 0:
        return 'category work'
    
    elif bottype.find("template") >= 0:
        return 'template work'
    
    elif bottype == "other w/ revert in comment":
        return "other w/ revert in comment"
    
    else:
        return "other classified"

In [12]:
df_all['bottype_group'] = df_all['bottype'].apply(bottype_group)

Analysis

Much of what we're interested in are articles, which are in namespace 0.


In [13]:
df_all_ns0 = df_all[df_all['page_namespace']==0].copy()

Bottype counts and percentages across all languages in the dataset, articles only


In [14]:
type_counts = df_all_ns0['bottype'].value_counts().rename("count")
type_percent = df_all_ns0['bottype'].value_counts(normalize=True).rename("percent") * 100
type_percent = type_percent.round(2).astype(str) + "%"

pd.concat([type_counts, type_percent], axis=1)


Out[14]:
count percent
interwiki link cleanup -- method2 244759 43.56%
interwiki link cleanup -- method1 159747 28.43%
fixing double redirect 130182 23.17%
other 13792 2.45%
protection template cleanup 2841 0.51%
other w/ revert in comment 2435 0.43%
botfight: Russbot vs Cydebot category renaming 2105 0.37%
moving category 1717 0.31%
template cleanup 1387 0.25%
other w/ per justification 776 0.14%
botfight: mathbot mathlist updates 514 0.09%
category redirect cleanup 444 0.08%
botfight: Cyberbot II vs AnomieBOT date tagging 301 0.05%
redirect tagging/sorting 297 0.05%
botfight: reverting CommonsDelinker 257 0.05%
botfight: 718bot vs ImageRemovalBot 171 0.03%
botfight: infoboxneeded 98 0.02%
fixing links 81 0.01%
template tagging 24 0.0%
clearing sandbox 5 0.0%
commons image migration 3 0.0%

Bottype counts and percentages for each language, articles only


In [15]:
counts_dict = {}
for lang in df_all_ns0['language'].unique():

    df_lang_ns0 = df_all_ns0[df_all_ns0['language']==lang]
    
    type_counts = df_lang_ns0['bottype'].value_counts().rename("count")
    type_percent = df_lang_ns0['bottype'].value_counts(normalize=True).rename("percent") * 100
    type_percent = type_percent.round(2).astype(str) + "%"

    counts_dict[lang]=pd.concat([type_counts, type_percent], axis=1)

In [16]:
df_all_ns0['language'].unique()


Out[16]:
array(['ja', 'es', 'pt', 'en', 'fr', 'de', 'zh'], dtype=object)

In [17]:
counts_dict['en']


Out[17]:
count percent
fixing double redirect 110513 45.15%
interwiki link cleanup -- method1 83718 34.2%
interwiki link cleanup -- method2 37085 15.15%
protection template cleanup 2831 1.16%
other 2616 1.07%
botfight: Russbot vs Cydebot category renaming 2095 0.86%
moving category 1442 0.59%
template cleanup 1249 0.51%
other w/ revert in comment 1009 0.41%
botfight: mathbot mathlist updates 514 0.21%
category redirect cleanup 337 0.14%
botfight: Cyberbot II vs AnomieBOT date tagging 301 0.12%
redirect tagging/sorting 297 0.12%
botfight: reverting CommonsDelinker 230 0.09%
other w/ per justification 179 0.07%
botfight: 718bot vs ImageRemovalBot 170 0.07%
botfight: infoboxneeded 98 0.04%
fixing links 81 0.03%
template tagging 24 0.01%
commons image migration 3 0.0%
clearing sandbox 1 0.0%

In [18]:
counts_dict['ja']


Out[18]:
count percent
interwiki link cleanup -- method2 27631 79.46%
interwiki link cleanup -- method1 5042 14.5%
other 1787 5.14%
fixing double redirect 294 0.85%
other w/ revert in comment 11 0.03%
other w/ per justification 7 0.02%

In [19]:
counts_dict['zh']


Out[19]:
count percent
interwiki link cleanup -- method2 23618 54.89%
interwiki link cleanup -- method1 14661 34.07%
fixing double redirect 3690 8.58%
other 794 1.85%
other w/ revert in comment 257 0.6%
other w/ per justification 6 0.01%
botfight: reverting CommonsDelinker 3 0.01%

In [20]:
counts_dict['de']


Out[20]:
count percent
interwiki link cleanup -- method2 35894 65.34%
interwiki link cleanup -- method1 16573 30.17%
other 1405 2.56%
fixing double redirect 989 1.8%
other w/ revert in comment 28 0.05%
protection template cleanup 10 0.02%
other w/ per justification 10 0.02%
botfight: Russbot vs Cydebot category renaming 10 0.02%
moving category 7 0.01%
botfight: reverting CommonsDelinker 5 0.01%
template cleanup 5 0.01%
botfight: 718bot vs ImageRemovalBot 1 0.0%
category redirect cleanup 1 0.0%

In [21]:
counts_dict['fr']


Out[21]:
count percent
interwiki link cleanup -- method2 41154 73.25%
interwiki link cleanup -- method1 10082 17.95%
fixing double redirect 3247 5.78%
other 1025 1.82%
other w/ per justification 398 0.71%
moving category 268 0.48%
other w/ revert in comment 3 0.01%
botfight: reverting CommonsDelinker 2 0.0%
clearing sandbox 2 0.0%

In [22]:
counts_dict['pt']


Out[22]:
count percent
interwiki link cleanup -- method2 41576 69.15%
interwiki link cleanup -- method1 13626 22.66%
other 2970 4.94%
fixing double redirect 1888 3.14%
other w/ per justification 36 0.06%
other w/ revert in comment 26 0.04%
botfight: reverting CommonsDelinker 1 0.0%
clearing sandbox 1 0.0%

In [23]:
counts_dict['es']


Out[23]:
count percent
interwiki link cleanup -- method2 37801 55.51%
interwiki link cleanup -- method1 16045 23.56%
fixing double redirect 9561 14.04%
other 3195 4.69%
other w/ revert in comment 1101 1.62%
other w/ per justification 140 0.21%
template cleanup 133 0.2%
category redirect cleanup 106 0.16%
botfight: reverting CommonsDelinker 16 0.02%
clearing sandbox 1 0.0%

Consolidation results


In [24]:
gb_lang_bottype = df_all.query("page_namespace == 0").groupby(["language", "bottype_group"])

In [25]:
gb_lang_bottype['rev_id'].count().unstack()


Out[25]:
bottype_group botfight category work fixing double redirect interwiki link cleanup -- method1 interwiki link cleanup -- method2 not classified other classified other w/ revert in comment protection template cleanup template work
language
de 16.0 8.0 989.0 16573.0 35894.0 1405.0 10.0 28.0 10.0 5.0
en 3408.0 1779.0 110513.0 83718.0 37085.0 2616.0 561.0 1009.0 2831.0 1273.0
es 16.0 106.0 9561.0 16045.0 37801.0 3195.0 141.0 1101.0 NaN 133.0
fr 2.0 268.0 3247.0 10082.0 41154.0 1025.0 400.0 3.0 NaN NaN
ja NaN NaN 294.0 5042.0 27631.0 1787.0 7.0 11.0 NaN NaN
pt 1.0 NaN 1888.0 13626.0 41576.0 2970.0 37.0 26.0 NaN NaN
zh 3.0 NaN 3690.0 14661.0 23618.0 794.0 6.0 257.0 NaN NaN

In [26]:

Final data format


In [26]:
df_all[0:2].transpose()


Out[26]:
0 1
archived False False
language ja ja
page_namespace 14 0
rev_deleted False False
rev_id 31654187 36768330
rev_minor_edit True True
rev_page 649447 10
rev_parent_id 3.1402e+07 3.67548e+07
rev_revert_offset 1 1
rev_sha1 hi8o3p1yka5v6fb8hdux0771qjxfewp nvmvm6tmzlze06abk9tvb4xiy1tp2ph
rev_timestamp 20100418121115 20110316095543
rev_user 105371 105371
rev_user_text VolkovBot VolkovBot
reverted_to_rev_id 31401978 36754779
reverting_archived False False
reverting_comment ボット: 言語間リンク 1 件を[[Wikipedia:ウィキデータ|ウィキデータ]]上の ... r2.6.4) (ロボットによる 追加: [[ksh:Sprooch]]
reverting_deleted False False
reverting_id 47290084 36779316
reverting_minor_edit True True
reverting_page 649447 10
reverting_parent_id 3.16542e+07 3.67683e+07
reverting_sha1 9i554cxlbugjcgq0ker7p6qsiscktbt j6mpty2xkvsejika47aq7m1rj9on882
reverting_timestamp 20130410053616 20110317101016
reverting_user 397108 397108
reverting_user_text EmausBot EmausBot
revisions_reverted 1 1
namespace_type category article
reverted_timestamp_dt 2010-04-18 12:11:15 2011-03-16 09:55:43
reverting_timestamp_dt 2013-04-10 05:36:16 2011-03-17 10:10:16
time_to_revert 1087 days 17:25:01 1 days 00:14:33
time_to_revert_hrs 26105.4 24.2425
time_to_revert_days 1087.73 1.0101
reverting_year 2013 2011
time_to_revert_days_log10 3.03652 0.00436616
time_to_revert_hrs_log10 4.41673 1.38458
reverting_comment_nobracket ボット: 言語間リンク 1 件を上の に転記 r2.6.4)
botpair EmausBot rv VolkovBot EmausBot rv VolkovBot
botpair_sorted ['EmausBot', 'VolkovBot'] ['EmausBot', 'VolkovBot']
reverts_per_page_botpair 1 1
reverts_per_page_botpair_sorted 1 1
bottype interwiki link cleanup -- method1 interwiki link cleanup -- method2
bottype_group interwiki link cleanup -- method1 interwiki link cleanup -- method2

Export data


In [27]:
df_all.to_pickle("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle")

In [28]:
!xz -9 -e --keep ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.pickle

In [29]:
df_all.to_csv("../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv", sep="\t")

In [30]:
!xz -9 -e --keep ../../datasets/parsed_dataframes/df_all_comments_parsed_2016.tsv

How long did this take to run?


In [31]:
end = datetime.datetime.now()

In [32]:
time_to_run = end - start
minutes = int(time_to_run.seconds/60)
seconds = time_to_run.seconds % 60
print("Total runtime: ", minutes, "minutes, ", seconds, "seconds")


Total runtime:  33 minutes,  53 seconds

In [33]: