Building and testing the enwiki revert classifer

In this notebook, I'm going to build a revscores classification model for reverts in Wikipedia.

Revision sample

In order to build up a classifer, we'll need labeled data. In order to gather such a set of labeled data, I chose to make use of Quarry. http://quarry.wmflabs.org/query/1621 queries for a random sample of recent revisions that were saved between 2 and 30 days ago.

Labels and features

I've written a script to help gather both the labels and reverted status of these revisions.

Gathers a set of features and reverted status for a set of revisions and
prints a TSV to stdout of the format:

<feature_value1>\t<feature_value2>\t...\t<reverted>

Usage:
    features_reverted -h | --help
    features_reverted --api=<url> --language=<clspath> <features> [--rev_ids=<path>]

Options:
    -h --help             Prints out this documentation
    <features>            The ClassPath to a list of features to extract.
    --api=<url>           The url of the API to use to extract features
    --language=<clspath>  The ClassPath to a language to use (required for some
                          features)
    --rev_pages=<path>    The location of a file containing rev_ids and
                          page_ids to extract. [default: <stdin>]

I've already prepared a list of features to extract in ores.models.enwiki.features, so that can just be used by the script.


In [1]:
import sys;sys.path.insert(0, "../") # Makes ores package accessible
from ores.models.enwiki import features
features


Out[1]:
[<log(added_badwords_ratio + 1)>,
 <log(added_misspellings_ratio + 1)>,
 <log(longest_repeated_char_added + 1)>,
 <log(longest_token_added + 1)>,
 <log(numeric_chars_added + 1)>,
 <log(prev_words + 1)>,
 <log(proportion_of_markup_added + 1)>,
 <log(proportion_of_numeric_added + 1)>,
 <log(proportion_of_symbolic_added + 1)>,
 <log(proportion_of_uppercase_added + 1)>,
 <log(seconds_since_last_page_edit + 1)>,
 <log(words_added + 1)>,
 <log(words_removed + 1)>,
 <log(user_age_in_seconds + 1)>,
 <user_is_anon>,
 <user_is_bot>,
 <day_of_week_in_utc>,
 <hour_of_day_in_utc>,
 <is_custom_comment>,
 <is_content_namespace>,
 <is_section_comment>]

OK. Time to gather features and reverted status. To do so, I run the following:

$ cat datasets/enwiki.rev_pages.5k.tsv | tail -n+2 | \
  ./features_reverted \
      --api=https://en.wikipedia.org/w/api.php \
      --language=revscores.languages.english \
      ores.models.enwiki.features > \
  datasets/enwiki.features_reverted.5k.tsv

A few of the revisions won't be found -- which is fine. We'll just have to work with the remaining data. So, let's load it in!


In [2]:
feature_scores = []
with open("../datasets/enwiki.features_reverted.combined.tsv") as f:
    for line in f:
        parts = line.strip().split("\t")
        values = parts[:-1]
        reverted = parts[-1] == "True"
        
        feature_values = []
        for feature, value in zip(features, values):
            
            if feature.returns == bool:
                feature_values.append(value == "True")
            else:
                feature_values.append(feature.returns(value))
    
        feature_scores.append((feature_values, reverted))

len(feature_scores)


Out[2]:
10882

Training the models

Now we'll use the extracted features to train some models. For this section, I'll use an RBF and Linear kernel Support Vector Classifiers.


In [3]:
from ores.models.enwiki import linear_svc, rbf_svc
from random import shuffle

shuffle(feature_scores) # Randomize the set

train_set = feature_scores[5000:] # Nearly 5000 training observations
test_set = feature_scores[:5000]  # 5000 test observations

print(linear_svc.train(train_set))
print(rbf_svc.train(train_set))


{'seconds_elapsed': 10.569149255752563}
{'seconds_elapsed': 12.478917360305786}

Testing the models

First, the linear kernel.


In [4]:
list(score['probability'] for score in linear_svc.score([f for f, s in test_set if s]))


Out[4]:
[{False: 0.45203363083433101, True: 0.79321484581690627},
 {False: 0.50932505011694551, True: 0.27761191093052312},
 {False: 0.45756936701301909, True: 0.77945621862484049},
 {False: 0.50917443360441927, True: 0.28286010208053081},
 {False: 0.45890250487007683, True: 0.7758184619205708},
 {False: 0.45764850493722703, True: 0.77924406499400778},
 {False: 0.45492299634105016, True: 0.78628986042354243},
 {False: 0.51009872970665759, True: 0.24933993009228389},
 {False: 0.49611488897478406, True: 0.55551526503286253},
 {False: 0.50211684976955329, True: 0.46301657317758987},
 {False: 0.45844653041707945, True: 0.77707814067542047},
 {False: 0.46048298615827699, True: 0.77132257386924463},
 {False: 0.45609104551796337, True: 0.78333474038208961},
 {False: 0.42893952300372346, True: 0.83420197860349532},
 {False: 0.51101351862291955, True: 0.21278589571485618},
 {False: 0.46159916544084417, True: 0.76802055502709143},
 {False: 0.46283236927466809, True: 0.76424294459125652},
 {False: 0.45528468740015848, True: 0.78538479577481457},
 {False: 0.45895416349620416, True: 0.77567471617557826},
 {False: 0.51043993622082529, True: 0.23612539738910898},
 {False: 0.44981020814803563, True: 0.79820001323028855},
 {False: 0.45789814829233233, True: 0.77857172739736569},
 {False: 0.4821241310488632, True: 0.67931160287809156},
 {False: 0.45887020521891764, True: 0.77590823172275403},
 {False: 0.45719560196600306, True: 0.78045189747750776},
 {False: 0.51065139941587478, True: 0.22768995488859289},
 {False: 0.45565413339592725, True: 0.78445111940670298},
 {False: 0.49292733164726049, True: 0.59215032492125919},
 {False: 0.45736286444687424, True: 0.78000760733818331},
 {False: 0.45789618861557935, True: 0.77857702355014069},
 {False: 0.45416835304076869, True: 0.78815007163181328},
 {False: 0.50994971606892858, True: 0.25496310458064325},
 {False: 0.50996003884471086, True: 0.25457639143574579},
 {False: 0.45801625573518384, True: 0.77825199505676168},
 {False: 0.45757086644190165, True: 0.77945220330767018},
 {False: 0.45728790083808712, True: 0.78020698508255704},
 {False: 0.46205394018982809, True: 0.76664364780285288},
 {False: 0.51099500494401573, True: 0.21356236341381513},
 {False: 0.41896234911145602, True: 0.84659094722687311},
 {False: 0.51011276740602862, True: 0.24880564809812844},
 {False: 0.50946763592306243, True: 0.27256923987463511},
 {False: 0.42631412545471681, True: 0.83769525566440062},
 {False: 0.46101597125919391, True: 0.76975934345232477},
 {False: 0.50197895239785884, True: 0.46560175104752061},
 {False: 0.48596644904208852, True: 0.65331748996415406},
 {False: 0.45942887673236588, True: 0.77434379108807749},
 {False: 0.45967871371520663, True: 0.77363602448538482},
 {False: 0.46226410972395926, True: 0.76600099070025662},
 {False: 0.46271113108244211, True: 0.76462055040683274},
 {False: 0.43551100724677227, True: 0.82459864921796577},
 {False: 0.45415583566983325, True: 0.78818061123775984},
 {False: 0.45755469370596386, True: 0.77949550335441042},
 {False: 0.50955790429599102, True: 0.26933872593939384},
 {False: 0.51047110190996725, True: 0.23489422284273187},
 {False: 0.45145420114909823, True: 0.7945415101373825},
 {False: 0.51106320702709607, True: 0.21069406096861387},
 {False: 0.50867199163405308, True: 0.29980708788132432},
 {False: 0.50985298986711425, True: 0.25856649052970815},
 {False: 0.45619012039604462, True: 0.78307972679071025},
 {False: 0.46272433894047188, True: 0.76457948048304403},
 {False: 0.50642182538710878, True: 0.36647991814275088},
 {False: 0.46164259613888609, True: 0.76788986342979537},
 {False: 0.45387075667576299, True: 0.78887338823834596},
 {False: 0.47353328933253613, True: 0.72441967272727315},
 {False: 0.45883580418381648, True: 0.77600375106340169},
 {False: 0.45782055758590412, True: 0.7787811987002875},
 {False: 0.45758027383545091, True: 0.77942700742670357},
 {False: 0.44741874199055948, True: 0.80325890555008972},
 {False: 0.5061219987472374, True: 0.37437672525858118},
 {False: 0.46461328329400559, True: 0.75853246434762234},
 {False: 0.46230705288863388, True: 0.76586918112995317},
 {False: 0.50592241627006795, True: 0.37951990833611499},
 {False: 0.4537799684187816, True: 0.78909291346400234},
 {False: 0.4770937687063419, True: 0.7075338914897229},
 {False: 0.50213993444753624, True: 0.46258121522367412},
 {False: 0.45226266643959573, True: 0.7926849254022037},
 {False: 0.45991350107970719, True: 0.77296623125695418},
 {False: 0.4583565545109497, True: 0.77732478980134578},
 {False: 0.45711684842675093, True: 0.7806603709548231},
 {False: 0.46174482272698064, True: 0.76758157780486058},
 {False: 0.4257834086025658, True: 0.83837983440118069},
 {False: 0.45864990526269828, True: 0.77651831097270752},
 {False: 0.46349165508788681, True: 0.76216510823819505},
 {False: 0.45904234536399219, True: 0.77542885055246114},
 {False: 0.51088728412115147, True: 0.21804883572602246},
 {False: 0.50935259663847543, True: 0.27664338764840929},
 {False: 0.45639536456170027, True: 0.78254922685770523},
 {False: 0.45914869977912331, True: 0.77513149186286823},
 {False: 0.45840515198316995, True: 0.77719164854256362},
 {False: 0.50909478907927652, True: 0.28560330597432748},
 {False: 0.5104704713076772, True: 0.23491917515361332},
 {False: 0.50822402060430694, True: 0.31422798317671663},
 {False: 0.50337473277918643, True: 0.43815401376285068},
 {False: 0.43719875621987975, True: 0.82191050716211456},
 {False: 0.4322381940052224, True: 0.8295443469872722},
 {False: 0.45683147096627669, True: 0.78141200384943754},
 {False: 0.48586932100737512, True: 0.65403154503618133},
 {False: 0.45727875843434829, True: 0.78023127228605427},
 {False: 0.4588117672063256, True: 0.77607043761987027},
 {False: 0.50423178738213636, True: 0.41977167043570368},
 {False: 0.46131990498495457, True: 0.76885690275670426},
 {False: 0.46289164585957554, True: 0.76405781946027851},
 {False: 0.50798257554449711, True: 0.32174639409080596},
 {False: 0.45826718029462876, True: 0.77756916893542027},
 {False: 0.46060880251802727, True: 0.77095575264279059},
 {False: 0.50692858829670606, True: 0.35264665977030651},
 {False: 0.45950369891888332, True: 0.77413235954079562},
 {False: 0.50098171756782661, True: 0.4835415414113553},
 {False: 0.50735299515590393, True: 0.34056479684030161},
 {False: 0.50254499775371264, True: 0.45481875809373401},
 {False: 0.46134475646633166, True: 0.76878275546882591},
 {False: 0.45527349486943952, True: 0.78541293602056239},
 {False: 0.4594231001880949, True: 0.77436009548103735},
 {False: 0.4588701946461442, True: 0.77590826109393629},
 {False: 0.45639566195471482, True: 0.78254845600427669},
 {False: 0.45607941420809717, True: 0.78336463336266149},
 {False: 0.45495683430251366, True: 0.78620556177292433},
 {False: 0.51033921185729803, True: 0.24007641417922362},
 {False: 0.4591257184968871, True: 0.77519582233252127},
 {False: 0.50543104861276122, True: 0.39181222570024143},
 {False: 0.45746088200287782, True: 0.77974628419891279},
 {False: 0.51011572103951541, True: 0.24869313053653677},
 {False: 0.50935583805393803, True: 0.27652924297364528},
 {False: 0.46176313798818042, True: 0.76752624539310077},
 {False: 0.51071396944667669, True: 0.22515672612125107},
 {False: 0.46228219599465042, True: 0.76594549743700702},
 {False: 0.45872802710201044, True: 0.77630240451984567},
 {False: 0.4632329171093964, True: 0.76298552009936227},
 {False: 0.50955270924126639, True: 0.26952545627669844},
 {False: 0.4388004294282663, True: 0.81926650296513814},
 {False: 0.4318760776630815, True: 0.83007109293827996},
 {False: 0.46166525029381378, True: 0.76782162574346935},
 {False: 0.51035197660667686, True: 0.23957805646770758},
 {False: 0.4503001068267134, True: 0.79712577711608223},
 {False: 0.45484741458064548, True: 0.78647787552593285},
 {False: 0.46267649851573511, True: 0.76472816301193247},
 {False: 0.50822232749651708, True: 0.31428131149554894},
 {False: 0.46424879483072728, True: 0.75972672990053813},
 {False: 0.50851169102172977, True: 0.30503961975579491},
 {False: 0.45534197355969952, True: 0.78524063321823256},
 {False: 0.458700530019099, True: 0.77637845327388288},
 {False: 0.46303662286562158, True: 0.763603645138626},
 {False: 0.47268974406807956, True: 0.72810774954147117},
 {False: 0.45506064087207698, True: 0.78594647322691891},
 {False: 0.45349547499166065, True: 0.78977740027650767},
 {False: 0.45860630612016517, True: 0.77663859840876392},
 {False: 0.45902824120252794, True: 0.77546821687559142},
 {False: 0.43553829150269935, True: 0.82455596401483011},
 {False: 0.46068991055859959, True: 0.77071856424051421},
 {False: 0.46061709527995548, True: 0.77093152748958882},
 {False: 0.51052663441907897, True: 0.2326902015730726},
 {False: 0.4619143433786751, True: 0.76706828424554174},
 {False: 0.50784358337545699, True: 0.32599693634676585},
 {False: 0.459337152120537, True: 0.77460236625347156},
 {False: 0.5050772715063, True: 0.40035077518697954},
 {False: 0.45729894213659594, True: 0.78017764513180676},
 {False: 0.40510473246367684, True: 0.8604899573763064},
 {False: 0.5089243243407694, True: 0.29140175255239614},
 {False: 0.45753339893354256, True: 0.77955248701644464},
 {False: 0.44612480797131082, True: 0.80587447021984882},
 {False: 0.44131063033772316, True: 0.81492857409864039},
 {False: 0.45512152038457848, True: 0.78579418699786718},
 {False: 0.45516580398681905, True: 0.78568325676108053},
 {False: 0.45778929469282664, True: 0.77886547011535123},
 {False: 0.50822813542965362, True: 0.31409834118704322},
 {False: 0.46232853351829828, True: 0.76580318493366362},
 {False: 0.45865123156901938, True: 0.77651464943818371},
 {False: 0.46065844774473047, True: 0.77081063932787031},
 {False: 0.5102609757641926, True: 0.24311610884114337},
 {False: 0.46112666885859566, True: 0.76943159467121947},
 {False: 0.45484243701647714, True: 0.78649024413931135},
 {False: 0.45857683895659956, True: 0.77671981229340004},
 {False: 0.51086789438357694, True: 0.21885076138690127},
 {False: 0.45752197793474481, True: 0.77958303501869386},
 {False: 0.45977701048233433, True: 0.77335615922813361},
 {False: 0.5113762592129234, True: 0.19724469944130402},
 {False: 0.51092890822649972, True: 0.21632154611395699},
 {False: 0.49595103649669808, True: 0.55757036748851763},
 {False: 0.4595032220106563, True: 0.77413370862194386},
 {False: 0.44289626984542785, True: 0.81205816044694523},
 {False: 0.50836970254654767, True: 0.30960673024483487},
 {False: 0.51084071427907229, True: 0.21997201152264886},
 {False: 0.45917437095827585, True: 0.77505958166708822},
 {False: 0.49630422469539304, True: 0.55311494548242679},
 {False: 0.49756625458244941, True: 0.53637840561753658},
 {False: 0.45518130152217628, True: 0.78564440409201219},
 {False: 0.46032469586902008, True: 0.77178216656205922},
 {False: 0.45973703883953432, True: 0.77347006006919528},
 {False: 0.51044863350544678, True: 0.23578223293777989},
 {False: 0.45448454576598757, True: 0.78737523180323798},
 {False: 0.4295389578330191, True: 0.83337854649158383},
 {False: 0.45780141351816661, True: 0.77883281170999419},
 {False: 0.48007339167141094, True: 0.69151670204804616},
 {False: 0.45703217895293191, True: 0.7808839964251737},
 {False: 0.46116378794216284, True: 0.76932145483194714},
 {False: 0.45871510789677611, True: 0.77633814260289857},
 {False: 0.50931801591463899, True: 0.27785879735799607},
 {False: 0.51043519264811543, True: 0.23631242720288359},
 {False: 0.46174516573338825, True: 0.76758054182152458},
 {False: 0.5108129769102383, True: 0.22111280870094149},
 {False: 0.45982658609584637, True: 0.77321470879104026},
 {False: 0.46256482760033096, True: 0.76507439095704588},
 {False: 0.45331055702669465, True: 0.79021955342562589},
 {False: 0.45438554795789632, True: 0.78761853442035734},
 {False: 0.45922537181851519, True: 0.77491656105608464},
 {False: 0.51094677827556856, True: 0.21557755250280497},
 {False: 0.50968129826606601, True: 0.26487405724289292},
 {False: 0.45541324455050608, True: 0.78506096432898553},
 {False: 0.46166777925532482, True: 0.76781400528895427},
 {False: 0.45833118806716955, True: 0.7773942129398459},
 {False: 0.50580110394833477, True: 0.38260298291472533},
 {False: 0.45867426383249116, True: 0.77645104225548633},
 {False: 0.43728587167515992, True: 0.82176907200835503},
 {False: 0.44706416497391627, True: 0.8039838592487869},
 {False: 0.46119759901899604, True: 0.76922102567729067},
 {False: 0.45602749533417403, True: 0.78349795107347031},
 {False: 0.5097858041699358, True: 0.26104811815592605},
 {False: 0.46045403993282263, True: 0.77140677708066063},
 {False: 0.51065086154154193, True: 0.22771165682385824},
 {False: 0.45661005484799722, True: 0.78199109541008449},
 {False: 0.51070062121726356, True: 0.22569859443377402},
 {False: 0.51078794238202885, True: 0.22213947022567151},
 {False: 0.46252155642089987, True: 0.76520823790475434},
 {False: 0.49002520425526475, True: 0.62030292965889711},
 {False: 0.50974325986851488, True: 0.26261064907468645},
 {False: 0.45824205032870735, True: 0.77763777162149905},
 {False: 0.45200876084593694, True: 0.7932721980013504},
 {False: 0.4595330980276397, True: 0.7740491594173593},
 {False: 0.45932725725429718, True: 0.77463021961504308},
 {False: 0.46412652675268362, True: 0.76012434508472226},
 {False: 0.4661959505864598, True: 0.7531861799964843},
 {False: 0.45887417744198661, True: 0.77589719625425047},
 {False: 0.5106209543570962, True: 0.22891634865378524},
 {False: 0.51063981050636897, True: 0.22815725916081747},
 {False: 0.4538168505983019, True: 0.78900379655206454},
 {False: 0.46567396643231668, True: 0.75497895439677853},
 {False: 0.46013820798697286, True: 0.77232092221517534},
 {False: 0.46311564088496771, True: 0.76335526205899418},
 {False: 0.45693331049745406, True: 0.78114445927964427},
 {False: 0.45552939473217036, True: 0.78476741211338241},
 {False: 0.45987477311600322, True: 0.77307702607883033},
 {False: 0.45691716637981933, True: 0.78118692217004737},
 {False: 0.46401596124242062, True: 0.76048261672246109},
 {False: 0.46183351010501528, True: 0.767313363091776},
 {False: 0.51069485490406941, True: 0.22593243311749811},
 {False: 0.45680597222982305, True: 0.78147887402629623},
 {False: 0.44493306689427981, True: 0.8082122937051317},
 {False: 0.44781244021404387, True: 0.80244655395511},
 {False: 0.50895177616831255, True: 0.29047460176126172},
 {False: 0.4348857055436498, True: 0.82557013239478494},
 {False: 0.49721897315189728, True: 0.54111617530349132},
 {False: 0.46670227621048466, True: 0.7514185139296804},
 {False: 0.46106536482689914, True: 0.7696132328908839},
 {False: 0.4608491327267018, True: 0.77025130242788098},
 {False: 0.45864023782767216, True: 0.77654499569951552},
 {False: 0.4441596315508361, True: 0.80969441765499184},
 {False: 0.46301743788522043, True: 0.76366386102470274},
 {False: 0.46488931578925052, True: 0.75761900342659827},
 {False: 0.46061369709317751, True: 0.77094145512067247},
 {False: 0.45541128368873623, True: 0.78506591216752508},
 {False: 0.49959905371363567, True: 0.50640650485173533},
 {False: 0.48972251125253985, True: 0.62300010631361868},
 {False: 0.43202678888021273, True: 0.82985234118196338},
 {False: 0.51069132881069479, True: 0.22607535289181072},
 {False: 0.45136632992345538, True: 0.7947409726798933},
 {False: 0.45645949595103108, True: 0.78238284984674544},
 {False: 0.46149180215129421, True: 0.7683429115950996},
 {False: 0.45731226648936474, True: 0.78014222641774422},
 {False: 0.46173756669570365, True: 0.76760349074576584}]

In [5]:
import matplotlib.pyplot as plt

stats = linear_svc.test(test_set)
print("mean.accuracy = {0}".format(stats['mean.accuracy']))
print("auc = {0}".format(stats['auc']))
print(stats['table'])


plt.figure()
plt.plot(stats['roc']['fpr'],
         stats['roc']['tpr'], 
         label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()


mean.accuracy = 0.9424
auc = 0.8426881425674402
{(False, True): 898, (True, False): 65, (False, False): 3814, (True, True): 223}

Now for the RBF kernel.


In [11]:
stats = rbf_svc.test(test_set)
print("mean.accuracy = {0}".format(stats['mean.accuracy']))
print("auc = {0}".format(stats['auc']))
print(stats['table'])

plt.figure()
plt.plot(stats['roc']['fpr'],
         stats['roc']['tpr'], 
         label='ROC curve')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()


mean.accuracy = 0.9428
auc = 0.796488513607733
{(False, True): 616, (True, False): 122, (False, False): 4098, (True, True): 164}

And there you have it.


In [ ]: