In this notebook, we'll be examining the use of a LinearSVC scorer. First, we'll train and test a machine learning model. Then we'll construct a scorer using that model and generate some scores. Then we'll serialize the model into a file for re-use.
Before we get too far, I'm going to import "pprint" (Pretty Print) to make it a little bit easier to read the datastructures we are working with. I'll also be generating some test data, so I'll need to generate some random noise.
In [1]:
from pprint import pprint
from random import normalvariate
import sys;sys.path.insert(0, "..")
In [2]:
from revscoring.features import (proportion_of_badwords_added,
proportion_of_markup_added)
Next we get the LinearSVC out of the "scorers" module.
In [3]:
from revscoring.scorers import MLScorer, MLScorerModel, LinearSVCModel
Since LinearSVC implements MLScorer, it has a "MODEL" class variable that points to a model class that we can construct. When we construct it, we give it the set of features we plan to use.
In [4]:
model = LinearSVCModel([proportion_of_badwords_added,
proportion_of_markup_added])
Now that we have a model, we're ready to do some training.
In [5]:
train_set = [((normalvariate(0.05, 0.03), normalvariate(0.001, 0.002)), True) for i in range(5000)] + \
[((normalvariate(0.001, 0.002), normalvariate(0.05, 0.03)), False) for i in range(5000)]
stats = model.train(train_set)
pprint(stats)
Now that we've trained the model, we should test it to make sure that it does a good job of making predictions. While I'm using fake data here, this test phase should really be done with a sample of data that was withheld from training.
In [6]:
test_set = [
([.052, .001], True),
([.049, .000], True),
([.073, .000], True),
([.041, .002], True),
([.053, .001], False), # This is an anomalous observation and will be mis-predicted
([.001, .101], False),
([.000, .107], False),
([.002, .090], False)
]
pprint(model.test(test_set))
A scorer's job is to combine a feature extractor with a model so that scores can be requested directly. So, in order to construct our scorer, we'll need to build an extractor first. Since our features use some language features, we'll need to provide a language to the extractor.
In [7]:
from mw.api import Session
from revscoring.extractors import APIExtractor
from revscoring.languages import english
extractor = APIExtractor(Session("https://en.wikipedia.org/w/api.php"), language=english)
Now, we have a trained model and an extractor, we can combine them to construct a scorer.
In [8]:
scorer = MLScorer(extractor, model)
Now we can use the scorer to score new revisions. Note that this was trained and tested on data that I made up, so it might not work that well.
In [9]:
pprint(list(scorer.score([639744702, 639746884])))
And there we have it.
In [10]:
from io import BytesIO
# Create a file. We'll use a fake file for this demonstration.
f = BytesIO()
# Ask the model to dump itself into the file.
model.dump(f)
OK. Now that we have 'f' containing the model. We can re-read it into a model and rebuild the scorer.
In [11]:
# Rewind the BytesIO file to the beginning so that we can read it.
f.seek(0)
# Use load() on the model class to read the file back in.
new_model = MLScorerModel.load(f)
# Rebuild the scorer
scorer = MLScorer(extractor, new_model)
# Score some revisions again.
pprint(list(scorer.score([639744702, 639746884])))
And there you have it. We have constructed a MLScorer model, trained it, tested it and stored the whole thing in a file so that we could make use of it later.