LinearSVC Scorer demonstration

In this notebook, we'll be examining the use of a LinearSVC scorer. First, we'll train and test a machine learning model. Then we'll construct a scorer using that model and generate some scores. Then we'll serialize the model into a file for re-use.

Before we get too far, I'm going to import "pprint" (Pretty Print) to make it a little bit easier to read the datastructures we are working with. I'll also be generating some test data, so I'll need to generate some random noise.


In [1]:
from pprint import pprint
from random import normalvariate
import sys;sys.path.insert(0, "..")

Part 1: Training and testing the model

First, we get some features for the classifer. In this case, I've arbitrarily chosen two features that return floating point values.


In [2]:
from revscoring.features import (proportion_of_badwords_added,
                                proportion_of_markup_added)

Next we get the LinearSVC out of the "scorers" module.


In [3]:
from revscoring.scorers import MLScorer, MLScorerModel, LinearSVCModel

Since LinearSVC implements MLScorer, it has a "MODEL" class variable that points to a model class that we can construct. When we construct it, we give it the set of features we plan to use.


In [4]:
model = LinearSVCModel([proportion_of_badwords_added,
                         proportion_of_markup_added])

Now that we have a model, we're ready to do some training.


In [5]:
train_set = [((normalvariate(0.05, 0.03), normalvariate(0.001, 0.002)), True) for i in range(5000)] + \
            [((normalvariate(0.001, 0.002), normalvariate(0.05, 0.03)), False) for i in range(5000)]
stats = model.train(train_set)
pprint(stats)


{'seconds_elapsed': 2.034245252609253}

Now that we've trained the model, we should test it to make sure that it does a good job of making predictions. While I'm using fake data here, this test phase should really be done with a sample of data that was withheld from training.


In [6]:
test_set = [
    ([.052, .001], True),
    ([.049, .000], True),
    ([.073, .000], True),
    ([.041, .002], True),
    ([.053, .001], False), # This is an anomalous observation and will be mis-predicted
    ([.001, .101], False),
    ([.000, .107], False),
    ([.002, .090], False)
]
pprint(model.test(test_set))


{'auc': 0.8125,
 'mean.accuracy': 0.5,
 'roc': {'fpr': [0.0, 0.25, 0.25, 0.25, 0.25, 0.5, 0.75, 1.0],
         'thresholds': [0.9999951120955598,
                        0.99193265274203946,
                        0.99121593614332948,
                        0.99008809411589405,
                        0.97458010603277967,
                        8.8607807278771145e-06,
                        1.8299857286638758e-06,
                        7.4447688971139952e-07],
         'tpr': [0.25, 0.25, 0.5, 0.75, 1.0, 1.0, 1.0, 1.0]},
 'table': {(False, False): 3, (False, True): 1, (True, True): 4}}

Part 2: Constructing a scorer from the model

A scorer's job is to combine a feature extractor with a model so that scores can be requested directly. So, in order to construct our scorer, we'll need to build an extractor first. Since our features use some language features, we'll need to provide a language to the extractor.


In [7]:
from mw.api import Session
from revscoring.extractors import APIExtractor
from revscoring.languages import english

extractor = APIExtractor(Session("https://en.wikipedia.org/w/api.php"), language=english)


WARNING:mw.api.session:Sending requests with default User-Agent.  Set 'user_agent' on api.Session to quiet this message.

Now, we have a trained model and an extractor, we can combine them to construct a scorer.


In [8]:
scorer = MLScorer(extractor, model)

Now we can use the scorer to score new revisions. Note that this was trained and tested on data that I made up, so it might not work that well.


In [9]:
pprint(list(scorer.score([639744702, 639746884])))


[{'prediction': True,
  'probability': {False: 0.40110323543096332, True: 0.59889676456903684}},
 {'prediction': False,
  'probability': {False: 0.99942594986920574, True: 0.00057405013079438205}}]

And there we have it.

Part 3: Storing the model for later

Now, for the final part of this demo, we'll serialize the model information into a file so that we can make use of it later. First, we store the model in a (fake) file using the model's dump() function.


In [10]:
from io import BytesIO

# Create a file.  We'll use a fake file for this demonstration. 
f = BytesIO()

# Ask the model to dump itself into the file. 
model.dump(f)

OK. Now that we have 'f' containing the model. We can re-read it into a model and rebuild the scorer.


In [11]:
# Rewind the BytesIO file to the beginning so that we can read it.
f.seek(0)

# Use load() on the model class to read the file back in.
new_model = MLScorerModel.load(f)

# Rebuild the scorer
scorer = MLScorer(extractor, new_model)

# Score some revisions again.
pprint(list(scorer.score([639744702, 639746884])))


[{'prediction': True,
  'probability': {False: 0.40110323543096332, True: 0.59889676456903684}},
 {'prediction': False,
  'probability': {False: 0.99942594986920574, True: 0.00057405013079438205}}]

Conclusion

And there you have it. We have constructed a MLScorer model, trained it, tested it and stored the whole thing in a file so that we could make use of it later.