In [1]:
from revscoring.extractors import api
import mwapi
extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org",
user_agent="Revscoring feature demo ahalfaker@wikimedia.org"))
The following line demonstrates a simple feature extraction. We'll extract two features: wikitext.revision.chars
, the number of characters in the entire revision; and wikitext.revision.diff.chars_added
, the number of characters added. Note that we wrap the call in a list() because it returns a generator.
In [2]:
from revscoring.features import wikitext
list(extractor.extract(123456789, [wikitext.revision.chars,
wikitext.revision.diff.chars_added]))
Out[2]:
In [3]:
from revscoring import Feature
chars_added_ratio_explicit = Feature(
"chars_added_ratio_explicit",
lambda a,c: a/max(c, 1), # Prevents divide by zero
depends_on=[wikitext.revision.diff.chars_added,
wikitext.revision.chars],
returns=float)
list(extractor.extract(123456789, [chars_added_ratio_explicit]))
Out[3]:
There's easier ways that we can do this though. revscoring.Feature
overloads simple mathematical operators to allow you to do math with features and get a feature returned. revscoring.features.modifiers
contains a set of basic functions that do the same. This code roughly corresponds to what's going on above.
In [4]:
from revscoring.features import modifiers
chars_added_ratio_implicit = (wikitext.revision.diff.chars_added /
modifiers.max(wikitext.revision.chars, 1))
list(extractor.extract(123456789, [chars_added_ratio_implicit]))
Out[4]:
While the implicit pattern is quicker and easier than the explicit pattern, it's name can not be customized.
In [5]:
chars_added_ratio_explicit, chars_added_ratio_implicit
Out[5]:
In [6]:
list(extractor.extract(662953550, [wikitext.revision.diff.datasources.segments_added,
wikitext.revision.diff.datasources.segments_removed]))
Out[6]:
OK. Let's define a new feature for counting the number of templates added. I'll make use of mwparserfromhell to do this. See the docs.
In [7]:
import mwparserfromhell as mwp
templates_added = Feature("templates_added",
lambda add_segments: sum(len(mwp.parse(s).filter_templates()) > 0 for s in add_segments),
depends_on=[wikitext.revision.diff.datasources.segments_added],
returns=int)
list(extractor.extract(662953550, [templates_added]))
Out[7]:
In [8]:
from revscoring.dependencies import draw
print(draw(templates_added))
In the tree structure above, you can see how our new feature depends on wikitext.revision.diff.segments_added
which depends on wikitext.revision.diff.operations
which depends (as you might imagine) on the current and parent revision. Some features can get quite complicated.
In [9]:
print(draw(wikitext.revision.diff.number_prop_delta_sum))
The dependency injection system will only solve a unique dependency once for a given tree. So, even though <revision.parent.text>
appears twice above, it will only be extracted once and then cached. This allows for multiple features to share large sections of their dependency trees -- and therefor minimize resource usage.
In [10]:
try:
list(extractor.extract(2, [wikitext.revision.diff.words_added]))
except Exception as e:
print(e)
In [11]:
try:
list(extractor.extract(262721924, [wikitext.revision.diff.words_added]))
except Exception as e:
print(e)
In [12]:
from revscoring.features import revision_oriented
try:
list(extractor.extract(172665816, [revision_oriented.revision.comment_matches("foo")]))
except Exception as e:
print(e)
In [13]:
from revscoring.features import temporal
try:
list(extractor.extract(591839757, [revision_oriented.revision.user.text_matches("foo")]))
except Exception as e:
print(e)