Feature engineering

This notebook will teach you how to extract feature values using revscoring's built-in feature library as well as to build your own features.

Set up the feature Extractor

This line constructs a "feature extractor" that uses Wikipedia's API. We'll need to use it later, so we'll construct it first.


In [29]:
import sys
sys.path.append("/usr/local/lib/python3.4/dist-packages/")
sys.path.append("/usr/local/lib/python3.4/dist-packages/revscoring/")
sys.path.append("/usr/local/lib/python3.4/dist-packages/more_itertools/")
sys.path.append("/usr/local/lib/python3.4/dist-packages/deltas/")

In [30]:
!sudo pip3 install dependencies deltas


Requirement already satisfied (use --upgrade to upgrade): dependencies in /usr/local/lib/python3.4/dist-packages
Requirement already satisfied (use --upgrade to upgrade): deltas in /usr/local/lib/python3.4/dist-packages
Cleaning up...

In [7]:
from revscoring.extractors import api
import mwapi

extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org",
                                        user_agent="Revscoring feature demo ahalfaker@wikimedia.org"))

Extract features

The following line demonstrates a simple feature extraction. We'll extract two features: wikitext.revision.chars, the number of characters added; and wikitext.revision.diff.chars_added, the number of characters in the entire revision. Note that we wrap the call in a list() because it returns a generator.


In [5]:
from revscoring.features import wikitext

Defining a custom feature

The next block defines a new feature and sets the dependencies to be the two features we just extracted. This feature represents the proportion of characters in the current version of the page that the current edit is responsible for adding.


In [8]:
from revscoring import Feature

chars_added_ratio_explicit = Feature(
    "chars_added_ratio_explicit", 
    lambda a,c: a/max(c, 1), # Prevents divide by zero
    depends_on=[wikitext.revision.diff.chars_added, 
                wikitext.revision.chars],
    returns=float)

list(extractor.extract(123456789, [chars_added_ratio_explicit]))


Out[8]:
[0.0002550369803621525]

There's easier ways that we can do this though. revscoring.Feature overloads simple mathematical operators to allow you to do math with features and get a feature returned. revscoring.features.modifiers contains a set of basic functions that do the same. This code roughly corresponds to what's going on above.


In [9]:
from revscoring.features import modifiers

chars_added_ratio_implicit = (wikitext.revision.diff.chars_added /
                              modifiers.max(wikitext.revision.chars, 1))

list(extractor.extract(123456789, [chars_added_ratio_implicit]))


Out[9]:
[0.0002550369803621525]

While the implicit pattern is quicker and easier than the explicit pattern, it's name can not be customized.


In [10]:
chars_added_ratio_explicit, chars_added_ratio_implicit


Out[10]:
(<feature.chars_added_ratio_explicit>,
 <feature.(wikitext.revision.diff.chars_added / max(wikitext.revision.chars, 1))>)

Extracting datasources

There's a also a set of revscoring.Datasource's that are part of the dependency injection system. These "datasources" represent the data needed for feature generation. We can extract them just like revscoring.Feature's.


In [11]:
list(extractor.extract(662953550, [wikitext.revision.diff.datasources.segments_added,
                                   wikitext.revision.diff.datasources.segments_removed]))


Out[11]:
[['Ideology and policies',
  'Political scientists [[Robert Ford]] and [[Matthew Goodwin]] characterised UKIP as "a radical right party".{{sfn|Ford|Goodwin|2014|p=13}}\n\n',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}'],
 ['Policies']]

OK. Let's define a new feature for counting the number of templates added. I'll make use of mwparserfromhell to do this. See the docs.


In [12]:
import mwparserfromhell as mwp

templates_added = Feature("templates_added", 
                          lambda add_segments: sum(len(mwp.parse(s).filter_templates()) > 0 for s in add_segments),
                          depends_on=[wikitext.revision.diff.datasources.segments_added],
                          returns=int)
list(extractor.extract(662953550, [templates_added]))


Out[12]:
[11]

Debugging

There's some facilities in place to help you make sense of issues when they arise. The most important is the draw function.


In [13]:
from revscoring.dependencies import draw
print(draw(templates_added))


 - <feature.templates_added>
	 - <datasource.wikitext.revision.diff.segments_added>
		 - <datasource.wikitext.revision.diff.operations>
			 - <datasource.tokenized(datasource.revision.parent.text)>
				 - <datasource.revision.parent.text>
			 - <datasource.tokenized(datasource.revision.text)>
				 - <datasource.revision.text>

In the tree structure above, you can see how our new feature depends on wikitext.revision.diff.segments_added which depends on wikitext.revision.diff.operations which depends (as you might imagine) on the current and parent revision. Some features can get quite complicated.


In [14]:
print(draw(wikitext.revision.diff.number_prop_delta_sum))


 - <feature.wikitext.revision.diff.number_prop_delta_sum>
	 - <datasource.values(<datasource.wikitext.revision.diff.number_prop_delta>)>
		 - <datasource.wikitext.revision.diff.number_prop_delta>
			 - <datasource.wikitext.revision.parent.number_frequency>
				 - <datasource.wikitext.revision.parent.numbers>
					 - <datasource.tokenized(datasource.revision.parent.text)>
						 - <datasource.revision.parent.text>
			 - <datasource.wikitext.revision.diff.number_delta>
				 - <datasource.wikitext.revision.parent.number_frequency>
					 - <datasource.wikitext.revision.parent.numbers>
						 - <datasource.tokenized(datasource.revision.parent.text)>
							 - <datasource.revision.parent.text>
				 - <datasource.wikitext.revision.number_frequency>
					 - <datasource.wikitext.revision.numbers>
						 - <datasource.tokenized(datasource.revision.text)>
							 - <datasource.revision.text>

The dependency injection system will only solve a unique dependency once for a given tree. So, even though <revision.parent.text> appears twice above, it will only be extracted once and then cached. This allows for multiple features to share large sections of their dependency trees -- and therefor minimize resource usage.

Errors during extraction

A revscoring.Extractor should be expected to throw an exception if it cannot find a missing resource during extraction. These messages are intented to clearly convey what went wrong.


In [15]:
try:
    list(extractor.extract(2, [wikitext.revision.diff.words_added]))
except Exception as e:
    print(e)


RevisionNotFound: Could not find revision ({revision}:2)

In [11]:
try:
    list(extractor.extract(262721924, [wikitext.revision.diff.words_added]))
except Exception as e:
    print(e)


TextDeleted: Text deleted (<revision.text>)

In [12]:
from revscoring.features import revision_oriented
try:
    list(extractor.extract(172665816, [revision_oriented.revision.comment_matches("foo")]))
except Exception as e:
    print(e)


CommentDeleted: Comment deleted (<revision.comment>)

In [13]:
from revscoring.features import temporal
try:
    list(extractor.extract(591839757, [revision_oriented.revision.user.text_matches("foo")]))
except Exception as e:
    print(e)


UserDeleted: User deleted ({revision.user})