XML Processing Example: Extract link count changes

This notebook details how to use the mwxml python library to efficiently process an entire Wikipedia-sized historical XML dump. In this example, we'll extract image link-count change events from the history of Dutch Wikipedia.


In [1]:
import mwxml

Step 1: Gather the paths to all of the dump files

On tool labs, the XML dumps are available in /public/dumps/public/. We're going to use python's glob library to get the paths of the Dutch Wikipedia dump (December 02, 2015) that contains the text of all revisions.


In [2]:
import glob

paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')
paths


Out[2]:
['/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history4.xml.bz2',
 '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history2.xml.bz2',
 '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history3.xml.bz2',
 '/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history1.xml.bz2']

Step 2: Define the image link extractor.

Here we're using a regular expression to extract image links from the revision text of articles. Nothing fancy here.


In [3]:
import re

EXTS = ["png", "gif", "jpg", "jpeg"]
# [[(file|image):<file>.<ext>]]
IMAGE_LINK_RE = re.compile(r"\[\[" + 
                           r"(file|image|afbeelding|bestand):" +  # Group 1
                           r"([^\]]+.(" + "|".join(EXTS) + r"))" +  # Group 2 & 3
                           r"(|[^\]]+)?" +  # Group 4
                           r"\]\]")

def extract_image_links(text):
  for m in IMAGE_LINK_RE.finditer(text):
    yield m.group(2)

Step 3: Run the XML dump processor on the paths

This is the part that mwxml can help you do easily. You need to define a process_dump function that takes two arguements: dump : mwxml.Dump and a path : str

In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with last_count. If the delta isn't 0, we yield some values. It's very important that the process_dump function either yields something or returns an iterable. We'll explain why in a moment.


In [4]:
def process_dump(dump, path):
  for page in dump:
    last_count = 0
    for revision in page:
      image_links = list(extract_image_links(revision.text or ""))
      delta = len(image_links) - last_count
      if delta != 0:
        yield revision.id, revision.timestamp, delta
      last_count = len(image_links)

OK. Now that everything is defined, it's time to run the code. mwxml has a map() function that applied the process_dump function each of the XML dump file in paths -- in parallel -- using python's multiprocessing library and collects all of the yielded values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output).


In [5]:
count = 0
for rev_id, rev_timestamp, delta in mwxml.map(process_dump, paths):
    print("\t".join(str(v) for v in [rev_id, rev_timestamp, delta]))
    count += 1
    if count > 15:
        break


5968207	2006-11-30T21:34:17Z	3
8992798	2007-08-19T03:19:00Z	-1
8996924	2007-08-19T12:38:26Z	1
9000899	2007-08-19T16:05:50Z	-3
5969056	2006-11-30T22:08:43Z	1
14712696	2008-11-29T21:22:49Z	-1
3110580	2006-02-09T11:33:12Z	27
8336191	2007-06-14T21:10:42Z	-27
16705457	2009-05-06T18:33:02Z	1
16750330	2009-05-09T11:59:22Z	1
16785196	2009-05-11T21:13:18Z	-1
16705928	2009-05-06T19:04:06Z	1
16738884	2009-05-08T18:06:57Z	-1
27871	2003-01-12T22:49:21Z	1
1161182	2005-05-25T11:45:28Z	-1
1162073	2005-05-25T11:46:50Z	1

Conclusion

That's it! And we only wrote ~25 lines of code.


In [ ]: