This notebook details how to use the mwxml python library to efficiently process an entire Wikipedia-sized historical XML dump. In this example, we'll extract image link-count change events from the history of Dutch Wikipedia.
In [1]:
import mwxml
In [2]:
import glob
paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')
paths
Out[2]:
In [3]:
import re
EXTS = ["png", "gif", "jpg", "jpeg"]
# [[(file|image):<file>.<ext>]]
IMAGE_LINK_RE = re.compile(r"\[\[" +
r"(file|image|afbeelding|bestand):" + # Group 1
r"([^\]]+.(" + "|".join(EXTS) + r"))" + # Group 2 & 3
r"(|[^\]]+)?" + # Group 4
r"\]\]")
def extract_image_links(text):
for m in IMAGE_LINK_RE.finditer(text):
yield m.group(2)
This is the part that mwxml can help you do easily. You need to define a process_dump function that takes two arguements: dump : mwxml.Dump and a path : str
In the example, below, we iterate through the pages in the dump, and keep track of how many image links we saw in the last revision with last_count. If the delta isn't 0, we yield some values. It's very important that the process_dump function either yields something or returns an iterable. We'll explain why in a moment.
In [4]:
def process_dump(dump, path):
for page in dump:
last_count = 0
for revision in page:
image_links = list(extract_image_links(revision.text or ""))
delta = len(image_links) - last_count
if delta != 0:
yield revision.id, revision.timestamp, delta
last_count = len(image_links)
OK. Now that everything is defined, it's time to run the code. mwxml has a map() function that applied the process_dump function each of the XML dump file in paths -- in parallel -- using python's multiprocessing library and collects all of the yielded values in a generator. As the code below demonstrates, it's easy to collect this output and write it to a new output file or print it out to the console (not recommended for large amounts of output).
In [5]:
count = 0
for rev_id, rev_timestamp, delta in mwxml.map(process_dump, paths):
print("\t".join(str(v) for v in [rev_id, rev_timestamp, delta]))
count += 1
if count > 15:
break
In [ ]: