Purpose

We'd like a bot to monitor new pages in subsurfwiki.org. New pages should be examined for various features, given a score, and the worst offenders marked for deletion. Other pages with a poor score can simply be listed for patrol.

Approach

Use the mwclient library to list new pages, then visit those pages and parse them. We'll build lists and send them to an HTML page using some sort of template and (say) Bootstrap styling. We can just put that page on the server, to be visited from a subsurfwiki.org/reports page.


In [23]:
import mwclient
user = 'exbot'
password = 'Castafiore2001'
url = 'subsurfwiki.org'
path = '/mediawiki/'
site = mwclient.Site(url,path)
site.login(user,password)

Pseudocode

  1. Set up and log in
  2. Get a list of the new pages
  3. Step over the list and evaluate for various criteria (below)
  4. Rank the new pages according to their scores
  5. Put the ranked list, perhaps in sections, on a web page
  6. Email the result to certain people
  7. Consider adding special section for subsurface pages, sending to Juli

Criteria

  • Length
  • Incorrectly formatted titles
  • No references
  • No categories
  • Underlining, colours, linebreaks, and other poor formatting

New pages


In [19]:
new_pages = [new_page['title'] for new_page in site.recentchanges() if new_page['type'] == u'new' and new_page['ns'] == 0]

In [18]:
new_pages


Out[18]:
[u'Page 4',
 u'Page 3',
 u'Page 2',
 u'Page 1',
 u'Things you can do',
 u'Unsolved problems',
 u'Sequence stratigraphy calibration app',
 u'Seismic quality',
 u"Jean le Rond d'Alembert",
 u'Frackability',
 u'Uncertainty in spectral decomposition',
 u'Gabor uncertainty',
 u'Crude price/NG',
 u'Geophysics Hackathon 2013',
 u'Seismic processing']

In [21]:
import re
results = {}
for p in new_pages:
    page = site.Pages[p]
    # Skip redirects and subpages
    if page.redirect or ('/' in p):
        continue
    # Length: 0 bytes scores 0, 1000 bytes or more scores 5
    results[p] = [min(int(page.length/200),5)]
    # Categories: 1 per cat, up to max of 3
    results[p].append(min(len([c for c in page.categories()]),3))
    # Images: 1 per image, up to a max of 3
    results[p].append(min(len([c for c in page.images()]),3))
    # Backlinks: 1 per link, up to a max of 3
    results[p].append(min(len([c for c in page.backlinks()]),3))
    # Links: 1 per link, up to a max of 3
    results[p].append(min(len([c for c in page.links()]),3))
    # References: 1 per ref, up to a max of 3
    results[p].append(min(len(re.findall(r"<ref>",page.edit())),3))

    # Title case name: 0 or 1
    # This could be much improved!
    words = filter(None,re.split(r'[ -,\.]',p))
    titlecase = sum([word.istitle() for word in words[1:]])
    total = len(words)-1.05  # Protect against div by 0
    proportion = int( 5 * titlecase / total )
    results[p].append(proportion)  # Anything above 1 could well be title case

In [22]:
results


Out[22]:
{u'Frackability': [4, 2, 0, 1, 1, 1, 0],
 u'Gabor uncertainty': [5, 2, 0, 2, 1, 2, 0],
 u'Geophysics Hackathon 2013': [5, 3, 3, 3, 3, 0, 2],
 u"Jean le Rond d'Alembert": [5, 2, 2, 2, 0, 3, 2],
 u'Page 1': [0, 3, 1, 0, 0, 0, 0],
 u'Page 2': [0, 3, 1, 0, 0, 0, 0],
 u'Page 3': [0, 2, 1, 0, 0, 0, 0],
 u'Page 4': [0, 3, 1, 0, 0, 0, 0],
 u'Seismic processing': [5, 0, 0, 1, 1, 0, 0],
 u'Seismic quality': [5, 1, 2, 1, 3, 0, 0],
 u'Sequence stratigraphy calibration app': [5, 1, 0, 1, 2, 0, 0],
 u'Things you can do': [5, 0, 0, 0, 3, 2, 0],
 u'Uncertainty in spectral decomposition': [5, 3, 3, 0, 3, 3, 0]}

Results

Now we 'just' need to parse the results and build a display. The display can simply be a list.


In [44]:
good, bad = {}, {}
best_score = 0
for p in results:
    score = sum(results[p])
    if score >= 10:
        good[p] = results[p]
        if score > best_score:
            best = p
            best_score = score
    else:
        bad[p] = results[p]

worst_new_pages = sorted(bad, key=lambda x : sum(bad[x]))
worst_new_pages


Out[44]:
[u'Page 3',
 u'Page 4',
 u'Page 2',
 u'Page 1',
 u'Seismic processing',
 u'Frackability',
 u'Sequence stratigraphy calibration app']

In [45]:
best


Out[45]:
u'Geophysics Hackathon 2013'

In [31]:
possibly_titlecase = [r for r in results if results[r][6]>=2]
possibly_titlecase


Out[31]:
[u"Jean le Rond d'Alembert", u'Geophysics Hackathon 2013']

At this point, I think I need to depart from the world of Notebooks, and get onto the server.