Purpose

We'd like a bot to monitor new pages in subsurfwiki.org. New pages should be examined for various features, given a score, and the worst offenders marked for deletion. Other pages with a poor score can simply be listed for patrol.

Approach

Use the mwclient library to list new pages, then visit those pages and parse them. We'll build lists and send them to an HTML page using some sort of template and (say) Bootstrap styling. We can just put that page on the server, to be visited from a subsurfwiki.org/reports page.



In [23]:

    
import mwclient
user = 'exbot'
password = 'Castafiore2001'
url = 'subsurfwiki.org'
path = '/mediawiki/'
site = mwclient.Site(url,path)
site.login(user,password)

Pseudocode

Set up and log in
Get a list of the new pages
Step over the list and evaluate for various criteria (below)
Rank the new pages according to their scores
Put the ranked list, perhaps in sections, on a web page
Email the result to certain people
Consider adding special section for subsurface pages, sending to Juli

Criteria

Length
Incorrectly formatted titles
No references
No categories
Underlining, colours, linebreaks, and other poor formatting

New pages



In [19]:

    
new_pages = [new_page['title'] for new_page in site.recentchanges() if new_page['type'] == u'new' and new_page['ns'] == 0]



In [18]:

    
new_pages









    Out[18]:





[u'Page 4',
 u'Page 3',
 u'Page 2',
 u'Page 1',
 u'Things you can do',
 u'Unsolved problems',
 u'Sequence stratigraphy calibration app',
 u'Seismic quality',
 u"Jean le Rond d'Alembert",
 u'Frackability',
 u'Uncertainty in spectral decomposition',
 u'Gabor uncertainty',
 u'Crude price/NG',
 u'Geophysics Hackathon 2013',
 u'Seismic processing']



In [21]:

    
import re
results = {}
for p in new_pages:
    page = site.Pages[p]
    # Skip redirects and subpages
    if page.redirect or ('/' in p):
        continue
    # Length: 0 bytes scores 0, 1000 bytes or more scores 5
    results[p] = [min(int(page.length/200),5)]
    # Categories: 1 per cat, up to max of 3
    results[p].append(min(len([c for c in page.categories()]),3))
    # Images: 1 per image, up to a max of 3
    results[p].append(min(len([c for c in page.images()]),3))
    # Backlinks: 1 per link, up to a max of 3
    results[p].append(min(len([c for c in page.backlinks()]),3))
    # Links: 1 per link, up to a max of 3
    results[p].append(min(len([c for c in page.links()]),3))
    # References: 1 per ref, up to a max of 3
    results[p].append(min(len(re.findall(r"<ref>",page.edit())),3))

    # Title case name: 0 or 1
    # This could be much improved!
    words = filter(None,re.split(r'[ -,\.]',p))
    titlecase = sum([word.istitle() for word in words[1:]])
    total = len(words)-1.05  # Protect against div by 0
    proportion = int( 5 * titlecase / total )
    results[p].append(proportion)  # Anything above 1 could well be title case



In [22]:

    
results









    Out[22]:





{u'Frackability': [4, 2, 0, 1, 1, 1, 0],
 u'Gabor uncertainty': [5, 2, 0, 2, 1, 2, 0],
 u'Geophysics Hackathon 2013': [5, 3, 3, 3, 3, 0, 2],
 u"Jean le Rond d'Alembert": [5, 2, 2, 2, 0, 3, 2],
 u'Page 1': [0, 3, 1, 0, 0, 0, 0],
 u'Page 2': [0, 3, 1, 0, 0, 0, 0],
 u'Page 3': [0, 2, 1, 0, 0, 0, 0],
 u'Page 4': [0, 3, 1, 0, 0, 0, 0],
 u'Seismic processing': [5, 0, 0, 1, 1, 0, 0],
 u'Seismic quality': [5, 1, 2, 1, 3, 0, 0],
 u'Sequence stratigraphy calibration app': [5, 1, 0, 1, 2, 0, 0],
 u'Things you can do': [5, 0, 0, 0, 3, 2, 0],
 u'Uncertainty in spectral decomposition': [5, 3, 3, 0, 3, 3, 0]}

Results

Now we 'just' need to parse the results and build a display. The display can simply be a list.



In [44]:

    
good, bad = {}, {}
best_score = 0
for p in results:
    score = sum(results[p])
    if score >= 10:
        good[p] = results[p]
        if score > best_score:
            best = p
            best_score = score
    else:
        bad[p] = results[p]

worst_new_pages = sorted(bad, key=lambda x : sum(bad[x]))
worst_new_pages









    Out[44]:





[u'Page 3',
 u'Page 4',
 u'Page 2',
 u'Page 1',
 u'Seismic processing',
 u'Frackability',
 u'Sequence stratigraphy calibration app']



In [45]:

    
best









    Out[45]:





u'Geophysics Hackathon 2013'



In [31]:

    
possibly_titlecase = [r for r in results if results[r][6]>=2]
possibly_titlecase









    Out[31]:





[u"Jean le Rond d'Alembert", u'Geophysics Hackathon 2013']

At this point, I think I need to depart from the world of Notebooks, and get onto the server.