We'd like a bot to monitor new pages in subsurfwiki.org. New pages should be examined for various features, given a score, and the worst offenders marked for deletion. Other pages with a poor score can simply be listed for patrol.
Use the mwclient library to list new pages, then visit those pages and parse them. We'll build lists and send them to an HTML page using some sort of template and (say) Bootstrap styling. We can just put that page on the server, to be visited from a subsurfwiki.org/reports page.
In [23]:
import mwclient
user = 'exbot'
password = 'Castafiore2001'
url = 'subsurfwiki.org'
path = '/mediawiki/'
site = mwclient.Site(url,path)
site.login(user,password)
In [19]:
new_pages = [new_page['title'] for new_page in site.recentchanges() if new_page['type'] == u'new' and new_page['ns'] == 0]
In [18]:
new_pages
Out[18]:
In [21]:
import re
results = {}
for p in new_pages:
page = site.Pages[p]
# Skip redirects and subpages
if page.redirect or ('/' in p):
continue
# Length: 0 bytes scores 0, 1000 bytes or more scores 5
results[p] = [min(int(page.length/200),5)]
# Categories: 1 per cat, up to max of 3
results[p].append(min(len([c for c in page.categories()]),3))
# Images: 1 per image, up to a max of 3
results[p].append(min(len([c for c in page.images()]),3))
# Backlinks: 1 per link, up to a max of 3
results[p].append(min(len([c for c in page.backlinks()]),3))
# Links: 1 per link, up to a max of 3
results[p].append(min(len([c for c in page.links()]),3))
# References: 1 per ref, up to a max of 3
results[p].append(min(len(re.findall(r"<ref>",page.edit())),3))
# Title case name: 0 or 1
# This could be much improved!
words = filter(None,re.split(r'[ -,\.]',p))
titlecase = sum([word.istitle() for word in words[1:]])
total = len(words)-1.05 # Protect against div by 0
proportion = int( 5 * titlecase / total )
results[p].append(proportion) # Anything above 1 could well be title case
In [22]:
results
Out[22]:
In [44]:
good, bad = {}, {}
best_score = 0
for p in results:
score = sum(results[p])
if score >= 10:
good[p] = results[p]
if score > best_score:
best = p
best_score = score
else:
bad[p] = results[p]
worst_new_pages = sorted(bad, key=lambda x : sum(bad[x]))
worst_new_pages
Out[44]:
In [45]:
best
Out[45]:
In [31]:
possibly_titlecase = [r for r in results if results[r][6]>=2]
possibly_titlecase
Out[31]:
At this point, I think I need to depart from the world of Notebooks, and get onto the server.