We'll try implementing the clustering method described in chapter 3 of Toby Segaran's book but using Chicago City Council documents as our corpus.
In [9]:
import os, os.path
os.environ['DJANGO_SETTINGS_MODULE'] = 'core.settings.loc'
import sys; sys.path.append('..')
# above assumes you're working out of the repository and that the app code
# is in a directory adjacent to the one containing this file.
import django
django.setup()
from cityhallmonitor.models import *
Instead of reading from a feed, we'll read from the scraped documents. For now, we'll do this using Django, although it could be instructive to do it again with raw SQL.
The exercise proposes to cluster related blogs. The CHM corpus doesn't have such clear thematic divides. Sponsors is the best alternative I can think of. But what's the variation in how sponsors sponsor things?
In [16]:
# there's probably a clever way to do this with Django query sets but I don't have the patience right now
from collections import Counter
import csv
fields = ['person']
fields.extend([x[0] for x in MatterType.objects.values_list('name').distinct()])
with open("sponsor_type_counts.csv","w") as f:
w = csv.DictWriter(f,fieldnames=fields)
w.writerow(dict(zip(fields,fields))) # DictWriter idiom to establish header row
rows = active = inactive = 0
for p in Person.objects.all():
rows += 1
mt_names = [m.matter_type.name for m in p.matters.all()]
row = dict((f,0) for f in fields) # set defaults
row.update(Counter(mt_names))
if (set(row.values()) != set([0])):
active += 1
row['person'] = p.full_name
w.writerow(row)
else:
inactive += 1
print("Created sponsor_type_counts.csv with {} rows, skipped {} inactive 'people'".format(active,inactive))
In [7]:
import agate
sponsorship = agate.Table.from_csv('sponsor_type_counts.csv', row_names=lambda r: "%(person)s" % r)
In [9]:
sponsorship.rows['Arena, John']
Out[9]: