Discovering Groups

working through ch. 3 of Programming Collective Intelligence

We'll try implementing the clustering method described in chapter 3 of Toby Segaran's book but using Chicago City Council documents as our corpus.


In [9]:
import os, os.path
os.environ['DJANGO_SETTINGS_MODULE'] = 'core.settings.loc'
import sys; sys.path.append('..')
# above assumes you're working out of the repository and that the app code 
# is in a directory adjacent to the one containing this file.
import django
django.setup()
from cityhallmonitor.models import *

Counting the words in a Feed

Instead of reading from a feed, we'll read from the scraped documents. For now, we'll do this using Django, although it could be instructive to do it again with raw SQL.

The exercise proposes to cluster related blogs. The CHM corpus doesn't have such clear thematic divides. Sponsors is the best alternative I can think of. But what's the variation in how sponsors sponsor things?

An aside: counts by sponsor and type


In [16]:
# there's probably a clever way to do this with Django query sets but I don't have the patience right now

from collections import Counter
import csv
fields = ['person']
fields.extend([x[0] for x in MatterType.objects.values_list('name').distinct()])

if not os.path.isfile()
with open("sponsor_type_counts.csv","w") as f:
    w = csv.DictWriter(f,fieldnames=fields)
    w.writerow(dict(zip(fields,fields))) # DictWriter idiom to establish header row
    rows = active = inactive = 0
    for p in Person.objects.all():
        rows += 1
        mt_names = [m.matter_type.name for m in p.matters.all()]
        row = dict((f,0) for f in fields) # set defaults
        row.update(Counter(mt_names))
        if (set(row.values()) != set([0])):
            active += 1
            row['person'] = p.full_name
            w.writerow(row)
        else:
            inactive += 1
    print("Created sponsor_type_counts.csv with {} rows, skipped {} inactive 'people'".format(active,inactive))


Created sponsor_type_counts.csv with 88 rows, skipped 63 inactive 'people'

In [7]:
import agate
sponsorship = agate.Table.from_csv('sponsor_type_counts.csv', row_names=lambda r: "%(person)s" % r)

In [9]:
sponsorship.rows['Arena, John']


Out[9]:
<agate.rows.Row at 0x106caf7f0>