The Italian Senate is clogged by computer-generated amendments. This notebook aims to cluster similar amendments in an automated fashion, so that the appropriate Senate procedures can be used to get rid of them in one sweep.
We begin as usual with some imports, some Jupyter magic and some useful constants.
In [1]:
from __future__ import division
import os
import re
from itertools import combinations
import xml.etree.ElementTree as ET
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
In [2]:
%matplotlib inline
In [3]:
DATA_FOLDER = 'data'
NAMESPACE = {'an': 'http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD03'}
ALPHANUM_REGEX = re.compile('[\W+]', re.UNICODE)
The problem we want to solve is an unsupervised clustering in an unknown number of clusters. The usual algorithm used to solve it is some variation of hierarchical clustering combined with some heuristics to "cut" the resulting dendrogram at a certain height to produce the predicted clusters.
All variations of hierarchical clustering require us to define some distance metric between elements. In our case, elements are free texts, so we use a distance related to Jaccard Similarity on the tokens of the text, where a token is a contiguous string of alphanumeric characters.
In [4]:
def to_tokens(s):
return set(ALPHANUM_REGEX.sub(' ', s).lower().split())
In [5]:
def jaccard_distance(x, y):
return 1 - (len(x['tokens'] & y['tokens']) / len(x['tokens'] | y['tokens']))
Using the XML data downloaded by the Scrapy spider, we build an array called amendments.
Each element of the array is a dictionary whose structure is exemplified by the following:
{
'_id': '1.100',
'authors': ['SACCONI', "D'ASCOLA", 'AIELLO', 'ALBERTINI', ..., 'DI BIAGIO'],
'raw': 'Sopprimere gli articoli da 1 a 10.',
'tokens': set(['1', '10', 'a', 'articoli', 'da', 'gli', 'sopprimere'])
}
In [6]:
amendments = []
for filename in os.listdir(DATA_FOLDER):
tree = ET.parse(os.path.join(DATA_FOLDER, filename))
_id = tree.find('.//an:FRBRnumber', NAMESPACE).get('value')
authors = [el.text for el in tree.findall('.//an:docProponent', NAMESPACE)]
raw = ' '.join(tree.find('.//an:amendmentContent', NAMESPACE).itertext())
tokens = to_tokens(raw)
amendments.append({'_id': _id, 'authors': authors, 'raw': raw, 'tokens': tokens})
To check if the algorithm is working correctly, we restrict ourselves to the first hundred amendments.
In [7]:
first_amendments = amendments[:100]
first_distances = [jaccard_distance(x, y) for x, y in combinations(first_amendments, 2)]
We now compute an hierarchical clustering on these first hundred elements, and we visualize the results as a dendrogram.
In [8]:
Z_first = linkage(first_distances, method='complete')
In [9]:
plt.figure(figsize=(25, 50))
plt.title('Z_first')
dendrogram(
Z_first,
orientation='right',
leaf_font_size=12.,
)
plt.show()
It appears that the algorithm found several clusters, highlighted by different colors. Let's inspect the last green one:
In [10]:
for i in [77, 72, 68, 64, 60, 56, 52, 48, 92, 89, 84, 80, 96]:
print u'{i}: {snippet}'.format(i=i, snippet=first_amendments[i]['raw'][:76])
We see that, in fact, all amendments of this cluster are variations of a single one.
Let's now try with the second to last red cluster:
In [11]:
for i in [78, 73, 69, 65, 61, 57, 53, 49, 93, 90, 85, 81]:
print u'{i}: {snippet}'.format(i=i, snippet=first_amendments[i]['raw'][:76])
Again, all amendments in this cluster are variations of a single one. Moreover, they differ from the previous cluster for the addition of the last sentence, which is why the hierarchical clustering algorithm will eventually merge the two clusters.
To double check, let's try with amendments 6 and 97, which are not part of the same cluster:
In [12]:
for i in [6, 97]:
print u'{i}: {snippet}'.format(i=i, snippet=first_amendments[i]['raw'][:76])
It appears that, in fact, the text of these two amendments is significantly different.
Finally, let's run the algorithm on all amendments at once.
In [13]:
distances = [jaccard_distance(x, y) for x, y in combinations(amendments, 2)]
In [14]:
Z_all = linkage(distances, method='complete')
In [15]:
plt.figure(figsize=(25, 10))
plt.title('Z_all')
dendrogram(
Z_all,
leaf_font_size=0.,
)
plt.show()
We observe several interesting clusters, like the green one appearing roughly at two thirds from the left, and the five red ones appearing just before the middle. Moreover, this shows that the algorithm can be applied to several thousands of amendments in a couple of minutes.
In [ ]: