We would like to compare and contrast case clustering based on the opinion text (natural language processing) vs. based on the citation structure (network community detection).

Commmunity detection on the network

  • modularity
  • walktrap
  • SBM (todo)

Clustering on the opinion texts

  • compute TD-IDF vectors of opinions
    • KNN on tdidf vectors
    • Gaussian mixture models (TODO)
    • spectral clustering on similarity matrix (TODO)
  • topic modeling (TODO)
    • LDA
    • nonegative matrix factorization

Relational topic models (see blei paper) (TODO)

TODO

  • match clusters from different algos
  • find representatives for clusters
    • top td-idf words
    • 'most central case' in community (is this a thing?)
  • more CD algos
    • fix number of communities same as number of NLP clusters
  • more NLP based clustering algos

Notes

borrowing some code from http://brandonrose.org/clustering


In [8]:
repo_directory = '/Users/iaincarmichael/Dropbox/Research/law/law-net/'
data_dir = '/Users/iaincarmichael/data/courtlistener/'

import numpy as np
import sys
import matplotlib.pyplot as plt



# graph package
import igraph as ig


# stat
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans


# our code
sys.path.append(repo_directory + 'code/')


sys.path.append(repo_directory + 'vertex_metrics_experiment/code/')
from bag_of_words import load_tf_idf


# which network to download data for
network_name = 'scotus' # 'federal', 'ca1', etc


# some sub directories that get used
raw_dir = data_dir + 'raw/'
subnet_dir = data_dir + network_name + '/'
text_dir = subnet_dir + 'textfiles/'
nlp_dir = subnet_dir + 'nlp/'


# jupyter notebook settings
%load_ext autoreload
%autoreload 2
%matplotlib inline


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [2]:
# load the graph
G = ig.Graph.Read_GraphML(subnet_dir + network_name +'_network.graphml')

largest connected component

restrict our attention to the largest connected componenet on the network. also we are missing some text files from 2016 so lets ignore 2016.


In [106]:
# limit ourselves to cases upto and including 2015 since we are missing some textfiles from 2016
G = G.subgraph(G.vs.select(year_le=2015))

# make graph undirected
Gud = G.copy()
Gud = Gud.as_undirected()

# get largest connected componenet
components = Gud.clusters(mode='STRONG')
g = components.subgraphs()[np.argmax(components.sizes())]

# CL ids of cases in largest connected component
CLids = g.vs['name']

graph clustering

Do community detection on network

modularity


In [107]:
%%time 

# modularity clustering
cd_modularity = g.community_fastgreedy() # .as_clustering().membership

mod_clust = cd_modularity.as_clustering()

mod_clust.summary()


CPU times: user 1min 50s, sys: 1.54 s, total: 1min 52s
Wall time: 2min 6s
Out[107]:
'Clustering with 27539 elements and 172 clusters'

In [108]:
graph_clusters = pd.Series(mod_clust.membership, index=g.vs['name'])

walktrap


In [109]:
# %time cd_walktrap = g.community_walktrap()

# wt_clust = cd_walktrap.as_clustering()

# wt_clust.summary()

NLP clustering

load td-idf vectors


In [6]:
tfidf_matrix, op_id_to_bow_id = load_tf_idf(nlp_dir)

K means clustering on td-idf


In [ ]:
%%time

# set number of clusters
num_clusters = 30

# run kmeans
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

nlp_clusters = km.labels_.tolist()

Compare NLP vs graph clustering


In [113]:
clusters = pd.DataFrame(index=normalized_text_dict.keys(), columns=['nlp', 'graph'])

# add in NLP clusters
clusters['nlp'] = nlp_clusters


# add in communities 
clusters['graph'] = graph_clusters

# consider nodes not considered in CD to be their own cluster
# i.e. nodes outside the largest connected component
clusters['graph'].fillna(max(graph_clusters) + 1, inplace=True)

# make formatting
clusters['graph'] = clusters['graph'].astype(np.int)

In [114]:
clusters


Out[114]:
nlp graph
145658 18 2
89370 14 0
89371 14 0
89372 29 0
89373 4 0
89374 19 0
89375 1 3
89376 19 0
89377 6 0
89378 5 0
89379 19 2
103549 18 0
103548 15 0
103541 4 0
103540 26 0
103543 0 0
103542 18 0
103545 14 2
103544 7 2
103547 11 0
103546 14 2
88395 19 0
88394 14 0
88397 19 2
88396 19 0
88391 5 2
88390 22 2
88398 4 0
97159 17 0
97158 21 2
... ... ...
86191 5 0
86190 23 0
86193 6 0
86192 19 3
103099 29 2
103098 16 0
103097 9 0
103096 18 0
96208 6 0
96209 12 0
96200 10 3
96201 20 4
96202 24 0
96203 24 0
96204 13 3
96205 8 2
96206 5 0
96207 25 2
88971 25 3
88970 4 3
88977 25 2
88976 24 0
88979 19 2
88978 24 0
103095 0 0
103094 0 2
103093 0 0
103092 12 0
103091 28 0
103090 20 2

27539 rows × 2 columns


In [115]:
# TODO: match clusters

In [ ]: