notebook.community

Edit and run

We would like to compare and contrast case clustering based on the opinion text (natural language processing) vs. based on the citation structure (network community detection).

Commmunity detection on the network

modularity
walktrap
SBM (todo)

Clustering on the opinion texts

compute TD-IDF vectors of opinions
- KNN on tdidf vectors
- Gaussian mixture models (TODO)
- spectral clustering on similarity matrix (TODO)
topic modeling (TODO)
- LDA
- nonegative matrix factorization

Relational topic models (see blei paper) (TODO)

TODO

match clusters from different algos
find representatives for clusters
- top td-idf words
- 'most central case' in community (is this a thing?)
more CD algos
- fix number of communities same as number of NLP clusters
more NLP based clustering algos

Notes

borrowing some code from http://brandonrose.org/clustering



In [8]:

    
repo_directory = '/Users/iaincarmichael/Dropbox/Research/law/law-net/'
data_dir = '/Users/iaincarmichael/data/courtlistener/'

import numpy as np
import sys
import matplotlib.pyplot as plt



# graph package
import igraph as ig


# stat
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans


# our code
sys.path.append(repo_directory + 'code/')


sys.path.append(repo_directory + 'vertex_metrics_experiment/code/')
from bag_of_words import load_tf_idf


# which network to download data for
network_name = 'scotus' # 'federal', 'ca1', etc


# some sub directories that get used
raw_dir = data_dir + 'raw/'
subnet_dir = data_dir + network_name + '/'
text_dir = subnet_dir + 'textfiles/'
nlp_dir = subnet_dir + 'nlp/'


# jupyter notebook settings
%load_ext autoreload
%autoreload 2
%matplotlib inline









    



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



In [2]:

    
# load the graph
G = ig.Graph.Read_GraphML(subnet_dir + network_name +'_network.graphml')

largest connected component

restrict our attention to the largest connected componenet on the network. also we are missing some text files from 2016 so lets ignore 2016.



In [106]:

    
# limit ourselves to cases upto and including 2015 since we are missing some textfiles from 2016
G = G.subgraph(G.vs.select(year_le=2015))

# make graph undirected
Gud = G.copy()
Gud = Gud.as_undirected()

# get largest connected componenet
components = Gud.clusters(mode='STRONG')
g = components.subgraphs()[np.argmax(components.sizes())]

# CL ids of cases in largest connected component
CLids = g.vs['name']

graph clustering

Do community detection on network

modularity



In [107]:

    
%%time 

# modularity clustering
cd_modularity = g.community_fastgreedy() # .as_clustering().membership

mod_clust = cd_modularity.as_clustering()

mod_clust.summary()









    



CPU times: user 1min 50s, sys: 1.54 s, total: 1min 52s
Wall time: 2min 6s






    Out[107]:





'Clustering with 27539 elements and 172 clusters'



In [108]:

    
graph_clusters = pd.Series(mod_clust.membership, index=g.vs['name'])

walktrap



In [109]:

    
# %time cd_walktrap = g.community_walktrap()

# wt_clust = cd_walktrap.as_clustering()

# wt_clust.summary()

NLP clustering

load td-idf vectors



In [6]:

    
tfidf_matrix, op_id_to_bow_id = load_tf_idf(nlp_dir)

K means clustering on td-idf



In [ ]:

    
%%time

# set number of clusters
num_clusters = 30

# run kmeans
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

nlp_clusters = km.labels_.tolist()

Compare NLP vs graph clustering



In [113]:

    
clusters = pd.DataFrame(index=normalized_text_dict.keys(), columns=['nlp', 'graph'])

# add in NLP clusters
clusters['nlp'] = nlp_clusters


# add in communities 
clusters['graph'] = graph_clusters

# consider nodes not considered in CD to be their own cluster
# i.e. nodes outside the largest connected component
clusters['graph'].fillna(max(graph_clusters) + 1, inplace=True)

# make formatting
clusters['graph'] = clusters['graph'].astype(np.int)



In [114]:

    
clusters









    Out[114]:






  
    
      
      nlp
      graph
    
  
  
    
      145658
      18
      2
    
    
      89370
      14
      0
    
    
      89371
      14
      0
    
    
      89372
      29
      0
    
    
      89373
      4
      0
    
    
      89374
      19
      0
    
    
      89375
      1
      3
    
    
      89376
      19
      0
    
    
      89377
      6
      0
    
    
      89378
      5
      0
    
    
      89379
      19
      2
    
    
      103549
      18
      0
    
    
      103548
      15
      0
    
    
      103541
      4
      0
    
    
      103540
      26
      0
    
    
      103543
      0
      0
    
    
      103542
      18
      0
    
    
      103545
      14
      2
    
    
      103544
      7
      2
    
    
      103547
      11
      0
    
    
      103546
      14
      2
    
    
      88395
      19
      0
    
    
      88394
      14
      0
    
    
      88397
      19
      2
    
    
      88396
      19
      0
    
    
      88391
      5
      2
    
    
      88390
      22
      2
    
    
      88398
      4
      0
    
    
      97159
      17
      0
    
    
      97158
      21
      2
    
    
      ...
      ...
      ...
    
    
      86191
      5
      0
    
    
      86190
      23
      0
    
    
      86193
      6
      0
    
    
      86192
      19
      3
    
    
      103099
      29
      2
    
    
      103098
      16
      0
    
    
      103097
      9
      0
    
    
      103096
      18
      0
    
    
      96208
      6
      0
    
    
      96209
      12
      0
    
    
      96200
      10
      3
    
    
      96201
      20
      4
    
    
      96202
      24
      0
    
    
      96203
      24
      0
    
    
      96204
      13
      3
    
    
      96205
      8
      2
    
    
      96206
      5
      0
    
    
      96207
      25
      2
    
    
      88971
      25
      3
    
    
      88970
      4
      3
    
    
      88977
      25
      2
    
    
      88976
      24
      0
    
    
      88979
      19
      2
    
    
      88978
      24
      0
    
    
      103095
      0
      0
    
    
      103094
      0
      2
    
    
      103093
      0
      0
    
    
      103092
      12
      0
    
    
      103091
      28
      0
    
    
      103090
      20
      2
    
  

27539 rows × 2 columns



In [115]:

    
# TODO: match clusters



In [ ]:

	nlp	graph
145658	18	2
89370	14	0
89371	14	0
89372	29	0
89373	4	0
89374	19	0
89375	1	3
89376	19	0
89377	6	0
89378	5	0
89379	19	2
103549	18	0
103548	15	0
103541	4	0
103540	26	0
103543	0	0
103542	18	0
103545	14	2
103544	7	2
103547	11	0
103546	14	2
88395	19	0
88394	14	0
88397	19	2
88396	19	0
88391	5	2
88390	22	2
88398	4	0
97159	17	0
97158	21	2
...	...	...
86191	5	0
86190	23	0
86193	6	0
86192	19	3
103099	29	2
103098	16	0
103097	9	0
103096	18	0
96208	6	0
96209	12	0
96200	10	3
96201	20	4
96202	24	0
96203	24	0
96204	13	3
96205	8	2
96206	5	0
96207	25	2
88971	25	3
88970	4	3
88977	25	2
88976	24	0
88979	19	2
88978	24	0
103095	0	0
103094	0	2
103093	0	0
103092	12	0
103091	28	0
103090	20	2