This script cleans the scotus network.

SCOTUS heard order 100 cases a year, however there are a few years that have several thousand .json files. A bunch of these .json files correspond to 'applications' to SCOTUS. We would like to remove these from the network.

We remove a case if:

  • It has zero degree (within the SCOTUS network) AND
    • It contains the word 'denied' or 'certiorari'

Outline:

  • load libraries
  • read in the scotus edges list (this should include every potential case)
  • add desired metadata to each vertex
  • Find potential cases to kick out
  • Remove 'bad' cases
  • Save cleaned network

Problems:

TODO:

  • Make sure we are removing all the cases we want to
    • Try more text searches i.e. 'granted'
    • Look at case length
  • Make sure we are not removing any actual cases
    • Some cases might have zero degree within the SCOTUS network but could cite/be cited by cases outside the SCOTUS network. ADD this functionality!!
  • Implement this better: no need to create the igraph network until the end.
    • SCOTUS adjacency list as a dict
    • SCOTUS adjacency list with meta data as a dict of dicts
    • Add in only the cases we want
  • Create a scotus specific SCOTUS edge list text file

In [2]:
#!/usr/bin/env python3

from pandas import DataFrame, read_csv
import time
import pandas as pd 
import sys 
from igraph import *
import glob
import re
import json
from bs4 import *
import os
import numpy as np

In [3]:
#Read in the edge list

scotus_network = Graph.Read_Lgl('../../data/created/scotus/original/scotus_net_all_lgl.txt', names = 'name', directed = True)
print(summary(scotus_network))


IGRAPH DN-- 63744 244496 -- 
+ attr: name (v)
None

In [5]:
#Add in meta data 
case_list_cl = [f.split('.')[0] for f in os.listdir("../../data/downloaded/clusters/scotus")]
loop_start = time.time()
for i in case_list_cl:
    filename = "../../data/downloaded/clusters/scotus/" + str(i) + ".json"
    with open(filename, encoding='utf-8') as data_file: 
        cluster_data = json.load(data_file, encoding='utf-8')
        
    date = cluster_data['date_filed']

    scotus_network.vs.select(name = 'id'+ str(i))['date'] = date
    
loop_end = time.time()
print('the loop took ' + str(loop_end- loop_start) + "s")


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-5-0125ba8b0e39> in <module>()
     10 
     11 
---> 12     scotus_network.vs.select(name = 'id'+ str(i))['date'] = date
     13 
     14 loop_end = time.time()

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/igraph/__init__.py in select(self, *args, **kwds)
   3501             else:
   3502                 values = vs[attr]
-> 3503             filtered_idxs=[i for i, v in enumerate(values) if func(v, value)]
   3504             vs = vs.select(filtered_idxs)
   3505 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/igraph/__init__.py in <listcomp>(.0)
   3501             else:
   3502                 values = vs[attr]
-> 3503             filtered_idxs=[i for i, v in enumerate(values) if func(v, value)]
   3504             vs = vs.select(filtered_idxs)
   3505 

KeyboardInterrupt: 

In [ ]:
#Find cases that have zero degree AND contain one of
    #'denied'
    #'certiorari'
    #'certiorari denied'
    # 'certiorari granted'

denied_all = [] #deg = 0, containes 'denied'
certiorari_all = []
certiorari_denied_all = []
certiorari_granted_all = []
years_all = []


k = 0
loop_start = time.time()
for i in case_list_op:
    current_vertex = scotus_network.vs.select(name = 'id'+ str(i))
    degree = current_vertex.degree()[0]
    
    if degree == 0:
        filename_op = "../../data/downloaded/opinions/scotus/" + str(i) + ".json"
        with open(filename_op, encoding='utf-8') as data_file: 
            op_data = json.load(data_file, encoding='utf-8')


        text = op_data['html']
        if len(text) == 0:
            text = op_data['html_with_citations']
        elif len(text) == 0:
            text = op_data['plain_text']
        elif len(text) == 0:
            text = op_data['html_lawbox']
        elif len(text) == 0:
            text = ''
            print('case ' + str(i) + ' has no text')
            
            
        year = current_vertex['date'][0].split('-')[0]


        if re.search(r'denied', text, re.IGNORECASE):
            denied_all.append(i)
            years_all.append('d-'+ year + '-' + str(i))

        if re.search(r'certiorari', text,re.IGNORECASE):
            certiorari_all.append(i)
            years_all.append('c-'+ year + '-' + str(i))

        if re.search(r'certiorari denied', text,re.IGNORECASE):
            certiorari_denied_all.append(i)
            years_all.append('cd-'+ year + '-' + str(i))


        if re.search(r'certiorari granted', text,re.IGNORECASE):
            certiorari_granted_all.append(i)
            years_all.append('cg-'+ year + '-' + str(i))
    
    k = k+1
                
loop_end = time.time()
print('the loop took ' + str(loop_end - loop_start) + "s")

In [ ]:
#kick out cases we don't want

#kills cases that have zero degree and contain either 'denied' or 'certiorari'
cases_to_kick_out = list(set(denied_all) | set(certiorari_all))

for i in cases_to_kick_out:
    caseid = 'id' + str(i)
    bad_vertex = scotus_network.vs.select(name = caseid)
    scotus_network.delete_vertices(bad_vertex)

In [ ]:
scotus_network.write_gml('../../data/created/scotus/clean/scotus_net_clean.txt')