This notebook downloads and cleans the SCOTUS subnetwork data. It can be modified to create any jurisdiction subnetwork and also the federal appelate subnetwork.

You have to modify the two paths in the cell below for your own computer.

  • repo_directory is the path to the cloned github repo
  • data_dir is the path to the data directory
    • I suggest putting this outside the code repo and not on dropbox since there these files can start to get large (order 10s of GBs for the text data).

This code is a little jenky and subject to change.

outline

  • import code
  • set up the data directory folder and subfolders
  • download data from CourtListener and SCDB
  • clean the network case metadata and edgelist
  • make the network with metadata and save it as a graphml file
  • set up the NLP data (you can skip this for the purpose of network analysis)

In [2]:
# modify these for your own computer
repo_directory = '/Users/iaincarmichael/Dropbox/Research/law/law-net/'

data_dir = '/Users/iaincarmichael/data/courtlistener/'

network_name is the subnetwork you want to work with. It can be either a single jurisdiction (scotus, ca1, etc) or a collection of jurisdiction (such as the federal appellate courts). Currently the federal appellate courts are implemented as 'federal'.

network_name is used in the make_network_data.py file. You can modify the get_courts function in this file to create other collections of courts.


In [5]:
# which network to download data for
network_name = 'scotus' # 'federal', 'ca1', etc

In [6]:
import sys

# graph package
import igraph as ig

# our code
sys.path.append(repo_directory + 'code/')
from setup_data_dir import setup_data_dir, make_subnetwork_directory
from pipeline.download_data import download_bulk_resource, download_master_edgelist, download_scdb
from helpful_functions import case_info

sys.path.append(repo_directory + 'vertex_metrics_experiment/code/')
from make_network_data import *
from make_graph import make_graph
from bag_of_words import make_tf_idf


# some sub directories that get used
raw_dir = data_dir + 'raw/'
subnet_dir = data_dir + network_name + '/'
text_dir = subnet_dir + 'textfiles/'


# jupyter notebook settings
%load_ext autoreload
%autoreload 2
%matplotlib inline


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

set up the data directory


In [ ]:
setup_data_dir(data_dir)

In [ ]:
make_subnetwork_directory(data_dir, network_name)

data download

get opinion and cluster files from CourtListener

opinions/cluster files are saved in data_dir/raw/court/


In [ ]:
download_op_and_cl_files(data_dir, network_name)

get the master edgelist from CL

master edgelist is saved in data_dir/raw/


In [ ]:
download_master_edgelist(data_dir)

download scdb data from SCDB

scdb data is saved in data_dir/scdb


In [ ]:
download_scdb(data_dir)

network data

make the case metadata and edgelist

  • add the raw case metadata data frame to the raw/ folder
  • remove cases missing scdb ids
  • remove detroit lumber case
  • get edgelist of cases within desired subnetwork
  • save case metadata and edgelist to the experiment_dir/

In [ ]:
# create the raw case metadata data frame in the raw/ folder
make_subnetwork_raw_case_metadata(data_dir, network_name)

In [ ]:
# create clean case metadata and edgelist from raw data
clean_metadata_and_edgelist(data_dir, network_name)

make graph

creates the network with the desired case metadata and saves it as a .graphml file in experiment_dir/


In [12]:
make_graph(subnet_dir, network_name)


/Users/iaincarmichael/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2827: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

NLP data

make case text files

grabs the opinion text for each case in the network and saves them as a text file in experiment_dir/textfiles/


In [ ]:
# make the textfiles for give court
make_network_textfiles(data_dir, network_name)

make tf-idf matrix

creates the tf-idf matrix for the corpus of cases in the network and saves them to subnet_dir + 'nlp/'


In [ ]:
make_tf_idf(text_dir, subnet_dir + 'nlp/')

Load network


In [7]:
# load the graph
G = ig.Graph.Read_GraphML(subnet_dir + network_name +'_network.graphml')


/Users/iaincarmichael/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: RuntimeWarning: Could not add vertex ids, there is already an 'id' vertex attribute at foreign-graphml.c:443
  from ipykernel import kernelapp as app

In [8]:
G.summary()


Out[8]:
'IGRAPH DN-- 27885 234312 -- \n+ attr: court (v), date (v), id (v), issueArea (v), name (v), num_words (v), year (v)'