Scrape court rulings from InfoCuria

The goal is to scrape the most recent judgments and opinions from InfoCuria - Case-law of the Court of Justice.

The scraping is done in two steps:

Scrape recent rulings. Get the most recent judgments and opinions from the table with search results. Store the link of each ruling in a link list.
Scrape each ruling. Get each individual page (see example) by going to each individual link in the link list.

After that, the text of the ruling decisions are saved into a single file that can be used for later analysis.

Define functions

Here we'll define the scraping functions that we've used many times already.



In [103]:

    
# Import libraries that we need for the web scraping.
import urllib.request
from lxml import html
from bs4 import BeautifulSoup

# Scrape all HTML from webpage.
def scrapewebpage(url):
	# Open URL and get HTML.
	web = urllib.request.urlopen(url)

	# Make sure there wasn't any errors opening the URL.
	if (web.getcode() == 200):
		html = web.read()
		return(html)
	else:
		print("Error %s reading %s" % str(web.getcode()), url)

# Helper function that scrape the webpage and turn it into soup.
def makesoup(url):
	html = scrapewebpage(url)
	return(BeautifulSoup(html, "lxml"))

Step 1. Scrape recent

Get the most recent judgments and opinions from the table with search results. The table can be found by looking at the table with search results and this table is named <table class="detail_table_documents">. In this table, there are many columns but we're only interested in the column "Name of the parties" and the links to the document.



In [104]:

    
# Scrape the page with most recent judgments and opinions.
judgments_page = makesoup("http://curia.europa.eu/juris/documents.jsf?language=en&jur=C&cit=none%252CC%252CCJ%252CR%252C2008E%252C%252C%252C%252C%252C%252C%252C%252C%252C%252Ctrue%252Cfalse%252Cfalse&td=%24mode%3D8D%24from%3D2018.2.1%24to%3D2018.2.8%3B%3B%3BPUB1%2CPUB3%3BNPUB1%3B%3BORDALL&ordreTri=dateDesc&pcs=O&redirection=doc&page=1")



In [105]:

    
# Extract the table from the page.
table = judgments_page.find("table", "detail_table_documents")



In [106]:

    
# Create an empty list to store the links.
links = []

# Go through row by row in the table. Note that tr is a row, td is a column.
for row in table.find_all("tr"):
    name = row.find("td", "table_cell_nom_usuel") # Find the first column (td) with the class name "table_cell_nom_usuel".
    link = row.find("td", "table_cell_links_eurlex") # Find the first column (td) with the class name "table_cell_links_eurlex".
    if name:
        # Print the name of the parties in the ruling.
        print(name.get_text())
    if link:
        links.append(link.find("a")["href"]) # Find <a href="link"> and add link to list.
        #print(link.find("a")["href"])









    



 Commission v Spain
 Lloyd's of London
 Commission v Greece
 Commission v Germany
 EV
 Pfizer Ireland Pharmaceuticals, Operations Support Group
 American Express
 American Express
 Lada
 Altun and Others
 Louboutin and Christian Louboutin
 Kompania Piwowarska
 Jehovan todistajat
 Industrias Químicas del Vallés
 Panalpina World Transport (Holding) and Others v Commission
 Deutsche Bahn and Others v Commission
 Schenker v Commission
 Kühne + Nagel International and Others v Commission



In [107]:

    
# How many links did we get?
len(links)









    Out[107]:





18

Step 2. Scrape each ruling

Each individual page (see example) only consists of text, and this is much easier to scrape compared to the previous step. We only need to find <div id="document_content"> which contains the text.



In [108]:

    
# Import time to be able to delay the scraping.
import time

# Create an empty list to store the texts.
text_list = []

# Scrape all the links in the list.
for link in links:
    print("Scraping " + link) 
    page = makesoup(link)    # Scrape each individual page in the list.
    text = page.find("", id="document_content").get_text() # Find <div id="document_content">
    text_list.append(text)   # Add text to list.
    time.sleep(1.0)          # Delay 1.0 second.
print("Done.")









    



Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199206&pageIndex=0&doclang=ES&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199201&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199202&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199207&pageIndex=0&doclang=DE&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199190&pageIndex=0&doclang=BG&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199189&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199182&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199181&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199101&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199097&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=199102&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198950&pageIndex=0&doclang=BG&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198949&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198948&pageIndex=0&doclang=BG&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198944&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198946&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198947&pageIndex=0&doclang=en&mode=req&dir=&occ=first&part=1&cid=789789
Scraping http://curia.europa.eu/juris/document/document.jsf;jsessionid=9ea7d2dc30dd37d6ecd89c8d44b08a09a01d3ae6dd96.e34KaxiLc3qMb40Rch0SaxyNaNn0?text=&docid=198945&pageIndex=0&doclang=DE&mode=req&dir=&occ=first&part=1&cid=789789
Done.



In [109]:

    
# How many texts did we get?
# It should be the same as the number of links, otherwise something is probably wrong.
len(text_list)









    Out[109]:





18

Save to file

Let's save all the texts to a single file named all_texts.txt.



In [110]:

    
# All texts are kept separate in the list text_list.
# This will join all texts together into one big text separated by a new line (\n).
one_big_text = "\n".join(text_list)

# Save the text to a single file.
file = open("all_texts.txt", "w", encoding="utf-8")
file.writelines(one_big_text)
file.close()
print("File saved!")









    



File saved!