The goal is to scrape the most recent judgments and opinions from InfoCuria - Case-law of the Court of Justice.
The scraping is done in two steps:
After that, the text of the ruling decisions are saved into a single file that can be used for later analysis.
In [103]:
# Import libraries that we need for the web scraping.
import urllib.request
from lxml import html
from bs4 import BeautifulSoup
# Scrape all HTML from webpage.
def scrapewebpage(url):
# Open URL and get HTML.
web = urllib.request.urlopen(url)
# Make sure there wasn't any errors opening the URL.
if (web.getcode() == 200):
html = web.read()
return(html)
else:
print("Error %s reading %s" % str(web.getcode()), url)
# Helper function that scrape the webpage and turn it into soup.
def makesoup(url):
html = scrapewebpage(url)
return(BeautifulSoup(html, "lxml"))
Get the most recent judgments and opinions from the table with search results. The table can be found by looking at the table with search results and this table is named <table class="detail_table_documents">
. In this table, there are many columns but we're only interested in the column "Name of the parties" and the links to the document.
In [104]:
# Scrape the page with most recent judgments and opinions.
judgments_page = makesoup("http://curia.europa.eu/juris/documents.jsf?language=en&jur=C&cit=none%252CC%252CCJ%252CR%252C2008E%252C%252C%252C%252C%252C%252C%252C%252C%252C%252Ctrue%252Cfalse%252Cfalse&td=%24mode%3D8D%24from%3D2018.2.1%24to%3D2018.2.8%3B%3B%3BPUB1%2CPUB3%3BNPUB1%3B%3BORDALL&ordreTri=dateDesc&pcs=O&redirection=doc&page=1")
In [105]:
# Extract the table from the page.
table = judgments_page.find("table", "detail_table_documents")
In [106]:
# Create an empty list to store the links.
links = []
# Go through row by row in the table. Note that tr is a row, td is a column.
for row in table.find_all("tr"):
name = row.find("td", "table_cell_nom_usuel") # Find the first column (td) with the class name "table_cell_nom_usuel".
link = row.find("td", "table_cell_links_eurlex") # Find the first column (td) with the class name "table_cell_links_eurlex".
if name:
# Print the name of the parties in the ruling.
print(name.get_text())
if link:
links.append(link.find("a")["href"]) # Find <a href="link"> and add link to list.
#print(link.find("a")["href"])
In [107]:
# How many links did we get?
len(links)
Out[107]:
Each individual page (see example) only consists of text, and this is much easier to scrape compared to the previous step. We only need to find <div id="document_content">
which contains the text.
In [108]:
# Import time to be able to delay the scraping.
import time
# Create an empty list to store the texts.
text_list = []
# Scrape all the links in the list.
for link in links:
print("Scraping " + link)
page = makesoup(link) # Scrape each individual page in the list.
text = page.find("", id="document_content").get_text() # Find <div id="document_content">
text_list.append(text) # Add text to list.
time.sleep(1.0) # Delay 1.0 second.
print("Done.")
In [109]:
# How many texts did we get?
# It should be the same as the number of links, otherwise something is probably wrong.
len(text_list)
Out[109]:
In [110]:
# All texts are kept separate in the list text_list.
# This will join all texts together into one big text separated by a new line (\n).
one_big_text = "\n".join(text_list)
# Save the text to a single file.
file = open("all_texts.txt", "w", encoding="utf-8")
file.writelines(one_big_text)
file.close()
print("File saved!")