This tutorial shows you how to use ETK to extract information for all soccer teams in Italy. Suppose that you want to construct a list of records containing team name, home city, latitude and longitude for every team in Italy.

We start with a Wikipedia page that lists all soccer teams in Italy: https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy. The page has a table for each division. Each table contains the team name and home city, as well as other information that we will ignore for now. The tables don’t contain the latitude and longitude of the cities. You will notice that most city names in the table are links to other wikipedia pages, and we could get the latitude and longitudes from there. In this tutorial we will use a different approach, linking the city names to geonames.org, a dataset containing every city in the world.

Part 1: Extracting The Team Tables

Look at the page, and you will notice that the teams are scattered over multiple tables, one for each division. Fortunately, all the tables have the same structure, which will make our job easier.

Defining an ETK module

An ETK module organizes the code for a project so that you can put all the extraction code for a project in a reusable module. Often, large projects will consist of multiple ETK modules for different kinds of documents. In this tutorial we will have only one module

First, we need to load some dependencies we need to cover through this tutorial. Besides, we create an instance etk which we'll also use through the whole process.


In [19]:
import requests
import json
import jsonpath_ng.ext as jex
import re
import sys
sys.path.append('../')
from etk.extractors.table_extractor import TableExtractor
from etk.extractors.glossary_extractor import GlossaryExtractor
from etk.etk import ETK
from etk.knowledge_graph_schema import KGSchema

kg_schema = KGSchema(json.load(open('./resources/master_config.json')))
etk = ETK(kg_schema=kg_schema)
etk.parser = jex.parse

Reading the HTML file

We read the url of soccer teams, get the body of response. We also create a cdr. It contains raw_content and url field. At the second part of this tutorial, we'll use it.


In [1]:
url = 'https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy'

html_page = open('./resources/italy_teams.html', mode='r', encoding='utf-8').read()

cdr = {
    'raw_content': html_page,
    'url': url,
    'dataset': 'italy_team'
}

print('The first 600 chars of the html page:\n')
print(html_page[:600])


The first 600 chars of the html page:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of football clubs in Italy - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_football_clubs_in_Italy","wgTitle":"List of football clubs in Italy","wgCurRevisionId":859334329,"wgRevisionId":859334329,"wgArticl

Extracting the tables

Extracting the tables in a Web page is very easy as ETK has a table extractor. We devide this phase into two parts.

The first part is to create an instance of TableExtractor, and use that instance to extract the raw tables.


In [21]:
my_table_extractor = TableExtractor()
tables_in_page = my_table_extractor.extract(html_page)[:14]
print('Number of tables in this page:', len(tables_in_page), '\n')
print('The first table in the page shows below: \n')
print(json.dumps(tables_in_page[0].value, indent=2))


Number of tables in this page: 14 

The first table in the page shows below: 

{
  "features": {
    "no_of_rows": 21,
    "no_of_cells": 105,
    "max_cols_in_a_row": 5,
    "ratio_of_img_tags_to_cells": 0.0,
    "ratio_of_href_tags_to_cells": 0.7238095238095238,
    "ratio_of_input_tags_to_cells": 0.0,
    "ratio_of_select_tags_to_cells": 0.0,
    "ratio_of_colspan_tags_to_cells": 0.0,
    "ratio_of_colons_to_cells": 0.0,
    "avg_cell_len": 14.942857142857143,
    "avg_row_len": 78.71428571428571,
    "avg_row_len_dev": 8.490409488646232,
    "avg_col_len": 313.8,
    "avg_col_len_dev": 3.8774340214067022,
    "no_of_cols_containing_num": 2,
    "no_of_cols_empty": 0
  },
  "rows": [
    {
      "cells": [
        {
          "cell": "<th>Team\n</th>",
          "text": "Team",
          "id": "row_0_col_0"
        },
        {
          "cell": "<th>Home city\n</th>",
          "text": "Home city",
          "id": "row_0_col_1"
        },
        {
          "cell": "<th>Stadium\n</th>",
          "text": "Stadium",
          "id": "row_0_col_2"
        },
        {
          "cell": "<th>Capacity\n</th>",
          "text": "Capacity",
          "id": "row_0_col_3"
        },
        {
          "cell": "<th>2017\u201318 season\n</th>",
          "text": "2017\u201318 season",
          "id": "row_0_col_4"
        }
      ],
      "text": "Team | Home city | Stadium | Capacity | 2017\u201318 season",
      "html": "<html><body><table><th>Team\n</th>\n<th>Home city\n</th>\n<th>Stadium\n</th>\n<th>Capacity\n</th>\n<th>2017\u201318 season\n</th>\n</table></body></html>",
      "id": "row_0"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Atalanta_B.C.\" title=\"Atalanta B.C.\">Atalanta</a>\n</td>",
          "text": "Atalanta",
          "id": "row_1_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Bergamo\" title=\"Bergamo\">Bergamo</a>\n</td>",
          "text": "Bergamo",
          "id": "row_1_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Atleti_Azzurri_d%27Italia\" title=\"Stadio Atleti Azzurri d'Italia\">Stadio Atleti Azzurri d'Italia</a>\n</td>",
          "text": "Stadio Atleti Azzurri d'Italia",
          "id": "row_1_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004213000000000000\u2660</span>21,300\n</td>",
          "text": "7004213000000000000\u2660 21,300",
          "id": "row_1_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">7th in Serie A</a>\n</td>",
          "text": "7th in Serie A",
          "id": "row_1_col_4"
        }
      ],
      "text": "Atalanta | Bergamo | Stadio Atleti Azzurri d'Italia | 7004213000000000000\u2660 21,300 | 7th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Atalanta_B.C.\" title=\"Atalanta B.C.\">Atalanta</a>\n</td>\n<td><a href=\"/wiki/Bergamo\" title=\"Bergamo\">Bergamo</a>\n</td>\n<td><a href=\"/wiki/Stadio_Atleti_Azzurri_d%27Italia\" title=\"Stadio Atleti Azzurri d'Italia\">Stadio Atleti Azzurri d'Italia</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004213000000000000\u2660</span>21,300\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">7th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_1"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Bologna_F.C._1909\" title=\"Bologna F.C. 1909\">Bologna</a>\n</td>",
          "text": "Bologna",
          "id": "row_2_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Bologna\" title=\"Bologna\">Bologna</a>\n</td>",
          "text": "Bologna",
          "id": "row_2_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Renato_Dall%27Ara\" title=\"Stadio Renato Dall'Ara\">Stadio Renato Dall'Ara</a>\n</td>",
          "text": "Stadio Renato Dall'Ara",
          "id": "row_2_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004382790000000000\u2660</span>38,279\n</td>",
          "text": "7004382790000000000\u2660 38,279",
          "id": "row_2_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">15th in Serie A</a>\n</td>",
          "text": "15th in Serie A",
          "id": "row_2_col_4"
        }
      ],
      "text": "Bologna | Bologna | Stadio Renato Dall'Ara | 7004382790000000000\u2660 38,279 | 15th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Bologna_F.C._1909\" title=\"Bologna F.C. 1909\">Bologna</a>\n</td>\n<td><a href=\"/wiki/Bologna\" title=\"Bologna\">Bologna</a>\n</td>\n<td><a href=\"/wiki/Stadio_Renato_Dall%27Ara\" title=\"Stadio Renato Dall'Ara\">Stadio Renato Dall'Ara</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004382790000000000\u2660</span>38,279\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">15th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_2"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Cagliari_Calcio\" title=\"Cagliari Calcio\">Cagliari</a>\n</td>",
          "text": "Cagliari",
          "id": "row_3_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Cagliari\" title=\"Cagliari\">Cagliari</a>\n</td>",
          "text": "Cagliari",
          "id": "row_3_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Sardegna_Arena\" title=\"Sardegna Arena\">Sardegna Arena</a>\n</td>",
          "text": "Sardegna Arena",
          "id": "row_3_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162330000000000\u2660</span>16,233\n</td>",
          "text": "7004162330000000000\u2660 16,233",
          "id": "row_3_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">16th in Serie A</a>\n</td>",
          "text": "16th in Serie A",
          "id": "row_3_col_4"
        }
      ],
      "text": "Cagliari | Cagliari | Sardegna Arena | 7004162330000000000\u2660 16,233 | 16th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Cagliari_Calcio\" title=\"Cagliari Calcio\">Cagliari</a>\n</td>\n<td><a href=\"/wiki/Cagliari\" title=\"Cagliari\">Cagliari</a>\n</td>\n<td><a href=\"/wiki/Sardegna_Arena\" title=\"Sardegna Arena\">Sardegna Arena</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162330000000000\u2660</span>16,233\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">16th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_3"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/A.C._ChievoVerona\" title=\"A.C. ChievoVerona\">Chievo</a>\n</td>",
          "text": "Chievo",
          "id": "row_4_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Verona\" title=\"Verona\">Verona</a>\n</td>",
          "text": "Verona",
          "id": "row_4_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Marc%27Antonio_Bentegodi\" title=\"Stadio Marc'Antonio Bentegodi\">Stadio Marc'Antonio Bentegodi</a>\n</td>",
          "text": "Stadio Marc'Antonio Bentegodi",
          "id": "row_4_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004384020000000000\u2660</span>38,402\n</td>",
          "text": "7004384020000000000\u2660 38,402",
          "id": "row_4_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">13th in Serie A</a>\n</td>",
          "text": "13th in Serie A",
          "id": "row_4_col_4"
        }
      ],
      "text": "Chievo | Verona | Stadio Marc'Antonio Bentegodi | 7004384020000000000\u2660 38,402 | 13th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/A.C._ChievoVerona\" title=\"A.C. ChievoVerona\">Chievo</a>\n</td>\n<td><a href=\"/wiki/Verona\" title=\"Verona\">Verona</a>\n</td>\n<td><a href=\"/wiki/Stadio_Marc%27Antonio_Bentegodi\" title=\"Stadio Marc'Antonio Bentegodi\">Stadio Marc'Antonio Bentegodi</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004384020000000000\u2660</span>38,402\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">13th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_4"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Empoli_F.C.\" title=\"Empoli F.C.\">Empoli</a>\n</td>",
          "text": "Empoli",
          "id": "row_5_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Empoli\" title=\"Empoli\">Empoli</a>\n</td>",
          "text": "Empoli",
          "id": "row_5_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Carlo_Castellani\" title=\"Stadio Carlo Castellani\">Stadio Carlo Castellani</a>\n</td>",
          "text": "Stadio Carlo Castellani",
          "id": "row_5_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162840000000000\u2660</span>16,284\n</td>",
          "text": "7004162840000000000\u2660 16,284",
          "id": "row_5_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">Serie B Champions</a>\n</td>",
          "text": "Serie B Champions",
          "id": "row_5_col_4"
        }
      ],
      "text": "Empoli | Empoli | Stadio Carlo Castellani | 7004162840000000000\u2660 16,284 | Serie B Champions",
      "html": "<html><body><table><td><a href=\"/wiki/Empoli_F.C.\" title=\"Empoli F.C.\">Empoli</a>\n</td>\n<td><a href=\"/wiki/Empoli\" title=\"Empoli\">Empoli</a>\n</td>\n<td><a href=\"/wiki/Stadio_Carlo_Castellani\" title=\"Stadio Carlo Castellani\">Stadio Carlo Castellani</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162840000000000\u2660</span>16,284\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">Serie B Champions</a>\n</td>\n</table></body></html>",
      "id": "row_5"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/ACF_Fiorentina\" title=\"ACF Fiorentina\">Fiorentina</a>\n</td>",
          "text": "Fiorentina",
          "id": "row_6_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Florence\" title=\"Florence\">Florence</a>\n</td>",
          "text": "Florence",
          "id": "row_6_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Artemio_Franchi\" title=\"Stadio Artemio Franchi\">Stadio Artemio Franchi</a>\n</td>",
          "text": "Stadio Artemio Franchi",
          "id": "row_6_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004431470000000000\u2660</span>43,147\n</td>",
          "text": "7004431470000000000\u2660 43,147",
          "id": "row_6_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">8th in Serie A</a>\n</td>",
          "text": "8th in Serie A",
          "id": "row_6_col_4"
        }
      ],
      "text": "Fiorentina | Florence | Stadio Artemio Franchi | 7004431470000000000\u2660 43,147 | 8th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/ACF_Fiorentina\" title=\"ACF Fiorentina\">Fiorentina</a>\n</td>\n<td><a href=\"/wiki/Florence\" title=\"Florence\">Florence</a>\n</td>\n<td><a href=\"/wiki/Stadio_Artemio_Franchi\" title=\"Stadio Artemio Franchi\">Stadio Artemio Franchi</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004431470000000000\u2660</span>43,147\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">8th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_6"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Frosinone_Calcio\" title=\"Frosinone Calcio\">Frosinone</a>\n</td>",
          "text": "Frosinone",
          "id": "row_7_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Frosinone\" title=\"Frosinone\">Frosinone</a>\n</td>",
          "text": "Frosinone",
          "id": "row_7_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Benito_Stirpe\" title=\"Stadio Benito Stirpe\">Stadio Benito Stirpe</a>\n</td>",
          "text": "Stadio Benito Stirpe",
          "id": "row_7_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162270000000000\u2660</span>16,227\n</td>",
          "text": "7004162270000000000\u2660 16,227",
          "id": "row_7_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">Serie B Playoff winner</a>\n</td>",
          "text": "Serie B Playoff winner",
          "id": "row_7_col_4"
        }
      ],
      "text": "Frosinone | Frosinone | Stadio Benito Stirpe | 7004162270000000000\u2660 16,227 | Serie B Playoff winner",
      "html": "<html><body><table><td><a href=\"/wiki/Frosinone_Calcio\" title=\"Frosinone Calcio\">Frosinone</a>\n</td>\n<td><a href=\"/wiki/Frosinone\" title=\"Frosinone\">Frosinone</a>\n</td>\n<td><a href=\"/wiki/Stadio_Benito_Stirpe\" title=\"Stadio Benito Stirpe\">Stadio Benito Stirpe</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162270000000000\u2660</span>16,227\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">Serie B Playoff winner</a>\n</td>\n</table></body></html>",
      "id": "row_7"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Genoa_C.F.C.\" title=\"Genoa C.F.C.\">Genoa</a>\n</td>",
          "text": "Genoa",
          "id": "row_8_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Genoa\" title=\"Genoa\">Genoa</a>\n</td>",
          "text": "Genoa",
          "id": "row_8_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Luigi_Ferraris\" title=\"Stadio Luigi Ferraris\">Stadio Luigi Ferraris</a>\n</td>",
          "text": "Stadio Luigi Ferraris",
          "id": "row_8_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004366850000000000\u2660</span>36,685\n</td>",
          "text": "7004366850000000000\u2660 36,685",
          "id": "row_8_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">12th in Serie A</a>\n</td>",
          "text": "12th in Serie A",
          "id": "row_8_col_4"
        }
      ],
      "text": "Genoa | Genoa | Stadio Luigi Ferraris | 7004366850000000000\u2660 36,685 | 12th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Genoa_C.F.C.\" title=\"Genoa C.F.C.\">Genoa</a>\n</td>\n<td><a href=\"/wiki/Genoa\" title=\"Genoa\">Genoa</a>\n</td>\n<td><a href=\"/wiki/Stadio_Luigi_Ferraris\" title=\"Stadio Luigi Ferraris\">Stadio Luigi Ferraris</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004366850000000000\u2660</span>36,685\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">12th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_8"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Inter_Milan\" title=\"Inter Milan\">Internazionale</a>\n</td>",
          "text": "Internazionale",
          "id": "row_9_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Milan\" title=\"Milan\">Milan</a>\n</td>",
          "text": "Milan",
          "id": "row_9_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/San_Siro\" title=\"San Siro\">San Siro</a>\n</td>",
          "text": "San Siro",
          "id": "row_9_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004800180000000000\u2660</span>80,018\n</td>",
          "text": "7004800180000000000\u2660 80,018",
          "id": "row_9_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">4th in Serie A</a>\n</td>",
          "text": "4th in Serie A",
          "id": "row_9_col_4"
        }
      ],
      "text": "Internazionale | Milan | San Siro | 7004800180000000000\u2660 80,018 | 4th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Inter_Milan\" title=\"Inter Milan\">Internazionale</a>\n</td>\n<td><a href=\"/wiki/Milan\" title=\"Milan\">Milan</a>\n</td>\n<td><a href=\"/wiki/San_Siro\" title=\"San Siro\">San Siro</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004800180000000000\u2660</span>80,018\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">4th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_9"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Juventus_F.C.\" title=\"Juventus F.C.\">Juventus</a>\n</td>",
          "text": "Juventus",
          "id": "row_10_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Turin\" title=\"Turin\">Turin</a>\n</td>",
          "text": "Turin",
          "id": "row_10_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Juventus_Stadium\" title=\"Juventus Stadium\">Juventus Stadium</a>\n</td>",
          "text": "Juventus Stadium",
          "id": "row_10_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004415070000000000\u2660</span>41,507\n</td>",
          "text": "7004415070000000000\u2660 41,507",
          "id": "row_10_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">Serie A Champions</a>\n</td>",
          "text": "Serie A Champions",
          "id": "row_10_col_4"
        }
      ],
      "text": "Juventus | Turin | Juventus Stadium | 7004415070000000000\u2660 41,507 | Serie A Champions",
      "html": "<html><body><table><td><a href=\"/wiki/Juventus_F.C.\" title=\"Juventus F.C.\">Juventus</a>\n</td>\n<td><a href=\"/wiki/Turin\" title=\"Turin\">Turin</a>\n</td>\n<td><a href=\"/wiki/Juventus_Stadium\" title=\"Juventus Stadium\">Juventus Stadium</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004415070000000000\u2660</span>41,507\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">Serie A Champions</a>\n</td>\n</table></body></html>",
      "id": "row_10"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/S.S._Lazio\" title=\"S.S. Lazio\">Lazio</a>\n</td>",
          "text": "Lazio",
          "id": "row_11_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Rome\" title=\"Rome\">Rome</a>\n</td>",
          "text": "Rome",
          "id": "row_11_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Olimpico\" title=\"Stadio Olimpico\">Stadio Olimpico</a>\n</td>",
          "text": "Stadio Olimpico",
          "id": "row_11_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004726980000000000\u2660</span>72,698\n</td>",
          "text": "7004726980000000000\u2660 72,698",
          "id": "row_11_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">5th in Serie A</a>\n</td>",
          "text": "5th in Serie A",
          "id": "row_11_col_4"
        }
      ],
      "text": "Lazio | Rome | Stadio Olimpico | 7004726980000000000\u2660 72,698 | 5th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/S.S._Lazio\" title=\"S.S. Lazio\">Lazio</a>\n</td>\n<td><a href=\"/wiki/Rome\" title=\"Rome\">Rome</a>\n</td>\n<td><a href=\"/wiki/Stadio_Olimpico\" title=\"Stadio Olimpico\">Stadio Olimpico</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004726980000000000\u2660</span>72,698\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">5th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_11"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/A.C._Milan\" title=\"A.C. Milan\">Milan</a>\n</td>",
          "text": "Milan",
          "id": "row_12_col_0"
        },
        {
          "cell": "<td>Milan\n</td>",
          "text": "Milan",
          "id": "row_12_col_1"
        },
        {
          "cell": "<td>San Siro\n</td>",
          "text": "San Siro",
          "id": "row_12_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004800180000000000\u2660</span>80,018\n</td>",
          "text": "7004800180000000000\u2660 80,018",
          "id": "row_12_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">6th in Serie A</a>\n</td>",
          "text": "6th in Serie A",
          "id": "row_12_col_4"
        }
      ],
      "text": "Milan | Milan | San Siro | 7004800180000000000\u2660 80,018 | 6th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/A.C._Milan\" title=\"A.C. Milan\">Milan</a>\n</td>\n<td>Milan\n</td>\n<td>San Siro\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004800180000000000\u2660</span>80,018\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">6th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_12"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/S.S.C._Napoli\" title=\"S.S.C. Napoli\">Napoli</a>\n</td>",
          "text": "Napoli",
          "id": "row_13_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Naples\" title=\"Naples\">Naples</a>\n</td>",
          "text": "Naples",
          "id": "row_13_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_San_Paolo\" title=\"Stadio San Paolo\">Stadio San Paolo</a>\n</td>",
          "text": "Stadio San Paolo",
          "id": "row_13_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004602400000000000\u2660</span>60,240\n</td>",
          "text": "7004602400000000000\u2660 60,240",
          "id": "row_13_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">2nd in Serie A</a>\n</td>",
          "text": "2nd in Serie A",
          "id": "row_13_col_4"
        }
      ],
      "text": "Napoli | Naples | Stadio San Paolo | 7004602400000000000\u2660 60,240 | 2nd in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/S.S.C._Napoli\" title=\"S.S.C. Napoli\">Napoli</a>\n</td>\n<td><a href=\"/wiki/Naples\" title=\"Naples\">Naples</a>\n</td>\n<td><a href=\"/wiki/Stadio_San_Paolo\" title=\"Stadio San Paolo\">Stadio San Paolo</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004602400000000000\u2660</span>60,240\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">2nd in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_13"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Parma_Calcio_1913\" title=\"Parma Calcio 1913\">Parma</a>\n</td>",
          "text": "Parma",
          "id": "row_14_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Parma\" title=\"Parma\">Parma</a>\n</td>",
          "text": "Parma",
          "id": "row_14_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Ennio_Tardini\" title=\"Stadio Ennio Tardini\">Stadio Ennio Tardini</a>\n</td>",
          "text": "Stadio Ennio Tardini",
          "id": "row_14_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004279060000000000\u2660</span>27,906\n</td>",
          "text": "7004279060000000000\u2660 27,906",
          "id": "row_14_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">2nd in Serie B</a>\n</td>",
          "text": "2nd in Serie B",
          "id": "row_14_col_4"
        }
      ],
      "text": "Parma | Parma | Stadio Ennio Tardini | 7004279060000000000\u2660 27,906 | 2nd in Serie B",
      "html": "<html><body><table><td><a href=\"/wiki/Parma_Calcio_1913\" title=\"Parma Calcio 1913\">Parma</a>\n</td>\n<td><a href=\"/wiki/Parma\" title=\"Parma\">Parma</a>\n</td>\n<td><a href=\"/wiki/Stadio_Ennio_Tardini\" title=\"Stadio Ennio Tardini\">Stadio Ennio Tardini</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004279060000000000\u2660</span>27,906\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">2nd in Serie B</a>\n</td>\n</table></body></html>",
      "id": "row_14"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/A.S._Roma\" title=\"A.S. Roma\">Roma</a>\n</td>",
          "text": "Roma",
          "id": "row_15_col_0"
        },
        {
          "cell": "<td>Rome\n</td>",
          "text": "Rome",
          "id": "row_15_col_1"
        },
        {
          "cell": "<td>Stadio Olimpico\n</td>",
          "text": "Stadio Olimpico",
          "id": "row_15_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004726980000000000\u2660</span>72,698\n</td>",
          "text": "7004726980000000000\u2660 72,698",
          "id": "row_15_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">3rd in Serie A</a>\n</td>",
          "text": "3rd in Serie A",
          "id": "row_15_col_4"
        }
      ],
      "text": "Roma | Rome | Stadio Olimpico | 7004726980000000000\u2660 72,698 | 3rd in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/A.S._Roma\" title=\"A.S. Roma\">Roma</a>\n</td>\n<td>Rome\n</td>\n<td>Stadio Olimpico\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004726980000000000\u2660</span>72,698\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">3rd in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_15"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/U.C._Sampdoria\" title=\"U.C. Sampdoria\">Sampdoria</a>\n</td>",
          "text": "Sampdoria",
          "id": "row_16_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Genoa\" title=\"Genoa\">Genoa</a>\n</td>",
          "text": "Genoa",
          "id": "row_16_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Luigi_Ferraris\" title=\"Stadio Luigi Ferraris\">Stadio Luigi Ferraris</a>\n</td>",
          "text": "Stadio Luigi Ferraris",
          "id": "row_16_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004366850000000000\u2660</span>36,685\n</td>",
          "text": "7004366850000000000\u2660 36,685",
          "id": "row_16_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">10th in Serie A</a>\n</td>",
          "text": "10th in Serie A",
          "id": "row_16_col_4"
        }
      ],
      "text": "Sampdoria | Genoa | Stadio Luigi Ferraris | 7004366850000000000\u2660 36,685 | 10th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/U.C._Sampdoria\" title=\"U.C. Sampdoria\">Sampdoria</a>\n</td>\n<td><a href=\"/wiki/Genoa\" title=\"Genoa\">Genoa</a>\n</td>\n<td><a href=\"/wiki/Stadio_Luigi_Ferraris\" title=\"Stadio Luigi Ferraris\">Stadio Luigi Ferraris</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004366850000000000\u2660</span>36,685\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">10th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_16"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/U.S._Sassuolo_Calcio\" title=\"U.S. Sassuolo Calcio\">Sassuolo</a>\n</td>",
          "text": "Sassuolo",
          "id": "row_17_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Sassuolo\" title=\"Sassuolo\">Sassuolo</a>\n</td>",
          "text": "Sassuolo",
          "id": "row_17_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Mapei_Stadium_%E2%80%93_Citt%C3%A0_del_Tricolore\" title=\"Mapei Stadium \u2013 Citt\u00e0 del Tricolore\">Mapei Stadium \u2013 Citt\u00e0 del Tricolore</a><br/><small>(<a href=\"/wiki/Reggio_Emilia\" title=\"Reggio Emilia\">Reggio Emilia</a>)</small>\n</td>",
          "text": "Mapei Stadium \u2013 Citt\u00e0 del Tricolore ( Reggio Emilia )",
          "id": "row_17_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004237170000000000\u2660</span>23,717\n</td>",
          "text": "7004237170000000000\u2660 23,717",
          "id": "row_17_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">11th in Serie A</a>\n</td>",
          "text": "11th in Serie A",
          "id": "row_17_col_4"
        }
      ],
      "text": "Sassuolo | Sassuolo | Mapei Stadium \u2013 Citt\u00e0 del Tricolore ( Reggio Emilia ) | 7004237170000000000\u2660 23,717 | 11th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/U.S._Sassuolo_Calcio\" title=\"U.S. Sassuolo Calcio\">Sassuolo</a>\n</td>\n<td><a href=\"/wiki/Sassuolo\" title=\"Sassuolo\">Sassuolo</a>\n</td>\n<td><a href=\"/wiki/Mapei_Stadium_%E2%80%93_Citt%C3%A0_del_Tricolore\" title=\"Mapei Stadium \u2013 Citt\u00e0 del Tricolore\">Mapei Stadium \u2013 Citt\u00e0 del Tricolore</a><br/><small>(<a href=\"/wiki/Reggio_Emilia\" title=\"Reggio Emilia\">Reggio Emilia</a>)</small>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004237170000000000\u2660</span>23,717\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">11th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_17"
    },
    {
      "cells": [
        {
          "cell": "<td><a class=\"mw-redirect\" href=\"/wiki/S.P.A.L._2013\" title=\"S.P.A.L. 2013\">SPAL</a>\n</td>",
          "text": "SPAL",
          "id": "row_18_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Ferrara\" title=\"Ferrara\">Ferrara</a>\n</td>",
          "text": "Ferrara",
          "id": "row_18_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Paolo_Mazza\" title=\"Stadio Paolo Mazza\">Stadio Paolo Mazza</a>\n</td>",
          "text": "Stadio Paolo Mazza",
          "id": "row_18_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004130200000000000\u2660</span>13,020\n</td>",
          "text": "7004130200000000000\u2660 13,020",
          "id": "row_18_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">17th in Serie A</a>\n</td>",
          "text": "17th in Serie A",
          "id": "row_18_col_4"
        }
      ],
      "text": "SPAL | Ferrara | Stadio Paolo Mazza | 7004130200000000000\u2660 13,020 | 17th in Serie A",
      "html": "<html><body><table><td><a class=\"mw-redirect\" href=\"/wiki/S.P.A.L._2013\" title=\"S.P.A.L. 2013\">SPAL</a>\n</td>\n<td><a href=\"/wiki/Ferrara\" title=\"Ferrara\">Ferrara</a>\n</td>\n<td><a href=\"/wiki/Stadio_Paolo_Mazza\" title=\"Stadio Paolo Mazza\">Stadio Paolo Mazza</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004130200000000000\u2660</span>13,020\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">17th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_18"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Torino_F.C.\" title=\"Torino F.C.\">Torino</a>\n</td>",
          "text": "Torino",
          "id": "row_19_col_0"
        },
        {
          "cell": "<td>Turin\n</td>",
          "text": "Turin",
          "id": "row_19_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Olimpico_Grande_Torino\" title=\"Stadio Olimpico Grande Torino\">Stadio Olimpico Grande Torino</a>\n</td>",
          "text": "Stadio Olimpico Grande Torino",
          "id": "row_19_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004279940000000000\u2660</span>27,994\n</td>",
          "text": "7004279940000000000\u2660 27,994",
          "id": "row_19_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">9th in Serie A</a>\n</td>",
          "text": "9th in Serie A",
          "id": "row_19_col_4"
        }
      ],
      "text": "Torino | Turin | Stadio Olimpico Grande Torino | 7004279940000000000\u2660 27,994 | 9th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Torino_F.C.\" title=\"Torino F.C.\">Torino</a>\n</td>\n<td>Turin\n</td>\n<td><a href=\"/wiki/Stadio_Olimpico_Grande_Torino\" title=\"Stadio Olimpico Grande Torino\">Stadio Olimpico Grande Torino</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004279940000000000\u2660</span>27,994\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">9th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_19"
    },
    {
      "cells": [
        {
          "cell": "<td><a href=\"/wiki/Udinese_Calcio\" title=\"Udinese Calcio\">Udinese</a>\n</td>",
          "text": "Udinese",
          "id": "row_20_col_0"
        },
        {
          "cell": "<td><a href=\"/wiki/Udine\" title=\"Udine\">Udine</a>\n</td>",
          "text": "Udine",
          "id": "row_20_col_1"
        },
        {
          "cell": "<td><a href=\"/wiki/Stadio_Friuli\" title=\"Stadio Friuli\">Stadio Friuli-Dacia Arena</a>\n</td>",
          "text": "Stadio Friuli-Dacia Arena",
          "id": "row_20_col_2"
        },
        {
          "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004251320000000000\u2660</span>25,132\n</td>",
          "text": "7004251320000000000\u2660 25,132",
          "id": "row_20_col_3"
        },
        {
          "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">14th in Serie A</a>\n</td>",
          "text": "14th in Serie A",
          "id": "row_20_col_4"
        }
      ],
      "text": "Udinese | Udine | Stadio Friuli-Dacia Arena | 7004251320000000000\u2660 25,132 | 14th in Serie A",
      "html": "<html><body><table><td><a href=\"/wiki/Udinese_Calcio\" title=\"Udinese Calcio\">Udinese</a>\n</td>\n<td><a href=\"/wiki/Udine\" title=\"Udine\">Udine</a>\n</td>\n<td><a href=\"/wiki/Stadio_Friuli\" title=\"Stadio Friuli\">Stadio Friuli-Dacia Arena</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004251320000000000\u2660</span>25,132\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">14th in Serie A</a>\n</td>\n</table></body></html>",
      "id": "row_20"
    }
  ],
  "context_before": ". List of Italian Football Championship clubs clubs. For a complete list of clubs see 2018\u201319 Serie A",
  "context_after": "Team Home city Stadium Capacity",
  "fingerprint": "-018-020-10th-11th-12th-13-132-13th-147-14th-15th-16-16th-17th-18-1909-1913-2013-2017-21-227-23-233-240-25-27-279-27Antonio_Bentegodi-27Ara-27Italia-284-2nd-300-36-38-3rd-402-41-43-4th-507-5th-60-685-698-6th-7004130200000000000-7004162270000000000-7004162330000000000-7004162840000000000-7004213000000000000-7004237170000000000-7004251320000000000-7004279060000000000-7004279940000000000-7004366850000000000-7004382790000000000-7004384020000000000-7004415070000000000-7004431470000000000-7004602400000000000-7004726980000000000-7004800180000000000-717-72-7th-80-8th-906-9318_Serie_A-9318_Serie_B-93_Citt-994-9th-A-A0_del_Tricolore-ACF-ACF_Fiorentina-Antonio-Ara-Arena-Artemio-Atalanta-Atalanta_B-Atleti-Azzurri-B-Benito-Bentegodi-Bergamo-Bologna-Bologna_F-C-C3-Cagliari-Cagliari_Calcio-Calcio-Capacity-Carlo-Castellani-Champions-Chievo-ChievoVerona-Citt\u00e0-Dacia-Dall-E2-Emilia-Empoli-Empoli_F-Ennio-F-Ferrara-Ferraris-Fiorentina-Florence-Franchi-Friuli-Frosinone-Frosinone_Calcio-Genoa-Genoa_C-Grande-Home-Inter-Inter_Milan-Internazionale-Italia-Juventus-Juventus_F-Juventus_Stadium-L-Lazio-Luigi-Mapei-Mapei_Stadium_-Marc-Mazza-Milan-Naples-Napoli-Olimpico-P-Paolo-Parma-Parma_Calcio_1913-Playoff-Reggio-Reggio_Emilia-Renato-Roma-Rome-S-SPAL-Sampdoria-San-San_Siro-Sardegna-Sardegna_Arena-Sassuolo-Serie-Siro-Stadio-Stadio_Artemio_Franchi-Stadio_Atleti_Azzurri_d-Stadio_Benito_Stirpe-Stadio_Carlo_Castellani-Stadio_Ennio_Tardini-Stadio_Friuli-Stadio_Luigi_Ferraris-Stadio_Marc-Stadio_Olimpico-Stadio_Olimpico_Grande_Torino-Stadio_Paolo_Mazza-Stadio_Renato_Dall-Stadio_San_Paolo-Stadium-Stirpe-Tardini-Team-Torino-Torino_F-Tricolore-Turin-U-Udine-Udinese-Udinese_Calcio-Verona-_1909-_2013-_ChievoVerona-_Lazio-_Milan-_Napoli-_Roma-_Sampdoria-_Sassuolo_Calcio-a-align-br-center-city-class-d-del-display-href-in-mw-none-redirect-season-small-sortkey-span-style-table-td-text-th-title-tr-wiki-winner",
  "html": "<table><tr><th>Team\n</th><th>Home city\n</th><th>Stadium\n</th><th>Capacity\n</th><th>2017\u201318 season\n</th></tr><tr><td><a href=\"/wiki/Atalanta_B.C.\" title=\"Atalanta B.C.\">Atalanta</a>\n</td><td><a href=\"/wiki/Bergamo\" title=\"Bergamo\">Bergamo</a>\n</td><td><a href=\"/wiki/Stadio_Atleti_Azzurri_d%27Italia\" title=\"Stadio Atleti Azzurri d'Italia\">Stadio Atleti Azzurri d'Italia</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004213000000000000\u2660</span>21,300\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">7th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Bologna_F.C._1909\" title=\"Bologna F.C. 1909\">Bologna</a>\n</td><td><a href=\"/wiki/Bologna\" title=\"Bologna\">Bologna</a>\n</td><td><a href=\"/wiki/Stadio_Renato_Dall%27Ara\" title=\"Stadio Renato Dall'Ara\">Stadio Renato Dall'Ara</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004382790000000000\u2660</span>38,279\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">15th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Cagliari_Calcio\" title=\"Cagliari Calcio\">Cagliari</a>\n</td><td><a href=\"/wiki/Cagliari\" title=\"Cagliari\">Cagliari</a>\n</td><td><a href=\"/wiki/Sardegna_Arena\" title=\"Sardegna Arena\">Sardegna Arena</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162330000000000\u2660</span>16,233\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">16th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/A.C._ChievoVerona\" title=\"A.C. ChievoVerona\">Chievo</a>\n</td><td><a href=\"/wiki/Verona\" title=\"Verona\">Verona</a>\n</td><td><a href=\"/wiki/Stadio_Marc%27Antonio_Bentegodi\" title=\"Stadio Marc'Antonio Bentegodi\">Stadio Marc'Antonio Bentegodi</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004384020000000000\u2660</span>38,402\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">13th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Empoli_F.C.\" title=\"Empoli F.C.\">Empoli</a>\n</td><td><a href=\"/wiki/Empoli\" title=\"Empoli\">Empoli</a>\n</td><td><a href=\"/wiki/Stadio_Carlo_Castellani\" title=\"Stadio Carlo Castellani\">Stadio Carlo Castellani</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162840000000000\u2660</span>16,284\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">Serie B Champions</a>\n</td></tr><tr><td><a href=\"/wiki/ACF_Fiorentina\" title=\"ACF Fiorentina\">Fiorentina</a>\n</td><td><a href=\"/wiki/Florence\" title=\"Florence\">Florence</a>\n</td><td><a href=\"/wiki/Stadio_Artemio_Franchi\" title=\"Stadio Artemio Franchi\">Stadio Artemio Franchi</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004431470000000000\u2660</span>43,147\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">8th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Frosinone_Calcio\" title=\"Frosinone Calcio\">Frosinone</a>\n</td><td><a href=\"/wiki/Frosinone\" title=\"Frosinone\">Frosinone</a>\n</td><td><a href=\"/wiki/Stadio_Benito_Stirpe\" title=\"Stadio Benito Stirpe\">Stadio Benito Stirpe</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004162270000000000\u2660</span>16,227\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">Serie B Playoff winner</a>\n</td></tr><tr><td><a href=\"/wiki/Genoa_C.F.C.\" title=\"Genoa C.F.C.\">Genoa</a>\n</td><td><a href=\"/wiki/Genoa\" title=\"Genoa\">Genoa</a>\n</td><td><a href=\"/wiki/Stadio_Luigi_Ferraris\" title=\"Stadio Luigi Ferraris\">Stadio Luigi Ferraris</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004366850000000000\u2660</span>36,685\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">12th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Inter_Milan\" title=\"Inter Milan\">Internazionale</a>\n</td><td><a href=\"/wiki/Milan\" title=\"Milan\">Milan</a>\n</td><td><a href=\"/wiki/San_Siro\" title=\"San Siro\">San Siro</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004800180000000000\u2660</span>80,018\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">4th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Juventus_F.C.\" title=\"Juventus F.C.\">Juventus</a>\n</td><td><a href=\"/wiki/Turin\" title=\"Turin\">Turin</a>\n</td><td><a href=\"/wiki/Juventus_Stadium\" title=\"Juventus Stadium\">Juventus Stadium</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004415070000000000\u2660</span>41,507\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">Serie A Champions</a>\n</td></tr><tr><td><a href=\"/wiki/S.S._Lazio\" title=\"S.S. Lazio\">Lazio</a>\n</td><td><a href=\"/wiki/Rome\" title=\"Rome\">Rome</a>\n</td><td><a href=\"/wiki/Stadio_Olimpico\" title=\"Stadio Olimpico\">Stadio Olimpico</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004726980000000000\u2660</span>72,698\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">5th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/A.C._Milan\" title=\"A.C. Milan\">Milan</a>\n</td><td>Milan\n</td><td>San Siro\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004800180000000000\u2660</span>80,018\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">6th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/S.S.C._Napoli\" title=\"S.S.C. Napoli\">Napoli</a>\n</td><td><a href=\"/wiki/Naples\" title=\"Naples\">Naples</a>\n</td><td><a href=\"/wiki/Stadio_San_Paolo\" title=\"Stadio San Paolo\">Stadio San Paolo</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004602400000000000\u2660</span>60,240\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">2nd in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Parma_Calcio_1913\" title=\"Parma Calcio 1913\">Parma</a>\n</td><td><a href=\"/wiki/Parma\" title=\"Parma\">Parma</a>\n</td><td><a href=\"/wiki/Stadio_Ennio_Tardini\" title=\"Stadio Ennio Tardini\">Stadio Ennio Tardini</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004279060000000000\u2660</span>27,906\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_B\" title=\"2017\u201318 Serie B\">2nd in Serie B</a>\n</td></tr><tr><td><a href=\"/wiki/A.S._Roma\" title=\"A.S. Roma\">Roma</a>\n</td><td>Rome\n</td><td>Stadio Olimpico\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004726980000000000\u2660</span>72,698\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">3rd in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/U.C._Sampdoria\" title=\"U.C. Sampdoria\">Sampdoria</a>\n</td><td><a href=\"/wiki/Genoa\" title=\"Genoa\">Genoa</a>\n</td><td><a href=\"/wiki/Stadio_Luigi_Ferraris\" title=\"Stadio Luigi Ferraris\">Stadio Luigi Ferraris</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004366850000000000\u2660</span>36,685\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">10th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/U.S._Sassuolo_Calcio\" title=\"U.S. Sassuolo Calcio\">Sassuolo</a>\n</td><td><a href=\"/wiki/Sassuolo\" title=\"Sassuolo\">Sassuolo</a>\n</td><td><a href=\"/wiki/Mapei_Stadium_%E2%80%93_Citt%C3%A0_del_Tricolore\" title=\"Mapei Stadium \u2013 Citt\u00e0 del Tricolore\">Mapei Stadium \u2013 Citt\u00e0 del Tricolore</a><br/><small>(<a href=\"/wiki/Reggio_Emilia\" title=\"Reggio Emilia\">Reggio Emilia</a>)</small>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004237170000000000\u2660</span>23,717\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">11th in Serie A</a>\n</td></tr><tr><td><a class=\"mw-redirect\" href=\"/wiki/S.P.A.L._2013\" title=\"S.P.A.L. 2013\">SPAL</a>\n</td><td><a href=\"/wiki/Ferrara\" title=\"Ferrara\">Ferrara</a>\n</td><td><a href=\"/wiki/Stadio_Paolo_Mazza\" title=\"Stadio Paolo Mazza\">Stadio Paolo Mazza</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004130200000000000\u2660</span>13,020\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">17th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Torino_F.C.\" title=\"Torino F.C.\">Torino</a>\n</td><td>Turin\n</td><td><a href=\"/wiki/Stadio_Olimpico_Grande_Torino\" title=\"Stadio Olimpico Grande Torino\">Stadio Olimpico Grande Torino</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004279940000000000\u2660</span>27,994\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">9th in Serie A</a>\n</td></tr><tr><td><a href=\"/wiki/Udinese_Calcio\" title=\"Udinese Calcio\">Udinese</a>\n</td><td><a href=\"/wiki/Udine\" title=\"Udine\">Udine</a>\n</td><td><a href=\"/wiki/Stadio_Friuli\" title=\"Stadio Friuli\">Stadio Friuli-Dacia Arena</a>\n</td><td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004251320000000000\u2660</span>25,132\n</td><td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">14th in Serie A</a>\n</td></tr></table>",
  "text": "Team | Home city | Stadium | Capacity | 2017\u201318 season\nAtalanta | Bergamo | Stadio Atleti Azzurri d'Italia | 7004213000000000000\u2660 21,300 | 7th in Serie A\nBologna | Bologna | Stadio Renato Dall'Ara | 7004382790000000000\u2660 38,279 | 15th in Serie A\nCagliari | Cagliari | Sardegna Arena | 7004162330000000000\u2660 16,233 | 16th in Serie A\nChievo | Verona | Stadio Marc'Antonio Bentegodi | 7004384020000000000\u2660 38,402 | 13th in Serie A\nEmpoli | Empoli | Stadio Carlo Castellani | 7004162840000000000\u2660 16,284 | Serie B Champions\nFiorentina | Florence | Stadio Artemio Franchi | 7004431470000000000\u2660 43,147 | 8th in Serie A\nFrosinone | Frosinone | Stadio Benito Stirpe | 7004162270000000000\u2660 16,227 | Serie B Playoff winner\nGenoa | Genoa | Stadio Luigi Ferraris | 7004366850000000000\u2660 36,685 | 12th in Serie A\nInternazionale | Milan | San Siro | 7004800180000000000\u2660 80,018 | 4th in Serie A\nJuventus | Turin | Juventus Stadium | 7004415070000000000\u2660 41,507 | Serie A Champions\nLazio | Rome | Stadio Olimpico | 7004726980000000000\u2660 72,698 | 5th in Serie A\nMilan | Milan | San Siro | 7004800180000000000\u2660 80,018 | 6th in Serie A\nNapoli | Naples | Stadio San Paolo | 7004602400000000000\u2660 60,240 | 2nd in Serie A\nParma | Parma | Stadio Ennio Tardini | 7004279060000000000\u2660 27,906 | 2nd in Serie B\nRoma | Rome | Stadio Olimpico | 7004726980000000000\u2660 72,698 | 3rd in Serie A\nSampdoria | Genoa | Stadio Luigi Ferraris | 7004366850000000000\u2660 36,685 | 10th in Serie A\nSassuolo | Sassuolo | Mapei Stadium \u2013 Citt\u00e0 del Tricolore ( Reggio Emilia ) | 7004237170000000000\u2660 23,717 | 11th in Serie A\nSPAL | Ferrara | Stadio Paolo Mazza | 7004130200000000000\u2660 13,020 | 17th in Serie A\nTorino | Turin | Stadio Olimpico Grande Torino | 7004279940000000000\u2660 27,994 | 9th in Serie A\nUdinese | Udine | Stadio Friuli-Dacia Arena | 7004251320000000000\u2660 25,132 | 14th in Serie A\n"
}

In the second part, we use JSON path to do further table extraction.

Aside: ETK uses JSON paths to access data in JSON documents. Take a look at the excellent and short introduction to JSON paths: http://goessner.net/articles/JsonPath/


In [22]:
all_json_path = '$.cells[0:4].text'
docs = list()
for table in tables_in_page:

    # skipping the first row, the heading
    for row in table.value['rows'][1:]:
        doc = etk.create_document(row)
        row_values = doc.select_segments(all_json_path)

        # add the information we extracted in the knowledge graph of the doc.
        doc.kg.add_value('team', value=row_values[0].value)
        doc.kg.add_value('city_name', value=row_values[1].value)
        doc.kg.add_value('stadium', value=row_values[2].value)
        capacity_split = re.split(' |,', row_values[3].value)
        if capacity_split[-1] != '':
            capacity = int(capacity_split[-2] + capacity_split[-1]) if len(capacity_split) > 1 else int(
                capacity_split[-1])
            doc.kg.add_value('capacity', value=capacity)
        docs.append(doc)

print('Number of rows extracted from that page', len(docs), '\n')
print('Sample rows(5):')
for doc in docs[:5]:
    print(doc.kg.value, '\n')


Number of rows extracted from that page 258 

Sample rows(5):
{'team': [{'value': 'Atalanta', 'key': 'atalanta'}], 'city_name': [{'value': 'Bergamo', 'key': 'bergamo'}], 'stadium': [{'value': "Stadio Atleti Azzurri d'Italia", 'key': "stadio atleti azzurri d'italia"}], 'capacity': [{'value': 21300, 'key': '21300'}]} 

{'team': [{'value': 'Bologna', 'key': 'bologna'}], 'city_name': [{'value': 'Bologna', 'key': 'bologna'}], 'stadium': [{'value': "Stadio Renato Dall'Ara", 'key': "stadio renato dall'ara"}], 'capacity': [{'value': 38279, 'key': '38279'}]} 

{'team': [{'value': 'Cagliari', 'key': 'cagliari'}], 'city_name': [{'value': 'Cagliari', 'key': 'cagliari'}], 'stadium': [{'value': 'Sardegna Arena', 'key': 'sardegna arena'}], 'capacity': [{'value': 16233, 'key': '16233'}]} 

{'team': [{'value': 'Chievo', 'key': 'chievo'}], 'city_name': [{'value': 'Verona', 'key': 'verona'}], 'stadium': [{'value': "Stadio Marc'Antonio Bentegodi", 'key': "stadio marc'antonio bentegodi"}], 'capacity': [{'value': 38402, 'key': '38402'}]} 

{'team': [{'value': 'Empoli', 'key': 'empoli'}], 'city_name': [{'value': 'Empoli', 'key': 'empoli'}], 'stadium': [{'value': 'Stadio Carlo Castellani', 'key': 'stadio carlo castellani'}], 'capacity': [{'value': 16284, 'key': '16284'}]} 

The extracted tables are now stored in your JSON document.

construct a dict that maps city names to all geonames records that contain the city name with population greater than 25,000.


In [23]:
file_name = './resources/cities_ppl_25000.json'
file = open(file_name, 'r')
city_dataset = json.loads(file.read())
file.close()
city_list = list(city_dataset.keys())
print('There are', len(city_list), 'cities with population great than or equal to 25,000.\n')
print('City list samples(20):\n')
print(city_list[:20])


There are 15117 cities with population great than or equal to 25,000.

City list samples(20):

['Marion', 'Fes', 'Fes al Bali', 'Gravina in Puglia', 'Nawada', 'Pensacola', 'Pedro Betancourt', 'Uriangato', 'Fiditi', 'Wilkes-Barre', 'Kafue', 'Chipata', 'Sawangan', 'Tuxpan de Rodriguez Cano', 'Rosny-sous-Bois', 'Caete', 'Kafr ad Dawwar', 'Reynoldsburg', 'Simferopol', 'Ouargla']

Identifying the city names in geonames and linking to geonames

There are many ways to do this step. We will do it using the ETK glossary extractor to illustrate how to use other extractors and how to chain the results of one extractor as input to other extractors.

Using data from the geonames.org web site, we prepared a list of all cities in the world with population greater than 25,000. We use this small glossary to make the code run faster, but you may want to try it with the full list of cities.

First, we need to load the glossary in ETK. We're using the default tokenizer to tokenize the strings. Besides, we set ngrams to zero to let the program choose the best ngram number automatically.


In [14]:
my_glossary_extractor = GlossaryExtractor(glossary=city_list, extractor_name='tutorial_glossary',
                                          tokenizer=etk.default_tokenizer, ngrams=3,
                                          case_sensitive=False)

Now we are going to use the glossary to extract from the Home city column all the strings that match names in geonames. This method will allow us to extract the geonames city name from cells that may contain extraneous information.

To run the glossary extractor over all cells containing Home city we use a JSON path that selects these cells across all tables. Our list of extractions has the names of cities that we know appear in geonames. Often, different cities in the world have the same name (e.g., Paris, France and Paris, Texas). To get the latitude and longitude, we need to identify the correct city. We know all the cities are in Italy, so we can easily filter.


In [15]:
hit_count = 0
for doc in docs:
    city_json_path = '$.cells[1].text'
    row_values = doc.select_segments(city_json_path)

    # use the city field of the doc, run the GlossaryExtractor
    extractions = doc.extract(my_glossary_extractor, row_values[0])
    if extractions:
        path = '$."' + extractions[0].value + '"[?(@.country == "Italy")]'
        jsonpath_expr = jex.parse(path)
        city_match = jsonpath_expr.find(city_dataset)
        if city_match:
            hit_count += 1

            # add corresponding values of city_dataset into knowledge graph of the doc
            for field in city_match[0].value:
                doc.kg.add_value(field, value=city_match[0].value[field])
print('There\'re', hit_count, 'hits for city_list.\n')
print('Final result sample:\n')
print(json.dumps(docs[0].kg.value, indent=2))


There're 138 hits for city_list.

Final result sample:

{
  "team": [
    {
      "value": "Atalanta",
      "key": "atalanta"
    }
  ],
  "city_name": [
    {
      "value": "Bergamo",
      "key": "bergamo"
    }
  ],
  "stadium": [
    {
      "value": "Stadio Atleti Azzurri d'Italia",
      "key": "stadio atleti azzurri d'italia"
    }
  ],
  "capacity": [
    {
      "value": 21300,
      "key": "21300"
    }
  ],
  "population": [
    {
      "value": 114162,
      "key": "114162"
    }
  ],
  "state": [
    {
      "value": "Lombardia",
      "key": "lombardia"
    }
  ],
  "country": [
    {
      "value": "Italy",
      "key": "italy"
    }
  ],
  "latitude": [
    {
      "value": "45.69601",
      "key": "45.69601"
    }
  ],
  "longitude": [
    {
      "value": "9.66721",
      "key": "9.66721"
    }
  ]
}

Part 2 ETK Module


In [18]:
import os
import sys
import json
import requests
import jsonpath_ng.ext as jex
import re
from etk.etk import ETK
from etk.document import Document
from etk.etk_module import ETKModule
from etk.knowledge_graph_schema import KGSchema
from etk.utilities import Utility
from etk.extractors.table_extractor import TableExtractor
from etk.extractors.glossary_extractor import GlossaryExtractor


class ItalyTeamsModule(ETKModule):
    def __init__(self, etk):
        ETKModule.__init__(self, etk)
        self.my_table_extractor = TableExtractor()

        file_name = './resources/cities_ppl_25000.json'
        file = open(file_name, 'r')
        self.city_dataset = json.loads(file.read())
        file.close()
        self.city_list = list(self.city_dataset.keys())

        self.my_glossary_extractor = GlossaryExtractor(glossary=self.city_list, extractor_name='tutorial_glossary',
                                                       tokenizer=etk.default_tokenizer, ngrams=3,
                                                       case_sensitive=False)

    def process_document(self, cdr_doc: Document):
        new_docs = list()
        doc_json = cdr_doc.cdr_document

        if 'raw_content' in doc_json and doc_json['raw_content'].strip() != '':
            tables_in_page = self.my_table_extractor.extract(
                doc_json['raw_content'])[:14]
            for table in tables_in_page:

                # skipping the first row, the heading
                for row in table.value['rows'][1:]:
                    doc = etk.create_document(row)
                    all_json_path = '$.cells[0:4].text'
                    row_values = doc.select_segments(all_json_path)
                    # add the information we extracted in the knowledge graph of the doc.
                    doc.kg.add_value('team', value=row_values[0].value)
                    doc.kg.add_value('city_name', value=row_values[1].value)
                    doc.kg.add_value('stadium', value=row_values[2].value)
                    capacity_split = re.split(' |,', row_values[3].value)
                    if capacity_split[-1] != '':
                        capacity = int(capacity_split[-2] + capacity_split[-1]) if len(capacity_split) > 1 else int(
                            capacity_split[-1])
                        doc.kg.add_value('capacity', value=capacity)

                    city_json_path = '$.cells[1].text'
                    row_values = doc.select_segments(city_json_path)

                    # use the city field of the doc, run the GlossaryExtractor
                    extractions = doc.extract(
                        self.my_glossary_extractor, row_values[0])
                    if extractions:
                        path = '$."' + \
                            extractions[0].value + '"[?(@.country == "Italy")]'
                        jsonpath_expr = jex.parse(path)
                        city_match = jsonpath_expr.find(self.city_dataset)
                        if city_match:
                            # add corresponding values of city_dataset into knowledge graph of the doc
                            for field in city_match[0].value:
                                doc.kg.add_value(
                                    field, value=city_match[0].value[field])
                    new_docs.append(doc)
        return new_docs

    def document_selector(self, doc) -> bool:
        return doc.cdr_document.get("dataset") == "italy_team"


if __name__ == "__main__":
    url = 'https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy'

    html_page = open('./resources/italy_teams.html', mode='r', encoding='utf-8').read()

    cdr = {
        'raw_content': html_page,
        'url': url,
        'dataset': 'italy_team'
    }
    kg_schema = KGSchema(json.load(open('./resources/master_config.json')))
    etk = ETK(modules=ItalyTeamsModule, kg_schema=kg_schema)
    etk.parser = jex.parse
    cdr_doc = Document(etk, cdr_document=cdr, mime_type='json', url=cdr['url'])
    results = etk.process_ems(cdr_doc)[1:]
    print('Total docs:', len(results))
    print("Sample result:\n")
    print(json.dumps(results[0].value, indent=2))


Total docs: 258
Sample result:

{
  "cells": [
    {
      "cell": "<td><a href=\"/wiki/Atalanta_B.C.\" title=\"Atalanta B.C.\">Atalanta</a>\n</td>",
      "text": "Atalanta",
      "id": "row_1_col_0"
    },
    {
      "cell": "<td><a href=\"/wiki/Bergamo\" title=\"Bergamo\">Bergamo</a>\n</td>",
      "text": "Bergamo",
      "id": "row_1_col_1"
    },
    {
      "cell": "<td><a href=\"/wiki/Stadio_Atleti_Azzurri_d%27Italia\" title=\"Stadio Atleti Azzurri d'Italia\">Stadio Atleti Azzurri d'Italia</a>\n</td>",
      "text": "Stadio Atleti Azzurri d'Italia",
      "id": "row_1_col_2"
    },
    {
      "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004213000000000000\u2660</span>21,300\n</td>",
      "text": "7004213000000000000\u2660 21,300",
      "id": "row_1_col_3"
    },
    {
      "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">7th in Serie A</a>\n</td>",
      "text": "7th in Serie A",
      "id": "row_1_col_4"
    }
  ],
  "text": "Atalanta | Bergamo | Stadio Atleti Azzurri d'Italia | 7004213000000000000\u2660 21,300 | 7th in Serie A",
  "html": "<html><body><table><td><a href=\"/wiki/Atalanta_B.C.\" title=\"Atalanta B.C.\">Atalanta</a>\n</td>\n<td><a href=\"/wiki/Bergamo\" title=\"Bergamo\">Bergamo</a>\n</td>\n<td><a href=\"/wiki/Stadio_Atleti_Azzurri_d%27Italia\" title=\"Stadio Atleti Azzurri d'Italia\">Stadio Atleti Azzurri d'Italia</a>\n</td>\n<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004213000000000000\u2660</span>21,300\n</td>\n<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">7th in Serie A</a>\n</td>\n</table></body></html>",
  "id": "row_1",
  "provenances": [
    {
      "@id": 0,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "Atalanta"
    },
    {
      "@id": 1,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "Bergamo"
    },
    {
      "@id": 2,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "Stadio Atleti Azzurri d'Italia"
    },
    {
      "@id": 3,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "21300"
    },
    {
      "@id": 4,
      "@type": "extraction_provenance_record",
      "method": "tutorial_glossary",
      "confidence": 1.0,
      "origin_record": {
        "path": "cells.[1].text",
        "start_char": 0,
        "end_char": 7
      }
    },
    {
      "@id": 5,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "114162"
    },
    {
      "@id": 6,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "Lombardia"
    },
    {
      "@id": 7,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "Italy"
    },
    {
      "@id": 8,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "45.69601"
    },
    {
      "@id": 9,
      "@type": "kg_provenance_record",
      "reference_type": "constant",
      "value": "9.66721"
    }
  ],
  "knowledge_graph": {
    "team": [
      {
        "value": "Atalanta",
        "key": "atalanta"
      }
    ],
    "city_name": [
      {
        "value": "Bergamo",
        "key": "bergamo"
      }
    ],
    "stadium": [
      {
        "value": "Stadio Atleti Azzurri d'Italia",
        "key": "stadio atleti azzurri d'italia"
      }
    ],
    "capacity": [
      {
        "value": 21300,
        "key": "21300"
      }
    ],
    "population": [
      {
        "value": 114162,
        "key": "114162"
      }
    ],
    "state": [
      {
        "value": "Lombardia",
        "key": "lombardia"
      }
    ],
    "country": [
      {
        "value": "Italy",
        "key": "italy"
      }
    ],
    "latitude": [
      {
        "value": "45.69601",
        "key": "45.69601"
      }
    ],
    "longitude": [
      {
        "value": "9.66721",
        "key": "9.66721"
      }
    ]
  },
  "doc_id": "cd0741601f62119dbd8869375ea6758acb1921b9c5cf8dbf7d09e05360c21b50"
}

In [ ]: