Building a corpus with title + text for a selected set of URLs


In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._

Loading the dataset

In this example, the web archive dataset will be loaded from local WARC / CDX files (created in this recipe). However, any other Data Specification (DataSpec) could be used here too, in order to load your records of different types and from different local or remote sources.


In [2]:
val warcPath = "/data/helgeholzmann-de.warc.gz"
val cdxPath = warcPath + "/*.cdx.gz"

In [3]:
val records = ArchiveSpark.load(WarcSpec.fromFiles(cdxPath, warcPath))

Filtering records

We can filter out videos, images, stylesheets and any other files except for webpages (mime type text/html), as well as webpages that were unavailable when they were crawled either (status code == 200).

It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual web archive records, which is the core efficiency feature of ArchiveSpark.


In [4]:
val pages = records.filter(r => r.mime == "text/html" && r.status == 200)

The following counts show that we filtered a very big portion, which makes the subsequent processing way more efficient:


In [5]:
records.count


Out[5]:
48

In [6]:
pages.count


Out[6]:
6

A peek at the first record of the filtered dataset (in pretty JSON format) shows that it indeed consists of HTML pages with successful status:


In [7]:
pages.peekJson


Out[7]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "sha1:HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "text/html",
        "compressedSize" : 2087,
        "meta" : "-",
        "status" : 200
    }
}

Select relevant records based on a given set of URLs

We now load the desired URLs into a Spark RDD. In this example, the list of URLs (here only one) is specified in code, but it could also be loaded from a file or other sources. These are then converted into the canonical SURT format, using a function from the Sparkling library:


In [8]:
val urls = Set("https://www.helgeholzmann.de/publications").map(org.archive.archivespark.sparkling.util.SurtUtil.fromUrl)

In [9]:
urls


Out[9]:
Set(de,helgeholzmann)/publications)

In order to make this data available to Spark (across all nodes of our computing environment), we use broadcast: (if the set of URLs it too big, a join operation should be used here instead of a broadcast, for an example see the recipe on Extracting embedded resources from webpages)


In [10]:
val selectedUrls = sc.broadcast(urls)

Filter the pages in our dataset


In [11]:
val filtered = pages.filter(r => selectedUrls.value.contains(r.surtUrl))

In [12]:
filtered.count


Out[12]:
1

Enrich the dataset with the desired information (title + text)

To access the content of an HTML page, ArchiveSpark comes with an Html Enrichment Function:


In [13]:
filtered.enrich(Html).peekJson


Out[13]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152831",
        "digest" : "sha1:XRVCBHVKAC6NQ4N24OCF4S2ABYUOJW3H",
        "originalUrl" : "https://www.helgeholzmann.de/publications",
        "surtUrl" : "de,helgeholzmann)/publications",
        "mime" : "text/html",
        "compressedSize" : 4280,
        "meta" : "-",
        "status" : 200
    },
    "payload" : {
        "string" : {
            "html" : {
                "html" : "<html>\r\n<head>\r\n    <title>Helge Holzmann - @helgeho</title>\r\n    <link rel=\"shortcut icon\" href=\"/images/favicon.png\">\r\n    <link rel=\"stylesheet\" href=\"/css/font-awesome.min.css\">\r\n    <link rel=\"stylesheet\" href=\"/css/academicons.min.css\">\r\n    <link rel=\"stylesheet\" href=\"/css...

As we can see, by default Html extracts the body of the page. To customize this, it provides different ways to specify which tags to extract:

  • Html.first("title") will extract the (first) title tag instead
  • Html.all("a") will extract all anchors / hyperlinks (the result is a list instead of a single item)
  • Html("p", 2) will extract the third paragraph of the page (index 2 = third match)

Fore more details as well as additional Enrichment Functions, please read the docs.


In [14]:
filtered.enrich(Html.first("title")).peekJson


Out[14]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152831",
        "digest" : "sha1:XRVCBHVKAC6NQ4N24OCF4S2ABYUOJW3H",
        "originalUrl" : "https://www.helgeholzmann.de/publications",
        "surtUrl" : "de,helgeholzmann)/publications",
        "mime" : "text/html",
        "compressedSize" : 4280,
        "meta" : "-",
        "status" : 200
    },
    "payload" : {
        "string" : {
            "html" : {
                "title" : "<title>Helge Holzmann - @helgeho</title>"
            }
        }
    }
}

As we are only interested in the text without the HTML tags (<title>), we need to use the HtmlText Enrichment Function. This, by default, depends on the default version of Html, hence it would extract the text of the body, i.e., the complete text of the page. In order to change this dependency to get only the title, we can use the .on/.of method that all Enrichment Functions provide. Now we can give this new Enrichment Function a name (Title) to reuse it later:


In [15]:
val Title = HtmlText.of(Html.first("title"))

In [16]:
filtered.enrich(Title).peekJson


Out[16]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152831",
        "digest" : "sha1:XRVCBHVKAC6NQ4N24OCF4S2ABYUOJW3H",
        "originalUrl" : "https://www.helgeholzmann.de/publications",
        "surtUrl" : "de,helgeholzmann)/publications",
        "mime" : "text/html",
        "compressedSize" : 4280,
        "meta" : "-",
        "status" : 200
    },
    "payload" : {
        "string" : {
            "html" : {
                "title" : {
                    "text" : "Helge Holzmann - @helgeho"
                }
            }
        }
    }
}

In addition to the title, we would also like to have the full text of the page. This will be our final dataset, so we assign it to a new variable (enriched):


In [17]:
val BodyText = HtmlText.of(Html.first("body"))

In [18]:
val enriched = filtered.enrich(Title).enrich(BodyText)

In [19]:
enriched.peekJson


Out[19]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152831",
        "digest" : "sha1:XRVCBHVKAC6NQ4N24OCF4S2ABYUOJW3H",
        "originalUrl" : "https://www.helgeholzmann.de/publications",
        "surtUrl" : "de,helgeholzmann)/publications",
        "mime" : "text/html",
        "compressedSize" : 4280,
        "meta" : "-",
        "status" : 200
    },
    "payload" : {
        "string" : {
            "html" : {
                "title" : {
                    "text" : "Helge Holzmann - @helgeho"
                },
                "body" : {
                    "text" : "Home Research Publications Private Projects Contact Helge Holzmann I am a researcher and PhD candidate at the L3S Research Center in Hannover, Germany. My main research inte...

Save the created corpus

The dataset can either be saves in JSON format as shown in the peek operations above, which is supported by ArchiveSpark, or it can be converted to some custom format and saved the raw text (using Spark's saveAsTextFile):

Save as JSON

By adding a .gz extension to the path, ArchiveSpark will automatically compress the output using Gzip


In [20]:
enriched.saveAsJson("/data/title-text_dataset.json.gz")

Save in a custom format

The Enrichment Functions (Title and BodyText) can be used as accessors to read the corresponding values, so we can create a tab separated format as follows:


In [21]:
val tsv = enriched.map{r =>
    // replace tab and newlines with a space
    val title = r.valueOrElse(Title, "").replaceAll("[\\t\\n]", " ")
    val text = r.valueOrElse(BodyText, "").replaceAll("[\\t\\n]", " ")
    // concatenate URL, timestamp, title and text with a tab
    Seq(r.originalUrl, r.timestamp, title, text).mkString("\t")
}

In [22]:
tsv.peek


Out[22]:
https://www.helgeholzmann.de/publications	20190528152831	Helge Holzmann - @helgeho	Home Research Publications Private Projects Contact Helge Holzmann I am a researcher and PhD candidate at the L3S Research Center in Hannover, Germany. My main research interest is on Web archives and related topics, such as big data processing, graph analysis and information retrieval. @helgeho on Twitter helgeho on GitHub Helge on arXiv Email me! Publications 2017 short H. Holzmann, Emily Novak Gustainis and Vinay Goel. Universal Distant Reading through Metadata Proxies with ArchiveSpark. 5th IEEE International Conference on Big Data (BigData). Boston, MA, USA. December 2017. H. Holzmann, W. Nejdl and A. Anand. Exploring Web Archives Through Temporal Anchor Texts. 7th International ACM C...

In [23]:
tsv.saveText("/data/title-text_dataset.tsv.gz")


Out[23]:
1