In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._
In this example, the web archive dataset will be loaded from local WARC / CDX files (created in this recipe). However, any other Data Specification (DataSpec) could be used here too, in order to load your records of different types and from different local or remote sources.
In [2]:
val warcPath = "/data/helgeholzmann-de.warc.gz"
val cdxPath = warcPath + "/*.cdx.gz"
In [3]:
val records = ArchiveSpark.load(WarcSpec.fromFiles(cdxPath, warcPath))
We can filter out videos, images, stylesheets and any other files except for webpages (mime type text/html), as well as webpages that were unavailable when they were crawled either (status code == 200).
It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual web archive records, which is the core efficiency feature of ArchiveSpark.
In [4]:
val pages = records.filter(r => r.mime == "text/html" && r.status == 200)
The following counts show that we filtered a very big portion, which makes the subsequent processing way more efficient:
In [5]:
records.count
Out[5]:
In [6]:
pages.count
Out[6]:
A peek at the first record of the filtered dataset (in pretty JSON format) shows that it indeed consists of HTML pages with successful status:
In [7]:
pages.peekJson
Out[7]:
We now load the desired URLs into a Spark RDD. In this example, the list of URLs (here only one) is specified in code, but it could also be loaded from a file or other sources. These are then converted into the canonical SURT format, using a function from the Sparkling library:
In [8]:
val urls = Set("https://www.helgeholzmann.de/publications").map(org.archive.archivespark.sparkling.util.SurtUtil.fromUrl)
In [9]:
urls
Out[9]:
In order to make this data available to Spark (across all nodes of our computing environment), we use broadcast: (if the set of URLs it too big, a join
operation should be used here instead of a broadcast, for an example see the recipe on Extracting embedded resources from webpages)
In [10]:
val selectedUrls = sc.broadcast(urls)
In [11]:
val filtered = pages.filter(r => selectedUrls.value.contains(r.surtUrl))
In [12]:
filtered.count
Out[12]:
In [13]:
filtered.enrich(Html).peekJson
Out[13]:
As we can see, by default Html
extracts the body of the page. To customize this, it provides different ways to specify which tags to extract:
Html.first("title")
will extract the (first) title tag insteadHtml.all("a")
will extract all anchors / hyperlinks (the result is a list instead of a single item)Html("p", 2)
will extract the third paragraph of the page (index 2 = third match)Fore more details as well as additional Enrichment Functions, please read the docs.
In [14]:
filtered.enrich(Html.first("title")).peekJson
Out[14]:
As we are only interested in the text without the HTML tags (<title>
), we need to use the HtmlText
Enrichment Function. This, by default, depends on the default version of Html
, hence it would extract the text of the body, i.e., the complete text of the page. In order to change this dependency to get only the title, we can use the .on
/.of
method that all Enrichment Functions provide. Now we can give this new Enrichment Function a name (Title
) to reuse it later:
In [15]:
val Title = HtmlText.of(Html.first("title"))
In [16]:
filtered.enrich(Title).peekJson
Out[16]:
In addition to the title, we would also like to have the full text of the page. This will be our final dataset, so we assign it to a new variable (enriched
):
In [17]:
val BodyText = HtmlText.of(Html.first("body"))
In [18]:
val enriched = filtered.enrich(Title).enrich(BodyText)
In [19]:
enriched.peekJson
Out[19]:
In [20]:
enriched.saveAsJson("/data/title-text_dataset.json.gz")
In [21]:
val tsv = enriched.map{r =>
// replace tab and newlines with a space
val title = r.valueOrElse(Title, "").replaceAll("[\\t\\n]", " ")
val text = r.valueOrElse(BodyText, "").replaceAll("[\\t\\n]", " ")
// concatenate URL, timestamp, title and text with a tab
Seq(r.originalUrl, r.timestamp, title, text).mkString("\t")
}
In [22]:
tsv.peek
Out[22]:
In [23]:
tsv.saveText("/data/title-text_dataset.tsv.gz")
Out[23]: