In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._
In this example, the web archive dataset will be loaded from local WARC / CDX files (created in this recipe). However, any other Data Specification (DataSpec) could be used here too, in order to load your records of different types and from different local or remote sources.
In [2]:
val warcPath = "/data/helgeholzmann-de.warc.gz"
val cdxPath = warcPath + "/*.cdx.gz"
In [3]:
val records = ArchiveSpark.load(WarcSpec.fromFiles(cdxPath, warcPath))
Embeds are specific to webpages, so we can filter out videos, images, stylesheets and any other files except for webpages (mime type text/html), as well as webpages that were unavailable when they were crawled either (status code == 200).
It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual web archive records, which is the core efficiency feature of ArchiveSpark.
In [4]:
val pages = records.filter(r => r.mime == "text/html" && r.status == 200)
By looking at the first record in our remaining dataset, we can see that this indeed is of type text/html and was online (status 200) at the time of crawl:
In [5]:
pages.peekJson
Out[5]:
In [6]:
val earliest = pages.distinctValue(_.digest) {(a, b) => if (a.time.isBefore(b.time)) a else b}.cache
In this example we want to extract stylesheets, hence we are interested in link
tags with attribute rel="stylesheet"
. Similarly, we could also extract images or other resources.
We first need to define the required Enrichment Function to enrich our metadata with the URLs (in SURT format) of the embedded stylesheets.
In [7]:
val Stylesheets = Html.all("link").mapMulti("stylesheets") { linkTags => linkTags.filter(_.contains("rel=\"stylesheet\""))}
val StylesheetUrls = SURT.of(HtmlAttribute("href").ofEach(Stylesheets))
In [8]:
earliest.enrich(StylesheetUrls).peekJson
Out[8]:
At this point, we have to access the original dataset records
again, as the stylesheets are not among the filtered pages
.
A join
operation is used to filter the records in the dataset and keep only the previously extracted stylesheet files. As a join
is performed on the keys in the dataset, we introduce a dummy value (true
) here to make the URL the key of the records. For more information please read the Spark Programming Guide.
In [11]:
val stylesheetUrls = earliest.flatMapValues(StylesheetUrls.multi).distinct.map(url => (url, true))
In [12]:
val stylesheets = records.map(r => (r.surtUrl, r)).join(stylesheetUrls).map{case (url, (record, dummy)) => record}
Similar to above, we again remove duplicates in the stylesheet dataset:
In [15]:
val distinctStylesheets = stylesheets.distinctValue(_.digest) {(a, b) => if (a.time.isBefore(b.time)) a else b}.cache
In [16]:
distinctStylesheets.peekJson
Out[16]:
In [14]:
distinctStylesheets.saveAsWarc("stylesheets.warc.gz", WarcMeta(publisher = "Internet Archive"))
Another option is to enrich the metadata of the stylesheets with their actual content and save it as JSON:
In [17]:
val enriched = distinctStylesheets.enrich(StringContent)
In [18]:
enriched.peekJson
Out[18]:
By adding a .gz extension to the output path, the data will be automatically compressed with GZip.
In [17]:
enriched.saveAsJson("stylesheets.json.gz")
To learn how to convert and save the dataset to some custom format, please see the recipe on Extracting title + text from a selected set of URLs.
For more recipes, please check the ArchiveSpark documentation.