Extracting embedded resources from webpages



In [1]:

    
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._

Loading the dataset

In this example, the web archive dataset will be loaded from local WARC / CDX files (created in this recipe). However, any other Data Specification (DataSpec) could be used here too, in order to load your records of different types and from different local or remote sources.



In [2]:

    
val warcPath = "/data/helgeholzmann-de.warc.gz"
val cdxPath = warcPath + "/*.cdx.gz"



In [3]:

    
val records = ArchiveSpark.load(WarcSpec.fromFiles(cdxPath, warcPath))

Filtering irrelevant records

Embeds are specific to webpages, so we can filter out videos, images, stylesheets and any other files except for webpages (mime type text/html), as well as webpages that were unavailable when they were crawled either (status code == 200).

It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual web archive records, which is the core efficiency feature of ArchiveSpark.



In [4]:

    
val pages = records.filter(r => r.mime == "text/html" && r.status == 200)

By looking at the first record in our remaining dataset, we can see that this indeed is of type text/html and was online (status 200) at the time of crawl:



In [5]:

    
pages.peekJson









    Out[5]:





{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "sha1:HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "text/html",
        "compressedSize" : 2087,
        "meta" : "-",
        "status" : 200
    }
}

Removing duplicates

In order to save processing time, we remove duplicate websites (based on the digest in the CDX records) and only keep the earliest snapshot for each distinct content. This will be cached, so that we do not need to compute it every time we want to access that collection.



In [6]:

    
val earliest = pages.distinctValue(_.digest) {(a, b) => if (a.time.isBefore(b.time)) a else b}.cache

Extracting embedded resources

In this example we want to extract stylesheets, hence we are interested in link tags with attribute rel="stylesheet". Similarly, we could also extract images or other resources.

We first need to define the required Enrichment Function to enrich our metadata with the URLs (in SURT format) of the embedded stylesheets.



In [7]:

    
val Stylesheets = Html.all("link").mapMulti("stylesheets") { linkTags => linkTags.filter(_.contains("rel=\"stylesheet\""))}
val StylesheetUrls = SURT.of(HtmlAttribute("href").ofEach(Stylesheets))



In [8]:

    
earliest.enrich(StylesheetUrls).peekJson









    Out[8]:





{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152831",
        "digest" : "sha1:XRVCBHVKAC6NQ4N24OCF4S2ABYUOJW3H",
        "originalUrl" : "https://www.helgeholzmann.de/publications",
        "surtUrl" : "de,helgeholzmann)/publications",
        "mime" : "text/html",
        "compressedSize" : 4280,
        "meta" : "-",
        "status" : 200
    },
    "payload" : {
        "string" : {
            "html" : {
                "link" : {
                    "stylesheets" : [
                        {
                            "attributes" : {
                                "href" : {
                                    "SURT" : "de,helgeholzmann)/images/favicon.png"
                                }
                            }
       ...

Identifying the relevant embeds / stylesheets in the dataset

At this point, we have to access the original dataset records again, as the stylesheets are not among the filtered pages. A join operation is used to filter the records in the dataset and keep only the previously extracted stylesheet files. As a join is performed on the keys in the dataset, we introduce a dummy value (true) here to make the URL the key of the records. For more information please read the Spark Programming Guide.



In [11]:

    
val stylesheetUrls = earliest.flatMapValues(StylesheetUrls.multi).distinct.map(url => (url, true))



In [12]:

    
val stylesheets = records.map(r => (r.surtUrl, r)).join(stylesheetUrls).map{case (url, (record, dummy)) => record}

Similar to above, we again remove duplicates in the stylesheet dataset:



In [15]:

    
val distinctStylesheets = stylesheets.distinctValue(_.digest) {(a, b) => if (a.time.isBefore(b.time)) a else b}.cache



In [16]:

    
distinctStylesheets.peekJson









    Out[16]:





{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152655",
        "digest" : "sha1:ETFODABRIVFG5WELMU2Y3UE7U66RQXD5",
        "originalUrl" : "https://www.helgeholzmann.de/css/academicons.min.css",
        "surtUrl" : "de,helgeholzmann)/css/academicons.min.css",
        "mime" : "text/css",
        "compressedSize" : 1497,
        "meta" : "-",
        "status" : 200
    }
}

Saving the relevant embeds

There are different options to save the embeds datasets. One way would be to save the embeds as WARC records as follows:



In [14]:

    
distinctStylesheets.saveAsWarc("stylesheets.warc.gz", WarcMeta(publisher = "Internet Archive"))

Another option is to enrich the metadata of the stylesheets with their actual content and save it as JSON:



In [17]:

    
val enriched = distinctStylesheets.enrich(StringContent)



In [18]:

    
enriched.peekJson









    Out[18]:





{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152655",
        "digest" : "sha1:ETFODABRIVFG5WELMU2Y3UE7U66RQXD5",
        "originalUrl" : "https://www.helgeholzmann.de/css/academicons.min.css",
        "surtUrl" : "de,helgeholzmann)/css/academicons.min.css",
        "mime" : "text/css",
        "compressedSize" : 1497,
        "meta" : "-",
        "status" : 200
    },
    "payload" : {
        "string" : "@font-face{font-family:'Academicons';src:url('../fonts/academicons.eot?v=1.7.0');src:url('../fonts/academicons.eot?v=1.7.0') format('embedded-opentype'), url('../fonts/academicons.ttf?v=1.7.0') format('truetype'), url('../fonts/academicons.woff?v=1.7.0') format('woff'), url('../fonts/academicons.svg?v=1.7.0#academicons') format('svg');...

By adding a .gz extension to the output path, the data will be automatically compressed with GZip.



In [17]:

    
enriched.saveAsJson("stylesheets.json.gz")

To learn how to convert and save the dataset to some custom format, please see the recipe on Extracting title + text from a selected set of URLs.

For more recipes, please check the ArchiveSpark documentation.