Downloading a web archive dataset as WARC/CDX from the Wayback Machine


In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._

Loading the dataset from the Wayback Machine

ArchiveSpark provides a Data Specification (DataSpec) to load records remotely from the Wayback Machine with metadata fetched from CDX Server. For more details about this and other DataSpecs please read the docs.

The following example loads all archived resources under the domain helgeholzmann.de (matchPrefix = true) between May and June 2019, with 5 blocks per page for max. 50 pages (please read the CDX server documentation for more information on these parameters):


In [2]:
val records = ArchiveSpark.load(WarcSpec.fromWayback("helgeholzmann.de", matchPrefix = true, from = 201905, to = 201906, blocksPerPage = 5, pages = 50))

Peeking at the first record as JSON

As usual the records in this dataset can be printed as JSON and all common operations as well as Enrichment Functions can be applied as shown in other recipes.


In [3]:
records.peekJson


Out[3]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "warc/revisit",
        "compressedSize" : 771,
        "meta" : "-",
        "status" : -1
    }
}

This revisit record that marks a duplicate in the Wayback Machine will be stored as the original text/html resource when downloaded locally.

Saving as WARC / CDX

Save the records as local .warc.gz and .cdx.gz files (by adding the .gz extension to the path, the output will be compressed using GZip):


In [4]:
records.saveAsWarc("/data/helgeholzmann-de.warc.gz", WarcFileMeta(publisher = "Helge Holzmann"), generateCdx = true)


Out[4]:
48

Loading from WARC / CDX

Now, with the dataset available in local WARC / CDX files, we can load it from there:


In [5]:
val records = ArchiveSpark.load(WarcSpec.fromFilesWithCdx("/data/helgeholzmann-de.warc.gz"))

In [6]:
records.peekJson


Out[6]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "sha1:HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "text/html",
        "compressedSize" : 2087,
        "meta" : "-",
        "status" : 200
    }
}