Loading WARC / Generating CDX (enabling more efficient processing)

ArchiveSpark gains its efficiency through a two-step loading approach, which only accesses metadata for common operations like filtering, sorting, grouping, etc. Only if content is required for applying additional filters or derive new information from a record, ArchiveSpark will access the actual records. The required metadata for web archives is commonly provided by CDX records. In the following we show how to generate these CDX records from a collection of (W)ARC(.gz) files.


In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._

Loading the dataset from (W)ARC(.gz) files (without CDX)

In this example, the web archive dataset will be loaded from local WARC files only (created in this recipe), without the corresponding CDX files. This is a lot slower than using CDX metadata records, but sometimes necessary if CDX files are not available.


In [2]:
val warc = ArchiveSpark.load(WarcSpec.fromFiles("/data/helgeholzmann-de.warc.gz/*.warc.gz"))

Taking a look at the first record

As we can see, although loaded directly from WARC, the records are internally represented in the same format as datasets with provided CDX data. Hence, we can apply the same operations as well Enrichment Functions, however, the processing will be less efficient than with available CDX records.


In [3]:
warc.peekJson


Out[3]:
{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "sha1:HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "text/html",
        "compressedSize" : 2087,
        "meta" : "-",
        "status" : 200
    }
}

Counting the records in this dataset takes long as all headers and contents are read and parsed


In [4]:
warc.count


Out[4]:
48

(this can take hours / days for large datasets)

Generating CDX

We can now generate and save the CDX records corresponding to our dataset for a more efficient use of this dataset with ArchiveSpark in the future:


In [5]:
warc.saveAsCdx("/data/helgeholzmann-de.cdx.gz")

(by adding .gz to the path, the output will automatically be compressed using GZip)

Re-loading dataset with CDX records

As we have CDX records for our dataset now, we can load it more efficiently by providing the CDX location to use a suitable Data Specification (DataSpec):


In [6]:
val records = ArchiveSpark.load(WarcSpec.fromFiles("/data/helgeholzmann-de.cdx.gz", "/data/helgeholzmann-de.warc.gz"))

Counting as well as most of the other operations provided by Spark as well as ArchiveSpark will be more efficient now.


In [7]:
records.count


Out[7]:
48

(this usually takes seconds / minutes, depending on the size of the dataset)