ArchiveSpark gains its efficiency through a two-step loading approach, which only accesses metadata for common operations like filtering, sorting, grouping, etc. Only if content is required for applying additional filters or derive new information from a record, ArchiveSpark will access the actual records. The required metadata for web archives is commonly provided by CDX records. In the following we show how to generate these CDX records from a collection of (W)ARC(.gz) files.
In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._
In this example, the web archive dataset will be loaded from local WARC files only (created in this recipe), without the corresponding CDX files. This is a lot slower than using CDX metadata records, but sometimes necessary if CDX files are not available.
In [2]:
val warc = ArchiveSpark.load(WarcSpec.fromFiles("/data/helgeholzmann-de.warc.gz/*.warc.gz"))
As we can see, although loaded directly from WARC, the records are internally represented in the same format as datasets with provided CDX data. Hence, we can apply the same operations as well Enrichment Functions, however, the processing will be less efficient than with available CDX records.
In [3]:
warc.peekJson
Out[3]:
In [4]:
warc.count
Out[4]:
(this can take hours / days for large datasets)
We can now generate and save the CDX records corresponding to our dataset for a more efficient use of this dataset with ArchiveSpark in the future:
In [5]:
warc.saveAsCdx("/data/helgeholzmann-de.cdx.gz")
(by adding .gz to the path, the output will automatically be compressed using GZip)
As we have CDX records for our dataset now, we can load it more efficiently by providing the CDX location to use a suitable Data Specification (DataSpec):
In [6]:
val records = ArchiveSpark.load(WarcSpec.fromFiles("/data/helgeholzmann-de.cdx.gz", "/data/helgeholzmann-de.warc.gz"))
Counting as well as most of the other operations provided by Spark as well as ArchiveSpark will be more efficient now.
In [7]:
records.count
Out[7]:
(this usually takes seconds / minutes, depending on the size of the dataset)