In [1]:
import org.archive.archivespark._
import org.archive.archivespark.functions._
import org.archive.archivespark.specific.warc._
ArchiveSpark provides a Data Specification (DataSpec) to load records remotely from the Wayback Machine with metadata fetched from CDX Server. For more details about this and other DataSpecs please read the docs.
The following example loads all archived resources under the domain helgeholzmann.de
(matchPrefix = true
) between May and June 2019, with 5 blocks per page for max. 50 pages (please read the CDX server documentation for more information on these parameters):
In [2]:
val records = ArchiveSpark.load(WarcSpec.fromWayback("helgeholzmann.de", matchPrefix = true, from = 201905, to = 201906, blocksPerPage = 5, pages = 50))
As usual the records in this dataset can be printed as JSON and all common operations as well as Enrichment Functions can be applied as shown in other recipes.
In [3]:
records.peekJson
Out[3]:
This revisit record that marks a duplicate in the Wayback Machine will be stored as the original text/html
resource when downloaded locally.
Save the records as local .warc.gz and .cdx.gz files (by adding the .gz
extension to the path, the output will be compressed using GZip):
In [4]:
records.saveAsWarc("/data/helgeholzmann-de.warc.gz", WarcFileMeta(publisher = "Helge Holzmann"), generateCdx = true)
Out[4]:
In [5]:
val records = ArchiveSpark.load(WarcSpec.fromFilesWithCdx("/data/helgeholzmann-de.warc.gz"))
In [6]:
records.peekJson
Out[6]: