Autonomía digital y tecnológica

Código e ideas para una internet distribuida

Linkoteca. WARC


The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive’s ARC_IA File Format[4] that has traditionally been used to store «web crawls» as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers.

To download a file and save the request and response data to a WARC file, run this:

wget «http://www.archiveteam.org/» –warc-file=»at»

This will download the file to index.html, but it will also create a file at-00000.warc.gz. This is a gzipped WARC file that contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have an uncompressed WARC file, use the –no-warc-compression option:

wget «http://www.archiveteam.org/» –warc-file=»at» –no-warc-compression