Autonomía digital y tecnológica

Código e ideas para una internet distribuida

Linkoteca. wget


I used wget, which is available on any linux-ish system (I ran it on the same Ubuntu server that hosts the sites).

wget –mirror -p –html-extension –convert-links -e robots=off -P . http://url-to-site

That command doesn’t throttle the requests, so it could cause problems if the server has high load. Here’s what that line does:

–mirror: turns on recursion etc… rather than just downloading the single file at the root of the URL, it’ll now suck down the entire site.
-p: download all prerequisites (supporting media etc…) rather than just the html
–html-extension: this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
–convert-links: rewrite the URLs in the downloaded html files, to point to the downloaded files rather than to the live site. this makes it nice and portable, with everything living in a self-contained directory.
-e robots=off: executes the «robots off» command, telling wget to ignore any directive to ignore the site in question. This is strictly Not a Good Thing To Do, but if you own the site, this is OK. If you don’t own the site being archived, you should obey all robots.txt files or you’ll be a Very Bad Person.
-P .: set the download directory to something. I left it at the default «.» (which means «here») but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…)
http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers.

To download a file and save the request and response data to a WARC file, run this:

wget «http://www.archiveteam.org/» –warc-file=»at»

This will download the file to index.html, but it will also create a file at-00000.warc.gz. This is a gzipped WARC file that contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have an uncompressed WARC file, use the –no-warc-compression option:

wget «http://www.archiveteam.org/» –warc-file=»at» –no-warc-compression

When IA first started doing their thing, they came across a problem: how do you actually save all of the information related to a website as it existed at a point in time? IA wanted to capture it all, including headers, images, stylesheets, etc.

After a lot of revision the smart folks there built a specification for a file format named WARC, for Web ARCive. The details aren’t super important, but the gist is that it will preserve everything, including headers, in a verifiable, indexed, checksumed format.

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --domains website.org --no-parent www.website.org/tutorials/html/

This command downloads the Web site www.website.org/tutorials/html/.

The options are:

–recursive: download the entire Web site.
–domains website.org: don’t follow links outside website.org.
–no-parent: don’t follow links outside the directory tutorials/html/.
–page-requisites: get all the elements that compose the page (images, CSS and so on).
–html-extension: save files with the .html extension.
–convert-links: convert links so that they work locally, off-line.
–no-clobber: don’t overwrite any existing files (used in case the download is interrupted and resumed).