Creating a WARC web archive using wget
Iāve been tinkering with keeping offline copies of websites (mostly mine), and have always used either wget or httrack. I wasnāt aware of theĀ WARCĀ formatĀ until recently, so I thought Iād try creating a fewĀ WARCĀ archives. wget, as it happens, hasĀ WARCĀ support built in via the āwarc-file option. I added that to my usual set of switches and put it all in a shell script, like so. #!/bin/sh # warc-archive.sh https://example.com warc-file-name wget \ --mirror \ --warc-file=$2 \ --warc-cdx \ --page-requisites \ --html-extension \ --execute robots=off \ --directory-prefix=. \ --wait=1 \ --random-wait \ $1 This creates a compressed, self-containedĀ WARCĀ file along with a mirrored set of files comprising the entire site. ...