Creating a WARC web archive using wget

I’ve been tinkering with keeping offline copies of websites (mostly mine), and have always used either wget or httrack. I wasn’t aware of the WARC format until recently, so I thought I’d try creating a few WARC archives. wget, as it happens, has WARC support built in via the –warc-file option. I added that to my usual set of switches and put it all in a shell script, like so. #!/bin/sh # warc-archive.sh https://example.com warc-file-name wget \ --mirror \ --warc-file=$2 \ --warc-cdx \ --page-requisites \ --html-extension \ --execute robots=off \ --directory-prefix=. \ --wait=1 \ --random-wait \ $1 This creates a compressed, self-contained WARC file along with a mirrored set of files comprising the entire site. ...

March 2, 2024 Â· 187 words