I’ve been tinkering with keeping offline copies of websites (mostly mine), and have always used either wget or httrack. I wasn’t aware of the WARC format until recently, so I thought I’d try creating a few WARC archives.
wget, as it happens, has WARC support built in via the –warc-file option. I added that to my usual set of switches and put it all in a shell script, like so.
#!/bin/sh
# warc-archive.sh https://example.com warc-file-name
wget \
--mirror \
--warc-file=$2 \
--warc-cdx \
--page-requisites \
--html-extension \
--execute robots=off \
--directory-prefix=. \
--wait=1 \
--random-wait \
$1
This creates a compressed, self-contained WARC file along with a mirrored set of files comprising the entire site.