Skip to main content

Baty.net

A blog about everything by Jack Baty đź‘‹

Tag: Archiving

Creating a WARC web archive using wget

I’ve been tinkering with keeping offline copies of websites (mostly mine), and have always used either wget or httrack. I wasn’t aware of the WARC format until recently, so I thought I’d try creating a few WARC archives.

wget, as it happens, has WARC support built in via the –warc-file option. I added that to my usual set of switches and put it all in a shell script, like so.

#!/bin/sh
# warc-archive.sh https://example.com warc-file-name

wget \
	--mirror \
	--warc-file=$2 \
	--warc-cdx \
	--page-requisites \
	--html-extension \
	--execute robots=off \
	--directory-prefix=. \
	--wait=1 \
	--random-wait \
	$1

This creates a compressed, self-contained WARC file along with a mirrored set of files comprising the entire site.