Crawling backwards in time: recovering lost websites

Frank McCown of Old Dominion University: Warrick - Tool for Reconstructing a Website
Warrick is a command-line utility for reconstructing or recovering a website that has been lost due to a hard drive crash, fire, failed backup, etc. Warrick will search the Internet Archive, Google, MSN, and Yahoo for stored pages and images and will save them to your filesystem.
Aaron Swartz: arcget: Retrieve a site from the Internet Archive

Servers die. Companies collapse. URLs change. The Web is a very messy place. Thankfully, the Internet Archive is there to record it all.

But once it's in there, how do you get it back? Sure, the Wayback Machine is nice for getting a couple pages, but anything more than that and it's a royal pain. Wouldn't it be nice if there were some easy way to get back that data? arcget is that easy way.

arcget asks the Internet Archive for all the files it has of that site, then goes through and tries to find a working copy of each one. It gets it, strips out the modifications made by the Wayback Machine, and places it in a properly named file.

Technorati Tags: , , , , , , , , , ,

Comments: Post a Comment