Gojomo

2003-10-27
Robots Attack White House, Film at 11

Some have suggested that something fishy is going on with the White House webserver, and the "robots.txt" file it uses to discourage automated web crawlers from visiting portions of the site. The Democratic National Committee sees devious historical revisionism, naturally.

Further wild speculation is available at Dan Gillmor's eJournal, especially in the comments.

But I'm pretty sure that this just a clumsy mistake, or misguided reaction to some haywire crawler, rather than any intentional manipulation.

In fact, I work for the Internet Archive on web crawling technology, and just 5 days ago I was relayed word, via email, that the White House webmaster wanted us to extensively crawl their site. In fact, they even wanted us to ignore most of their robots.txt "Disallows" directives -- because aside from the first 4 directives, "all the prohibitions were on links to plain text versions of the formatted pages."

Now, it's awkward for crawler operators to manually override the directives we typically respect, on a site-by-site basis. Rather than expressing such wishes in private communications, we would prefer that whitehouse.gov begin its robots.txt file with a narrow expression of their legitimate exclusions...

User-agent: ia_archiver
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /help
...and then continue their robots.txt file with additional alternative directives for other crawlers, or all crawlers ("*"). Then we clearly know that except for the 4 listed URL-prefixes, our site crawling is encouraged.

But awkwardness does not imply scheming, and there was no hint of sinister intent in their expressed wishes. Instead, we were told "we could scoop everything up, no problem" -- a genuine desire to have whitehouse.gov material archived, on a topic-neutral basis. We suggested that the White House webserver make the clarifying robots.txt changes described above, but I haven't yet seen a confirmation that our suggestion reached the right people.

So rather than squinting to see something sinister here, I'd suggest giving the whitehouse.gov team the benefit of the doubt. From what I've seen, they want their site crawled, archived, and searchable -- and their robots.txt should eventually stabilize to confirm that fact.


Comments: Post a Comment