But I'm pretty sure that this just a clumsy mistake, or misguided reaction to some haywire crawler, rather than any intentional manipulation.
In fact, I work for the Internet Archive on web crawling technology, and just 5 days ago I was relayed word, via email, that the White House webmaster wanted us to extensively crawl their site. In fact, they even wanted us to ignore most of their robots.txt "Disallows" directives -- because aside from the first 4 directives, "all the prohibitions were on links to plain text versions of the formatted pages."
Now, it's awkward for crawler operators to manually override the directives we typically respect, on a site-by-site basis. Rather than expressing such wishes in private communications, we would prefer that whitehouse.gov begin its robots.txt file with a narrow expression of their legitimate exclusions...
...and then continue their robots.txt file with additional alternative directives for other crawlers, or all crawlers ("*"). Then we clearly know that except for the 4 listed URL-prefixes, our site crawling is encouraged.
But awkwardness does not imply scheming, and there was no hint of sinister intent in their expressed wishes. Instead, we were told "we could scoop everything up, no problem" -- a genuine desire to have whitehouse.gov material archived, on a topic-neutral basis. We suggested that the White House webserver make the clarifying robots.txt changes described above, but I haven't yet seen a confirmation that our suggestion reached the right people.
So rather than squinting to see something sinister here, I'd suggest giving the whitehouse.gov team the benefit of the doubt. From what I've seen, they want their site crawled, archived, and searchable -- and their robots.txt should eventually stabilize to confirm that fact.