Rookie blunder on Google Sitemap stats

I was quite pleased to learn last week that by proving 'control' over a website, I could view detailed statistics about how the Google crawler sees it.

The signup process involved putting a blank file with an arbitrary name at the site root; the existence of the file in response to a Google probe confirmed that you, the Google account holder who had just requested the filename, controlled the site

I signed up and happily viewed some data on a site I control. Later, though, while offline, I recalled that many sites will give an OK response to *any* URL path requested of them. These "soft 404s" can cause some confusion for web crawlers, which wind up collecting pages of negligible value. Would a site that gave a "soft 404" OK for any path let anyone claim the right to view stats at Google?

As this is an old and well-known problem in crawling, I figured Google had accounted for it -- for example, they could probe a site with random paths and determine that it gives false OK indications, and then require a more rigorous test in those cases. But, I didn't check that they actually did this.

Well, others did check -- and found that Google had made a rookie mistake, ignoring the prevalence of soft 404s, allowing anyone to view the crawler stats for sites like EBay, AOL, and even Google Orkut. (The flaw has now reportedly been fixed.)

This was the second security flub just last week by Google: Google Base initially launched with a cross-site scripting vulnerability -- of the same sort as had bitten Google's AdWords site just last month.

Give 'em a few more launches, they'll eventually get this right. And it is important that Google does -- because the theme of most of their recent launches has been linking more web content and behavioral data than ever to precise human identities.

I think there's a master plan at work -- more on this in a future post. But both the 'false claim of ownership' and cross-site-scripting exploits are forms of identity theft, and if Google is cross-referencing all your web trails to a single identity, then that identity is going to be a very attractive target for hijacking.

Technorati Tags: , , , , , , , ,

Comments: Post a Comment