maandag, april 14, 2008

Googlebot, WTF are you doing?

"WTF?". That's what my coworker Maikel thought when he was looking into his web site's access logs last week. He showed me that Googlebot was doing thousands of searches on his site. Searches that often returned no results. Googlebot was crawling URLs that no page points to. How could this be? Wtf was going on?
Only two days later my RSS reader provided the answer: Google has announced that they started filling in forms on a "small number" of "high-quality" web sites to get back information. Clearly, this is what was happening on

In 48 hours time, Googlebot performed over 3,100 search requests on On average, that is one search request per minute. The searches look really natural; they seem to be titles of songs, but most of them do not exist on
Here are a few example queries Googlebot did:
  1. Hooked On A Problem
  2. Stop Playing Wit Me
  3. Looking For Something
  4. Que Paso (hey Baby)
  5. (many more)

Here are some entries from the log file: - - [12/Apr/2008:09:00:13 -0400] "GET /search.php?naam=Then%20You%20Turn%20Away&keuze=2&stage=results HTTP/1.1" 200 6831 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [12/Apr/2008:09:01:00 -0400] "GET /search.php?naam=Biggest%20Part%20Of%20Me&keuze=2&stage=results HTTP/1.1" 200 7516 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [12/Apr/2008:09:01:23 -0400] "GET /search.php?naam=To%20China%20With%20Love&keuze=2&stage=results HTTP/1.1" 200 6831 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [12/Apr/2008:09:03:20 -0400] "GET /search.php?naam=Follow%20Me%20(part%201)&keuze=2&stage=results HTTP/1.1" 200 6831 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [12/Apr/2008:09:04:29 -0400] "GET /search.php?naam=Marilyn%20Mongos%20Rache&keuze=2&stage=results HTTP/1.1" 200 6833 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +"

It is great that in the eyes of Google is a "high-quality site", but I'm not sure what to think of their almost brute-force approach to harvest content from such sites. Not only does it put extra load on's search engine, IMHO it is really awfully close to what some of the search engine spammers do: targeted scraping of websites.
If a website owner wants all his pages in the Google index, why not leave it up to the website owner to submit XML sitemaps (or make them available trough robots.txt) containing all the valid keywords / pages? That feels a whole lot more logical to me.

I wonder how long it'll take before a website runs into problems and sues Google because of this new 'feature'.

Labels: , , ,