maandag, april 14, 2008

Googlebot, WTF are you doing?

"WTF?". That's what my coworker Maikel thought when he was looking into his web site's access logs last week. He showed me that Googlebot was doing thousands of searches on his site. Searches that often returned no results. Googlebot was crawling URLs that no page points to. How could this be? Wtf was going on?
Only two days later my RSS reader provided the answer: Google has announced that they started filling in forms on a "small number" of "high-quality" web sites to get back information. Clearly, this is what was happening on Lyrics.net.

In 48 hours time, Googlebot performed over 3,100 search requests on Lyrics.net. On average, that is one search request per minute. The searches look really natural; they seem to be titles of songs, but most of them do not exist on Lyrics.net.
Here are a few example queries Googlebot did:
  1. Hooked On A Problem
  2. Stop Playing Wit Me
  3. Looking For Something
  4. Que Paso (hey Baby)
  5. (many more)

Here are some entries from the log file:
lyrics.net-usa-access_log.1208005200:66.249.72.1 - - [12/Apr/2008:09:00:13 -0400] "GET /search.php?naam=Then%20You%20Turn%20Away&keuze=2&stage=results HTTP/1.1" 200 6831 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
lyrics.net-usa-access_log.1208005200:66.249.72.1 - - [12/Apr/2008:09:01:00 -0400] "GET /search.php?naam=Biggest%20Part%20Of%20Me&keuze=2&stage=results HTTP/1.1" 200 7516 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
lyrics.net-usa-access_log.1208005200:66.249.72.1 - - [12/Apr/2008:09:01:23 -0400] "GET /search.php?naam=To%20China%20With%20Love&keuze=2&stage=results HTTP/1.1" 200 6831 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
lyrics.net-usa-access_log.1208005200:66.249.72.1 - - [12/Apr/2008:09:03:20 -0400] "GET /search.php?naam=Follow%20Me%20(part%201)&keuze=2&stage=results HTTP/1.1" 200 6831 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
lyrics.net-usa-access_log.1208005200:66.249.72.1 - - [12/Apr/2008:09:04:29 -0400] "GET /search.php?naam=Marilyn%20Mongos%20Rache&keuze=2&stage=results HTTP/1.1" 200 6833 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


It is great that in the eyes of Google Lyrics.net is a "high-quality site", but I'm not sure what to think of their almost brute-force approach to harvest content from such sites. Not only does it put extra load on Lyrics.net's search engine, IMHO it is really awfully close to what some of the search engine spammers do: targeted scraping of websites.
If a website owner wants all his pages in the Google index, why not leave it up to the website owner to submit XML sitemaps (or make them available trough robots.txt) containing all the valid keywords / pages? That feels a whole lot more logical to me.

I wonder how long it'll take before a website runs into problems and sues Google because of this new 'feature'.

Labels: , , ,