woensdag, september 27, 2006

Google's Evil Crawler

Today my ISP, PimpMedia, contacted me about a webcrawler that was doing an extreme amount of requests in a very short time. The requests were coming from the IP address, which is owned by CMP Media.The crawler identified itself as Internet Explorer running on Windows XP. Here is the first line from the log: - - [27/Sep/2006:13:48:50 +0200] "GET /robots.txt HTTP/1.1" 200 76 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"

What is this crawler that pretends to be Internet Explorer? Why is it scraping my website? Why does it need to do up to 5 requests per second?, I asked myself.

I searched Google for the C-address of the IP. It returned quite a lot of results where people are complaining about similar behavior at their sites. Here is a good example at DigitalPoint. People are blaming McColo for scraping, but nobody was able to tell why McColo would do this. So, I continued my search..

Once again, I searched Google. This time I searched for the whole IP address and went trough the list of results. The 10th result showed something surprising: looking at the cached version of the page I saw in the right menu: "Your IP:". Hmm. If I visit actual page, it shows my own IP address.
Ehrr... that makes me believe that the crawler that visited my website belongs to Google? But why does Google pretends to be Internet Explorer on Windows XP? Why doesn't the robot identify as Googlebot, and why are they doing such an insane amount of requests? Clearly, I'm not the first person complaining, but it seems to be that I'm the first person blaming Google for this.
Am I right? Does Google have "stealth" crawlers which basically perform a DoS-attack on your website? Or is there another explaination...? But if it isn't Google, how could you explain that that IP address shows up on the cached page in Google?
To be continued...