Perl Programmer/Consultant
Remote System Administrator
Free Software
... contact me
 
  while ($making_other_plans) { life(); }
  location('ipsstb', 'internet', 'web_weirdness', 'bad_bots_must_be_punished');

 For Web Designers 2020-04-25 01:35:49 UTC Mail Delivery Problems? 

Monday, August 06 2012

Bad Bots Must Be Punished.

I periodically look through my web server logs to pick out things that are not as they should be. You might recall from previous blog entries that I operate spam traps and so on -- last night I picked out of my server logs that some critter calling itself MJ12bot was going where no legitimate bots belong. But it's apparently trying to be a good bot because it leaves its calling card:

173.242.125.206 - - [06/Aug/2012:01:28:42 -0600] "GET /robots.txt HTTP/1.0" 200 1247 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+")"

So off I go to that URI, and find that the folks who run the thing have said "If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email..." And so I did.

We discussed the matter via email a bit, and it seems probable that their bot encountered some kind of network error when it tried to grab my robots.txt file. Not an error response from my server, but a failure to even contact my server. To my way of thinking, in a case like that a properly designed bot will try again to get that file, and will not crawl the site until it gets either the file or a verifiable 404 Not Found. Not MJ12bot, though. The network failure is treated as if it were a 404, and is taken to mean that the whole darn site is wide open to them. Here's what their guy Alex said:

Sadly it's very difficult for us to diagnose this case - as you can see from your logs our bot grabs robots.txt, so we are not intentionally breaking your directives, it's just if bot could not get robots.txt then it could not obey it :(

Huh? Your bot encounters a network error and that gives you license to crawl my site in violation of my terms of service? It seems to me that if you know your crawler is broken in that way, which you do now, and you continue to run it knowing it's broken in that way, then what you're doing is willful negligence and that makes it intentional.

No worries here. I've informed the folks behind the thing that their bot is no longer welcome here and any connections it makes will be considered trespass. The fun part? When their bot comes around it will not see my web site. It will instead see a very, very long joke that will be delivered very, very slowly. How slowly? From start to finish will take from an hour and a half to more than six hours.

If you've seen a bad bot in your logs and want to punish it in this way, feel free to hit my contact form to inquire about it. It's a freebie if all you need is the application itself and very minimal installation/configuration instructions. After all: Bad Bots Must Be Punished!

→ committed: 8/6/2012 17:36:54

[ / internet / web_weirdness] permanent link

Comments: 6    Trackbacks: 0

 

Alex Chudnovsky wrote at 2012-08-16 22:56:

Hi,

This is the Alex that you've quoted...

I've just checked our conversation in support system and note that we've responded to your query within 90 minutes of it being raised.

One issue with your site that we've raised was the fact that you do not have robots.txt present on http://artsackett.com/ - you are redirecting all requests to http://www.artsackett.com/robots.txt - this means that there is no actual robots.txt on http://artsackett.com/ - robots.txt standard (http://www.robotstxt.org) says that robots.txt file should be present on actual hostname.

Support for redirects is only recommended (not required), our bot attempts to support this recommendation and in most cases it works, but there is a lot more can go wrong when robots.txt is redirected.

It was not possible to give you exact answer why on that particular occasion your log did not show robots.txt being requested (though plenty of examples in your own log shown that our bot did fetch robots.txt and obeyed it).

Your site was added to our nocrawl list since you clearly did not want our bot to visit it, I assume our bot obeyed that instruction since you've never contacted us after that.

You don't have to like us, but I think you are wrong referring to our bot as "bad" - we've responded to your query very quickly and even though it was not possible to answer it exactly we've done what you requested - stopped crawling your site.

While regretably it appears that our bot somehow disobeyed your robots.txt in one particular case I don't think it's a good reason to label us as "bad" people.

In July 2012 we've crawled about 33,000,000,000 URLs from around 200,000,000 root domains. Number of MJ12bot support queries that we've received in this timeframe - 6 (Six), so that's approximately 1 query per 5 bln URLs crawled from ~35 mln websites. There are maybe a dozen of companies int he whole world who have such a large scale web crawl, sadly they don't share details about issues with robots.txt, but it's a good opportunity to set some benchmark numbers.

Alex

Art Sackett wrote at 2012-08-17 00:45:

Hi Alex. I expected to hear from you soon. :-) You seem to show up wherever your idiot bot is spoken of in less than glowing terms. Search engines... darn things are everywhere.

Yes indeed, you did respond promptly to my queries and for that I laud your efforts. Good on ya for that. And I thank you for dropping by to confirm for the world (all three people who read my blog) that your bot behaves badly.

The matter at hand here is not HTTP 301 redirects, how many resources your MJ12 bots have requested or how many complaints were generated by MJ12bot as a result. The matter at hand here is that your MJ12bot went where it does not belong, you are aware of it having done so, and you are taking no action to prevent it from doing so on other sites. Those were the points I made both in my private email to you and publicly here in my blog. Willful negligence is willful negligence.

I've never seen any other legitimate web crawler venturing into parts of my site or other sites where they are not supposed to go. They do not, MJ12bot does. You should fix that. It's easily enough done: Before venturing any further into a site, you have either a robots.txt that is of the file size the HTTP response Content-Length header said it would be, or you have an HTTP 404 response to the request for it. Nothing else will do. Lacking either, go no further.

Fix the thing, or don't, it's your call. But don't try to pretend that you don't know it's broken and behaving badly. Right now and until it's fixed, MJ12bot is a bad bot.

Alex Chudnovsky wrote at 2012-08-17 10:26:

Hi,

I thought it would be right to leave a reply since you've chosen to post details of our private emails without even telling us. I'll let visitors of your blog make their own mind whether we match pattern of bad guys or not.

I've provided in my reply statistics of how often we get queries regarding our bot operations (a lot of those queries relate to mistake made by webmaster because robots.txt standard is often wrongly interpreted) - very infrequently compared to the level of crawl that we do.

There is no wilful negligence on our part - as soon as it became clear you are not happy with our bot visiting your site in the first place we've added it to no crawl list.

I don't think we have any more to say on this subject, anybody who thinks we are "bad" can just try to contact us via bot email address with their concerns and decide by themselves based on our response to any problems raised.

Alex

Art Sackett wrote at 2012-08-17 15:03:

Let us dispense with the garbage and focus on the important: "While regretably [sic] it appears that our bot somehow disobeyed your robots.txt in one particular case I ..." don't feel a need to directly address the problem.

Here's the thing, Mr. Chudnovsky: MJ12bot crawled my site without having successfully retrieved either robots.txt or an HTTP 404 response to the request for it, and it retrieved content it should not have. This is irrefutable evidence of faulty software, and you have acknowledged this evidence. From the point of acknowledgement forward, it is willful negligence every time MJ12bot fails the same way and violates the terms of service of another web site.

That content that MJ12bot wrongly retrieved from my site is there solely to identify bad bots that ignore robots.txt, specifically email address harvesters. That content has been there since 2003 and no legitimate search engine spider has ever mistakenly crawled that content. MJ12bot ignored my robots.txt and retrieved content it should not have. You acknowledge that MJ12bot did so, but then claim that I am wrong to say that MJ12bot is a bad bot. You claim that it is my fault that MJ12bot ignored my robots.txt.

Until it is fixed MJ12bot is a bad bot.

Anonymous Coward wrote at 2012-08-17 16:30:

Hi,

We've been through this extensively when we tried to deal with your query - your log file has shown multiple MJ12bot taking robots.txt and obeying it, but in one case it did not - the extract from your log did not show robots.txt actually being requested from your site and our bot always requests it before crawling a batch of URLs.

Since your site is not robots.txt compliant (using robots.txt redirect as I've explained above) chances of bots not getting it right increase - that's entirely your setup that we have no control over even though we try our best to support non-required parts of robots.txt standard.

I think it's unreasonable to expect for us keep complete track of all network traffic we receive every day (over 20 TB) we simply could not guess what could have possibly happened in that case - if you search for Yahoo or Google "disobeys robots.txt" you'll see plenty of posts about it.

They may not have done so on your site because normally their operations are very reliable just like ours, but odd things can happen when crawling billions of URLs, we've done our best to investigate your problem and when you could not be satisfied we've added your site to no crawl list.

As I said we've added your site to our nocrawl list and our bot should not crawl it at all. Let us know next time you come across with "bad guys" who crawl billions of urls every month AND:

1) respond to your email query within hours 2) add site to nocrawl list as soon as it was requested 3) spend their time answering you on public blog

I doubt you find many such "bad guys" - real bad guys won't even bother with robots.txt, they will fake user-agent, they won't have real contact address, they will ignore you completely and keep crawling your site.

Now those are really bad guys - I'll let visitors of your blog to decide if we are one of them.

Alex

Art Sackett wrote at 2012-08-17 21:43:

The salient point here: "... but in one case it did not..."

Maybe MJ12bot tried, but at the time the internet didn't cooperate so the request never made it as far as my server. If that's what happened your MJ12bot did not handle the error properly. MJ12bot went where it was not supposed to go. And you've known for ten days now that MJ12bot ignored my robots.txt and you're still protesting that it is somehow my fault.

HTTP redirection is not relevant here. I don't know why you'd ever make an HTTP/1.1 request of my server with the "Host: artsackett.com" header when in fact there is no host artsackett.com and no way to resolve any A, AAAA, or CNAME record for it. artsackett.com is a domain name only. There are no wildcard DNS resource records for it. You can't get here from there without foolish heuristics.

I don't know what kind of mental or emotional hangup is between your acknowledgment that MJ12bot did a wrong thing and your insistence that it only ever does right things. MJ12bot did a wrong thing. That is the only relevant point: MJ12bot did a wrong thing. I'm not saying that I believe you to be dishonest or that I believe MJ12bot is evil. I'm saying that MJ12bot has done a wrong thing and that you refuse to address the matter even after acknowledging that MJ12bot did a wrong thing. Blaming others for MJ12bot doing a wrong thing is not addressing the matter -- it's deflecting accountability.

I'm tired of repeating myself. I'm not going to approve your pending comment because it is irrelevant, even less relevant than some malarkey about HTTP redirects. Come on back around when you can report a bug fix.

Comments are closed for this story.

 

Trackbacks are closed for this story.

Save the Net

Creative Commons License

Project Honeypot Member

 
August 2012
Mon Tue Wed Thu Fri Sat Sun
   
6
   

By Month:

By category:

Feeds:

Served to 18.206.13.28:47010 at 01:35:49 GMT on Monday, May 25, 2020.

return(0.5383);