Protect Your Site with a Blackhole for Bad Bots

Started by bill, November 01, 2010, 08:39:43 AM

Previous topic - Next topic

bill

Here's an interesting open source script that will help you keep the bots at bay.

QuoteProtect Your Site with a Blackhole for Bad Bots

The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the "one-strike" rule: bots have one chance to follow the robots.txt protocol, check the site's robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.

In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

...

Torben

I'm not a big fan of bot traps.

The usual reasons for using bot traps are to reduce bandwidth usage, stop content scrapers and stop spammers.

In my opinion it's not architecturally optimal to try to solve these problems with an IP filter at the PHP level. The payoff from running the filter on every page request just isn't high enough.

Bandwidth doesn't cost anything if you have a good hosting partner. However, if you have a very dynamic site it may cost you on the performance. The big bandwidth thieves are usually search engine bots. It's easy to pick out bots from search engines you no interest in and filter them out. Just do it on the webserver (Apache) or the firewall and not in PHP.

When it comes to content scraping I don't care what kind of filtering you have in place. If it is publicly available it can be scraped.

When it comes to blog spam the Akismet spam filter does a really good job on WordPress installations. However, if your website is getting hammered by spam bots you can stop most of them by denying access to no-refferrer requests in .htaccess at the webserver level.

# DENY ACCESS TO NO-REFERRER REQUESTS
<IfModule mod_rewrite.c>
RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{REQUEST_URI} .wp-comments-post\. [NC]
RewriteCond %{HTTP_REFERER} !.*mywebsite\. [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule (.*) - [F,L]
</IfModule>

 
Last but not least there is the issue of page load time and caching. You cannot use IP filtering at the PHP level and use page caching at the same the.


Travoli

Great system Bill.  I like that idea a lot.  Torben makes good points, but I'd probably trade the speed for ability to block scrapers instantly.

ukgimp

>>If it is publicly available it can be scraped.

Taught and hired a few genuis scrapers. I have seen some magical shit. I have also fucked up peoples servers with wild scraping, 50 threads but thats for a private discussion :-)

bill

I guess all the real tin-foil hat crowd are doing white listing rather than black listing. ;)

That script certainly isn't going to thwart the pros, but it will catch a lot of the script-kiddy and general riffraff/nuisance traffic. I don't think users of a script like this would normally have access to Apache or the firewall. This would be something you could use on a shared hosting account though.

4Eyes

Might be useful for me.

I do a lot of Ebay aff stuff at the moment, and they are useless at discounting clicks from spiders and rogue bots. It just takes one of these to get through and your quality score gets hammered.

We do a fair bit to block them at the moment, but there are still a few sneak through - this might help plug the last few holes.

cheers :)

Rumbas

>On top of that I often get the link credit

That would be my take as well. Scrape all you want, but provide me with a link.

Drastic

Don't regular SEs ignore robots.txt from time to time?

bill

I just noticed that this script whitelists all the SE bot UAs. That's going to let a lot of crap through in addition to the legit SEs.

QuoteInitially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now "whitelists" any user agent identifying as any of the following:

  • googlebot (Google)
  • msnbot (MSN/Bing)
  • yandex (Yandex)
  • teoma (Ask)
  • slurp (Yahoo)

TallTroll

So, host your scraper in .ru, and deliver your UA as Yandex, and you're through then. Still, it's a good idea, just not appropriate for every site

Kali

good as a warning  but I wouldn't add auto blocking to it - certainly worth investigating any IPs that fall foul of the trap though.

There are definitely bad scrapers out there as well as those that will give you links.