Author Topic: Protect Your Site with a Blackhole for Bad Bots  (Read 40475 times)

bill

  • Devil's Avocado
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1286
  • Avast!
    • View Profile
    • Email
Protect Your Site with a Blackhole for Bad Bots
« on: November 01, 2010, 08:39:43 AM »
Here's an interesting open source script that will help you keep the bots at bay.

Quote
Protect Your Site with a Blackhole for Bad Bots

The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site. I call it the “one-strike” rule: bots have one chance to follow the robots.txt protocol, check the site’s robots.txt file, and obey its directives. Failure to comply results in immediate banishment. The best part is that the Blackhole only affects bad bots: normal users never see the hidden link, and good bots obey the robots rules in the first place.

In five easy steps, you can set up your own Blackhole to trap bad bots and protect your site from evil scripts, bandwidth thieves, content scrapers, spammers, and other malicious behavior.

...

Torben

  • Global Moderator
  • Sr. Member
  • *****
  • Posts: 305
    • View Profile
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #1 on: November 01, 2010, 09:49:02 AM »
I'm not a big fan of bot traps.

The usual reasons for using bot traps are to reduce bandwidth usage, stop content scrapers and stop spammers.

In my opinion it’s not architecturally optimal to try to solve these problems with an IP filter at the PHP level. The payoff from running the filter on every page request just isn’t high enough.

Bandwidth doesn't cost anything if you have a good hosting partner. However, if you have a very dynamic site it may cost you on the performance. The big bandwidth thieves are usually search engine bots. It’s easy to pick out bots from search engines you no interest in and filter them out. Just do it on the webserver (Apache) or the firewall and not in PHP.

When it comes to content scraping I don’t care what kind of filtering you have in place. If it is publicly available it can be scraped.

When it comes to blog spam the Akismet spam filter does a really good job on WordPress installations. However, if your website is getting hammered by spam bots you can stop most of them by denying access to no-refferrer requests in .htaccess at the webserver level.

# DENY ACCESS TO NO-REFERRER REQUESTS
<IfModule mod_rewrite.c>
RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{REQUEST_URI} .wp-comments-post\. [NC]
RewriteCond %{HTTP_REFERER} !.*mywebsite\. [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule (.*) - [F,L]
</IfModule>

 
Last but not least there is the issue of page load time and caching. You cannot use IP filtering at the PHP level and use page caching at the same the.


Travoli

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1205
    • View Profile
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #2 on: November 01, 2010, 08:42:12 PM »
Great system Bill.  I like that idea a lot.  Torben makes good points, but I'd probably trade the speed for ability to block scrapers instantly.

ukgimp

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2241
    • View Profile
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #3 on: November 01, 2010, 09:05:07 PM »
>>If it is publicly available it can be scraped.

Taught and hired a few genuis scrapers. I have seen some magical sh##. I have also f###ed up peoples servers with wild scraping, 50 threads but thats for a private discussion :-)

bill

  • Devil's Avocado
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1286
  • Avast!
    • View Profile
    • Email
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #4 on: November 02, 2010, 04:32:39 AM »
I guess all the real tin-foil hat crowd are doing white listing rather than black listing. ;)

That script certainly isn't going to thwart the pros, but it will catch a lot of the script-kiddy and general riffraff/nuisance traffic. I don't think users of a script like this would normally have access to Apache or the firewall. This would be something you could use on a shared hosting account though.

4Eyes

  • Hero Member
  • *****
  • Posts: 817
    • View Profile
    • Email
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #5 on: November 02, 2010, 05:37:01 AM »
Might be useful for me.

I do a lot of Ebay aff stuff at the moment, and they are useless at discounting clicks from spiders and rogue bots. It just takes one of these to get through and your quality score gets hammered.

We do a fair bit to block them at the moment, but there are still a few sneak through - this might help plug the last few holes.

cheers :)

Rumbas

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2105
  • Viking Wrath
    • MSN Messenger - rasmussoerensen@hotmail.com
    • AOL Instant Messenger - seorasmus
    • View Profile
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #6 on: November 02, 2010, 12:08:51 PM »
>On top of that I often get the link credit

That would be my take as well. Scrape all you want, but provide me with a link.

Drastic

  • Need a bigger hammer...
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 3084
  • Resident Redneck
    • View Profile
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #7 on: November 02, 2010, 12:41:37 PM »
Don't regular SEs ignore robots.txt from time to time?

bill

  • Devil's Avocado
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1286
  • Avast!
    • View Profile
    • Email
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #8 on: November 03, 2010, 12:21:01 AM »
I just noticed that this script whitelists all the SE bot UAs. That's going to let a lot of crap through in addition to the legit SEs.

Quote
Initially, the Blackhole blocked any bot that disobeyed the robots.txt directives. Unfortunately, as discussed in the comments, Googlebot, Yahoo, and other major search bots do not always obey robots rules. And while blocking Yahoo! Slurp is debatable, blocking Google, MSN/Bing, et al would just be dumb. Thus, the Blackhole now “whitelists” any user agent identifying as any of the following:
  • googlebot (Google)
  • msnbot (MSN/Bing)
  • yandex (Yandex)
  • teoma (Ask)
  • slurp (Yahoo)

TallTroll

  • Sr. Member
  • ****
  • Posts: 272
    • View Profile
    • Email
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #9 on: November 03, 2010, 01:58:17 AM »
So, host your scraper in .ru, and deliver your UA as Yandex, and you're through then. Still, it's a good idea, just not appropriate for every site

Kali

  • Newbie
  • *
  • Posts: 10
    • View Profile
    • Email
Re: Protect Your Site with a Blackhole for Bad Bots
« Reply #10 on: November 04, 2010, 01:33:46 AM »
good as a warning  but I wouldn't add auto blocking to it - certainly worth investigating any IPs that fall foul of the trap though.

There are definitely bad scrapers out there as well as those that will give you links.