http://search.cpan.org/~awncorp/Scrappy-0.62/lib/Scrappy.pm
Seems to do it all and is easy to use if you are a perl DIYer.
Here is a bit from its author:
http://ana.im/press/2010/09/scrappy/
Looks like it covers the abc's of scraping pretty good! 'happy scraping' as they put it.
It looks like the all-in-one package for quicker scraping.
Do you know if it's a tree based parser? They can get pretty slow on large documents. I'm expermening with a few and found PHP's DOM too slow. onsgmls (used by the X/HTML validator) seems like a decent option.
It is built on other modules, LWP and Web::Scraper. It looks like Web::Scraper is built on Tree base parsing (HTML::Tree). One guy doing a review of HTML::Tree said that he was able to get 2x faster results by pre-stripping attributes like <span> and tables.
Yep that would help, anything tree-based is ok i guess if you're not wandering the web with fetching, sticking to pre-defined sites that aren't 2MB of links.
I have a half built template stripper that looks at tag depths & tries to target the unique page content... onsgmls (opensp) seems to work well in regards to the parsing aspect.