Perl Scrappy, sure makes crawling/parsing simple

littleman · February 28, 2011, 07:36:14 AM

http://search.cpan.org/~awncorp/Scrappy-0.62/lib/Scrappy.pm

Seems to do it all and is easy to use if you are a perl DIYer.

Here is a bit from its author:
http://ana.im/press/2010/09/scrappy/

BoL · February 28, 2011, 12:37:49 PM

Looks like it covers the abc's of scraping pretty good! 'happy scraping' as they put it.

It looks like the all-in-one package for quicker scraping.

Do you know if it's a tree based parser? They can get pretty slow on large documents. I'm expermening with a few and found PHP's DOM too slow. onsgmls (used by the X/HTML validator) seems like a decent option.

littleman · February 28, 2011, 07:06:04 PM

It is built on other modules, LWP and Web::Scraper. It looks like Web::Scraper is built on Tree base parsing (HTML::Tree). One guy doing a review of HTML::Tree said that he was able to get 2x faster results by pre-stripping attributes like <span> and tables.

BoL · February 28, 2011, 08:49:26 PM

Yep that would help, anything tree-based is ok i guess if you're not wandering the web with fetching, sticking to pre-defined sites that aren't 2MB of links.

I have a half built template stripper that looks at tag depths & tries to target the unique page content... onsgmls (opensp) seems to work well in regards to the parsing aspect.

Perl Scrappy, sure makes crawling/parsing simple

littleman

BoL

littleman

BoL