Perl Scrappy, sure makes crawling/parsing simple

Started by littleman, February 28, 2011, 07:36:14 AM

Previous topic - Next topic

littleman


BoL

Looks like it covers the abc's of scraping pretty good! 'happy scraping' as they put it.

It looks like the all-in-one package for quicker scraping.

Do you know if it's a tree based parser? They can get pretty slow on large documents. I'm expermening with a few and found PHP's DOM too slow. onsgmls (used by the X/HTML validator) seems like a decent option.


littleman

It is built on other modules, LWP and Web::Scraper.  It looks like Web::Scraper is built on Tree base parsing (HTML::Tree).  One guy doing a review of HTML::Tree said that he was able to get 2x faster results by pre-stripping attributes like <span> and tables.

BoL

Yep that would help, anything tree-based is ok i guess if you're not wandering the web with fetching, sticking to pre-defined sites that aren't 2MB of links.

I have a half built template stripper that looks at tag depths & tries to target the unique page content... onsgmls (opensp) seems to work well in regards to the parsing aspect.