Language Detection / Template Extraction

Why We Are Here > Web Development

(1/1)

BoL:
I'm in need of two tools or at least some inspiration for best practice

1st is detecting languages used on web page, as some tests show that lang attributes are accurate 80% of the time, so something more robust that actually looks at the content. I'm aware of a technique that looks at two-three character combos which apparently works well, also perhaps popular words from each language. Anyone seen an implementation (with code or explanation) that works well?

2nd is somewhat related, evaluating 1 or more web pages from a domain and being able to detect the main content area of a page. Seen anything that claims to work well (code or explanation would be great)

littleman:
BoL, that's an interesting challenge. I never have had to deal with sorting languages.

I poked around a bit to see what was out there.
https://github.com/optimaize/language-detector
http://pear.php.net/package/Text_LanguageDetect
(you may have already known about these...)

BoL:
Thanks Littleman, those both look very helpful. Having read around it seems detections can get as good as 99%, at least for the major languages. I'll dig around to see how those compare.

The template extraction problem I have an idea of how one would work but there's doesn't seem to be much (public) code out there, and it is much more of a SE-specific problem than language.

Rooftop:
We do some stuff with the API at detectlanguage.com . It works well, although we have some issues with the sample size in our data (our issue, not theirs)

Navigation

[0] Message Index

Go to full version