Th3 Core

Why We Are Here => Web Development => Topic started by: BoL on May 18, 2018, 08:54:59 PM

Title: Language Detection / Template Extraction
Post by: BoL on May 18, 2018, 08:54:59 PM
I'm in need of two tools or at least some inspiration for best practice

1st is detecting languages used on web page, as some tests show that lang attributes are accurate 80% of the time, so something more robust that actually looks at the content. I'm aware of a technique that looks at two-three character combos which apparently works well, also perhaps popular words from each language. Anyone seen an implementation (with code or explanation) that works well?

2nd is somewhat related, evaluating 1 or more web pages from a domain and being able to detect the main content area of a page. Seen anything that claims to work well (code or explanation would be great)
Title: Re: Language Detection / Template Extraction
Post by: littleman on May 18, 2018, 09:41:43 PM
BoL, that's an interesting challenge.  I never have had to deal with sorting languages.

I poked around a bit to see what was out there.
https://github.com/optimaize/language-detector
http://pear.php.net/package/Text_LanguageDetect
(you may have already known about these...)
Title: Re: Language Detection / Template Extraction
Post by: BoL on May 22, 2018, 08:26:49 AM
Thanks Littleman, those both look very helpful. Having read around it seems detections can get as good as 99%, at least for the major languages. I'll dig around to see how those compare.

The template extraction problem I have an idea of how one would work but there's doesn't seem to be much (public) code out there, and it is much more of a SE-specific problem than language.