Author Topic: Language Detection / Template Extraction (Read 3177 times)

BoL · « **on:** May 18, 2018, 08:54:59 PM »

I'm in need of two tools or at least some inspiration for best practice

1st is detecting languages used on web page, as some tests show that lang attributes are accurate 80% of the time, so something more robust that actually looks at the content. I'm aware of a technique that looks at two-three character combos which apparently works well, also perhaps popular words from each language. Anyone seen an implementation (with code or explanation) that works well?

2nd is somewhat related, evaluating 1 or more web pages from a domain and being able to detect the main content area of a page. Seen anything that claims to work well (code or explanation would be great)

littleman · « **Reply #1 on:** May 18, 2018, 09:41:43 PM »

BoL, that's an interesting challenge. I never have had to deal with sorting languages.

I poked around a bit to see what was out there.
https://github.com/optimaize/language-detector
http://pear.php.net/package/Text_LanguageDetect
(you may have already known about these...)

BoL · « **Reply #2 on:** May 22, 2018, 08:26:49 AM »

Thanks Littleman, those both look very helpful. Having read around it seems detections can get as good as 99%, at least for the major languages. I'll dig around to see how those compare.

The template extraction problem I have an idea of how one would work but there's doesn't seem to be much (public) code out there, and it is much more of a SE-specific problem than language.

Rooftop · « **Reply #3 on:** June 26, 2018, 03:53:17 PM »

We do some stuff with the API at detectlanguage.com . It works well, although we have some issues with the sample size in our data (our issue, not theirs)

The Core

News:

Author Topic: Language Detection / Template Extraction (Read 3177 times)

BoL

Language Detection / Template Extraction

littleman

Re: Language Detection / Template Extraction

BoL

Re: Language Detection / Template Extraction

Rooftop

Re: Language Detection / Template Extraction