Author Topic: Language Detection / Template Extraction  (Read 609 times)

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 790
    • View Profile
Language Detection / Template Extraction
« on: May 18, 2018, 08:54:59 PM »
I'm in need of two tools or at least some inspiration for best practice

1st is detecting languages used on web page, as some tests show that lang attributes are accurate 80% of the time, so something more robust that actually looks at the content. I'm aware of a technique that looks at two-three character combos which apparently works well, also perhaps popular words from each language. Anyone seen an implementation (with code or explanation) that works well?

2nd is somewhat related, evaluating 1 or more web pages from a domain and being able to detect the main content area of a page. Seen anything that claims to work well (code or explanation would be great)

littleman

  • Administrator
  • Hero Member
  • *****
  • Posts: 4045
    • View Profile
Re: Language Detection / Template Extraction
« Reply #1 on: May 18, 2018, 09:41:43 PM »
BoL, that's an interesting challenge.  I never have had to deal with sorting languages.

I poked around a bit to see what was out there.
https://github.com/optimaize/language-detector
http://pear.php.net/package/Text_LanguageDetect
(you may have already known about these...)

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 790
    • View Profile
Re: Language Detection / Template Extraction
« Reply #2 on: May 22, 2018, 08:26:49 AM »
Thanks Littleman, those both look very helpful. Having read around it seems detections can get as good as 99%, at least for the major languages. I'll dig around to see how those compare.

The template extraction problem I have an idea of how one would work but there's doesn't seem to be much (public) code out there, and it is much more of a SE-specific problem than language.

Rooftop

  • Inner Core
  • Hero Member
  • *
  • Posts: 1915
    • View Profile
Re: Language Detection / Template Extraction
« Reply #3 on: June 26, 2018, 03:53:17 PM »
We do some stuff with the API at detectlanguage.com .  It works well, although we have some issues with the sample size in our data (our issue, not theirs)