I'm in need of two tools or at least some inspiration for best practice
1st is detecting languages used on web page, as some tests show that lang attributes are accurate 80% of the time, so something more robust that actually looks at the content. I'm aware of a technique that looks at two-three character combos which apparently works well, also perhaps popular words from each language. Anyone seen an implementation (with code or explanation) that works well?
2nd is somewhat related, evaluating 1 or more web pages from a domain and being able to detect the main content area of a page. Seen anything that claims to work well (code or explanation would be great)