Author Topic: Extracting themes from text  (Read 7077 times)

Rooftop

  • Inner Core
  • Hero Member
  • *
  • Posts: 1915
    • View Profile
Extracting themes from text
« on: September 05, 2014, 05:33:48 PM »
Can anyone point me towards a solution for this?

We're looking for a system to identify themes in documents.  Doesn't have to be massively granular, just at a level of "fashion" "programming" "medical" or similar.  It needs to be cheap on a per lookup basis and be able to handle 100k documents per day. Speed isn't a massive issue.

I remember there being some APIs around for this some years ago, but my memory is failing me. 

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 1209
    • View Profile
Re: Extracting themes from text
« Reply #1 on: September 05, 2014, 11:28:05 PM »
>API

OpenCalais is free and pretty good, it definitely does part of speech tagging well and can classify people/places etc

Other than that http://code.google.com/p/word2vec/ ... written in C, fast but maybe hard to adapt.

Rooftop

  • Inner Core
  • Hero Member
  • *
  • Posts: 1915
    • View Profile
Re: Extracting themes from text
« Reply #2 on: September 08, 2014, 10:00:24 AM »
OpenCalais - that is the one I was trying to remember.  Fantastic - thanks BoL

JasonD

  • Inner Core
  • Hero Member
  • *
  • Posts: 1420
  • Look at THAT!!!!
    • AOL Instant Messenger - JasonDDuke
    • View Profile
    • Domain Names
    • Email
Re: Extracting themes from text
« Reply #3 on: September 09, 2014, 02:17:47 PM »
OpenCalais is superb but there are alternatives too.

The best known of which on our industry, if it's per URL, is probably Majestic's topic rank thingymajij. I've found it VERY good. There are other routes that have no direct cost too, but they are a touch "unusual" although very accurate.

PS Sorry for so much Majestic love today!

martinibuster

  • Inner Core
  • Full Member
  • *
  • Posts: 180
    • View Profile
    • Email
Re: Extracting themes from text
« Reply #4 on: December 22, 2014, 11:42:31 PM »
Majestic's Topical Trust Rank is based on inbound links. It's really useful but it doesn't identify themes in document text. It's all about the inlinks.

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 9294
    • View Profile
Re: Extracting themes from text
« Reply #5 on: December 23, 2014, 03:16:23 AM »
Dixon said they put a lot of time and research energy into figuring out how to do textual analysis to pull the topic from the text and in the end nothing they could come up with worked as well as checking the inbound links and looking at the anchor text and the neighborhood. Turns out having computers pull meaning from text is still very, very hard.

martinibuster

  • Inner Core
  • Full Member
  • *
  • Posts: 180
    • View Profile
    • Email
Re: Extracting themes from text
« Reply #6 on: December 23, 2014, 05:46:59 AM »
That's very interesting, thanks for sharing that Tom.  

There's TrustRank and then there's Topical TrustRank. Turns out that TrustRank (TR) is unreliable because ultimately it's about quantity (plus a built in bias, etc.). What makes Topical TrustRank (TTR) so useful is it's about Quality of relevance. Playing around with Majestic's Topical TrustRank I realized they went very far in nailing down relevance. With TTR you can instantly understand the relevance factor of the inlinks. We know there is no factor called Trust, although confusingly to some there are Trust Factors. TTR nails down the relevance of inlinks. Relevance is, in my opinion, one of the most important factors for ranking. There are other factors for determining your "trustworthiness" which is another way of saying whether you are spam or not spam. But I think that after you get past that door then you're subject to whether you're relevant for a query or not.

The trustworthy part, I don't believe TrustRank scores can measure that because those are largely Quantity metrics that are relative (subjective) and can be gamed. But the relevance part is a little harder to game. Nevertheless, I quite find it useful to at least have the Relevance piece of the puzzle and Majestic's TTR is quite remarkable and useful in that respect. I'm not sure it's well recognized how useful that bit of information is.
« Last Edit: December 23, 2014, 05:51:14 AM by martinibuster »

Brad

  • Inner Core
  • Hero Member
  • *
  • Posts: 4154
  • What, me worry?
    • View Profile
Re: Extracting themes from text
« Reply #7 on: December 23, 2014, 01:12:26 PM »
Hmm, its sounding like Majestic is half way towards being able to make a general search engine or at least the technology to do it.

JasonD

  • Inner Core
  • Hero Member
  • *
  • Posts: 1420
  • Look at THAT!!!!
    • AOL Instant Messenger - JasonDDuke
    • View Profile
    • Domain Names
    • Email
Re: Extracting themes from text
« Reply #8 on: December 23, 2014, 01:13:39 PM »
> half way

7/8ths IMO :)

Rooftop

  • Inner Core
  • Hero Member
  • *
  • Posts: 1915
    • View Profile
Re: Extracting themes from text
« Reply #9 on: December 23, 2014, 02:15:33 PM »
> half way

7/8ths IMO :)

I was thinking along the same names when they switched domains from majesticseo to majestic . 

JasonD

  • Inner Core
  • Hero Member
  • *
  • Posts: 1420
  • Look at THAT!!!!
    • AOL Instant Messenger - JasonDDuke
    • View Profile
    • Domain Names
    • Email
Re: Extracting themes from text
« Reply #10 on: December 23, 2014, 02:21:05 PM »
>MajesticSEO to Majestic

I actually don't think they'll launch as a search engine, however I can see 2015 as transformational for them with a huge push in getting their data into more corporates to be used as a search engine. Whether it to be vertically based searches or internal or even purely detailed and encompassing research / market intelligence.

However, I do see 2016 as being a target for an IPO and possible AIM listing.