Author Topic: Mojeek hits 4 Billion 2021  (Read 6712 times)

Brad

  • Inner Core
  • Hero Member
  • *
  • Posts: 4154
  • What, me worry?
    • View Profile

Brad

  • Inner Core
  • Hero Member
  • *
  • Posts: 4154
  • What, me worry?
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #1 on: July 02, 2021, 02:13:26 PM »
This reminds me:

With Google getting rid of the Android search auction I'm hoping Mojeek might get listed as one of the search engine options.  I think Mojeek qualifies for English language and maybe French and German.  It's one way to get more users.  This is EU (and UK I think).  We in the US will continue under the Google/Android dominance while US regulators keep wondering why their AOL CD's no longer work.

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 1208
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #2 on: July 02, 2021, 04:49:58 PM »
That 2 million figure for new pages will end up being closer to 10 million shortly as new servers get switched on. We were a bit delayed by the new servers arriving a couple of months late from Asia.

Some algo improvements on the way for longer tail stuff. Quite a few features in the pipeline. Good to see greater recognition that you can count the number of SE's with their own English indexes on one hand.

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 9292
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #3 on: July 02, 2021, 10:51:02 PM »
>>10 million

So that means adding 3-4 billion pages a year. To ask a perhaps dumb question, is that a little or a lot?

Put another way, at what point would you say the index is deep enough and broad enough and the challenge is less about crawling more of the web and more about relevancy and surfacing good results?

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 1208
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #4 on: July 03, 2021, 08:51:17 AM »
>is that a little or a lot?

Hard to tell for sure. We at least know we can update every page in the index within a small time frame, so getting bigger doesn't cause staleness issues. You'd think with 4bn pages that most of the authoritative ones are there, and index size isn't the be all and end all. If index size were the goal, indexing all of Twitter and FB could raise index size by orders of magnitude. Looking at post id's on twitter they're into the quintillions.

Increasing the size does help in crawling deeper into authoritative sites.

Been trying to estimate Bing's size and have seen estimates that it's about 40bn (as a comparison, they do index more languages) but possibly higher than that. On Mojeek you do get exact result counts but Bing doesn't, could probably correlate over a few thousand queries.

>more about relevancy and surfacing good results?

Higher priority in the short term for us. e.g. compare 'war of the worlds' with and without quotes. There's a push in the pipeline to make the unquoted search more like the quoted results among other algo changes.

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 9292
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #5 on: July 03, 2021, 06:17:43 PM »
>>post id's on twitter

That's a lot of what I was asking about. Sometimes I think having all that in the Google index reduces quality rather than improving it. In an ideal world, you would crawl as aggressively as Google, but throw a lot more garbage in the trash can. Google has such a jump on indexing and so much power, I don't think anyone could reasonably compete there but, again, I'm not sure anyone needs to. An index 10% the size of Google's but twice as good at surfacing relevant results might be enough to take Google down, but an index that's 2X the size of Google and 10% worse at relevancy could not compete on search alone and  would need some other attraction (privacy, fewer ads). Which, obviously, are all things you know and which Mojeek seems to be persuing.

>>deeper into authoritative sites

That's another part. I remember at least 10 years ago at Pubcon the head of SEO for MS (I think Duane Forrester... in that era) saying that they had over 100m URLs. A lot of that was due to poor canonicalization and internationalization and they were actively trying to figure out how to reduce that number because Google and Live(?) couldn't keep up with crawling it all.

>> 'war of the worlds' with and without quotes.

The unquoted isn't bad. It at least surfaces the book. But yes, it's definitely too biased toward World Wars in the unquoted version.

In my test searches on Mojeek, I find it quite good for common searches, and quite far behind Google for obscure searches. I think that's where the size of the crawl is still hurting you.

I did a bunch of Yosemite searches because I know fairly well what they typically look like on Google and Bing.

[yosemite hiking] - I'd actually prefer the Mojeek results. My friend Russ's site is #1 and that is where I send everyone who asks (I had a site on Yosemite hiking that is so out of date and so bad, I send everyone to Russ; it really is the best resource for what most people are seeking with that query).

[yosemite lodging] - again, I would say the Mojeek results are better than Google in terms of delivering what people actually want with that query.

But if I switch to one of my favorite John Muir quotes... Let's say I can't remember it and I put in a phrase I do remember in quotes
"Storms are fine speakers"

Mojeek doesn't seem to have that exact phrase in its index and the results don't get me any closer to finding the quote.
https://www.mojeek.com/search?q=%22storms+are+fine+speakers%22

Google's results are not perfect.
https://www.google.com/search?q=%22storms+are+fine+speakers%22&oq=%22storms+are+fine+speakers%22

I would take the results I get and re-order them as 6, 9, 8 and would not include the #1 result at all or at least push it well off the front page. There's room for improvement. If I had gotten this result from Google in 2009 (Before Farmer/Panda and Penguin when Google was just swimming in spam), I would be thrilled. Now I'm satisfied. I suspect that in 2033 (12 years from now), I would be rather disappointed with this.

That's just a few examples. My sense is that Mojeek does well on relevancy and ranking when it starts with a good set of results that it can rank, but then falls down as soon as the result set gets thin or non-existent.

All of which is, i guess, the long version of my initial question.

[update]

PS since I know the suspense is killing you....

"Storms are fine speakers, and tell all they know, but their voices of lightning, torrent, and rushing wind are much less numerous than the nameless still, small voices too low for human ears; and because we are poor listeners we fail to catch much that is fairly within reach. "
https://vault.sierraclub.org/john_muir_exhibit/writings/the_mountains_of_california/chapter_11.aspx

PPS, since you didn't ask, my other favorite quote is recounted by Sam Hall Young who visited Muir while Muir was working 16-hour days on his fruit farm in the Bay Area (Martinez), his health withering from overwork. The bolded part was taped to my computer monitor for many years:

Quote
Eagerly he questioned me of my travels and of the " progress " of the glaciers and woods of Alaska. Beyond a few short mountain trips he had seen nothing for two years of his beloved wilds.

Passionately he voiced his discontent: "I am losing the precious days. I am degenerating into a machine for making money. I am learning nothing in this trivial world of men. I must break away and get out into the mountains to learn the news."
https://archive.org/stream/alaskadayswithjo00younuoft/alaskadayswithjo00younuoft_djvu.txt
« Last Edit: July 03, 2021, 06:41:15 PM by ergophobe »

BoL

  • Inner Core
  • Hero Member
  • *
  • Posts: 1208
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #6 on: July 04, 2021, 03:24:10 AM »
Ergo, cheers for that- great feedback and thanks for the examples.

FWIW here's a current test algo (still a work in progress) for the WOTW query: https://www.mojeek.com/search?q=war+of+the+worlds&m=202

Will take a look about the quote you provided and potentially see if there were good results for your query.

Quotes like that should be something that Mojeek can do and perhaps you're right in that we lack the pages.

The main difference with Mojeek is that it's a boolean search engine (with stemming) vs what Google doing nowadays being much more like a vector space model, if it is still that given BERT et al. Definite advantages/disadvantages to both. Heard plenty people grumble when they type something into Google and get annoyed when their typed in words are inferred to mean other words. On the flip side, there are definitely times where synonyms being included are helpful. At least it's good to have both versions available.

« Last Edit: July 04, 2021, 04:15:23 AM by BoL »

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 9292
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #7 on: July 04, 2021, 10:23:57 PM »
>>get annoyed when their typed in words are inferred to mean other words

I feel like I hear this a lot among some of the crankier older types on WMW. The problem is that they aren't paying attention to the cases where it helps them and makes the SERPs better. Some level of stemming is required for decent search. Remember when results for singular and plural were radically different?

There are cases where it still makes sense to differentiate, of course, if you can successfully infer intent from it.

But my general feeling is that most of the people yelling at their computer monitor in the morning because Google tries to infer meaning and sometimes gets it wrong are the same people who are going to spend the afternoon yelling at their TV because the referee made a bad call. You can't really make a business strategy around.

Brad

  • Inner Core
  • Hero Member
  • *
  • Posts: 4154
  • What, me worry?
    • View Profile
Re: Mojeek hits 4 Billion 2021
« Reply #8 on: July 04, 2021, 11:54:30 PM »
> longer tail

This would be welcome.  Part of G's early reputation came off that long tail.

> stemming

Good point ergo.  Also popular terminology for the same thing changes over time (eg. what used to be "software" is now "apps").

> crankier older types

I thought we all moved to The Core.  hhh


PS.  I do get a little traffic from Mojeek, which is great.  People are using it.