The Core

Why We Are Here => Traffic => Topic started by: rcjordan on March 14, 2022, 10:38:31 PM

Title: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: rcjordan on March 14, 2022, 10:38:31 PM

https://blog.mojeek.com/2022/03/five-billion-pages.html
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 14, 2022, 10:57:29 PM
Is that a big number? At what point is it less a question of coverage and more a question of search quality?

Also, notice how the cycle time is dropping. It took 22 months to go from 2 to 3 billion, 14 months to go from 3 to 4 billion, 8 months to go from 4 to 5 billion.

I was going to check some obscure sites to see whether they were indexed or not, but couldn't get the site: operator to work. It turns out that it still needs a search term. So site:nytimes.com returns 0 results but site:nytimes nytimes returns over two million results (vs 5.5 million for google and 2.4 million for Bing). So it seems like the index size is comparable to Bing.

I also searched on a small, obscure, low-traffic site (one of mine). Moojeek and Bing returned about the same number of pages, Google indexes another 20%. Essentially all the same though. I'm not exactly an objective witness, but I can say that the Mojeek results correctly sorts the most popular page as 1 & 2. For overall richness, Bing is the best. You could argue both ways which is the "better" result. Google gives by far the worse result. Actually not even in the running with Mojeek and Bing.

That last results points to my first question. The smaller indexes both yield better results. Looking at the Google result, I would say that some of the pages omitted from the Mojeek index are rightfully omitted - they're really stub pages that by all rights should never show up in a search result. So adding those pages reduces, not increases, the quality of the index.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: Brad on March 14, 2022, 11:41:10 PM
The quality has improved at 5 billion vs 2 billion which is about when I started following Mojeek.  Freshness has improved too.

There is a nice balance on Mojeek, it's not always the same warhorse sites that come up but a mix of better known sites and obscure sites that still answer the query.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: BoL on March 15, 2022, 06:29:15 PM
>big number

Guess it depends what you're looking for. Sometimes it's big enough for the deeper pages, other times it's just duplicate content and low quality pages like you say. We put links to other search engines in case you want to look elsewhere.

When the UK gov reviewed the ad/search space, (IIRC) they placed Bing's index in the 40-50 billion range and Google about 3x more. OFC when you factor in duplicate content, canonicals etc the number verges towards meaningless.

We do get some issues with access, like FB blocks our bot and most others. Maybe some searches would require those pages, maybe not. We also only index specific languages so it's not a directly comparable number to other index sizes.

I've been working on language detection of documents the past month or so. Turns out even if you're monolingual there's some fairly solid clues about the language of a document, though some tougher differentiations like Norwegian/Danish or Dutch/Afrikaans.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 15, 2022, 09:04:10 PM
Thanks BoL. That is the sort of context I was wondering about when asking whether or not it's a big number.

The 100+ billion in Google's index does not strike me as a worthwhile target. What I haven't tried with Mojeek are really obscure queries. Now that I'm not an active scholar, I do far fewer of those.

To me as a scholar the killer feature in Google has always been the very rich Google Books index of old, public-domain books. Nothing else is close. Other repos are just as big or bigger (BNF for example), but nobody has it searchable. I have to believe that's a money loser for Google and it seems like a mostly abandoned project. I wonder what percentage of the Google index is comprised of Google Books and Google Scholar.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: BoL on March 16, 2022, 07:01:12 AM
Was a bit off with my numbers, but these are estimates anyway
"We found that Google's index is larger than that of Bing in terms of number of
pages in the index. Based on submissions from these parties, Google's index
contains around [500-600 billion] pages and Microsoft's index contains around
[100-200 billion] pages."
https://assets.publishing.service.gov.uk/media/5fe4957c8fa8f56aeff87c12/Appendix_I_-_search_quality_v.3_WEB_.pdf

>target
Would agree, there's some algo tests improvements on the way, as we know we have certain pages in our index that should rank better for specific searches

>scholar
Interesting, they started that project quite early on didn't they? Not sure if they ran into legal issues... definitely a massive project to undertake.

Plenty complaining about G quality nowadays but they are in a league of their own. DDG seems to be/have been doing well, most people don't know it's Bing reskinned. Brave could be interesting but not sure how much they rely on their old scraped CTR data. It's quite hard to compare to one another since most are relying on G/B in one way or another. Impressive that Mojeek is mainly the work of one person though.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: Brad on March 16, 2022, 11:31:32 AM
This.

>mainly the work of one person

Mojeek is a great proof of concept, that it no longer takes tens of billions of dollars to make a spidering general search engine any more. Others have to be/are watching this and thinking about starting their own.  As regulation and bad reputation takes Google down a few more notches I think we will see more start up search engines breaking out of bedrooms and garages.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 16, 2022, 04:30:29 PM
Quote from: BoL on March 16, 2022, 07:01:12 AM
>scholar
Interesting, they started that project quite early on didn't they? Not sure if they ran into legal issues... definitely a massive project to undertake.

Yes, they cut deals with a lot of libraries to start digitizing their collections and made huge progress, but then started running into legal hurdles. In the end, Google won their copyright suit, but it took ten years. I think Google Scholar has always been on safe footing because it's mostly titles and abstracts, and those are generally publicly available on the publisher sites and then you pay for the full paper. So that works well for publishers.

The Google Books vs The Authors Guild case is interesting. Google's original model for Books was to show snippets and then charge for full text and it wasn't clear what relationships they would cut with publishers and authors, how it would work for books with living authors and out-of-business publishers and many other issues. The fundamental question was whether or not showing snippets of books violated Fair Use provisions. Fair Use is ill-defined. It's like trademark infringement. Until you go to court, you can't know. There have been efforts to codify it more, but it's vague.


So think about that for a second and what the stakes were for Google and all search engines. If the courts had ruled that excerpting snippets from books was illegal, logically the same would apply to web pages. So the case went to the heart of search engines as we know them. Ultimately, the courts decided that Google was acting within the limits of Fair Use, but just barely. In the meantime, 10 years had gone by and a lot of the energy had gone out of the project. Still, they launched a Books update recently and they do have a catalog of 30,000,000 books, making it one of the larger "libraries" in the world. That is a bit deceptive when comparing it to a real library. They copied every book in several libraries and many many books have multiple copies.

If you want all the nitty gritty, check out the Atlantic article which starts with an overview of Fair Use and then gets to the Google case
https://www.theatlantic.com/technology/archive/2015/10/fair-use-transformative-leval-google-books/411058/

I'm not sure how all this figures into Google's page count. But 30,000,000 books with an average of 333 pages (I have downloaded several 600 page monsters) is... 10 billion pages. I guess it's not that much in light of the numbers you just posted. But, you know, 10 billion pages here, 10 billion pages there. Before you know it, it starts to add up.

Seriously, I think Google Books is one of the best, most useful things Google has done. The only shame was that the state of OCR was poor when they started. But even with their back catalog, they seem to rescan it frequently. It still struggles with figuring out things like two-column displays. So sometimes I will get excited to find a couple of references to the odd "word1 word2" phrase I've typed in, only to find that word1 is at the right of one column and word2 is at the left of the next, but in the logical flow of the text they are hundreds of words apart.

Still, it has literally made millions of books from the 1800s and earlier searchable, most of which don't even have indexes even if you have access to the physical book. It has literally transformed historical scholarship.

As one anecdote - there is a phrase I have tried to find a good definition for since 1998 when I first assigned a text to my students. I or a student found rough analogs, but never a good definition. This year we tried again. Either Google added some books or, more likely, they rescanned some books and improved the OCR. We found a nice definition in a tri-lingual (French, German, Latin) medical dictionary from 1664 and, from there, in a couple other medical dictionaries from 1659 and 1694. All without leaving your desks. That was unthinkable to a scholar in 1990.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: BoL on March 16, 2022, 05:12:43 PM
ergo,

Definitely an unprecedented task and feature they can offer, thanks for the context. I'd agree it's one of the better things they've done but not something I have used much. Sounds more like their 'organizing the world's information' ideal rather than the 'do no evil' one.

RE: OCR, I'd guess they have the means to correct it nowadays but whether it's a priority or not... it'd be good to sort.

brad
>Mojeek is a great proof of concept

Agree, there is a bit of hassle getting bot access nowadays because of all the other ones and their own agendas... but proof enough you don't need 10 figure numbers to get things up and running.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 16, 2022, 05:50:52 PM
Quote from: BoL on March 16, 2022, 05:12:43 PM
RE: OCR, I'd guess they have the means to correct it nowadays but whether it's a priority or not... it'd be good to sort.

I don't know as it is a priority, but I have seen massive improvements. At first, you could use GB to find old books, but not text within old books. Then it started to improve and it just got better and better. By the time I stopped being a full-time historian in 2012, I was often doing hundreds of searches per day in GB and the results seemed to be getting better and better. By that I mean that formerly as often as not when you clicked through to the image, the "hit" would have been due to a faulty transcription. Over time that became more a problem with edge cases. I do way way less searching in GB now, so I do not have the same feel for it, but I certainly feel that the text searching has continued to improve over the last ten years.

The edge cases are still a challenge, but rightly so. I'm thinking of things where an old text uses a "u" where we use a "v" so the word is spelled "enurionment" and, of course, your search for "environment" fails. Or, as I mentioned, multi-column texts where the columns are close together and GB parses it as a single wide column.

Probably not a priority, but a backwater side project for Google might have as many people working on it as other whole companies.

>>'organizing the world's information'

That may be so, but I always thought there were other reasons for the GB effort. One is that Google was trying to decode how to recognize quality. The advantage of the GB corpus is that they know for certain that almost all of that text has gone through a traditional editorial process.

So I have wondered whether the project exists and the OCR transcriptions keep improving because it's an excellent training environment where, in general, you know that text is grammatical, spelled correctly and so forth.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: BoL on March 16, 2022, 06:11:18 PM
>The edge cases are still a challenge, but rightly so. I'm thinking of things where an old text uses a "u" where we use a "v" so the word is spelled "enurionment" and, of course, your search for "environment" fails. Or, as I mentioned, multi-column texts where the columns are close together and GB parses it as a single wide column.

'old' definitely could be an issue for them, as well as many languages having 'legacy' scripts before being normalised into latin and such. The time angle definitely means moving goalposts. Still, if anyone has the resources to suss it, they do.

Wonder if they have any theories about language use pre-internet vs now, maybe even if it's a ranking factor. Probably need to discount emoji's in that TF/IDF comparison!

Unprecedented though isn't it, the fact you're able to search through those texts within the last couple of decades. 
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 16, 2022, 09:30:17 PM
>>Unprecedented though

Now that I've thoroughly derailed this into a Google Books convo... yes, it is unprecedented and I see it more and more as my grad students are search natives. The bit shift was 2016, which would mean on average students born around 1990.

In 1998, I felt compelled to actually have a full one hour lecture showing grad students then internet. Of course they knew you could buy things on the internet and find porn, but most had completely discounted it as a research tool. I knew one Master's student in the early 1990s who still did not know how to use the computer catalog. By 2000, it was obvious that the internet was important and students had only vague memories of the card catalog. But search was not a native behavior.

From 1998 to 2012, I would have a brief intro to regular expressions. I would show them that if they had all their notes saved in plain text and had a GREP engine, they could use regular expressions to find things they could never find with normal search. There were also dictionaries and other resources coming online, mostly from France for whatever reason, that allowed regex searches. Ho hum. No interest.

In 2014, students were intrigued, but didn't dive in. In 2016, though, they were blown away and made regex part of their regular workflow. This year, one of the students who does some farming too, had a bull born during the course and named her new bull Regex.

All that to say that it's not just that it's unprecedented for older scholars like me. I'm blown away by the possibilities. But more importantly, I see that search as a native behavior has changed the way my students think and the way they approach problems.

I was an outlier in history because of my connection with WMW and the SEO worlds, my undergrad computer science training and so forth. All very odd for a historian. That meant that for many years, I was just way way more savvy than students even 20 years younger than me.

Now that's not so. It was one of my students who found the Google Books reference to the troublesome phrase, screenshotted it and posted it to the course Slack channel. Those search behaviors are now more comfortable to them than they are to me. What's unfamiliar to them is the more old school wandering through a library pulling books off the shelf at random.

Anyway, quite far off topic, but the point being that it's not just the information that's unprecedented, but the style of thinking and the way of approaching problems.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: BoL on March 16, 2022, 10:00:01 PM
> I was an outlier in history because of my connection with WMW and the SEO worlds, my undergrad computer science training and so forth. All very odd for a historian. That meant that for many years, I was just way way more savvy than students even 20 years younger than me.

Handy knowledge to have and to be fair some of that knowledge is now obsolete, not like many 'webmasters' have to worry about CSS (with a theme) or FTP. Probably still matters to an extent from the search crawler PoV. Tbf I watch a lot of physics videos and they have physics ambassadors to the wider community, academia needs to have wider skills to ensure you reach the wider community?

>Now that's not so. It was one of my students who found the Google Books reference to the troublesome phrase, screenshotted it and posted it to the course Slack channel. Those search behaviors are now more comfortable to them than they are to me. What's unfamiliar to them is the more old school wandering through a library pulling books off the shelf at random.

There's definitely a generational shift in how people find information. Recall a thread (maybe also on here) about how filesystems seemed like a foreign concept to the younger generation. I'd partly attribute it to them not having the cruft that we (maybe) no longer need, like the search engine dealing with synonyms. Depends on the search though.

>Anyway, quite far off topic, but the point being that it's not just the information that's unprecedented, but the style of thinking and the way of approaching problems.

Yes, the fact that all options are available hopefully. As frustrating as synonym crunching can be, it's helpful, as is verbatim searching. I mention that because >95% searches are done on Google and that is the main complaint I see. 'Each to their own' seems like a decent mantra, as long as the information can be found in a reasonable manner it's there to be found... which circles back to that SEO stuff : )

As an added thing since I've been working on language detection stuff, there doesn't seem to be a single site on this planet than can simply tell me the alphabets of each language, even wiki pages have alphabets split into different sections, other websites have them as images. It's 2022 and it seems like a reasonable objective.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 17, 2022, 04:28:32 AM
So you need something like https://fileformat.info but for alphabets instead of charsets?

From earlier comments, I think you mean for cases where you don't have language stipulated in the headers or document and you're trying to detect based on intrinsic factors, right?

https://www.sciencedirect.com/science/article/pii/S1742287612000291
https://help.relativity.com/RelativityOne/Content/Relativity/Analytics/Language_identification.htm


>> some of that knowledge is now obsolete

Ha ha! It'a almost all obsolete! But very few historians would have ever worked with regular expressions or relational databases or things like that, which gave me a whole set of tools that were uncommon in the field.
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: BoL on March 17, 2022, 07:54:03 AM
I'm using an n-gram method, but also the training data (has to be of the right kind of license) sometimes has mixed language. Knowing a language's alphabet/orthography helps ignore or weed out the problem. omniglot.com looks pretty solid but a lot of their data of interest is in images and partially in spreadsheets. It isn't a huge deal, just bouncing between various sources when it seems there could've been one.

>set of tools

Programming-like tools are great for so many fields
Title: Re: Mojeek Surpasses 5 Billion Pages | Mojeek Blog
Post by: ergophobe on March 25, 2022, 03:12:40 AM
I think with 5 billion pages indexed, it's time for a subtle language shift. This is a recent reply comment to two visitor questions on my blog