Brave Search Index at 10 Billion

Started by Brad, May 09, 2022, 01:20:51 PM

Previous topic - Next topic

Brad

Interesting interview with Brave Search

https://dkb.io/post/brave-search-interview

1. Brave claims +10B pages in it's index.  They don't say what language those pages are in.  Brave Search is the successor to Cliqz index.  Cliqz was a German company so I'm guessing many of those pages are non-English.

2. Brave discovers new pages via an opt-in discovery program available only to Brave browser users.  So it sounds like Brave is not following links.

3. Brave does not want a huge index and they ain't going to get it with just a select bunch of human finders.  Brave is a shallow engine, not good for long tail.

4. Brave search uses a backfill blended with their own search index.  They use Bing as backfill for web based searches and Google backfill (opt-in only) for searches from within the Brave Browser.

I'm not sure what I think about this.  Anyway it was a good interview.  (Not sure where I found this.)

BoL

Good interview.

Impressive getting a 10bn index without a named crawler.

>Google backfill (opt-in only) for searches from within the Brave Browser.

Managed to replicate this yesterday.
- I think you have to turn on 'Web Discovery Project' in Brave browser.
- Turn on `brave://net-export/` to monitor the network
- Do a quoted search for something you know isn't in their index
- Turn on Google backfill when asked on Brave Search

You can see in the network log that it simply calls Google and fetches the results. It'll also call brave's servers, perhaps just to store the query/serp (like cliqz did) or maybe even to crawl it later.

Brad

I wasn't sure what I thought of Brave Search before.  A lot of that comes from my not understanding how their ranking system works.

But thinking again, as the saying goes, "a thousand roads lead to Rome."  The result seems to be a viable search engine for sort of daily end consumer users.  It gives you the sort of first couple of pages of serps that you would expect from Bing or Google without being either.  Using backfill does not bother me lots of old time search portals used backfill.  The fact that they do have their own index is a good thing for the Web and there is nothing wrong with trying something different.

It's not a deep research engine but it does not intend to be.

They must be relying on adoption of their browser for growth.

BoL

>browser for growth

I think initially it was the BAT crowd, but they've probably picked up a lot of new users with that DDG drama wrt filtering. Their results are a lot like Google's in places.

IMO all things considered they're a better proposition than DDG and the other metas, just not sure on their claims of independence.

ergophobe

BoL - I know you're not a neutral observer, but do you see the metas as a viable strategy for chipping away at the dominance of the megas (Google and Bing, maybe more of a giga and a mega)?

Does using a meta that mostly feeds with Bing data and, in this case (Brave search with Brave browser), even backfills with Google data really contribute to diversifying the search ecosystem?

It reminds me of a time a couple years ago when I had two friends in the same month smugly tell me that they hate Amazon, so they refuse to buy from them and buy all their books on Abe instead to support competition. They were aghast when I told them that Abe has been a fully owned subsidiary of Amazon since 2008.

Obviously, that's different. Google obviously doesn't own DDG, but is DDG actually competition for Bing and Google? The second MS decides that DDG hurts Bing more than it hurts Google, don't the just cut it off? So I don't really see it as having a trajectory that can seriously impact the dominance of the large players.

And finally... one of the articles linked here recently proposed that the competition to Google would not likely be a general purpose search engine but a series of niche engines targeted at one hobby or industry. The obvious question, though, is when a search engine is that niche, how does anyone find the search engine? I feel like on some level there are thousands of such search engines already. But I usually find them via Google search.

Brad

EG - You might find this interesting.

A look at search engines with their own indexes
https://seirdy.one/2021/03/10/search-engines-with-own-indexes.html

BoL

#6
>contribute to diversifying the search ecosystem?

Hard to quantify I guess, each of the non G/B ones have unique propositions, even if it's simply a Bing reskin that plants trees.

DDG are quite sleek with their marketing. IIRC they still don't have a single organic listing of their own, and their '400 sources' relate to their instant answers that was previously open sourced and brought in house.

Brave are interesting and as the article alludes to, a bit more transparent, perhaps not when it comes to crawling 10 billion pages without an announced user agent.

True independents face the largest task as the article mentions. Results can be of a poorer standard (well, a lot in places, for sure), but they're offering an alternate window.

Brave/DDG have been good at highlighting the dominance of Google and raising awareness, so I'd lean towards yes, they're a force of good for diversity, even if they're not directly offering it. I'm usually just knee deep into coding, the link Brad shared comes from a guy that has a good handle on most things search.


ergophobe

Thanks guys.

I suspect that when Google search dominance collapses it will happen, is often the case with revolutions, slowly, then all at once.

Quote"How did you go bankrupt?" Bill asked.

"Two ways," Mike said. "Gradually and then suddenly."

"What brought it on?"

"Friends," said Mike. "I had a lot of friends.

  -- Ernest Hemingway. The Sun Also Rises (1926)

Brad

>diversifying the search ecosystem

A bit.  With DDG it got a lot of people searching with something other than Google.  But DDG's biggest contribution was around privacy. DDG helped by changing the conversation and hitting Google where they were vulnerable and not letting go.

>just cut it off

Yeah.  DDG is not a long term solution. 

>niche engines

I think it's going to be a combination of new general engines with their own indexes.  These will be varied everything from Mojeek spidering everything it can to setups like Brave cherry picking and providing smaller indexes with backfill.  I can see some of these using Mojeek as backfill someday like they use Bing now.

But niche engines will be in there too.  Indexes of non-commercial pages could become a popular niche.  I can see non-commercial engines like Magnolia search providing an API to engines like Brave.  Also engines you use more for surfing the web. 

So my guess is it will be both general and niche engines together.  The cat is out of the bag though.  5 years ago nobody was launching a new search engine with it's own index.   Today, lots of people are trying their hand at building search engines and more crop up every day.

rcjordan

When moving to FF, I used Brave search as content for new tabs because G was putting portal crap on their search page.  As it happens, this gives me dual search options --Brave, if the search is shallow. G, using the action bar, if deep.  Handy way to ween myself off G a bit.

Brad

Using Brave search is profoundly better than using Google for the health of the web.  But Brave is still dependent on Google, not just for backfill, but for reverse engineering sort order in the serps.  If Google ever went down Brave would have nothing to copy.   And since they are down-stream of Google there will always be a delay before Brave can catch up to changes in Google's serp.

This is just me, but I don't think Brave has a real crawler like Googlebot, Bingbot and Mojeekbot.  I think it's more of a fetcher, it just fetches pages from Google serps.  It's all just a bit dodgy.

Still, with that said, they have reverse engineered something that is usable with an attractive serp.  I use it as a backup engine.

rcjordan

I'll switch the tab to mojeek and give BOL a thrill.

BoL

Feel free to switch around! Brave & Mojeek have links to each other so it's painless to try both.

Brave's index is remarkably similar to Google's. Probably a huge coincidence...

Brad


BoL

>Remarkably

Definitely interesting. Wonder what the science says about two totally independent search engines producing almost identical results.

Of course there's there cliqz historical data which is tied to G as you know, and the backfill thing that they do with Brave browsers scraping G results.

Actually surprised no one has came out and called foul on that. Or their crawling blocking based Googlebot rules. Guess the mainstream tech crowd have no hope of anyone toppling or even competing with G / bing.