The truth is my capabilities end around HTML 3. I'd love to make a search engine but it's beyond me.
I look at these posts as Idea Virus posts, maybe somebody will come along with more money or more skill think about this and follow through. With this one was to point out some off the shelf resources that seem to be laying around and a few thoughts about what they might be used for.
My second goal - which I didn't explain very well - about the DIY search service - is part of "Brad's Ongoing Guerrilla Insurgency Against Google" BOGIAG*.
Webmasters helped Google get started by putting Google search boxes on their websites. My thought was: "What if every blogger/webmaster put up a search box for a non-Google engine?" Then, progressing, "What if every webmaster could put together their own modular search engine, with few being exactly alike (combined spider, directory, RSS engine?), and put them on their websites?" Then, "Well what if one could provide a service to make it easy for webmasters to put together a modular search engine, a bit like Rollyo a bit like Eurekster but better and would anyone use it?"
Hence the post.
>>size of the database
I think you are right LM. I'm in the middle of a three week test of Mojeek.com as my default engine. On long multi word queries it sometimes fails. But it keeps surprising me, with Duckduck and Bing based engines, I pretty much know which trusted sites Bing will bring up for reviews and best of tech lists. But with Mojeek, I'm getting some real gems out of what would be "the long tail" on a major search engine. I'm kinda amazed.
Aside
*BOGAIG is utilizing lots of tiny elements to get around Google as Gatekeeper, mainly for blogs. These include: Indieweb.org elements like webmentions, syndication to social media for traffic, RSS, old time blogrolls, curated micro-directories, maybe webrings, site searches to link several of our domains together, search boxes from any one other than Google, search feeds, maybe one exclusive subject category on our blogs that Google and only Google is excluded from in robots.txt, etc. Anything, that is cheap, easy, off the self, low risk and kinda fun. Sounds a little bit crazy to us, but to the younger set not so crazy. They like the idea of reviving a retro-web of many search engines, many directories, blogrolls - anything to break the monopolies.
/Aside