AI search engines fail accuracy test, study finds 60% error rate | TechSpot

Started by rcjordan, March 12, 2025, 11:33:25 PM

Previous topic - Next topic

ergophobe

My hopes of accurately sourcing all those apocryphal Twain quotes are waning.

Who knows, to revisit in 10 years.

And realistically, at this point, the tasks they gave in the article are not things I would use an LLM for. On the other hand, asking it to expand an article on wellness retreats or asking it to provide three ideas for expanding said article worked really well. Since, personally, I'm offended by the BS language used by 99% of spas and wellness retreats and ChatGPT and Claude have no such scruples, it was better than what I could write.

The 1min.ai "content expander" worked quite well for this purpose. But, again, the purpose was to spew out flowery BS, not to provide accurate citations on research science.

rcjordan

I've not used AI for searching through sites but the 1min.ai's "Summarizer" selection of bots did a very good job bulletizing the medical articles I gave them.  Note, however, that I sent the bots to individual articles.  I've yet to try assigning one of the bots 3-4 articles and prompting it to provide a merged summary.


Having done this type of medical research before, I can vouch that I completed this project in approx 1/10th the time it would have taken to do it the old, manual way.

ergophobe

>> very good job bulletizing the medical articles

Yup. I'm not a scientist, but I'm pretty good at reading that type of stuff, but if I just want the general idea or have a specific question I want answered about an article ("Does this article say that A is effective for X?") a first pass read with an LLM typically works quite well.

The more constrained the context and the fewer steps involved, the better (if you remember my experiments seeing how well the chatbots do with very simple but multi-step math of raising a number to a power)

Rupert

I have been struggling with some basic data manipulation.  I take a photo of some stats, turn it into text, and my phone spits it out as 2 arrays, one for each column.

I ask AI (chatgpt or claude so far),
To align, trim, and sort the 2 arrays. 
I Kept having to go back with "the last 5 names are not alphabetically sorted" or "you seem to have missed 3 names", or "you have mixed up the data, please refer back to the original array. "

I now ask it to align, trim, but not sort. Google spreadshhets does that, correctly every time.  I have twice asked it to tell me how to ask for the end result, and next time, it still gets it wrong.

Very disillusioned with the ability of AI. A 5 year old could do better. I have spent longer making it than I would have spend just typing them out. 

This was the last command it told me to use:

"Please match these two lists in their original order, then sort the names by surname. Format them as 'First Name, Surname, Value' without currency symbols or decimals, so I can paste into Google Sheets. Ensure no names are missing."


"Please match these two lists in their original order (do not sort by value):
List 1: [Names in order]
List 2: [Values in order]
Please maintain the exact order of names as shown in List 1 and pair each name with the corresponding value from List 2 in sequence."


So it repeats itself in part.....
... Make sure you live before you die.

ergophobe

It's best to think of an LLM as a being with a vast knowledge and a super short attention span.

This goes back to the tests I posted where I was trying to get chatbots to do simple math like 1.1^18 and the errors were enormous.

The problem I listed above has 18 steps, which should be fine, plus it's floating point math which is a challenge for computers. LLM chatbots have more and more problems as the number of steps increases, even if each step is super simple, trivial for a math library in any high-level programming language (very challenging in assembly language though) or an eighth grader given enough time and paper. This is better if you have access to the more expensive models, by the way. One of the main differences is the ability to stay on task through many steps without getting distracted (aka hallucinating).

Sorting a list is also a lot of steps. With a poor algorithm, a sort quickly consumes all memory in traditional computing as the list grows. It seems simple because Excel can do it or a five year old can do it (I think you're exaggerating that last one - I didn't know my alphabet let alone who to sort a list alphabetically when I was five). But Excel probably has a very good sorting algorithm built in, but the LLMs probably don't.

Looking at your prompt, that's a mouthful and pretty conceptually complex. I think you would get much better results with more stepwise prompts that make fewer assumptions about what the LLM understands (like that it understands the word "match" in the same way you do - I think that is potentially open to many interpretations; if you think of a matching game where you, say, match the French word to the English word, one of the lists has to get resorted to achieve that).

This seems like a job for Excel, but if you want to keep trying with an LLM, I would give it a very direct description of the data sets, how you want to combine them, then an example.

My goal would be a big, long, rather specific prompt like this:

"The first list is a list of names of people. The second list is a list of the value of the person. First remove the currency symbols from the values list. Then taking each list in order, create a new list. The first row of the new list should have the first entry from the name list in the left column and the first entry from the value list in the right column. Repeat this for all names and values in order. So first row of the new list should have "Smith, John" in the left column and "123" and the second row should have "Arupa, Vikram" in the left column and "456" in the right column."

You would get to a working long prompt by breaking it into small steps and checking the result.

So first pass would be something like:
"There are two lists below. The first list is a list of names of people. The second list is a list of the value of the person. First remove the currency symbols from the values list then show the new lists of names and values."

See if the above will print both lists without modifying the names but with the currency symbols removed from the values. When that works, add a micro step.

That said, I have gotten some excellent results this way and have gone down a rabbit hole where I simply could not get a reasonable answer and eventually the prompt got too detailed and complicated the results started getting worse.

Rupert

Ok, so I am not alone. It makes me worry how it can take over the driving of a car successfully though, it it starts hallucinating so quickly. 

And yes, probably exaggerating with 5 years old... there are some bight kids about though :)

I agree, breaking it down into simple requests, one at a time, does seem to work best. I dont understand why it would struggle with a list. I would expect AI to use an excel type function for a command "sort", much as I do. That is definitely where it repeatedly fails.

And as for dropping values completely, its still a computer, it should not do that.

I clearly expect too much from it, and expect it to still be able to do what we now consider basic computing. How can it manage big data sets if it looses a large percentage of the data? (my example is 3 dropped of 42 total, not exactly a huge dataset)

... Make sure you live before you die.

rcjordan

OTOH, my buddy/mentor Dave sent me this last week  ...I'm gobsmacked.

====
I started messing with Grok for purely childish political reasons.

On the van i stripped a bolt that holds down the injector.  Forums, facebook, many conversations, lots of bad answers,  There was a TimeSert kit but it's no longer made as a kit so you have to get the  tools portion of the kit and then source the inserts separately.  I managed to work through all of this.  It took a better part of the day.

Then i asked Grok:

which time-sert kit do i use to repair injector hold down on Mercedes OM647 engine?

This is the correct answer:

To repair the injector hold-down bolt threads on a Mercedes OM647 engine, you should use the TIME-SERT Injector Hold Down Bolt Repair Kit specifically designed for the OM647, OM648, and OM642 engines. This kit is widely recommended for addressing stripped or damaged M6 threads, which are common in the injector hold-down bolts of these engines.

The specific TIME-SERT insert size for this repair is the M6x1.0x17.0 (TIME-SERT part number TS16107). This insert matches the fine M6 thread used in the OM647 engine's injector hold-down bolt bores. The kit typically includes the necessary tools—such as a drill bit, tap, and insert driver—along with the inserts themselves, ensuring a precise and durable repair.

You can find this kit from retailers like IDParts.com, where it's listed as the "Injector Hold Down Bolt Repair Kit (OM647)(OM648)(OM642)." It's tailored for the OM647 engine (among others), which is a 2.7L inline-5 diesel commonly found in vehicles like the T1N Sprinter and certain Mercedes models. Be sure to confirm the kit includes the M6x1.0x17.0 inserts, as this is the critical specification for the OM647.

For best results, follow the TIME-SERT installation process carefully, ensuring the damaged threads are properly drilled and tapped to accept the insert. This repair is preferred over alternatives like Helicoil for high-stress applications like injector bolts, as TIME-SERT provides a stronger, more reliable thread.

ergophobe

Rupert, I think in some way, RC's post is your answer.

What the LLMs do well is collate and connect information well. In the world of biology, they sometimes talk of "associative intelligence."

When they sent a chimp into space, the systems went haywire and the chimp was able to successfully pilot the craft back to earth (this is told in Roger Fouts amazing book, Next of Kin, a top-10 non-fiction read for me and in https://en.wikipedia.org/wiki/Ham_(chimpanzee)

What's the point? The point is that a chimp performed in a way that probably was better than what a human could have done, but could not possibly put a list in alphabetical order.

I think that you are looking at cognitive skills from the point of view of a human with access to procedural software. The thing is, humans with access to procedural software are still better at tasks optimized for humans with access to procedural software.

So it's only in your mind that a sort in Excel requires a high-level of cognition and driving a car does not, by which I mean to say that it is not baked into the nature of cognition, but based on a perception of cognition.

>>  I would expect AI to use an excel type function for a command "sort", much as I do

And yet, it won't. That would require it to know when to invoke a sorting library. I'm sure we'll get there soon, but for the time being, since sorting is a solved problem, it's not really a priority to have an LLM do it well.

As I mentioned, I would have thought that it could multiply 1.1 * 1.1 eighteen times, but it can't.

I've seen news items that they are working on getting LLMs to be able to invoke math and stats libraries, like a human programmer would do in C or something, but it isn't fundamental to the models and so far they aren't good at outsourcing that.

>>  I dont understand why it would struggle with a list

For reasons I mentioned above, lists are hard. If you've ever programmed in assembly language, you know that lists and arrays are seriously difficult. Even in a language like C there is a lot of room for crashing a program. Lists appear easy because humans are good at lists and so are modern high-level programming languages. But that doesn't inherently mean that a new kind of intelligence with completely different cognitive strengths should also be good at lists.

>> its still a computer, it should not do that.

Again, this is based on a conception of what a computer is based on procedural or functional languages.

Imagine for a second that we had invented LLMs first and then invented procedural software and apps like Excel and Word. Your complaint would be like me saying, "I mean, the stupid thing can do hard things like sort huge tables, but it can't even construct a basic sentence with the proper grammar." LLMs can't sort, but they have revolutionary language abilities. But remember that those abilities are driven by what is essentially a probability and prediction engine and you are asking it to do a type of "thinking" that is not at all based on probability and prediction, but on precision, specifically precise data manipulation.

This just isn't the way LLMs "think" or they way they are intended to think. The hallucination problem is, I've heard, a requirement to create the illusion of creativity. Figuring out how to make LLMs apparently creative without hallucination is a big challenge. So it's no surprise that they are not great at precise data manipulation.

Or, to exaggerate a bit, it's like grabbing a screwdriver and walking up to a board and swearing because the damn thing doesn't cut for shit.

Rupert

Well put. I am still surprised it does not access those libraries.
... Make sure you live before you die.