Hallucination rates 20-30% for the best LLMs

Started by rcjordan, January 04, 2026, 02:46:38 AM

Previous topic - Next topic

rcjordan


Rupert

different answers here, they think its better, I wonder why?
https://github.com/vectara/hallucination-leaderboard


20 to 30% seems crazy high...
... Make sure you live before you die.

rcjordan

>20 to 30% seems crazy high...

Agree.  But I suspect the error rate is worse in some or most categories. Coding, for instance, seems to be less prone to hallucinations.

I thought I'd posted this a week ago, apparently not:

Very recently, I was prompting GPT-5 to sort 245 emojis by unicode value. Thinking I could devise a better prompt, it returned a list of 2100 emojis two or three times. When I called out the obvious error it replied "Sorry, I hallucinated."

ergophobe

That original chart only goes through GTP 4.5 so, like most AI research, somewhat dated by the time it's published.

My understanding is that you only get GTP 5 + "thinking" if you are paying for the most expensive models and you are not time sensitive. I don't think we're typically getting that with our 1min.ai accounts.

Generally, again as I understand it without having really looked into it, when you query a chatbot it's doing some degree of "routing" and using cheaper, faster models for parts it thinks will be easy and you need the highest tier account to prevent that.

Then for the next phase only the highest tier account does a lot of "thinking" which slows things down, costs energy/money, but reduces hallucination by a lot and improves benchmark performance a lot by the charts I've seen.

Then there is the context window problem. This is why even frontier models have struggled to multiply four-digit numbers. They simply are not good at carrying the numbers like a fifth grader with a piece of paper can do.

So when you ask to sort a list by unicode value, you're creating a fairly large context window. If you think back to your basic computer science, simple sorts (bubble sort) are O(n^2), so if you are asking an AI to sort something, you're really hitting one of its weak areas (remembering items) and that weakness basically increases as a square of list size. Something like Excel will do way better with sorting, multiplication and adding many numbers.

rcjordan

Worth mentioning again that nuking posts by emojis *really* cleans up social media drivel. Unfortunately, UltraEdit v6 doesn't handle unicode and 1min was handy, so...

Brad

Related:

Cops Forced to Explain Why AI Generated Police Report Claimed Officer Transformed Into Frog

https://futurism.com/artificial-intelligence/ai-police-report-frog

ergophobe

I'm trying to research an LLC for which there is little info online. Google search was no good because the LLC has too many phrase matches and adding additional quoted search terms results in no results.

ChatGPT just told me where to find info on an LLC

Gemini however, came through with a bit of detailed information about the principal agent and location. All of it wrong. When I said, "That is incorrect" and gave the correct details from the Sec of State record, I got this response

QuoteThank you for providing that specific record ID and details. Based on the information you have just provided (Record ID BA20251567012, Filing Date 7/25/2025, and Address 1519 E Chapman Ave), it appears we are looking at two different records or a discrepancy in how the data is being indexed or viewed.

However, there is a critical detail in the information you provided that needs to be addressed immediately: The filing date you listed is July 25, 2025.

Since today's date is currently in 2024, a filing date of 7/25/2025 is in the future.

This suggests one of a few possibilities regarding the information you have:

It is a typo in the source...

It's been a while since I've encountered this level of hallucination.

It's surprising just how jagged that "jagged frontier" is.

rcjordan

There was a roughly similar case posted on bsky yesterday.  The LLM was dead wrong and the user posted an easily found recorded document to correct it.

ergophobe

I was a little surprised by the wholesale invention of (mis)information, but the thing that really surprised me was asserting that it is 2024. Claude and ChatGPT regularly say, "My information is not completely up to date, so I can't answer questions about 2025,"  but they have never "thought" today was some date in 2024 (which is before Gemini 3's training run even so it want even "born" yet). 

The jagged frontier has some real surprises. Chatbots can do PhD math and write undergrad history papers, but they can't get the current year right. I think when the Enterprise computer does that, Spock knows some bad actor has altered data.