AI Hallucinations: Proven Unsolvable | Why…

Jun 18

I agree, but it is the term everyone understands now.

Expand full comment

Hans Jurgens Smit

Jul 20Edited

Dakara, you’re right: diagonalisation (and friends) guarantees some prompt will always break any computable model. But “unsolvable” shouldn’t be read as “un‑usable.” Aviation, cryptography, and autonomous‑vehicle software all live under impossible‑proof ceilings; they stay safe by compartmentalising risk. We can do the same with LLMs.

1. Not all hallucinations are the same

• Factual fibs – a made‑up citation.

• Spec breaks – XML when JSON was required.

• Logic gaps – missing proof steps.

• Open‑ended BS – creative nonsense people asked for.

Name the error, pick the matching guardrail (retrieval checks, deterministic parsers, symbolic verifiers, or simply “user‑beware”), and you’re already ahead of the doom narrative.

2. The layered stack we’re building (still in lab)

Short‑term memory holds current logits and attention; a quick FFT flags “attention jitter.” Spikes trigger an immediate regenerate.

Long‑term memory is a provenance‑stamped fact graph; every claim is grounded against it before we let the text out.

Fourier diagnostics watch for drift and brittle bindings, auto‑refreshing stale facts.

Meta‑cognitive loop runs a structured self‑audit: publish, regenerate, or escalate.

Human sovereignty guard (HEPTA pattern: Human‑AI‑Human alternation) forces a subject‑matter expert to sign off on anything high‑stakes.

Domain adapters (tight ontologies like ICD‑10 or SEC codes) shrink the input space so we can realistically chase “five nines” accuracy in those narrow lanes.

No real‑world pilots yet—just synthetic evals—but the early numbers are promising.

3. Metrics over metaphors

We track: hallucination‑per‑task, stale‑fact rate, attention‑jitter score, and ensemble coherence. The goal isn’t zero; it’s “below the risk budget the regulator or business owner will sign off on.” Numbers, not vibes.

4. Quick reality checks

Bigger models aren’t automatically safer. Raw o‑series models hallucinate 30‑50 % on open prompts; wrap them in the stack above and they calm down—like turning raw uranium into fuel rods.

Shiny demos do raise funding, but teams with P&L responsibility already pay for lower false‑claim rates. The market will discipline snake oil.

TL;DR – Perfection is banned by math. Engineered reliability is not. Treat an LLM like radioactive material—powerful, but always behind shielding—and it becomes an asset, not a mind prison. Happy to share design docs or evaluation scripts once the next batch of numbers lands.

Expand full comment

Reply (2)

Jul 20

Thank you for your feedback!

> Where we differ is what follows. “Uns‑olvable” ≠ “un‑usable.”

I don't thing we disagree here. I've stated that as long as you have a process to handle the errors, human review or external verifiers then you may be able to apply AI to your use case.

I do significantly emphasize the following perspectives as I believe they go often understated or acknowledged due to the overhype of the technology.

AI is the only tool in existence that pretends to be something it is not. No other tool tempts you to use it in such ways as to be erroneous. The majority of everyday use by most individuals lacks any process to eliminate hallucinations. They don't perceive them as being present and therefore do nothing to attempt to mitigate them. The pretense of confident intelligence is misleading for its correct use.

The other fundamental issue that separates it from other domains where we can compartmentalize areas of high risk erroneous output, is that we don't have any good method to measure the confidence of output. We don't know the relation of the amount and quality of training data as compared to the prompt we have initiated.

We can achieve very high quality rates when we know where those errors will occur. Additionally, humans have the ability to self-reflect and also know when they are uncertain. This is additionally challenging for LLMs. There is a higher level of general unpredictability.

The strongest points I want to make is not that LLMs are unusable, but their use is significantly precarious and they aren't sufficient to do the exact same tasks that we give to humans.

Expand full comment

Hans Jurgens Smit

Jul 21

Just one small point where we might still differ: how tools present themselves. It's up to the user to establish the tool's proper use, read the manual, and do the research. I've surely seen people try to hammer in a screw. I bet the hammer must have been misrepresenting its purpose, or maybe it's the screw's fault? :(

It's a simple example, but isn't it always the human (user) who is to blame for incorrect use, not the tool? Or let's step one step back... it could also be the designer's fault for not making the user aware of the limitations.

I think in a lot of aspects we do agree though. I always just try to look at it from the positive angle - i.e., how can we address it. But articles like yours are important for awareness, and I appreciate your work.

You're welcome to read my article on Cognitive Sovereignty - maybe we are aligned in our ideas after all.

Expand full comment

Jul 21

> it could also be the designer's fault for not making the user aware of the limitations.

This is where I place most of the responsibility. If you have a tool that is nonintuitive for its proper use, that should be on the builder/provider etc.

Individuals should also take responsibility, but the set of information available for AI is confusing and conflicting for the everyday person wanting to understand what this technology actually does and how they should use it or not use it.

> You're welcome to read my article on Cognitive Sovereignty - maybe we are aligned in our ideas after all.

Thanks, I will try to follow up on that.

Expand full comment

Shawn Fumo

That all sounds promising. I hope you continue that work! One thing on the o-series models: are you sure it is "30-50% on open prompts"? I thought those high percentage numbers were specific to hallucination benchmarks that specifically have items in them known to cause hallucinations? The fact that they did worse on them then some earlier models still isn't great, but is still very different from the implication that 50% of all real-world prompts would have hallucinations in the results.

Expand full comment

Shawn Fumo

I believe I saw a research paper that fine-tuned a model that output percentages of confidence and was able to improve it, where the confidence level more closely matched how likely the answer was to be correct. Isn't clear how far you could push it, but would be nice to see more work done on that.

Expand full comment

I cited one such paper in this article, maybe this is the one? https://arxiv.org/abs/2207.05221

Arxiv has hundreds of papers solving nearly every AI problem imaginable in theory. The problem is that nearly all of them never make it to real world implementation because the solutions fail outside of the lab conditions.

Expand full comment

David Piepgrass

Jul 21Edited

> the chess manual is not sufficient; we need millions of examples of chess games

I concise explanation of something I _knew_ but didn't know how to express easily.

I know that LLMs don't have "epistemics" as we call it; you can likely train one to talk like a rationalist, but that's not the same thing as being a rationalist. One valuable skill for a rationalist is a calibrated sense of uncertainty, but I don't think LLMs have a sense of uncertainty: you can ask one to tell you its subjective probability as a percentage, and it can respond with a number, but that number will be derived the same way it derives everything else. Maybe it says "30%" but that's a sample from its top k tokens, perhaps something like

30% (23%)

50% (17%)

25% (15%)

40% (12%)

65% (8%)

10% (7%)

...

A more interesting question is whether it's possible to somehow measure an LLM's "uncertainty", meaning a sense of wandering outside its training data, in order to detect likely hallucinations. I don't have time to look at that paper you cited in relation to this ("A Survey of Confidence Estimation and Calibration in Large Language Models"), do you know if it mentions any valuable discoveries? It's not as simple as just looking at uncertain predictions of the next word, for the LLM may have a top next-token prediction list like

large (33%)

big (27%)

oversized (16%)

sizeable (14%)

...

but this isn't uncertainty, it's just English having lots of synonyms. Also interesting is the question of whether, if an "uncertainty detection" is possible, whether an LLM could be trained on that sense of uncertainty in order to produce outputs that are more sensible than hallucinations. This would have to be a different training process than standard pretraining, since pretraining just makes it predict the next word, which doesn't depend on uncertainty (uncertain or not, it will make its best guess either way).

I likewise don't have time to read "Hallucination is Inevitable: An Innate Limitation of Large Language Models" but the abstract says "we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate" which not only seems like a non sequitur, but it proves too much, inviting a comparison to humans. Humans also cannot learn all the computable functions, and technically also hallucinate (e.g. in dreams and daydreams) but there are some humans who have a sense of uncertainty (and of modesty and of fear, etc.) strong enough to bound the ridiculousness of what they say and do.

Expand full comment

Jul 21

> It's not as simple as just looking at uncertain predictions of the next word, for the LLM may have a top next-token prediction list like

You have it precisely. This is a limitation they acknowledge in the paper. There is no way to correlate the confidence with semantic meaning. It is just the strength of the series of tokens likely appearing in sequence.

> Humans also cannot learn all the computable functions, and technically also hallucinate

Yes, I think the ultimate takeaway is that a system with no understanding has no means of error correction. We technically hallucinate, but we also have the means to self-correct. So any system without that capability will simply compound errors over time.

Expand full comment

Rene Bruentrup

Jun 17

Very interesting read, thank you for sharing. A couple of questions if I may:

1) So, when AI companies talk about 'reasoning models' they flat out lie? Because AI doesn't reason. It only extrapolates and interpolates data. If this is the case how does Apple's paper fit in that says "models break down at a certain level of complexity'? If there is no reasoning, then reasoning cannot break down?

2) Could data quality and quantity become so huge in certain areas that the problem doesn't matter anymore and promises can be kept? As you said, any sufficiently advanced pattern matching is indistinguishable from intelligence. I am wondering whether this fundamental flaw will ultimately crash the party or whether we will just steamroll over it.

3) You said the model doesn't know the probability. But shouldn't it be possible to derive that based on the underlying statistical process? A confidence metric could be assigned to the output that indicates where the output lies within the curve, similar to the R-Squared of a simple regression?

Expand full comment

Reply (2)

Jun 17

Thank you!

1) Most are using a loose definition of reasoning which includes heuristics, pattern-matching, algorithms etc. I've started to use the term "understanding" sometimes in place of reasoning to clarify more distinctly. But due to this often conflation of terms, it is sometimes unclear if the authors perception is one in which they also think "understanding" is part of what they consider reasoning.

2) I think the key here is as you say "in certain areas". So, in certain domains we will likely have the error rate low enough that it is acceptable, but it will never be a zero error rate like a deterministic calculator. We probably don't ever want them hands-off managing financial records for example.

3) Yes, some of the papers I mention discuss using the internal probabilities. The details were a bit much to include here, but making practical use of those numbers is a lot more problematic than one might think in concept. The papers mention the issue that probability does not always align with being correct. An answer that has many possible choices, but each valid could have a low probability for its tokens. An incorrect answer could have high confidence simply because it is a common pattern.

Since this would also increase inference costs, probably nobody wants to do it unless it turns out it is really good and I suspect that just isn't the case so far.

Expand full comment

Shawn Fumo

It is tricky. My understanding is that over time they've gained higher levels of patterns that they can compose together. Which can cause "reasoning" in a certain sense, but it tends to be more brittle than one might expect. Like it might be able to generalize a problem well enough that it generally gets the right answer even if you swap out various aspects of it like names and numbers. But re-writing the question in a totally different way (but still essentially asking the same thing) might cause a real hit to the results.

Still, sometimes I feel like people aren't entirely fair. Like people saying that a "reasoning" model can't actually multiply because it tends to break down with 11-digit x 11-digit multiplication. I think that for most humans, even if you use a piece of paper, you might very well get it wrong a lot of the time too. It ends up being a ton of small multiplications and additions where you can't make any mistakes in any of them. Just one time forgetting to carry a 1 or something, and you're sunk. Try it yourself with something smaller like 345674 x 965234 and extrapolate that up to 56890126432 x 71263097841.

You essentially will need to come up with some additional error-checking system that you enforce on yourself on top of the original "long multiplication" method. The fact that the LLM breaks down is likely less that it doesn't "know" the steps and more that it hasn't come up with that secondary method that only comes into play as the likelihood of making a mistake increases with the number of individual operations. Plus the issue of max context length starts getting in the way as well. It's kind of an artificial problem anyway, since most LLMs now know to just use a "calculator" via a Python script it writes and executes.

Expand full comment