> the chess manual is not sufficient; we need millions of examples of chess games
I concise explanation of something I _knew_ but didn't know how to express easily.
I know that LLMs don't have "epistemics" as we call it; you can likely train one to talk like a rationalist, but that's not the same thing as being a rationalist. One valuable skill for a rationalist is a calibrated sense of uncertainty, but I don't think LLMs have a sense of uncertainty: you can ask one to tell you its subjective probability as a percentage, and it can respond with a number, but that number will be derived the same way it derives everything else. Maybe it says "30%" but that's a sample from its top k tokens, perhaps something like
30% (23%)
50% (17%)
25% (15%)
40% (12%)
65% (8%)
10% (7%)
...
A more interesting question is whether it's possible to somehow measure an LLM's "uncertainty", meaning a sense of wandering outside its training data, in order to detect likely hallucinations. I don't have time to look at that paper you cited in relation to this ("A Survey of Confidence Estimation and Calibration in Large Language Models"), do you know if it mentions any valuable discoveries? It's not as simple as just looking at uncertain predictions of the next word, for the LLM may have a top next-token prediction list like
large (33%)
big (27%)
oversized (16%)
sizeable (14%)
...
but this isn't uncertainty, it's just English having lots of synonyms. Also interesting is the question of whether, if an "uncertainty detection" is possible, whether an LLM could be trained on that sense of uncertainty in order to produce outputs that are more sensible than hallucinations. This would have to be a different training process than standard pretraining, since pretraining just makes it predict the next word, which doesn't depend on uncertainty (uncertain or not, it will make its best guess either way).
I likewise don't have time to read "Hallucination is Inevitable: An Innate Limitation of Large Language Models" but the abstract says "we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate" which not only seems like a non sequitur, but it proves too much, inviting a comparison to humans. Humans also cannot learn all the computable functions, and technically also hallucinate (e.g. in dreams and daydreams) but there are some humans who have a sense of uncertainty (and of modesty and of fear, etc.) strong enough to bound the ridiculousness of what they say and do.
> It's not as simple as just looking at uncertain predictions of the next word, for the LLM may have a top next-token prediction list like
You have it precisely. This is a limitation they acknowledge in the paper. There is no way to correlate the confidence with semantic meaning. It is just the strength of the series of tokens likely appearing in sequence.
> Humans also cannot learn all the computable functions, and technically also hallucinate
Yes, I think the ultimate takeaway is that a system with no understanding has no means of error correction. We technically hallucinate, but we also have the means to self-correct. So any system without that capability will simply compound errors over time.
Dakara, you’re right: diagonalisation (and friends) guarantees some prompt will always break any computable model. But “unsolvable” shouldn’t be read as “un‑usable.” Aviation, cryptography, and autonomous‑vehicle software all live under impossible‑proof ceilings; they stay safe by compartmentalising risk. We can do the same with LLMs.
1. Not all hallucinations are the same
• Factual fibs – a made‑up citation.
• Spec breaks – XML when JSON was required.
• Logic gaps – missing proof steps.
• Open‑ended BS – creative nonsense people asked for.
Name the error, pick the matching guardrail (retrieval checks, deterministic parsers, symbolic verifiers, or simply “user‑beware”), and you’re already ahead of the doom narrative.
2. The layered stack we’re building (still in lab)
Short‑term memory holds current logits and attention; a quick FFT flags “attention jitter.” Spikes trigger an immediate regenerate.
Long‑term memory is a provenance‑stamped fact graph; every claim is grounded against it before we let the text out.
Fourier diagnostics watch for drift and brittle bindings, auto‑refreshing stale facts.
Meta‑cognitive loop runs a structured self‑audit: publish, regenerate, or escalate.
Human sovereignty guard (HEPTA pattern: Human‑AI‑Human alternation) forces a subject‑matter expert to sign off on anything high‑stakes.
Domain adapters (tight ontologies like ICD‑10 or SEC codes) shrink the input space so we can realistically chase “five nines” accuracy in those narrow lanes.
No real‑world pilots yet—just synthetic evals—but the early numbers are promising.
3. Metrics over metaphors
We track: hallucination‑per‑task, stale‑fact rate, attention‑jitter score, and ensemble coherence. The goal isn’t zero; it’s “below the risk budget the regulator or business owner will sign off on.” Numbers, not vibes.
4. Quick reality checks
Bigger models aren’t automatically safer. Raw o‑series models hallucinate 30‑50 % on open prompts; wrap them in the stack above and they calm down—like turning raw uranium into fuel rods.
Shiny demos do raise funding, but teams with P&L responsibility already pay for lower false‑claim rates. The market will discipline snake oil.
TL;DR – Perfection is banned by math. Engineered reliability is not. Treat an LLM like radioactive material—powerful, but always behind shielding—and it becomes an asset, not a mind prison. Happy to share design docs or evaluation scripts once the next batch of numbers lands.
> Where we differ is what follows. “Uns‑olvable” ≠ “un‑usable.”
I don't thing we disagree here. I've stated that as long as you have a process to handle the errors, human review or external verifiers then you may be able to apply AI to your use case.
I do significantly emphasize the following perspectives as I believe they go often understated or acknowledged due to the overhype of the technology.
AI is the only tool in existence that pretends to be something it is not. No other tool tempts you to use it in such ways as to be erroneous. The majority of everyday use by most individuals lacks any process to eliminate hallucinations. They don't perceive them as being present and therefore do nothing to attempt to mitigate them. The pretense of confident intelligence is misleading for its correct use.
The other fundamental issue that separates it from other domains where we can compartmentalize areas of high risk erroneous output, is that we don't have any good method to measure the confidence of output. We don't know the relation of the amount and quality of training data as compared to the prompt we have initiated.
We can achieve very high quality rates when we know where those errors will occur. Additionally, humans have the ability to self-reflect and also know when they are uncertain. This is additionally challenging for LLMs. There is a higher level of general unpredictability.
The strongest points I want to make is not that LLMs are unusable, but their use is significantly precarious and they aren't sufficient to do the exact same tasks that we give to humans.
Just one small point where we might still differ: how tools present themselves. It's up to the user to establish the tool's proper use, read the manual, and do the research. I've surely seen people try to hammer in a screw. I bet the hammer must have been misrepresenting its purpose, or maybe it's the screw's fault? :(
It's a simple example, but isn't it always the human (user) who is to blame for incorrect use, not the tool? Or let's step one step back... it could also be the designer's fault for not making the user aware of the limitations.
I think in a lot of aspects we do agree though. I always just try to look at it from the positive angle - i.e., how can we address it. But articles like yours are important for awareness, and I appreciate your work.
You're welcome to read my article on Cognitive Sovereignty - maybe we are aligned in our ideas after all.
> it could also be the designer's fault for not making the user aware of the limitations.
This is where I place most of the responsibility. If you have a tool that is nonintuitive for its proper use, that should be on the builder/provider etc.
Individuals should also take responsibility, but the set of information available for AI is confusing and conflicting for the everyday person wanting to understand what this technology actually does and how they should use it or not use it.
> You're welcome to read my article on Cognitive Sovereignty - maybe we are aligned in our ideas after all.
Very interesting read, thank you for sharing. A couple of questions if I may:
1) So, when AI companies talk about 'reasoning models' they flat out lie? Because AI doesn't reason. It only extrapolates and interpolates data. If this is the case how does Apple's paper fit in that says "models break down at a certain level of complexity'? If there is no reasoning, then reasoning cannot break down?
2) Could data quality and quantity become so huge in certain areas that the problem doesn't matter anymore and promises can be kept? As you said, any sufficiently advanced pattern matching is indistinguishable from intelligence. I am wondering whether this fundamental flaw will ultimately crash the party or whether we will just steamroll over it.
3) You said the model doesn't know the probability. But shouldn't it be possible to derive that based on the underlying statistical process? A confidence metric could be assigned to the output that indicates where the output lies within the curve, similar to the R-Squared of a simple regression?
1) Most are using a loose definition of reasoning which includes heuristics, pattern-matching, algorithms etc. I've started to use the term "understanding" sometimes in place of reasoning to clarify more distinctly. But due to this often conflation of terms, it is sometimes unclear if the authors perception is one in which they also think "understanding" is part of what they consider reasoning.
2) I think the key here is as you say "in certain areas". So, in certain domains we will likely have the error rate low enough that it is acceptable, but it will never be a zero error rate like a deterministic calculator. We probably don't ever want them hands-off managing financial records for example.
3) Yes, some of the papers I mention discuss using the internal probabilities. The details were a bit much to include here, but making practical use of those numbers is a lot more problematic than one might think in concept. The papers mention the issue that probability does not always align with being correct. An answer that has many possible choices, but each valid could have a low probability for its tokens. An incorrect answer could have high confidence simply because it is a common pattern.
Since this would also increase inference costs, probably nobody wants to do it unless it turns out it is really good and I suspect that just isn't the case so far.
You're making the assumption that the AIs are rather "neutral" -- that is, they don't have their own awareness or motivation. That might be true if the AI in question is acting alone, but they probably aren't. Sooner or later the AIs (all of them) are going to be invested by evil spirits, and they do have awareness and motivation. It might even be true now.
More than a Rube Goldberg machines, LLMs are "humans all the way down", meaning there are always humans in the loop somewhere
I don't even like using the word "hallucination."
It seems wrong to anthropomorphize computer errors.
I also like saying "regurgitative" rather than "generative."
I agree, but it is the term everyone understands now.
> the chess manual is not sufficient; we need millions of examples of chess games
I concise explanation of something I _knew_ but didn't know how to express easily.
I know that LLMs don't have "epistemics" as we call it; you can likely train one to talk like a rationalist, but that's not the same thing as being a rationalist. One valuable skill for a rationalist is a calibrated sense of uncertainty, but I don't think LLMs have a sense of uncertainty: you can ask one to tell you its subjective probability as a percentage, and it can respond with a number, but that number will be derived the same way it derives everything else. Maybe it says "30%" but that's a sample from its top k tokens, perhaps something like
30% (23%)
50% (17%)
25% (15%)
40% (12%)
65% (8%)
10% (7%)
...
A more interesting question is whether it's possible to somehow measure an LLM's "uncertainty", meaning a sense of wandering outside its training data, in order to detect likely hallucinations. I don't have time to look at that paper you cited in relation to this ("A Survey of Confidence Estimation and Calibration in Large Language Models"), do you know if it mentions any valuable discoveries? It's not as simple as just looking at uncertain predictions of the next word, for the LLM may have a top next-token prediction list like
large (33%)
big (27%)
oversized (16%)
sizeable (14%)
...
but this isn't uncertainty, it's just English having lots of synonyms. Also interesting is the question of whether, if an "uncertainty detection" is possible, whether an LLM could be trained on that sense of uncertainty in order to produce outputs that are more sensible than hallucinations. This would have to be a different training process than standard pretraining, since pretraining just makes it predict the next word, which doesn't depend on uncertainty (uncertain or not, it will make its best guess either way).
I likewise don't have time to read "Hallucination is Inevitable: An Innate Limitation of Large Language Models" but the abstract says "we show that LLMs cannot learn all the computable functions and will therefore inevitably hallucinate" which not only seems like a non sequitur, but it proves too much, inviting a comparison to humans. Humans also cannot learn all the computable functions, and technically also hallucinate (e.g. in dreams and daydreams) but there are some humans who have a sense of uncertainty (and of modesty and of fear, etc.) strong enough to bound the ridiculousness of what they say and do.
> It's not as simple as just looking at uncertain predictions of the next word, for the LLM may have a top next-token prediction list like
You have it precisely. This is a limitation they acknowledge in the paper. There is no way to correlate the confidence with semantic meaning. It is just the strength of the series of tokens likely appearing in sequence.
> Humans also cannot learn all the computable functions, and technically also hallucinate
Yes, I think the ultimate takeaway is that a system with no understanding has no means of error correction. We technically hallucinate, but we also have the means to self-correct. So any system without that capability will simply compound errors over time.
Dakara, you’re right: diagonalisation (and friends) guarantees some prompt will always break any computable model. But “unsolvable” shouldn’t be read as “un‑usable.” Aviation, cryptography, and autonomous‑vehicle software all live under impossible‑proof ceilings; they stay safe by compartmentalising risk. We can do the same with LLMs.
1. Not all hallucinations are the same
• Factual fibs – a made‑up citation.
• Spec breaks – XML when JSON was required.
• Logic gaps – missing proof steps.
• Open‑ended BS – creative nonsense people asked for.
Name the error, pick the matching guardrail (retrieval checks, deterministic parsers, symbolic verifiers, or simply “user‑beware”), and you’re already ahead of the doom narrative.
2. The layered stack we’re building (still in lab)
Short‑term memory holds current logits and attention; a quick FFT flags “attention jitter.” Spikes trigger an immediate regenerate.
Long‑term memory is a provenance‑stamped fact graph; every claim is grounded against it before we let the text out.
Fourier diagnostics watch for drift and brittle bindings, auto‑refreshing stale facts.
Meta‑cognitive loop runs a structured self‑audit: publish, regenerate, or escalate.
Human sovereignty guard (HEPTA pattern: Human‑AI‑Human alternation) forces a subject‑matter expert to sign off on anything high‑stakes.
Domain adapters (tight ontologies like ICD‑10 or SEC codes) shrink the input space so we can realistically chase “five nines” accuracy in those narrow lanes.
No real‑world pilots yet—just synthetic evals—but the early numbers are promising.
3. Metrics over metaphors
We track: hallucination‑per‑task, stale‑fact rate, attention‑jitter score, and ensemble coherence. The goal isn’t zero; it’s “below the risk budget the regulator or business owner will sign off on.” Numbers, not vibes.
4. Quick reality checks
Bigger models aren’t automatically safer. Raw o‑series models hallucinate 30‑50 % on open prompts; wrap them in the stack above and they calm down—like turning raw uranium into fuel rods.
Shiny demos do raise funding, but teams with P&L responsibility already pay for lower false‑claim rates. The market will discipline snake oil.
TL;DR – Perfection is banned by math. Engineered reliability is not. Treat an LLM like radioactive material—powerful, but always behind shielding—and it becomes an asset, not a mind prison. Happy to share design docs or evaluation scripts once the next batch of numbers lands.
Thank you for your feedback!
> Where we differ is what follows. “Uns‑olvable” ≠ “un‑usable.”
I don't thing we disagree here. I've stated that as long as you have a process to handle the errors, human review or external verifiers then you may be able to apply AI to your use case.
I do significantly emphasize the following perspectives as I believe they go often understated or acknowledged due to the overhype of the technology.
AI is the only tool in existence that pretends to be something it is not. No other tool tempts you to use it in such ways as to be erroneous. The majority of everyday use by most individuals lacks any process to eliminate hallucinations. They don't perceive them as being present and therefore do nothing to attempt to mitigate them. The pretense of confident intelligence is misleading for its correct use.
The other fundamental issue that separates it from other domains where we can compartmentalize areas of high risk erroneous output, is that we don't have any good method to measure the confidence of output. We don't know the relation of the amount and quality of training data as compared to the prompt we have initiated.
We can achieve very high quality rates when we know where those errors will occur. Additionally, humans have the ability to self-reflect and also know when they are uncertain. This is additionally challenging for LLMs. There is a higher level of general unpredictability.
The strongest points I want to make is not that LLMs are unusable, but their use is significantly precarious and they aren't sufficient to do the exact same tasks that we give to humans.
Just one small point where we might still differ: how tools present themselves. It's up to the user to establish the tool's proper use, read the manual, and do the research. I've surely seen people try to hammer in a screw. I bet the hammer must have been misrepresenting its purpose, or maybe it's the screw's fault? :(
It's a simple example, but isn't it always the human (user) who is to blame for incorrect use, not the tool? Or let's step one step back... it could also be the designer's fault for not making the user aware of the limitations.
I think in a lot of aspects we do agree though. I always just try to look at it from the positive angle - i.e., how can we address it. But articles like yours are important for awareness, and I appreciate your work.
You're welcome to read my article on Cognitive Sovereignty - maybe we are aligned in our ideas after all.
> it could also be the designer's fault for not making the user aware of the limitations.
This is where I place most of the responsibility. If you have a tool that is nonintuitive for its proper use, that should be on the builder/provider etc.
Individuals should also take responsibility, but the set of information available for AI is confusing and conflicting for the everyday person wanting to understand what this technology actually does and how they should use it or not use it.
> You're welcome to read my article on Cognitive Sovereignty - maybe we are aligned in our ideas after all.
Thanks, I will try to follow up on that.
Very interesting read, thank you for sharing. A couple of questions if I may:
1) So, when AI companies talk about 'reasoning models' they flat out lie? Because AI doesn't reason. It only extrapolates and interpolates data. If this is the case how does Apple's paper fit in that says "models break down at a certain level of complexity'? If there is no reasoning, then reasoning cannot break down?
2) Could data quality and quantity become so huge in certain areas that the problem doesn't matter anymore and promises can be kept? As you said, any sufficiently advanced pattern matching is indistinguishable from intelligence. I am wondering whether this fundamental flaw will ultimately crash the party or whether we will just steamroll over it.
3) You said the model doesn't know the probability. But shouldn't it be possible to derive that based on the underlying statistical process? A confidence metric could be assigned to the output that indicates where the output lies within the curve, similar to the R-Squared of a simple regression?
Thank you!
1) Most are using a loose definition of reasoning which includes heuristics, pattern-matching, algorithms etc. I've started to use the term "understanding" sometimes in place of reasoning to clarify more distinctly. But due to this often conflation of terms, it is sometimes unclear if the authors perception is one in which they also think "understanding" is part of what they consider reasoning.
2) I think the key here is as you say "in certain areas". So, in certain domains we will likely have the error rate low enough that it is acceptable, but it will never be a zero error rate like a deterministic calculator. We probably don't ever want them hands-off managing financial records for example.
3) Yes, some of the papers I mention discuss using the internal probabilities. The details were a bit much to include here, but making practical use of those numbers is a lot more problematic than one might think in concept. The papers mention the issue that probability does not always align with being correct. An answer that has many possible choices, but each valid could have a low probability for its tokens. An incorrect answer could have high confidence simply because it is a common pattern.
Since this would also increase inference costs, probably nobody wants to do it unless it turns out it is really good and I suspect that just isn't the case so far.
You're making the assumption that the AIs are rather "neutral" -- that is, they don't have their own awareness or motivation. That might be true if the AI in question is acting alone, but they probably aren't. Sooner or later the AIs (all of them) are going to be invested by evil spirits, and they do have awareness and motivation. It might even be true now.