Why LLMs Don't Ask For Calculators?
Notes From the Desk: No. 37 - 2025.02.15
Notes From the Desk are periodic informal posts that summarize recent topics of interest or other brief notable commentary.
Why LLMs Don't Ask For Calculators?
A simple observation of LLM behavior can tell us much more than might be apparent at first. Why don’t LLMs ask for calculators for work they cannot do? The answer will tell us much about LLM reasoning.
Recently making the rounds on social media is the following chart, which shows the current progress of one of the top LLM models at performing multiplication. The chart shows the measured accuracy by the number of digits in the multiplication.
This context of the chart is to emphasize the significant progress that has been made since the introduction of reasoning models, where previously anything above 3x3 digits was red.
However, we have a bit more clarity if we emphasize the tests that passed with 100% accuracy. The context is math; anything under 100% is essentially useless. Who wants a calculator with any level of random failure?
But it gets much worse. We see that this accuracy chart was based on only 45 iterations! It is likely no test would meet 100% by calculator standards.
So, should we expect LLMs to be perfect at multiplication? When LLMs are criticized for their poor math ability, the predominant rebuttal is “well, I couldn’t do that either,” as evident in this thread with many such replies.
However, what would you do if asked to perform such a task? You are intelligent; you would ask for a calculator — a reliable instrument to complete the work efficiently. You wouldn’t just attempt to do it in your head or on pencil and paper. You would do this because you have understanding of both your capability and the capability of the calculator. Yet nobody asked why the LLM didn’t ask for a calculator.
So, why is it that LLMs don’t ask for a calculator, or suggest providing code to calculate the task, or at least explain that it should be done on a calculator for accuracy? Do LLMs know that they are bad at math? Yes.
Do LLMs know what a calculator is? Also Yes. And it still can't place these two concepts together to realize it is likely giving you wrong answers. However, we know both of these concepts are within the training data. Any human would intuit the correct course of action. It is blindingly apparent to us, but not the LLM.
We are continuously told LLMs are performing at PhD levels. However, we have here a very interesting case that demonstrates something that is sometimes hard to find: an incongruity in training data and applied reasoning. This incongruity exists because the internet is not full of math problems where individuals respond by asking for a calculator. Therefore, the LLM does not ask for a calculator.
Now, some might interject here and say we could, of course, train the LLM to ask for a calculator, or provide code, or use an agent, etc. However, that would not make them intelligent. Humans don’t need training for such a decision, that being what tool to use or path to take when it is understood what capability the different options provide. Resolving the path to the best solution is intelligence.
Humans require no training at all for basic calculators, as they are such intuitive instruments. We use them simply because we have understanding for the capability they provide.
The LLM has no self-reflection for the knowledge it knows and has no understanding of concepts beyond what can be assembled by patterns in language. It just so happens that language patterns can overlap with many types of tasks in such a way that LLMs appear to understand. However, when language does not provide the pattern, then the LLMs fail to perceive it.
What we want out of LLMs is understanding, and these results are further confirmations that it does not exist. What is important here is not the failure at math. The implications are much greater. Math is just a domain in which we can easily and objectively observe such failures, but this problem is pervasive across all tasks given to LLMs.
In other domains, the errors are more difficult to detect, but they are there, and they come masked in eloquent and convincing language as the result of fine-tuning to make all responses sound convincing and intelligent.
"I'm not against any use, except for one, where an LM does the work without a human validating everything with a magnifying glass. A language model isn't a rational agent, it doesn't reason, and it produces output it doesn't comprehend itself. These are real LMs. But those hype generators don't like real LMs. They like imaginary ones and fool everyone into thinking that these either already exist or will soon be released." - Andriy Burkov, author of The Hundred-Page Language Models Book
Would you let the success of your company be determined by a roll of dice? LLMs are simply unsuitable for any tasks requiring reliability. Under supervision, LLMs have some legitimate uses, but reliable automation and true intelligent reasoning isn’t on the horizon. How did randomly correct answers become a multi-billion dollar industry?
“The fact that something that has ingested the entirety of human literature can't figure out how to generalize multiplication past 13 digits is actually a sign of the fact that it has no understanding of what a multiplication algorithm is. …”
— Chomba Bupe, commenting on this math benchmark
Here are more simple LLM failures I have documented previously that also demonstrate evident gaps in reasoning if you wish to explore further.
Twilight Zone - A Nice Place To Visit (AGI)
An excellent allegory for the utopian dreamers of AGI comes from a 1960s episode of The Twilight Zone, “A Nice Place To Visit”. The plot reveals the unexpected outcome of wish-granting power, which is currently the heavily sought-after prize of AGI/ASI.
I have clipped together a short summary of the episode here, so that you can get the meaning in only a couple of minutes, but the original full episode is certainly worth a watch. As demonstrated above, it seems LLMs are not on the path to AGI; however, we should spend some time to contemplating whether that is that path we would ever want to be on, if it becomes achievable.
No compass through the dark exists without hope of reaching the other side and the belief that it matters …
2025.02.20 - minor updates for clarity based on some feedback
Counter,
prediction really is modeling and having a model really is one step from intelligence
its true that LLMs are not humans and don't work like humans but they can model our culture
math breaks LLMs because it's dense, you can't 'hide infinity competence' in a little box like you can with some other things (like summarization, knowledge extraction, etc).
That just means we have yet another layer of resource management to handle :P perhaps we always will.
uploading consciousness really is just predicting culture and to the extend that anyone needs AI it's here.
enjoy
This seems like a matter of training. Early LLMs didn't know how to answer questions either - they just completed text where you would start writing something and it would finish it. But then we built corpora of "Instruction Tuning" to show example transcripts of chats and bias the system to generate content that was more useful in general context per the benchmarks that have been defined. We have the start of something similar with tool use or function calling where the models identify when they need to use a tool, such as a calculator. Early examples of this are generally pretty good. Likewise reasoning models are increasingly getting guided transcripts to train on how to break down problems to get more accurate results ... leveraging the same underlying training data (except for the reasoning training data) as more vanilla llms. The internet at large and books and such, don't often call out step-by-step instructions with, and "now plug this into a calculator" and to the extent they do, models didn't know how to literally do that with the function call, other than output the text, until recently. LLM based approaches might reach some limit where they can't address certain tasks like maths ... but right now that limit seems to be data. If we can can guide them to know how to approach the problem from a combination of reasoning, using external tools, introspection or feedback, and iteratively addressing the problem we can approach something similar to humans on a wide variety of problem domains. This is why there is a bunch of hype about "Agentic Systems"