When AI Sounds Certain but the Numbers Are Not

James Garner
Dec 27, 2025
5 min read

Updated: Dec 28, 2025

Large language models produce fluent answers, but their inability to handle numbers effectively is becoming a material risk to project delivery.

Large language models are increasingly embedded in project environments, supporting reporting, planning and decision preparation. However, research highlighted by BoundaryML points to a persistent and structural limitation: LLMs do not understand numbers in the way project delivery requires. They do not calculate, verify or reason over quantities. Instead, they generate text that resembles numerical output based on probability.

This creates a gap between how authoritative AI responses appear and how reliable they actually are. For project delivery professionals, this gap introduces risks around budgeting, forecasting, progress reporting and assurance that cannot be ignored.

Language Models Are Not Numerical Systems

The BoundaryML analysis makes a clear distinction that is often blurred in day-to-day AI use. Large language models are trained to predict the next most likely character or token in a sequence. They are optimised for linguistic fluency, not for arithmetic correctness.

When an LLM produces a number, it is not performing a calculation. It is selecting a sequence of characters that statistically fits the surrounding text. This difference matters because numbers in project environments are not decorative. They are commitments.

Academic research supports this distinction. Studies evaluating numerical reasoning in LLMs show consistent degradation in accuracy as numerical complexity increases, particularly when tasks require multi-step calculation or verification rather than recall. Even when models arrive at correct answers, they often do so inconsistently, with no internal mechanism to recognise error.

BoundaryML’s central claim is therefore not controversial. It is structural. LLMs are language engines, not numerical engines.

We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research.

The Illusion of Confidence

One of the most practical warnings in the BoundaryML article concerns confidence scores. When asked to estimate certainty or likelihood, LLMs routinely generate precise percentages such as 90 or 95 per cent. These figures sound authoritative, but they do not reflect an internal statistical model of truth.

This happens because the model has learned that confident-sounding numbers are commonly associated with authoritative writing. The confidence is stylistic, not analytical.

In project delivery contexts, this creates a specific risk. Status updates, risk assessments and forecast summaries often rely on quantified confidence. If these numbers are generated rather than calculated, they can mislead decision-makers without triggering obvious red flags.

The danger is not that the AI is intentionally deceptive. The danger is that it does not know when it is wrong.

Literacy Has Advanced Faster Than Numeracy

The last two years of AI adoption have primarily focused on written output. Strategy notes, summaries, reports and stakeholder updates have improved in speed and polish. This has led to a broader assumption that LLMs are general-purpose problem solvers.

BoundaryML’s analysis challenges that assumption. The models demonstrate high literacy but weak numeracy. They can describe a budget persuasively while misunderstanding the arithmetic behind it.

Research from multiple academic benchmarks reinforces this point. LLMs often fail basic numerical comparison tasks, struggle with counting and produce inconsistent results when the same calculation is phrased differently. These failures persist even as language quality improves.

For project delivery professionals, this reinforces a practical boundary. AI can support explanation and synthesis, but it cannot be treated as a source of numerical truth.

Why Tokenisation Matters More Than Prompting

BoundaryML highlights tokenisation as a key reason numerical reasoning breaks down. Numbers are not processed as values. They are broken into fragments that resemble text.

For example, a five-digit number may be split into multiple tokens, each treated independently. The model has no stable internal representation of magnitude or order. This explains why tasks such as counting characters in a word or tallying rows in a document often fail.

This is not a prompt design problem. No amount of instruction or tone adjustment changes how numbers are represented internally. Precision cannot be prompted into existence where the architecture does not support it.

The implication for industry is evident. Accuracy requires tools that operate deterministically, such as spreadsheets, calculators or code execution environments. Language models can interface with those tools, but they cannot replace them.

Implications for Project Delivery

The BoundaryML article becomes particularly relevant when translated into delivery practice.

Numerical Outputs Must Be Treated as Drafts

Any figure produced by an LLM should be considered provisional. Budgets, timelines, resource allocations and progress percentages require verification by deterministic systems or human review.

This is not an argument against the use of AI. It is an argument for explicit validation layers.

Compounding Error Is the Hidden Risk

If an AI system summarises a project as 80 per cent complete when actual completion is closer to 60 per cent, that error does not remain isolated. It propagates into forecasts, cash flow projections, supplier planning and executive reporting.

Because the output sounds reasonable, it may pass through multiple layers of review without being checked. This is what BoundaryML effectively describes as silent error propagation.

The Risk of Numerical De-skilling

There is also a human factor. If junior delivery staff rely on AI to generate numerical summaries, they may lose familiarity with the underlying data. Over time, this creates a gap between accountability and understanding.

Project environments depend on people who can question numbers, not just repeat them.

The Shift Toward Tool-Centred Workflows

Leading organisations are already adjusting. Rather than asking what the AI can do on its own, they are designing workflows in which the AI extracts context, while specialised tools perform calculations and validation.

In this model, the AI acts as an interface, not an authority.

What This Means for Project Delivery Leadership

From our point of view, this moment marks a transition in how AI should be positioned in delivery systems.

Early adoption focused on novelty and capability. The current phase requires precision and control. Understanding where AI fails is as important as understanding where it performs well.

Language models are effective at interpretation, explanation and synthesis. They are unreliable at arithmetic, counting and numerical verification. Treating them accordingly is not conservative. It is professional.

This distinction allows teams to benefit from AI without undermining delivery assurance.

What Project Leaders Should Do Now

Require verification of all AI-generated numbers using deterministic tools.
Separate data extraction from calculation in workflow design.
Train teams to challenge numerical outputs rather than accept them at face value.
Avoid using LLMs as the sole source of quantitative judgment.
Treat fluency as presentation, not proof.

Designing for Reliability, Not Assumption

The BoundaryML analysis does not argue that AI is failing. It shows that AI is being used beyond its structural limits. When language systems are mistaken for numerical systems, reliability suffers.

Project delivery does not require perfect AI. It requires predictable systems, clear validation and informed human oversight. When these elements are in place, AI becomes a valuable component rather than a hidden liability.

Re-examine Where Numbers Enter Your AI Workflows

If AI is involved in producing figures that influence budgets, schedules or delivery decisions, now is the time to review those processes. Identify where language models generate numbers and ensure deterministic checks are in place.

Subscribe to Project Flux for grounded analysis on how AI actually behaves in delivery environments and how to design systems that remain reliable under pressure.