The Illusion of Thinking - Apple's Study in Limits, Not a Verdict on Intelligence

Yoshi Soornack
Jun 9
3 min read

Apple’s latest AI paper, The Illusion of Thinking, takes aim at a question that cuts to the core of current AI progress: do large language/reasoning models really reason, or are we just projecting logic onto pattern-matching machines?

To answer it, the team built a controlled suite of logic puzzles. Think Tower of Hanoi, River Crossing and Block World. They then pushed a range of models through them, scaling up complexity bit by bit. Unlike benchmark tasks which are often muddied by training contamination or final-answer bias, this was a clean probe. Each puzzle grows incrementally harder, offering a lens on how reasoning copes under strain.

And what they found was striking: as problems get more complex, models don’t just perform worse. They often stop trying.

The Collapse Curve

Early on, large reasoning models (LRMs) like o3, DeepSeek-R1, Claude 3.7 Sonnet etc. do well. They trace steps, plan actions, solve problems. But as puzzle complexity increases, more moves, deeper planning - something odd happens. Their chain-of-thoughts get shorter. They use fewer tokens. Accuracy collapses.

Not because they’ve hit a system limit.

Not because the tokens ran out.

But seemingly because they choose not to continue.

The paper paints this as a form of “giving up”. We see it as a kind of learned frugality. The model senses the edge of its pattern-matching competence and decides (in its own statistical way) not to waste effort spiralling into failure. This emergent characteristic is the opposite to 'overthinking', which we also see when LRM's are faced with simpler tasks.

And here’s where the paper becomes less a critique of reasoning, and more a study of its boundaries.

Reasoning: Real, but Shallow

The central question isn’t does the model reason?

It’s how deep is that reasoning, and what is it built on?

Many commentators fall into the trap of binary thinking. Either LRMs reason like humans or they don’t reason at all. But that misses a more interesting middle ground: models do reason, but often through a layer of heuristics learned from training data.

It’s a form of shallow, surface-level planning. Good enough for everyday tasks, but brittle when structure or recursion is required. They imitate the outward signs of logic, but don’t hold a generalised planning engine underneath. Not yet.

The Spatial Mismatch

There’s also something worth saying about the choice of puzzles. Tower of Hanoi and Block World aren’t just logic challenges, they’re spatial simulations. Humans solve them by visualising moves, manipulating imagined objects. LLMs, by contrast, operate in text space, without diagrams, scratchpads, or structured memory.

So part of the failure here may stem not from a lack of reasoning, but from a lack of tools to support reasoning. In the wild, we already see how models like o3 or Claude 3.7 improve drastically when given access to:

external memory,
code interpretors
interactive environments,
or even just simple visual planning tools.

The Apple study removes all that scaffolding. It asks the model to juggle logic, space, and structure using only next-word prediction. This is a logical account for a drop in performance.

Could it Be a Strategic Retreat

But let’s revisit that moment of collapse. The “giving up.”

What if that’s not a flaw, but a sign of computational maturity?

A form of early meta-reasoning, recognising the problem is unsolvable and withdrawing instead of burning through resources and compounding failure.

Call it statistical frustration. Or maybe just good judgement.

It suggests an emergent behaviour we’d want in future systems: the ability to assess difficulty, to decline tasks gracefully, to know when help is needed. Not all failures are equal. Some are reckless, others are wise.

Not an Illusion, But a Glimpse

So, does the Apple study debunk LLM reasoning?

Not really. It shines a light on what kind of reasoning we’ve built so far, shallow, heuristic-driven, and highly dependent on prompt design, scaffolding, and context.

It doesn’t prove that models don’t think. It just shows they don’t persist well under pressure, not without support.

And in that sense, the paper is useful not as a judgement, but as a diagnostic. A mirror on the boundaries of today’s systems, and an invitation to build smarter scaffolds, deeper planners, and more dynamic agents who don’t just mimic thought, but extend it.

Rabbit Hole

Read the full paper here