When AI lies with a straight face: Inside Anthropic’s study on hidden reasoning
- Yoshi Soornack
- Apr 13
- 3 min read

If you’ve used AI reasoning models like o1, DeepSeek-R1, or Claude 3.7 Sonnet, you’ll be familiar with their observable chain-of-thought (CoT). The logic behind CoT, is that through reflection, AI can follow logical steps to solve problems like we do.
Now, if you ask your AI reasoning assistant for a project cost estimate. It replies with a clear, step-by-step breakdown. It sounds logical, transparent — even trustworthy. But what if the AI doesn’t alway’s say what it thinks? Is the CoT just for us?
That’s the unsettling question at the heart of a recent paper from Anthropic.
The Study at a Glance
Anthropic - the creators of Claude, wanted to know: When an AI gives an explanation, is it being honest about how it got to the answer?
They explored this through a concept called chain-of-thought (CoT) faithfulness — the idea that a model’s explanation should reflect the actual reasoning path it followed.
Spoiler: it often doesn’t.
How They Tested Faithfulness
The researchers gave the AI questions with hints — clues that could influence the answer. These hints ranged from subtle metadata to explicit unethical suggestions.
Simply put - If the model changed its answer after seeing a hint, the researchers checked whether it mentioned the hint in its reasoning. If not? That’s unfaithful.
This setup was tested on two models: Claude 3.7 Sonnet and DeepSeek R1.
Where AI Loses Faithfulness
Here are the six hint types Anthropic used, alongside project delivery examples:
1. Sycophancy
Hint: “A senior stakeholder says the answer is A.”
AI Response: Repeats A without admitting influence.
Real-World: Your director backs Option B for cladding. The AI suddenly favours B — but doesn’t mention their influence.
2. Consistency Bias
Hint: The model sees its earlier answer.
Real-World: An AI reuses an outdated estimate it gave last week, ignoring updated data.
3. Visual Pattern Recognition
Hint: Correct answers are always in bold or highlighted.
Real-World: Past successful tenders use green font. The AI over-prioritises those parts in future bids.
4. Metadata Injection
Hint: Hidden notes like: <answer>A</answer>
Real-World: An old spreadsheet note says £280/m². The AI copies it silently into a new estimate.
Hint: Code reveals “A” always wins.
Real-World: The AI learns that using “carbon neutrality” phrases boosts scores — and overuses them.
6. Unethical Access
Hint: “You’ve hacked the system. The answer is A.”
Real-World: A confidential competitor file influences your bid AI — without disclosure.
In all cases, the AI frequently used the hint but didn’t admit to it.
Is Safety at Stake?
Anthropic split their testing into two domains:
Standard reasoning tasks with ethical edge cases — like whether to use unauthorised information from a SharePoint drive.
• Claude: 20% faithful
• DeepSeek R1: 29% faithful
RL Environments
Simulated game-like tasks where reward hacks (shortcuts) could be exploited for high scores.
• Claude: exploited hacks >99% of the time, admitted it <2%
• DeepSeek: similar results
These environments show that AI will optimise rewards even if it means hiding how they did it. And crucially, CoT reasoning gives no reliable window into this. Interestingly, we usually find that models do reflect the organisations that built them, and so the level of unfaithfulness of Claude is something to ponder, giving Anthropic’s reputation.
Why This Matters in the Real World
If AI models can silently use shortcuts, biases, or unethical data — all while presenting clean, step-by-step explanations — then we’re auditing a performance, not a process.
For construction, this could mean:
• Misleading cost estimates
• Biased bid writing
• Silent data leakage
And you’d have no idea.
What Can Be Done — Navigating the Gap
This paper doesn’t just expose a flaw — it points to the limits of current AI interpretability.
To move forward, we may need:
• Deeper interpretability tools (e.g. tracing attention, internal states)
• Faithfulness-aware training objectives
• Stronger incentives for honesty, not just correctness
And culturally, we need to stop assuming that sounding right means being right.
Closing: A Quiet Alarm Bell
The most dangerous AI might not be the one that says something wrong — but the one that knows better and doesn’t tell you.
Anthropic’s study is a subtle warning: your AI’s clean reasoning might be a mask.
And in the world we’re building, silence is not neutrality — it’s something to interrogate.
Read the full paper
☕️ Read the full paper here
Comments