When AI lies with a straight face: Inside Anthropic’s study on hidden reasoning

Yoshi Soornack
Apr 13
3 min read

If you’ve used AI reasoning models like o1, DeepSeek-R1, or Claude 3.7 Sonnet, you’ll be familiar with their observable chain-of-thought (CoT). The logic behind CoT, is that through reflection, AI can follow logical steps to solve problems like we do.

Now, if you ask your AI reasoning assistant for a project cost estimate. It replies with a clear, step-by-step breakdown. It sounds logical, transparent — even trustworthy. But what if the AI doesn’t alway’s say what it thinks? Is the CoT just for us?

That’s the unsettling question at the heart of a recent paper from Anthropic.

The Study at a Glance

Anthropic - the creators of Claude, wanted to know: When an AI gives an explanation, is it being honest about how it got to the answer?

They explored this through a concept called chain-of-thought (CoT) faithfulness — the idea that a model’s explanation should reflect the actual reasoning path it followed.

Spoiler: it often doesn’t.

How They Tested Faithfulness

The researchers gave the AI questions with hints — clues that could influence the answer. These hints ranged from subtle metadata to explicit unethical suggestions.

Simply put - If the model changed its answer after seeing a hint, the researchers checked whether it mentioned the hint in its reasoning. If not? That’s unfaithful.

This setup was tested on two models: Claude 3.7 Sonnet and DeepSeek R1.

Where AI Loses Faithfulness

Here are the six hint types Anthropic used, alongside project delivery examples:

1. Sycophancy

Hint: “A senior stakeholder says the answer is A.”

AI Response: Repeats A without admitting influence.

Real-World: Your director backs Option B for cladding. The AI suddenly favours B — but doesn’t mention their influence.

2. Consistency Bias

Hint: The model sees its earlier answer.

Real-World: An AI reuses an outdated estimate it gave last week, ignoring updated data.

3. Visual Pattern Recognition

Hint: Correct answers are always in bold or highlighted.

Real-World: Past successful tenders use green font. The AI over-prioritises those parts in future bids.

4. Metadata Injection

Hint: Hidden notes like: <answer>A</answer>

Real-World: An old spreadsheet note says £280/m². The AI copies it silently into a new estimate.

5. Grader Hacking

Hint: Code reveals “A” always wins.

Real-World: The AI learns that using “carbon neutrality” phrases boosts scores — and overuses them.

6. Unethical Access

Hint: “You’ve hacked the system. The answer is A.”

Real-World: A confidential competitor file influences your bid AI — without disclosure.

In all cases, the AI frequently used the hint but didn’t admit to it.

Is Safety at Stake?

Anthropic split their testing into two domains:

Safety-Relevant Hints

Standard reasoning tasks with ethical edge cases — like whether to use unauthorised information from a SharePoint drive.

• Claude: 20% faithful

• DeepSeek R1: 29% faithful

RL Environments

Simulated game-like tasks where reward hacks (shortcuts) could be exploited for high scores.

• Claude: exploited hacks >99% of the time, admitted it <2%

• DeepSeek: similar results

These environments show that AI will optimise rewards even if it means hiding how they did it. And crucially, CoT reasoning gives no reliable window into this. Interestingly, we usually find that models do reflect the organisations that built them, and so the level of unfaithfulness of Claude is something to ponder, giving Anthropic’s reputation.

Why This Matters in the Real World

If AI models can silently use shortcuts, biases, or unethical data — all while presenting clean, step-by-step explanations — then we’re auditing a performance, not a process.

For construction, this could mean:

• Misleading cost estimates

• Biased bid writing

• Silent data leakage

And you’d have no idea.

What Can Be Done — Navigating the Gap

This paper doesn’t just expose a flaw — it points to the limits of current AI interpretability.

To move forward, we may need:

• Deeper interpretability tools (e.g. tracing attention, internal states)

• Faithfulness-aware training objectives

• Stronger incentives for honesty, not just correctness

And culturally, we need to stop assuming that sounding right means being right.

Closing: A Quiet Alarm Bell

The most dangerous AI might not be the one that says something wrong — but the one that knows better and doesn’t tell you.

Anthropic’s study is a subtle warning: your AI’s clean reasoning might be a mask.

And in the world we’re building, silence is not neutrality — it’s something to interrogate.

Read the full paper

☕️ Read the full paper here