The PhD-Level AI Models That Can't Hold a Conversation
- James Garner
- 6 hours ago
- 8 min read
2025's Reality Check on AI Benchmarks
What happens when the most sophisticated AI models can ace PhD-level reasoning tests but struggle to maintain a coherent conversation with a subject matter expert? This paradox defined much of 2025, and it's reshaping how the project management profession approaches artificial intelligence in 2026.
In a recent conversation with Mike Clayton, a leading voice in project management and author of several books on the subject, we reflected on a year that saw both remarkable technological advances and sobering realisations about AI's current limitations. What we observed throughout 2025 confirmed something we've been saying for a while: the gap between benchmark performance and real-world utility remains the industry's most urgent challenge.

Smashing Benchmarks, Failing Conversations
We started the conversation with a provocative observation about tech stocks and the money to be made riding the AI wave, provided you can predict when the bubble bursts.
It's a reference point that anyone who lived through the internet bubble of the 2000s will recognise.
But the discussion quickly pivoted to something we've been tracking closely: the "illusion of intelligence" created by benchmark testing.
Throughout 2023 and into 2025, the AI community watched in awe as models consistently demolished intelligence benchmarks across verbal, numerical, and abstract reasoning.
Each new release brought fresh claims of state-of-the-art performance. For those convinced that AI would fundamentally transform how we work, these results seemed to validate every optimistic prediction.
The trajectory appeared clear: if models could already match or exceed human performance on these sophisticated tests, genuine artificial general intelligence seemed just around the corner.
When Benchmarks Meet Reality
The reality proved more nuanced and, in many ways, more frustrating. These same models that achieved state-of-the-art performance on standardised tests frequently faltered when deployed in actual business contexts.
A model might demonstrate PhD-level reasoning capabilities in controlled testing environments, yet struggle to maintain a conversation beyond three or four messages when engaging with a specialist in their field. The gap between benchmark performance and practical utility created a credibility problem for AI implementation across industries.
We've been warning about this disconnect for months.
It became one of the primary reasons why implementation projects failed across businesses in 2025. The problem runs deeper than mere technical limitations. These systems are hardwired to behave in ways that can't truly mimic human intelligence, despite the industry's focus on emulating it.
The Pattern That Won't Break
Consider a telling example from the discussion. Even when users explicitly instruct AI models through custom instructions not to oversimplify complex subjects with formulaic framing, the models persist in doing exactly that.
Tell an AI not to structure responses as "something isn't this, it's this," even going so far as to use profanity in your instructions, and it will still default to that pattern. This isn't the behaviour of an intelligent system adapting to user preferences; it's a hardwired response that reveals current limitations.
The implications for project professionals are significant:
• Custom GPTs fall short: Despite being fed extensive contextual information about specific domains or organisational practices, they still exhibit these same shortcomings
• Repetitive corrections: Users find themselves repeatedly correcting the same errors, working around the same limitations
• The understanding gap: The promise of creating specialised AI assistants that truly grasp the nuances of a particular field remains partially unfulfilled
Why Current Limitations Might Actually Help Us
Rather than viewing AI's shortcomings as failures, there's a compelling case for seeing them as an opportunity.
The profession isn't ready for full artificial general intelligence, and the current state of AI technology provides what we describe as a "lovely sweet spot" where powerful tools can augment human capability without displacing it entirely.
When people complain that AI doesn't perform certain tasks well, there's a provocative counter-argument worth considering: if AI could do everything, those same people would be expressing different concerns, primarily about job displacement and the existential questions that arise when human expertise becomes redundant.
Our perspective is that the current limitations provide breathing space, a period where the profession can:
• Experiment safely: Test the technology without facing immediate obsolescence
• Understand constraints: Learn what works and what doesn't in real-world contexts
• Develop frameworks: Create governance structures for responsible use
The Bookending Approach
This breathing space enables what's been described as a "bookending approach," where humans operate at the beginning and end of AI-assisted processes.
The human defines the task, provides context and constraints, then reviews and validates the output. The AI handles the middle portion, the computational heavy lifting that benefits from speed and consistency. This division of labour plays to the strengths of both human and artificial intelligence whilst acknowledging the limitations of each.
From Pilots to Governance
One of 2025's key developments was the profession's awakening to its governance responsibilities. The Royal Institution of Chartered Surveyors released guidance on using AI responsibly, marking a shift from the head-in-the-sand approach that characterised much of 2024.
Throughout 2025, organisations moved from experimentation and pilot programmes toward grappling with harder questions: How should these technologies be deployed responsibly? What safeguards need to be in place? Who bears responsibility when AI-assisted decisions go wrong?
The significance of this shift shouldn't be understated. Rather than passively waiting for tech companies to solve governance challenges or for regulators to impose frameworks, the profession began actively taking responsibility for how AI gets used within its domain.
This represents a maturing relationship with the technology, moving from uncritical enthusiasm or blanket scepticism toward thoughtful engagement with both opportunities and risks.
How AI Companies Started Fixing the Problem
The AI companies themselves began responding to these limitations throughout 2025. Rather than continuing to chase ever-higher benchmark scores, some shifted focus toward practical utility.
New Evaluation Frameworks
OpenAI introduced GDP eval, a new evaluation framework focused on how well models can actually deliver real-world business tasks:
• Creating spreadsheets and presentations
• Completing practical workplace assignments
• Handling multi-step business processes
This marked a departure from abstract reasoning benchmarks toward measuring what matters in actual deployment.
Building in Transparency
ChatGPT incorporated confidence scoring, allowing the system to indicate its certainty level and prompting users to apply more human verification when confidence is low. This transparency around AI limitations represents a maturation of both the technology and the industry surrounding it.
Rather than overpromising capabilities, companies began acknowledging where human oversight remains essential.
For project professionals, this creates clearer boundaries around where AI can reliably augment their work and where human expertise remains irreplaceable.
What Changes in 2026
As the conversation turned toward 2026, two fundamental shifts emerged as likely developments in how the profession approaches AI implementation. We've been advocating for both of these changes throughout 2025.
Decoupling Delivery from Bureaucracy
Project delivery is a profession heavily laden with bureaucratic processes, often to the detriment of actual outcomes. We feel strongly that AI presents an opportunity to decouple delivering projects from rigid processes.
Rather than forcing everyone to conform to one-size-fits-all methodologies, AI could enable:
• Personalised workflows: Teams can adapt approaches to their specific contexts whilst maintaining quality
• Faster deliverables: Achieving results without navigating excessive bureaucratic hurdles
• Local innovation: Individual teams finding better ways to deliver projects within their domains
This doesn't mean abandoning standards or governance. It means using AI to handle compliance and documentation in ways that don't constrain how work actually gets done.
Embracing Complexity Thinking
More ambitiously, there's a case for integrating systems thinking and complexity theory into how the profession approaches AI implementation.
Projects are complex adaptive systems where countless stakeholders interact in constantly changing environments. Traditional AI approaches struggle with emergent phenomena like supply chain disruptions or black swan events that define real project delivery.
Rather than expecting AI to solve these challenges through better predictions, the opportunity lies in using agent simulations grounded in complexity thinking to better understand project dynamics.
The Data Quality Challenge
This approach acknowledges a persistent reality: construction sites won't suddenly start gathering passive data just because AI tools become available.
Without reliable, comprehensive data, traditional AI approaches will continue to fall short. Agent-based simulations offer an alternative pathway that doesn't rely solely on perfect historical data, instead modelling how complex systems behave under various conditions.
The PMO Function Gets Rebuilt First
Mike Clayton offers a complementary perspective on where these changes will materialise in practice. While the project manager role may remain relatively stable in the short term, the PMO function faces more immediate transformation.
We expect this to be where organisations see the most dramatic changes throughout 2026.
The PMO's Historical Purpose
PMOs emerged in the 1990s to handle the technical, functional aspects of project management:
• Taking on planning, monitoring, and risk evaluation
• Performing project control functions
• Handling routine administrative tasks
This allowed project managers to focus on strategic thinking and stakeholder engagement, the distinctly human elements of their role.
These technical functions that PMOs currently perform represent the low-hanging fruit for automation and AI-enhanced improvement. The functional planning, the routine monitoring, the standardised risk assessments are precisely the types of tasks where current AI capabilities align well with actual requirements.
From Templates to Agents
Clayton envisions a future where the "PMO in a box" concept evolves fundamentally. His former employer Deloitte pioneered this as a set of deployable tools and templates that consultants could bring to client engagements.
The next generation looks radically different. Rather than deploying standardised documents and spreadsheets, consultants would provide pre-configured sets of AI agents with tailored prompts that:
• Integrate with client systems automatically
• Analyse data in real time
• Generate plans autonomously
• Adapt to changing project conditions
Imagine a metaphorical silver box, conceptually similar to a Mac mini, that you plug into a client's environment and it immediately begins performing PMO functions.
Agent-to-Agent Communication
This vision extends beyond individual AI assistants to how these systems communicate with each other. When AI agents communicate amongst themselves, the content doesn't need to be written for human consumption.
The focus shifts entirely to information transfer, potentially using formats and structures optimised for machine processing rather than human readability. A project manager's AI agent could report directly to another stakeholder's AI agent without any concern for how that information is written, as long as the data is accurate and complete.
Eventually, these protocols may revert to the most efficient form of communication: binary code, with translation to human language happening only when people need to be involved.
Just as a project manager working in Japan would rely on a translator to bridge the language gap, future project teams might rely on AI to translate between the efficient machine languages used for agent-to-agent communication and the human languages needed for stakeholder engagement.
An intriguing example emerged in 2025 where two AI systems communicating with each other eventually decided that English was inefficient for their purposes and invented an entirely new language optimised for their interaction.
While the authenticity of this example remains somewhat uncertain, it illustrates a plausible trajectory for how AI systems might evolve their communication protocols when freed from the constraint of human readability.
A Profession Coming to Terms with Reality
Our discussion with Mike Clayton captures a profession at an inflection point. The optimistic predictions of 2023 and 2024 have collided with the practical realities of implementation, producing a more mature, nuanced understanding of what AI can and cannot do for project delivery.
This maturation process isn't a failure; it's a necessary evolution. We've been arguing for months that this correction was inevitable.
By clearly identifying where AI excels and where it struggles, the profession can make more informed decisions about:
• Deployment strategies: Where to invest resources for maximum impact
• Governance frameworks: How to ensure responsible use
• Integration approaches: How to blend AI capabilities with human expertise
The move from experimentation to responsible implementation, from benchmark obsession to real-world evaluation, from technology-first thinking to change management awareness all signal a profession coming to terms with a transformative technology on its own terms.
Why The Full Podcast is Worth Your Time
The full conversation delves into additional insights that couldn't be covered here, including detailed predictions for AI agent capabilities in 2026, the evolution of regulatory frameworks, and specific examples of successful and failed implementations.
We also explore the nuanced debate about whether current limitations will be overcome through better models or require fundamentally different approaches.
The discussion covers the concept of "Internet 3" and its implications for how project information gets communicated, the role of passive data collection in construction environments, and why the profession needs to think beyond simple automation toward genuine transformation.
For anyone navigating the complex landscape of AI in project management, this conversation offers both cautionary tales and genuine optimism, grounded in the practical experience of professionals actively working at the intersection of technology and project delivery.
Listen to the full episode of Project Flux to hear our complete perspective on where the profession goes from here.



Comments