Cupertino, June 6, 2025 — Just hours before the tech giant’s highly anticipated Worldwide Developers Conference (WWDC), Apple has made headlines with a startling revelation in artificial intelligence research. A newly released paper titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” reveals that even the most advanced AI models struggle—and ultimately fail—when presented with complex reasoning tasks.
The Core Finding: Collapse Under Complexity
While Large Reasoning Models (LRMs) and Large Language Models (LLMs) such as Claude 3.7 Sonnet and DeepSeek-V3 have shown promise on standard AI benchmarks, Apple’s research team discovered that their performance deteriorates rapidly when faced with increased complexity.
“They exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget,” the study noted.
This finding indicates a systemic failure in current-generation AI reasoning capabilities—despite apparent improvements in natural language understanding and general task execution.
The Testing Ground: Puzzles That Broke the Models
To investigate, researchers created a framework of puzzles and logic tasks, dividing them into three complexity categories:
- Low Complexity
- Medium Complexity
- High Complexity
Sample tasks included:
- Checkers Jumping
- River Crossing
- Blocks World
- Tower of Hanoi
Models were then tested across this spectrum. While they performed adequately on simpler tasks, both Claude 3.7 Sonnet (with and without ‘Thinking’) and DeepSeek variants consistently failed at high-complexity problems.
Implications for the AI Industry
This study throws a wrench in the narrative of rapidly advancing AI reasoning, suggesting that today’s most advanced systems might be hitting cognitive ceilings when faced with real-world complexity. For a company like Apple—often seen as lagging in AI innovation compared to peers like Google and OpenAI—this bold research move highlights a deep focus on scientific transparency rather than immediate commercial hype.
Why This Matters
The paper’s implications are profound:
- AI reasoning is not scaling linearly with problem difficulty.
- Token limits are not the bottleneck—models stop “thinking” even when resources are available.
- This could explain why LLMs make basic mistakes despite vast knowledge bases.
As the WWDC begins, Apple is expected to unveil its AI roadmap, possibly including partnerships, on-device AI capabilities, or integrated features leveraging Siri and iOS. Whether or not the company will offer solutions to the issues its own research has exposed remains to be seen.