Rethinking the Hype: Why Billion-Dollar AIs Struggle with Simple Child’s Puzzles
A recent research paper from Apple has significantly challenged the prevailing view in the tech community that large language models (LLMs) and their latest derivative, large reasoning models (LRMs), possess reliable reasoning abilities. Reactions have varied, with some expressing shock while others were not surprised. Venture capitalist Josh Wolfe emphasized this sentiment on X, stating that “Apple [had] just GaryMarcus’d LLM reasoning ability,” coining a phrase that refers to the act of critically examining and exposing the inflated expectations of artificial intelligence by revealing its limitations in reasoning, understanding, or general intelligence.
The Apple paper demonstrates that leading models, such as ChatGPT, Claude, and Deepseek, may appear intelligent but falter when faced with complex problems. Essentially, these models excel at recognizing patterns but often stumble when confronted with novel situations that exceed their training parameters, even though, as the paper points out, they were specifically designed forreasoning tasks. While the paper leaves one issue unresolved, its main assertions are compelling enough that LLM advocates are already beginning to acknowledge these deficiencies, cautiously expressing hope for better outcomes in the future.
This paper echoes and amplifies a long-standing argument I have presented since 1998: various types of neural networks can generalize within the data distribution they are trained on, but this generalization tends to fail outside of that distribution. For instance, I once trained an older model to solve simple mathematical equations using only even-numbered training data. While it could generalize to solve new even-numbered problems, it struggled with any scenarios involving odd numbers.
More than 25 years later, these systems perform well when tasks are similar to their training data. However, as they move away from that data, they often experience breakdowns, as evidenced in the more rigorous tests performed in the Apple research. This limitation is arguably the single most significant drawback of LLMs.
The enduring hope has been that “scaling” these models by increasing their size would address these issues. Yet the new Apple study firmly counters this notion. It challenged some of the most advanced and costly models with classic problems, such as the Tower of Hanoi, revealing persistent deep issues. Combined with numerous expensive failures in efforts to create systems comparable to GPT-5, this casts a shadow over any optimism.
The Tower of Hanoi is a timeless puzzle involving three pegs and multiple discs, where the objective is to transfer all discs from one peg to another without placing a larger disc on a smaller one. Strikingly, the Apple research found that leading generative models could barely manage scenarios involving seven discs, achieving less than 80% accuracy and virtually failing at eight discs. It is a considerable embarrassment that LLMs are unable to solve this problem reliably.
As co-lead author Iman Mirzadeh shared with me, “it’s not just about ‘solving’ the puzzle. We have an experiment where we provide the model with the solution algorithm, and it still fails. Based on what we observe from their thought processes, they do not demonstrate logical and intelligent reasoning.”
The paper also reinforces arguments made by Arizona State University computer scientist Subbarao Kambhampati, who has pointed out that people often anthropomorphize these systems, mistakenly attributing human-like reasoning processes to them. He has previously demonstrated that they share the same weaknesses documented by Apple.
If billion-dollar AI systems cannot solve problems that AI pioneer Herb Simon addressed using classical methods back in 1957, the prospects for models like Claude or o3 to achieve artificial general intelligence (AGI) seem exceedingly remote. So, what is the unresolved issue I hinted at? Well, humans also have their limitations. In puzzles such as the Tower of Hanoi, many people struggle with versions involving eight discs.
This is precisely why we developed computers and calculators: to reliably compute solutions to complex problems. AGI does not aim to perfectly replicate human reasoning but rather to merge human adaptability with computational power and reliability. We do not want an AGI that falters at basic arithmetic simply because humans sometimes do.
When people ask why I am optimistic about AI—contrary to the common misconception that I oppose it—I highlight the potential advancements in science and technology we could realize by combining the causal reasoning of our best scientists with the immense computational power of modern digital systems.
Ultimately, the Apple paper underscores that these overly hyped LLMs cannot replace effective and well-defined conventional algorithms. They also fail to perform tasks such as chess and protein folding as well as specialized algorithms or dedicated systems.
This has significant implications for businesses: one cannot simply implement models like o3 or Claude in complicated situations and expect dependable results. For society, it means we cannot fully rely on generative AI; its outcomes can be unpredictable.
A striking revelation from the new research is that an LLM might excel in simpler test scenarios (like the Tower of Hanoi with four discs), giving a misleading impression that it has developed a valid, generalizable solution when it has not. Certainly, LLMs will maintain their utility, especially in coding, brainstorming, and writing, provided humans remain involved in the loop.
However, anyone believing that LLMs are a straightforward path to the type of AGI capable of fundamentally transforming society for the better is mistaken.