@www.marktechpost.com
//
Apple researchers are challenging the perceived reasoning capabilities of Large Reasoning Models (LRMs), sparking debate within the AI community. A recent paper from Apple, titled "The Illusion of Thinking," suggests that these models, which generate intermediate thinking steps like Chain-of-Thought reasoning, struggle with fundamental reasoning tasks. The research indicates that current evaluation methods relying on math and code benchmarks are insufficient, as they often suffer from data contamination and fail to assess the structure or quality of the reasoning process.
To address these shortcomings, Apple researchers introduced controllable puzzle environments, including the Tower of Hanoi, River Crossing, Checker Jumping, and Blocks World, allowing for precise manipulation of problem complexity. These puzzles require diverse reasoning abilities, such as constraint satisfaction and sequential planning, and are free from data contamination. The Apple paper concluded that state-of-the-art LRMs ultimately fail to develop generalizable problem-solving capabilities, with accuracy collapsing to zero beyond certain complexities across different environments. However, the Apple research has faced criticism. Experts, like Professor Seok Joon Kwon, argue that Apple's lack of high-performance hardware, such as a large GPU-based cluster comparable to those operated by Google or Microsoft, could be a factor in their findings. Some argue that the models perform better on familiar puzzles, suggesting that their success may be linked to training exposure rather than genuine problem-solving skills. Others, such as Alex Lawsen and "C. Opus," argue that the Apple researchers' results don't support claims about fundamental reasoning limitations, but rather highlight engineering challenges related to token limits and evaluation methods. Recommended read:
References :
@machinelearning.apple.com
//
Apple researchers have released a new study questioning the capabilities of Large Reasoning Models (LRMs), casting doubt on the industry's pursuit of Artificial General Intelligence (AGI). The research paper, titled "The Illusion of Thinking," reveals that these models, including those from OpenAI, Google DeepMind, Anthropic, and DeepSeek, experience a 'complete accuracy collapse' when faced with complex problems. Unlike existing evaluations primarily focused on mathematical and coding benchmarks, this study evaluates the reasoning traces of these models, offering insights into how LRMs "think".
Researchers tested various models, including OpenAI's o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet, using puzzles like the Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World. These environments allowed for the manipulation of complexity while maintaining consistent logical structures. The team discovered that standard language models surprisingly outperformed LRMs in low-complexity scenarios, while LRMs only demonstrated advantages in medium-complexity tasks. However, all models experienced a performance collapse when faced with highly complex tasks. The study suggests that the so-called reasoning of LRMs may be more akin to sophisticated pattern matching, which is fragile and prone to failure when challenged with significant complexity. Apple's research team identified three distinct performance regimes: low-complexity tasks where standard models outperform LRMs, medium-complexity tasks where LRMs show advantages, and high-complexity tasks where all models collapse. Apple has begun integrating powerful generative AI into its own apps and experiences. The new Foundation Models framework gives app developers access to the on-device foundation language model. Recommended read:
References :
Dashveenjit Kaur@TechHQ
//
Dell Technologies has secured a contract with the U.S. Department of Energy to construct the next-generation NERSC-10 supercomputer, a project powered by NVIDIA's Vera Rubin architecture. This new system, dubbed "Doudna" after Nobel laureate Jennifer Doudna, a pioneer in CRISPR gene-editing technology, is poised to be a major federal investment in scientific computing infrastructure. Energy Secretary Chris Wright announced the contract during a visit to Lawrence Berkeley National Laboratory, emphasizing that the deployment in 2026 is crucial for maintaining American technological leadership amidst increasing global competition in AI and quantum computing.
The "Doudna" supercomputer, also known as NERSC-10, aims to significantly accelerate scientific research across multiple domains, including fusion energy, astronomy, and life sciences. Designed to serve 11,000 researchers, it represents an integration of artificial intelligence, quantum workflows, and real-time data streaming from experimental facilities. Unlike traditional supercomputers, Doudna’s architecture emphasizes coherent memory access between CPUs and GPUs, facilitating efficient data sharing between heterogeneous processors which is essential for modern AI-accelerated scientific workflows. The Doudna system is expected to deliver a 10x increase in scientific output compared to its predecessor, Perlmutter, while only consuming 2-3x the power, translating to a 3-5x improvement in performance per watt. Nick Wright, advanced technologies group lead and Doudna chief architect at NERSC, stated, "We’re not just building a faster computer, we’re building a system that helps researchers think bigger and discover sooner." NVIDIA's Vera Rubin platform introduces hardware-level optimizations specifically designed for the convergence of simulation, machine learning, and quantum algorithm development, marking a significant advancement in cutting-edge research capabilities. Recommended read:
References :
@www.quantamagazine.org
//
References:
Quanta Magazine
, www.trails.umd.edu
Researchers are making strides in AI reasoning and efficiency, tackling both complex problem-solving and the energy consumption of these systems. One promising area involves reversible computing, where programs can run backward as easily as forward, theoretically saving energy by avoiding data deletion. Michael Frank, a researcher interested in the physical limits of computation, discovered that reversible computing could keep computational progress going as traditional computing slows due to physical limitations. Christof Teuscher at Portland State University emphasized the potential for significant power savings with this approach.
An evolution of the LLM-as-a-Judge paradigm is emerging. Meta AI has introduced the J1 framework which shifts the paradigm of LLMs from passive generators to active, deliberative evaluators through self-evaluation. This approach, detailed in "J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning," addresses the growing need for rigorous and scalable evaluation as AI systems become more capable and widely deployed. By reframing judgment as a structured reasoning task trained through reinforcement learning, J1 aims to create models that perform consistent, interpretable, and high-fidelity evaluations. Soheil Feizi, an associate professor at the University of Maryland, has received a $1 million federal grant to advance foundational research in reasoning AI models. This funding, stemming from a Presidential Early Career Award for Scientists and Engineers (PECASE), will support his work in defending large language models (LLMs) against attacks, identifying weaknesses in how these models learn, encouraging transparent, step-by-step logic, and understanding the "reasoning tokens" that drive decision-making. Feizi plans to explore innovative approaches like live activation probing and novel reinforcement-learning designs, aiming to transform theoretical advancements into practical applications and real-world usages. Recommended read:
References :
|
Blogs
|