Top Mathematics discussions

NishMath - #reasoning

@www.marktechpost.com //
Apple researchers are challenging the perceived reasoning capabilities of Large Reasoning Models (LRMs), sparking debate within the AI community. A recent paper from Apple, titled "The Illusion of Thinking," suggests that these models, which generate intermediate thinking steps like Chain-of-Thought reasoning, struggle with fundamental reasoning tasks. The research indicates that current evaluation methods relying on math and code benchmarks are insufficient, as they often suffer from data contamination and fail to assess the structure or quality of the reasoning process.

To address these shortcomings, Apple researchers introduced controllable puzzle environments, including the Tower of Hanoi, River Crossing, Checker Jumping, and Blocks World, allowing for precise manipulation of problem complexity. These puzzles require diverse reasoning abilities, such as constraint satisfaction and sequential planning, and are free from data contamination. The Apple paper concluded that state-of-the-art LRMs ultimately fail to develop generalizable problem-solving capabilities, with accuracy collapsing to zero beyond certain complexities across different environments.

However, the Apple research has faced criticism. Experts, like Professor Seok Joon Kwon, argue that Apple's lack of high-performance hardware, such as a large GPU-based cluster comparable to those operated by Google or Microsoft, could be a factor in their findings. Some argue that the models perform better on familiar puzzles, suggesting that their success may be linked to training exposure rather than genuine problem-solving skills. Others, such as Alex Lawsen and "C. Opus," argue that the Apple researchers' results don't support claims about fundamental reasoning limitations, but rather highlight engineering challenges related to token limits and evaluation methods.

Recommended read:
References :
  • TheSequence: The Sequence Research #663: The Illusion of Thinking, Inside the Most Controversial AI Paper of Recent Weeks
  • chatgptiseatingtheworld.com: Research: Did Apple researchers overstate “The Illusion of Thinking†in reasoning models. Opus, Lawsen think so.
  • www.marktechpost.com: Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation
  • arstechnica.com: New Apple study challenges whether AI models truly “reason†through problems
  • 9to5Mac: New paper pushes back on Apple’s LLM ‘reasoning collapse’ study

Carl Franzen@AI News | VentureBeat //
Mistral AI has launched its first reasoning model, Magistral, signaling a commitment to open-source AI development. The Magistral family features two models: Magistral Small, a 24-billion parameter model available with open weights under the Apache 2.0 license, and Magistral Medium, a proprietary model accessible through an API. This dual release strategy aims to cater to both enterprise clients seeking advanced reasoning capabilities and the broader AI community interested in open-source innovation.

Mistral's decision to release Magistral Small under the permissive Apache 2.0 license marks a significant return to its open-source roots. The license allows for the free use, modification, and distribution of the model's source code, even for commercial purposes. This empowers startups and established companies to build and deploy their own applications on top of Mistral’s latest reasoning architecture, without the burdens of licensing fees or vendor lock-in. The release serves as a powerful counter-narrative, reaffirming Mistral’s dedication to arming the open community with cutting-edge tools.

Magistral Medium demonstrates competitive performance in the reasoning arena, according to internal benchmarks released by Mistral. The model was tested against its predecessor, Mistral-Medium 3, and models from Deepseek. Furthermore, Mistral's Agents API's Handoffs feature facilitates smart, multi-agent workflows, allowing different agents to collaborate on complex tasks. This enables modular and efficient problem-solving, as demonstrated in systems where agents collaborate to answer inflation-related questions.

Recommended read:
References :
  • Simon Willison: Mistral's first reasoning LLM - Magistral - was released today and is available in two sizes, an open weights (Apache 2) 24B model called Magistral Small and an API/hosted only model called Magistral Medium.
  • Simon Willison's Weblog: Mistral's first reasoning model is out today, in two sizes. There's a 24B Apache 2 licensed open-weights model called Magistral Small (actually Magistral-Small-2506), and a larger API-only model called Magistral Medium.
  • THE DECODER: Mistral launches Europe's first reasoning model Magistral but lags behind competitors
  • AI News | VentureBeat: The company is signaling that the future of reasoning AI will be both powerful and, in a meaningful way, open to all.
  • www.marktechpost.com: How to Create Smart Multi-Agent Workflows Using the Mistral Agents API’s Handoffs Feature
  • TestingCatalog: Mistral AI debuts Magistral models focused on advanced reasoning
  • www.artificialintelligence-news.com: Mistral AI has pulled back the curtain on Magistral, their first model specifically built for reasoning tasks.
  • www.infoworld.com: Mistral AI unveils Magistral reasoning model
  • AI News: Mistral AI has pulled back the curtain on Magistral, their first model specifically built for reasoning tasks.
  • the-decoder.com: The French start-up Mistral is launching its first reasoning model on the market with Magistral. It is designed to enable logical thinking in European languages.
  • Simon Willison: Mistral's first reasoning LLM - Magistral - was released today and is available in two sizes, an open weights (Apache 2) 24B model called Magistral Small and an API/hosted only model called Magistral Medium. My notes here, including running Small locally with Ollama and accessing Medium via my llm-mistral plugin
  • SiliconANGLE: Mistral AI debuts new Magistral series of reasoning LLMs.
  • siliconangle.com: Mistral AI debuts new Magistral series of reasoning LLMs
  • MarkTechPost: Mistral AI Releases Magistral Series: Advanced Chain-of-Thought LLMs for Enterprise and Open-Source Applications
  • www.marktechpost.com: Mistral AI Releases Magistral Series: Advanced Chain-of-Thought LLMs for Enterprise and Open-Source Applications
  • WhatIs: What differentiates Mistral AI reasoning model Magistral
  • AlternativeTo: Mistral AI debuts Magistral: a transparent, multilingual reasoning model family, including open-source Magistral Small available on Hugging Face and enterprise-focused Magistral Medium available on various platforms.

Mark Tyson@tomshardware.com //
OpenAI has recently launched its newest reasoning model, o3-pro, making it available to ChatGPT Pro and Team subscribers, as well as through OpenAI’s API. Enterprise and Edu subscribers will gain access the following week. The company touts o3-pro as a significant upgrade, emphasizing its enhanced capabilities in mathematics, science, and coding, and its improved ability to utilize external tools.

OpenAI has also slashed the price of o3 by 80% and o3-pro by 87%, positioning the model as a more accessible option for developers seeking advanced reasoning capabilities. This price adjustment comes at a time when AI providers are competing more aggressively on both performance and affordability. Experts note that evaluations consistently prefer o3-pro over the standard o3 model across all categories, especially in science, programming, and business tasks.

O3-pro utilizes the same underlying architecture as o3, but it’s tuned to be more reliable, especially on complex tasks, with better long-range reasoning. The model supports tools like web browsing, code execution, vision analysis, and memory. While the increased complexity can lead to slower response times, OpenAI suggests that the tradeoff is worthwhile for the most challenging questions "where reliability matters more than speed, and waiting a few minutes is worth the tradeoff.”

Recommended read:
References :
  • Maginative: OpenAI’s new o3-pro model is now available in ChatGPT and the API, offering top-tier performance in math, science, and coding—at a dramatically lower price.
  • AI News | VentureBeat: OpenAI's most powerful reasoning model, o3, is now 80% cheaper, making it more affordable for businesses, researchers, and individual developers.
  • Latent.Space: OpenAI just dropped the price of their o3 model by 80% today and launched o3-pro.
  • THE DECODER: OpenAI has lowered the price of its o3 language model by 80 percent, CEO Sam Altman said.
  • Simon Willison's Weblog: OpenAI's Adam Groth explained that the engineers have optimized inference, allowing a significant price reduction for the o3 model.
  • the-decoder.com: OpenAI lowered the price of its o3 language model by 80 percent, CEO Sam Altman said.
  • AI News | VentureBeat: OpenAI released the latest in its o-series of reasoning model that promises more reliable and accurate responses for enterprises.
  • bsky.app: The OpenAI API is back to running at 100% again, plus we dropped o3 prices by 80% and launched o3-pro - enjoy!
  • Sam Altman: We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence, and at least so far it’s much less weird than it seems like it should be.
  • siliconangle.com: OpenAI’s newest reasoning model o3-pro surpasses rivals on multiple benchmarks, but it’s not very fast
  • SiliconANGLE: OpenAI’s newest reasoning model o3-pro surpasses rivals on multiple benchmarks, but it’s not very fast
  • bsky.app: the OpenAI API is back to running at 100% again, plus we dropped o3 prices by 80% and launched o3-pro - enjoy!
  • bsky.app: OpenAI has launched o3-pro. The new model is available to ChatGPT Pro and Team subscribers and in OpenAI’s API now, while Enterprise and Edu subscribers will get access next week. If you use reasoning models like o1 or o3, try o3-pro, which is much smarter and better at using external tools.
  • The Algorithmic Bridge: OpenAI o3-Pro Is So Good That I Can’t Tell How Good It Is
  • datafloq.com: What is OpenAI o3 and How is it Different than other LLMs?
  • www.marketingaiinstitute.com: [The AI Show Episode 153]: OpenAI Releases o3-Pro, Disney Sues Midjourney, Altman: “Gentle Singularity†Is Here, AI and Jobs & News Sites Getting Crushed by AI Search

Carl Franzen@AI News | VentureBeat //
Mistral AI has launched Magistral, its inaugural reasoning large language model (LLM), available in two distinct versions. Magistral Small, a 24 billion parameter model, is offered with open weights under the Apache 2.0 license, enabling developers to freely use, modify, and distribute the code for commercial or non-commercial purposes. This model can be run locally using tools like Ollama. The other version, Magistral Medium, is accessible exclusively via Mistral’s API and is tailored for enterprise clients, providing traceable reasoning capabilities crucial for compliance in highly regulated sectors such as legal, financial, healthcare, and government.

Mistral is positioning Magistral as a powerful tool for both professional and creative applications. The company highlights Magistral's ability to perform "transparent, multilingual reasoning," making it suitable for tasks involving complex calculations, programming logic, decision trees, and rule-based systems. Additionally, Mistral is promoting Magistral for creative writing, touting its capacity to generate coherent or, if desired, uniquely eccentric content. Users can experiment with Magistral Medium through the "Thinking" mode within Mistral's Le Chat platform, with options for "Pure Thinking" and a high-speed "10x speed" mode powered by Cerebras.

Benchmark tests reveal that Magistral Medium is competitive in the reasoning arena. On the AIME-24 mathematics benchmark, the model achieved an impressive 73.6% accuracy, comparable to its predecessor, Mistral Medium 3, and outperforming Deepseek's models. Mistral's strategic release of Magistral Small under the Apache 2.0 license is seen as a reaffirmation of its commitment to open source principles. This move contrasts with the company's previous release of Medium 3 as a proprietary offering, which had raised concerns about a shift towards a more closed ecosystem.

Recommended read:
References :
  • AI News | VentureBeat: Mistrals first reasoning model, Magistral, launches with large and small Apache 2.0 version.
  • Simon Willison: Mistral's first reasoning LLM - Magistral - was released today and is available in two sizes, an open weights (Apache 2) 24B model called Magistral Small and an API/hosted only model called Magistral Medium. My notes here, including running Small locally with Ollama and accessing Medium via my llm-mistral plugin
  • Simon Willison's Weblog: Magistral — the first reasoning model by Mistral AI
  • the-decoder.com: Mistral launches Europe's first reasoning model Magistral but lags behind competitors
  • SiliconANGLE: Mistral AI debuts new Magistral series of reasoning LLMs
  • MarkTechPost: Mistral AI Releases Magistral Series: Advanced Chain-of-Thought LLMs for Enterprise and Open-Source Applications
  • TestingCatalog: Mistral AI debuts Magistral models focused on advanced reasoning
  • siliconangle.com: Mistral AI SAS today introduced Magistral, a new lineup of reasoning-optimized large language models. The LLM series includes two algorithms on launch.
  • www.artificialintelligence-news.com: Mistral AI challenges big tech with reasoning model
  • www.marktechpost.com: Mistral AI Releases Magistral Series: Advanced Chain-of-Thought LLMs for Enterprise and Open-Source Applications
  • WhatIs: What differentiates Mistral AI reasoning model Magistral

@medium.com //
References: , TheSequence , thezvi.substack.com ...
DeepSeek's latest AI model, R1-0528, is making waves in the AI community due to its impressive performance in math and reasoning tasks. This new model, despite having a similar name to its predecessor, boasts a completely different architecture and performance profile, marking a significant leap forward. DeepSeek R1-0528 has demonstrated "unprecedented levels of demand" shooting to the top of the App Store past closed model rivals and overloading their API with unprecedented levels of demand to the point that they actually had to stop accepting payments.

The most notable improvement in DeepSeek R1-0528 is its mathematical reasoning capabilities. On the AIME 2025 test, the model's accuracy increased from 70% to 87.5%, surpassing Gemini 2.5 Pro and putting it in close competition with OpenAI's o3. This improvement is attributed to "enhanced thinking depth," with the model using significantly more tokens per question, engaging in more thorough chains of reasoning. This means the model can check its own work, recognize errors, and course-correct during problem-solving.

DeepSeek's success is challenging established closed models and driving competition in the AI landscape. DeepSeek-R1-0528 continues to utilize a Mixture-of-Experts (MoE) architecture, now scaled up to an enormous size. This sparse activation allows for powerful specialized expertise in different coding domains while maintaining efficiency. The context also continues to remain at 128k (with RoPE scaling or other improvements capable of extending it further.) The rise of DeepSeek is underscored by its performance benchmarks, which show it outperforming some of the industry’s leading models, including OpenAI’s ChatGPT. Furthermore, the release of a distilled variant, R1-0528-Qwen3-8B, ensures broad accessibility of this powerful technology.

Recommended read:
References :
  • : The 'Minor Upgrade' That's Anything But: DeepSeek R1-0528 Deep Dive
  • TheSequence: The Sequence Radar #554 : The New DeepSeek R1-0528 is Very Impressive
  • lambda.ai: DeepSeek-R1-0528: The Open-Source Titan Now Live on Lambda’s Inference API
  • thezvi.substack.com: DeepSeek-r1-0528 Did Not Have a Moment
  • thezvi.wordpress.com: DeepSeek-r1-0528 Did Not Have a Moment

@www.marktechpost.com //
DeepSeek has released a major update to its R1 reasoning model, dubbed DeepSeek-R1-0528, marking a significant step forward in open-source AI. The update boasts enhanced performance in complex reasoning, mathematics, and coding, positioning it as a strong competitor to leading commercial models like OpenAI's o3 and Google's Gemini 2.5 Pro. The model's weights, training recipes, and comprehensive documentation are openly available under the MIT license, fostering transparency and community-driven innovation. This release allows researchers, developers, and businesses to access cutting-edge AI capabilities without the constraints of closed ecosystems or expensive subscriptions.

The DeepSeek-R1-0528 update brings several core improvements. The model's parameter count has increased from 671 billion to 685 billion, enabling it to process and store more intricate patterns. Enhanced chain-of-thought layers deepen the model's reasoning capabilities, making it more reliable in handling multi-step logic problems. Post-training optimizations have also been applied to reduce hallucinations and improve output stability. In practical terms, the update introduces JSON outputs, native function calling, and simplified system prompts, all designed to streamline real-world deployment and enhance the developer experience.

Specifically, DeepSeek R1-0528 demonstrates a remarkable leap in mathematical reasoning. On the AIME 2025 test, its accuracy improved from 70% to an impressive 87.5%, rivaling OpenAI's o3. This improvement is attributed to "enhanced thinking depth," with the model now utilizing significantly more tokens per question, indicating more thorough and systematic logical analysis. The open-source nature of DeepSeek-R1-0528 empowers users to fine-tune and adapt the model to their specific needs, fostering further innovation and advancements within the AI community.

Recommended read:
References :
  • Kyle Wiggers ?: DeepSeek updates its R1 reasoning AI model, releases it on Hugging Face
  • AI News | VentureBeat: VentureBeat article on DeepSeek R1-0528.
  • Analytics Vidhya: New Deepseek R1-0528 Update is INSANE
  • MacStories: Testing DeepSeek R1-0528 on the M3 Ultra Mac Studio and Installing Local GGUF Models with Ollama on macOS
  • www.analyticsvidhya.com: New Deepseek R1-0528 Update is INSANE
  • www.marktechpost.com: DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU Efficiency
  • NextBigFuture.com: DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by leveraging increased computational resources and introducing algorithmic optimization mechanisms during post-training.
  • MarkTechPost: DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU Efficiency
  • pandaily.com: In the early hours of May 29, Chinese AI startup DeepSeek quietly open-sourced the latest iteration of its R1 large language model, DeepSeek-R1-0528, on the Hugging Face platform .
  • www.computerworld.com: Reports that DeepSeek releases a new version of its R1 reasoning AI model.
  • techcrunch.com: DeepSeek updates its R1 reasoning AI model, releases it on Hugging Face
  • the-decoder.com: Deepseek's R1 model closes the gap with OpenAI and Google after major update
  • Simon Willison: Some notes on the new DeepSeek-R1-0528 - a completely different model from the R1 they released in January, despite having a very similar name Terrible LLM naming has managed to infect the Chinese AI labs too
  • Analytics India Magazine: The new DeepSeek-R1 Is as good as OpenAI o3 and Gemini 2.5 Pro
  • : The 'Minor Upgrade' That's Anything But: DeepSeek R1-0528 Deep Dive
  • simonwillison.net: Some notes on the new DeepSeek-R1-0528 - a completely different model from the R1 they released in January, despite having a very similar name Terrible LLM naming has managed to infect the Chinese AI labs too
  • TheSequence: This article provides an overview of the new DeepSeek R1-0528 model and notes its improvements over the prior model released in January.
  • Kyle Wiggers ?: News about the release of DeepSeek's updated R1 AI model, emphasizing its increased censorship.
  • Fello AI: Reports that the R1-0528 model from DeepSeek is matching the capabilities of OpenAI's o3 and Google's Gemini 2.5 Pro.
  • felloai.com: Latest DeepSeek Update Called R1-0528 Is Matching OpenAI’s o3 & Gemini 2.5 Pro
  • www.tomsguide.com: DeepSeek’s latest update is a serious threat to ChatGPT and Google — here’s why