Top Mathematics discussions

NishMath - #aireasoning

@www.quantamagazine.org //
Recent developments in the field of large language models (LLMs) are focusing on enhancing reasoning capabilities through reinforcement learning. This approach aims to improve model accuracy and problem-solving, particularly in challenging tasks. While some of the latest LLMs, such as GPT-4.5 and Llama 4, were not explicitly trained using reinforcement learning for reasoning, the release of OpenAI's o3 model shows that strategically investing in compute and tailored reinforcement learning methods can yield significant improvements.

Competitors like xAI and Anthropic have also been incorporating more reasoning features into their models, such as the "thinking" or "extended thinking" button in xAI Grok and Anthropic Claude. The somewhat muted response to GPT-4.5 and Llama 4, which lack explicit reasoning training, suggests that simply scaling model size and data may be reaching its limits. The field is now exploring ways to make language models work better, including the use of reinforcement learning.

One of the ways that researchers are making language models work better is to sidestep the requirement for language as an intermediary step. Language isn't always necessary, and that having to turn ideas into language can slow down the thought process. LLMs process information in mathematical spaces, within deep neural networks, however, they must often leave this latent space for the much more constrained one of individual words. Recent papers suggest that deep neural networks can allow language models to continue thinking in mathematical spaces before producing any text.

Recommended read:
References :
  • pub.towardsai.net: The article discusses the application of reinforcement learning to improve the reasoning abilities of LLMs.
  • Sebastian Raschka, PhD: This blog post delves into the current state of reinforcement learning in enhancing LLM reasoning capabilities, highlighting recent advancements and future expectations.
  • Quanta Magazine: This article explores the use of reinforcement learning to make Language Models work better, especially in challenging reasoning tasks.

Chris McKay@Maginative //
OpenAI has released its latest AI models, o3 and o4-mini, designed to enhance reasoning and tool use within ChatGPT. These models aim to provide users with smarter and faster AI experiences by leveraging web search, Python programming, visual analysis, and image generation. The models are designed to solve complex problems and perform tasks more efficiently, positioning OpenAI competitively in the rapidly evolving AI landscape. Greg Brockman from OpenAI noted the models "feel incredibly smart" and have the potential to positively impact daily life and solve challenging problems.

The o3 model stands out due to its ability to use tools independently, which enables more practical applications. The model determines when and how to utilize tools such as web search, file analysis, and image generation, thus reducing the need for users to specify tool usage with each query. The o3 model sets new standards for reasoning, particularly in coding, mathematics, and visual perception, and has achieved state-of-the-art performance on several competition benchmarks. The model excels in programming, business, consulting, and creative ideation.

Usage limits for these models vary, with o3 at 50 queries per week, and o4-mini at 150 queries per day, and o4-mini-high at 50 queries per day for Plus users, alongside 10 Deep Research queries per month. The o3 model is available to ChatGPT Pro and Team subscribers, while the o4-mini models are used across ChatGPT Plus. OpenAI says o3 is also beneficial in generating and critically evaluating novel hypotheses, especially in biology, mathematics, and engineering contexts.

Recommended read:
References :
  • Simon Willison's Weblog: OpenAI are really emphasizing tool use with these: For the first time, our reasoning models can agentically use and combine every tool within ChatGPT—this includes searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images. Critically, these models are trained to reason about when and how to use tools to produce detailed and thoughtful answers in the right output formats, typically in under a minute, to solve more complex problems.
  • the-decoder.com: OpenAI’s new o3 and o4-mini models reason with images and tools
  • venturebeat.com: OpenAI launches o3 and o4-mini, AI models that ‘think with images’ and use tools autonomously
  • www.analyticsvidhya.com: o3 and o4-mini: OpenAI’s Most Advanced Reasoning Models
  • www.tomsguide.com: OpenAI's o3 and o4-mini models
  • Maginative: OpenAI’s latest models—o3 and o4-mini—introduce agentic reasoning, full tool integration, and multimodal thinking, setting a new bar for AI performance in both speed and sophistication.
  • THE DECODER: OpenAI’s new o3 and o4-mini models reason with images and tools
  • Analytics Vidhya: o3 and o4-mini: OpenAI’s Most Advanced Reasoning Models
  • www.zdnet.com: These new models are the first to independently use all ChatGPT tools.
  • The Tech Basic: OpenAI recently released its new AI models, o3 and o4-mini, to the public. Smart tools employ pictures to address problems through pictures, including sketch interpretation and photo restoration.
  • thetechbasic.com: OpenAI’s new AI Can “See†and Solve Problems with Pictures
  • www.marktechpost.com: OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning
  • MarkTechPost: OpenAI Introduces o3 and o4-mini: Progressing Towards Agentic AI with Enhanced Multimodal Reasoning
  • analyticsindiamag.com: Access to o3 and o4-mini is rolling out today for ChatGPT Plus, Pro, and Team users.
  • THE DECODER: OpenAI is expanding its o-series with two new language models featuring improved tool usage and strong performance on complex tasks.
  • gHacks Technology News: OpenAI released its latest models, o3 and o4-mini, to enhance the performance and speed of ChatGPT in reasoning tasks.
  • www.ghacks.net: OpenAI Launches o3 and o4-Mini models to improve ChatGPT's reasoning abilities
  • Data Phoenix: OpenAI releases new reasoning models o3 and o4-mini amid intense competition. OpenAI has launched o3 and o4-mini, which combine sophisticated reasoning capabilities with comprehensive tool integration.
  • Shelly Palmer: OpenAI Quietly Reshapes the Landscape with o3 and o4-mini. OpenAI just rolled out a major update to ChatGPT, quietly releasing three new models (o3, o4-mini, and o4-mini-high) that offer the most advanced reasoning capabilities the company has ever shipped.
  • THE DECODER: Safety assessments show that OpenAI's o3 is probably the company's riskiest AI model to date
  • shellypalmer.com: OpenAI Quietly Reshapes the Landscape with o3 and o4-mini
  • BleepingComputer: OpenAI details ChatGPT-o3, o4-mini, o4-mini-high usage limits
  • TestingCatalog: OpenAI’s o3 and o4‑mini bring smarter tools and faster reasoning to ChatGPT
  • simonwillison.net: Introducing OpenAI o3 and o4-mini
  • bdtechtalks.com: What to know about o3 and o4-mini, OpenAI’s new reasoning models
  • bdtechtalks.com: What to know about o3 and o4-mini, OpenAI’s new reasoning models
  • thezvi.wordpress.com: OpenAI has finally introduced us to the full o3 along with o4-mini. Greg Brockman (OpenAI): Just released o3 and o4-mini! These models feel incredibly smart. We’ve heard from top scientists that they produce useful novel ideas. Excited to see their …
  • thezvi.wordpress.com: OpenAI has upgraded its entire suite of models. By all reports, they are back in the game for more than images. GPT-4.1 and especially GPT-4.1-mini are their new API non-reasoning models.
  • felloai.com: OpenAI has just launched a brand-new series of GPT models—GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano—that promise major advances in coding, instruction following, and the ability to handle incredibly long contexts.
  • Interconnects: OpenAI's o3: Over-optimization is back and weirder than ever
  • www.ishir.com: OpenAI has released o3 and o4-mini, adding significant reasoning capabilities to its existing models. These advancements will likely transform the way users interact with AI-powered tools, making them more effective and versatile in tackling complex problems.
  • www.bigdatawire.com: OpenAI released the models o3 and o4-mini that offer advanced reasoning capabilities, integrated with tool use, like web searches and code execution.
  • Drew Breunig: OpenAI's o3 and o4-mini models offer enhanced reasoning capabilities in mathematical and coding tasks.
  • TestingCatalog: OpenAI’s o3 and o4-mini bring smarter tools and faster reasoning to ChatGPT
  • www.techradar.com: ChatGPT model matchup - I pitted OpenAI's o3, o4-mini, GPT-4o, and GPT-4.5 AI models against each other and the results surprised me
  • www.techrepublic.com: OpenAI’s o3 and o4-mini models are available now to ChatGPT Plus, Pro, and Team users. Enterprise and education users will get access next week.
  • Last Week in AI: OpenAI’s new GPT-4.1 AI models focus on coding, OpenAI launches a pair of AI reasoning models, o3 and o4-mini, Google’s newest Gemini AI model focuses on efficiency, and more!
  • techcrunch.com: OpenAI’s new reasoning AI models hallucinate more.
  • computational-intelligence.blogspot.com: OpenAI's new reasoning models, o3 and o4-mini, are a step up in certain capabilities compared to prior models, but their accuracy is being questioned due to increased instances of hallucinations.
  • www.unite.ai: unite.ai article discussing OpenAI's o3 and o4-mini new possibilities through multimodal reasoning and integrated toolsets.
  • Unite.AI: On April 16, 2025, OpenAI released upgraded versions of its advanced reasoning models.
  • Digital Information World: OpenAI’s Latest o3 and o4-mini AI Models Disappoint Due to More Hallucinations than Older Models
  • techcrunch.com: TechCrunch reports on OpenAI's GPT-4.1 models focusing on coding.
  • Analytics Vidhya: o3 vs o4-mini vs Gemini 2.5 pro: The Ultimate Reasoning Battle
  • THE DECODER: OpenAI's o3 achieves near-perfect performance on long context benchmark.
  • the-decoder.com: OpenAI's o3 achieves near-perfect performance on long context benchmark
  • www.analyticsvidhya.com: AI models keep getting smarter, but which one truly reasons under pressure? In this blog, we put o3, o4-mini, and Gemini 2.5 Pro through a series of intense challenges: physics puzzles, math problems, coding tasks, and real-world IQ tests.
  • Simon Willison's Weblog: This post explores the use of OpenAI's o3 and o4-mini models for conversational AI, highlighting their ability to use tools in their reasoning process. It also discusses the concept of
  • Simon Willison's Weblog: The benchmark score on OpenAI's internal PersonQA benchmark (as far as I can tell no further details of that evaluation have been shared) going from 0.16 for o1 to 0.33 for o3 is interesting, but I don't know if it it's interesting enough to produce dozens of headlines along the lines of "OpenAI's o3 and o4-mini hallucinate way higher than previous models"
  • techstrong.ai: Techstrong.ai reports OpenAI o3, o4 Reasoning Models Have Some Kinks.
  • www.marktechpost.com: OpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise Workflows
  • Towards AI: OpenAI's o3 and o4-mini models have demonstrated promising improvements in reasoning tasks, particularly their use of tools in complex thought processes and enhanced reasoning capabilities.
  • Analytics Vidhya: In this article, we explore how OpenAI's o3 reasoning model stands out in tasks demanding analytical thinking and multi-step problem solving, showcasing its capability in accessing and processing information through tools.
  • pub.towardsai.net: TAI#149: OpenAI’s Agentic o3; New Open Weights Inference Optimized Models (DeepMind Gemma, Nvidia…
  • composio.dev: OpenAI o3 vs. Gemini 2.5 Pro vs. o4-mini
  • Composio: OpenAI o3 and o4-mini are out. They are two reasoning state-of-the-art models. They’re expensive, multimodal, and super efficient at tool use.

@simonwillison.net //
Google has broadened access to its advanced AI model, Gemini 2.5 Pro, showcasing impressive capabilities and competitive pricing designed to challenge rival models like OpenAI's GPT-4o and Anthropic's Claude 3.7 Sonnet. Google's latest flagship model is currently recognized as a top performer, excelling in Optical Character Recognition (OCR), audio transcription, and long-context coding tasks. Alphabet CEO Sundar Pichai highlighted Gemini 2.5 Pro as Google's "most intelligent model + now our most in demand." Demand has increased by over 80 percent this month alone across both Google AI Studio and the Gemini API.

Google's expansion includes a tiered pricing structure for the Gemini 2.5 Pro API, offering a more affordable option compared to competitors. Prompts with less than 200,000 tokens are priced at $1.25 per million for input and $10 per million for output, while larger prompts increase to $2.50 and $15 per million tokens, respectively. Although prompt caching is not yet available, its future implementation could potentially lower costs further. The free tier allows 500 free grounding queries with Google Search per day, with an additional 1,500 free queries in the paid tier, with costs per 1,000 queries set at $35 beyond that.

The AI research group EpochAI reported that Gemini 2.5 Pro scored 84% on the GPQA Diamond benchmark, surpassing the typical 70% score of human experts. This benchmark assesses challenging multiple-choice questions in biology, chemistry, and physics, validating Google's benchmark results. The model is now available as a paid model, along with a free tier option. The free tier can use data to improve Google's products while the paid tier cannot. Rates vary by tier and range from 150-2,000/minute. Google will retire the Gemini 2.0 Pro preview entirely in favor of 2.5.

Recommended read:
References :
  • Data Phoenix: Google Unveils Gemini 2.5: Its Most Intelligent AI Model Yet
  • AI News | VentureBeat: Gemini 2.5 Pro is now available without limits and for cheaper than Claude, GPT-4o
  • Simon Willison's Weblog: Google's Gemini 2.5 Pro is currently the top model and, from , a superb model for OCR, audio transcription and long-context coding. You can now pay for it! The new gemini-2.5-pro-preview-03-25 model ID is priced like this: Prompts less than 200,00 tokens: $1.25/million tokens for input, $10/million for output Prompts more than 200,000 tokens (up to the 1,048,576 max): $2.50/million for input, $15/million for output This is priced at around the same level as Gemini 1.5 Pro ($1.25/$5 for input/output below 128,000 tokens, $2.50/$10 above 128,000 tokens), is cheaper than GPT-4o for shorter prompts ($2.50/$10) and is cheaper than Claude 3.7 Sonnet ($3/$15). Gemini 2.5 Pro is a reasoning model, and invisible reasoning tokens are included in the output token count. I just tried prompting "hi" and it charged me 2 tokens for input and 623 for output, of which 613 were "thinking" tokens. That still adds up to just 0.6232 cents (less than a cent) using my which I updated to support the new model just now. I released this morning adding support for the new model: llm install -U llm-gemini llm -m gemini-2.5-pro-preview-03-25 hi Note that the model continues to be available for free under the previous gemini-2.5-pro-exp-03-25 model ID: llm -m gemini-2.5-pro-exp-03-25 hi The free tier is "used to improve our products", the paid tier is not. Rate limits for the paid model - from 150/minute and 1,000/day for tier 1 (billing configured), 1,000/minute and 50,000/day for Tier 2 ($250 total spend) and 2,000/minute and unlimited/day for Tier 3 ($1,000 total spend). Meanwhile the free tier continues to limit you to 5 requests per minute and 25 per day. Google are entirely in favour of 2.5. Via Tags: , , , , , , ,
  • THE DECODER: Google has opened broader access to Gemini 2.5 Pro, its latest AI flagship model, which demonstrates impressive performance in scientific testing while introducing competitive pricing.
  • Bernard Marr: Google's latest AI model, Gemini 2.5 Pro, is poised to streamline complex mathematical and coding operations.
  • The Cognitive Revolution: In this illuminating episode of The Cognitive Revolution, host Nathan Labenz speaks with Jack Rae, principal research scientist at Google DeepMind and technical lead on Google's thinking and inference time scaling work.
  • bsky.app: Gemini 2. 5 Pro pricing was announced today - it's cheaper than both GPT-4o and Claude 3.7 Sonnet I've updated my llm-gemini plugin to add support for the new paid model Full notes here:
  • Last Week in AI: Google unveils a next-gen AI reasoning model, OpenAI rolls out image generation powered by GPT-4o to ChatGPT, Tencent’s Hunyuan T1 AI reasoning model rivals DeepSeek in performance and price

Maximilian Schreiner@THE DECODER //
Google's Gemini 2.5 Pro is making waves as a top-tier reasoning model, marking a leap forward in Google's AI capabilities. Released recently, it's already garnering attention from enterprise technical decision-makers, especially those who have traditionally relied on OpenAI or Claude for production-grade reasoning. Early experiments, benchmark data, and developer reactions suggest Gemini 2.5 Pro is worth serious consideration.

Gemini 2.5 Pro distinguishes itself with its transparent, structured reasoning. Google's step-by-step training approach results in a structured chain of thought that provides clarity. The model presents ideas in numbered steps, with sub-bullets and internal logic that's remarkably coherent and transparent. This breakthrough offers greater trust and steerability, enabling enterprise users to validate, correct, or redirect the model with more confidence when evaluating output for critical tasks.

Recommended read:
References :
  • AI News | VentureBeat: Google’s Gemini 2.5 Pro is the smartest model you’re not using — and 4 reasons it matters for enterprise AI
  • Composio: Gemini 2.5 Pro vs. Claude 3.7 Sonnet (thinking) vs. Grok 3 (think)
  • thezvi.wordpress.com: Gemini 2.5 is the New SoTA
  • www.infoworld.com: Google introduces Gemini 2.5 reasoning models
  • Composio: Gemini 2. 5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
  • Analytics India Magazine: Gemini 2.5 is better than the Claude 3.7 Sonnet for coding in the Aider Polyglot leaderboard.
  • www.tomsguide.com: Surprise move comes just days after Gemini 2.5 Pro Experimental arrived for Advanced subscribers.