Top Mathematics discussions

NishMath

@Techmeme - 46d
OpenAI's new o3 model has achieved a significant breakthrough on the ARC-AGI benchmark, demonstrating advanced reasoning capabilities through a 'private chain of thought' mechanism. This approach involves the model searching over natural language programs to solve tasks, with a substantial increase in compute leading to a vastly improved score of 75.7% on the Semi-Private Evaluation set within a $10k compute limit, and 87.5% in a high-compute configuration. The o3 model uses deep learning to guide program search, moving beyond basic next-token prediction. Its ability to recombine knowledge at test time through program execution marks a major step toward more general AI capabilities.

The o3 model's architecture and performance represents a form of deep learning-guided program search, where it explores many paths through program space. This process, which can involve tens of millions of tokens and cost thousands of dollars for a single task, is guided by a base LLM. While o3 appears to be more than just next-token prediction, it’s still being speculated what the core mechanisms of this process are. This breakthrough highlights how increases in compute can drastically improve performance and marks a substantial leap in AI capabilities, moving far beyond previous GPT model performance. The model's development and testing also revealed that it cost around $6,677 to run o3 in "high efficiency" mode against the 400 public ARC-AGI puzzles for a score of 82.8%.

Share: bluesky twitterx--v2 facebook--v1 threads


References :
  • arcprize.org: OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit.
  • Simon Willison's Weblog: OpenAI o3 breakthrough high score on ARC-AGI-PUB
  • Techmeme: Techmeme report about O3 model.
  • TechCrunch: TechCrunch reporting on OpenAI's unveiling of o3 and o3-mini with advanced reasoning capabilities.
  • Ars Technica - All content: OpenAI announces o3 and o3-mini, its next simulated reasoning models
  • THE DECODER: OpenAI unveils o3, its most advanced reasoning model yet. A cost-effective mini version is set to launch in late January 2025, followed by the full version.
  • www.heise.de: OpenAI's new o3 model aims to outperform humans in reasoning benchmarks
  • NextBigFuture.com: OpenAI Releases O3 Model With High Performance and High Cost
  • www.techmeme.com: Techmeme post about OpenAI o3 model
  • @julianharris.bsky.social - Julian Harris: OpenAI announced o3 that is significantly better than previous systems, according to an independent benchmark org (The Arc Prize) that apparently got access. Only thing is it’s wildly wildly expensive to run. Like its top end system is around $10k per TASK.
  • shellypalmer.com: OpenAI’s o3: Progress Toward AGI or Just More Hype?
  • pub.towardsai.net: OpenAI’s O3: Pushing the Boundaries of Reasoning with Breakthrough Performance and Cost Efficiency Image Source: The world of AI continues to evolve at an astonishing pace, and OpenAI’s latest announcement has left the community buzzing with excitement.
  • www.marktechpost.com: OpenAI Announces OpenAI o3: A Measured Advancement in AI Reasoning with 87.5% Score on Arc AGI Benchmarks
  • www.rdworldonline.com: Just how big of a deal is OpenAI’s o3 model anyway?
  • NextBigFuture.com: OpenAI O3 Crushes Benchmark Tests But is it Intelligence ?
  • Analytics India Magazine: OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI
  • OODAloop: OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning
  • pub.towardsai.net: TAI 131: OpenAI’s o3 Passes Human Experts; LLMs Accelerating With Inference Compute Scaling
Classification:
@the-decoder.com - 14d
OpenAI's o3 model is facing scrutiny after achieving record-breaking results on the FrontierMath benchmark, an AI math test developed by Epoch AI. It has emerged that OpenAI quietly funded the development of FrontierMath, and had prior access to the benchmark's datasets. The company's involvement was not disclosed until the announcement of o3's unprecedented performance, where it achieved a 25.2% accuracy rate, a significant jump from the 2% scores of previous models. This lack of transparency has drawn comparisons to the Theranos scandal, raising concerns about potential data manipulation and biased results. Epoch AI's associate director has admitted the lack of transparency was a mistake.

The controversy has sparked debate within the AI community, with questions being raised about the legitimacy of o3's performance. While OpenAI claims the data wasn't used for model training, concerns linger as six mathematicians who contributed to the benchmark said that they were not aware of OpenAI's involvement or the company having exclusive access. They also indicated that had they known, they might not have contributed to the project. Epoch AI has said that an "unseen-by-OpenAI hold-out set" was used to verify the model's capabilities. Now, Epoch AI is working on developing new hold-out questions to retest the o3 model's performance, ensuring OpenAI does not have prior access.

Share: bluesky twitterx--v2 facebook--v1 threads


References :
  • Analytics India Magazine: The company has had prior access to datasets of a benchmark the o3 model scored record results on. 
  • the-decoder.com: OpenAI's involvement in funding FrontierMath, a leading AI math benchmark, only came to light when the company announced its record-breaking performance on the test.
  • THE DECODER: OpenAI's involvement in funding FrontierMath, a leading AI math benchmark, only came to light when the company announced its record-breaking performance on the test. Now, the benchmark's developer Epoch AI acknowledges they should have been more transparent about the relationship.
  • LessWrong: Some lessons from the OpenAI-FrontierMath debacle
  • Pivot to AI: OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions
Classification: