OpenAI Researchers Say AI Will Perform Most Human Research Within Two Years, with Math as the Benchm-AI Topic

OpenAI researchers Predict AI Will Handle Most Human Research Within Two Years, Citing Math as the Proving Ground

Mathematics offers unambiguous proof of Artificial Intelligence progress, and according to OpenAI researchers, AI systems are on track to perform the majority of human research work within the next two years. This forecast, shared by Sébastien Bubeck and Ernest Ryu in an episode of openai's video series, places urgent pressure on leaders across academia and industry to address the implications of increasingly capable AI systems.

Why Mathematics Serves as the Ideal benchmark for AI

Unlike vague performance tests that can mask software errors, mathematics demands absolute clarity. Bubeck highlights this advantage: "The nice thing about mathematics is that the questions are very clear, non-ambiguous. You can verify the answer. Once a model can give an answer, everybody will agree: was it correct or was it not correct." This precision makes math a rigorous proving ground for evaluating genuine AI progress.

The Real Test: Sustaining Long Chains of Reasoning

Answering a single Prompt is no longer sufficient to dEMOnstrate meaningful machine Intelligence. The true challenge lies in maintaining consistency across long chains of reasoning. "To resolve a problem, you have to think for a long time and think consistently," Bubeck explains. A single mistake in a multi-step argument can collapse an entire proof. Consequently, the ultimate benchmark for advanced models is the ability to independently detect and correct their own errors mid-process.

Debating the nature of scientific discovery

As AI models pass increasingly difficult mathematical tests, expectations for historic breakthroughs intensify. A heated debate has emerged among Technology leaders over whether these systems generate genuinely original ideas or merely uncover hidden connections between existing academic papers. Bubeck initially shared an example of AI solving a complex math problem through a deep search that scanned thousands of papers to link unrelated fields. However, he notes that internal labs have since progressed far beyond simple retrieval. "A few months later, we have more than ten ACTual solutions that are completely new, publishable in top journals in combinatorics," he reports, underscoring that AI Models are now producing truly original, groundbreaking theorems.

Independent Research Requires Endurance and Human Oversight

Generating novel theorems is impressive, but real scientific breakthroughs demand sustained focus over weeks of rigorous testing. Current AI systems still require human supervisors to guide and verify every shift in direction, a bottleneck that prevents the creation of fully independent scientific labs. To measure progress toward autonomy, Bubeck uses the concept of "AGI time" to track how long a model can independently mimic human thinking without breaking down. "Now we are roughly at days or one week," he says, while the industry goal remains extending that endurance to weeks or months, enabling autonomous work in fields like biology and the physical sciences.

long-term memory as the Foundation for Future Research Tools

Enabling AI to reason over months demands persistent memory that outlasts a single software session. This technical upgrade reshapes the kind of infrastructure organizations must fund and monitor. Text limits restrict depth because historic math proofs can exceed 50 pages and require tracing extensive histories. AI Agents will need saved notes, tracked guesses, and detailed reasoning logs, much like developers managing large code repositories. As systems advance toward independent planning and self-review, organized work logs become critical for audits. Science tools will also require seamless integrations and continuous test loops to autonomously locate data and detect errors.

Ernest Ryu points to coding environments as a model to overcome these limitations. Most mathematicians operate within a text window equivalent to about 50 pages, which is "not long enough to make true deep math breakthroughs." Code repositories, by contrast, function like expansive math notes, allowing for much longer work sessions and suggesting a path forward for research tools with extended memory.

The Rising Value of Human Expertise in an Automated Era

As AI tools become more powerful and autonomous, the risk of blind trust grows. With machines handling more heavy lifting, workplaces will place an increasing premium on elite professionals who possess the deep foundational knowledge necessary to challenge and verify AI-generated results. Bubeck warns that overreliance on AI could erode the human capacity for deep intellectual engagement, resulting in workers who lack the patience "to sit patiently for hours, many days in a row, to underStand deeply a result." In this landscape, true human expertise becomes more valuable than ever, as pushing advanced software to its limits still requires years of rigorous training to identify subtle yet convincing AI mistakes.

Transforming Publication Standards and Peer Review

The anticipated flood of AI-assisted research creates a system-wide trust problem that demands new automated filters to manage volume without APProving flawed science. Automated checks can reduce review times by flagging weak steps or missing ideas while maintaining human accountability. As content generation becomes cheap, reputation systems will gain importance, turning human responsibility into a scarce and valuable market signal. Fast feedback loops will reDeFine academic publishing, prioritizing constantly verified results over static, aging papers. Organizations that build internal verification systems stand to gain a significant advantage in uncovering past academic errors.

The traditional academic review cycle, which can take years to manually verify a 300-page proof, stifles rapid innovation. Ryu notes that fatally incorrect proofs occasionally slip through this slow process. While AI can Dramatically accelerate verification, Bubeck cautions against handing total quality control to software. Instead, AI should flag potential issues to reduce the human workload while the supervising researcher bears ultimate reputational responsibility for any published errors. "The social structure of mathematics or code has to change," Bubeck concludes.

Broader perspective and Remaining Challenges

Despite the ambitious Timeline proposed by Bubeck and Ryu, independent assessments urge caution. In its 2025 time-horizons study, METR found that current frontier Agents remain unable to execute substantive projects independently or serve as direct substitutes for human labor, suggesting that "most human research work" within two years may be an aggressive projection rather than an established trajectory.

Additionally, impressive demonstrations do not yet equate to durable scientific autonomy. Nature reported in March 2026 that an AI Scientist system passed the first round of peer review for a workshop paper, but the Same coveRAGe emphasized that the field continues to grApple with the strengths and limitations of autonomous research tools. The critical question—whether AI can sustain original, reliable discovery without heavy human framing—remains unresolved.

The verification infrastructure is already strained even before fully autonomous research becomes widespread. Nature has documented that more than half of researchers have used AI in peer review, often against journal guidelines, while a separate analysis uncovered tens of thousands of 2025 papers containing potentially invalid AI-generated citations. This suggests the bottleneck is not merely generating more research, but rather building a review system robust enough to stem a growing flood of weak or misleading ouTPUt.

★★★★★

Be the first to rate this article.

OpenAI Researchers Say AI Will Perform Most Human Research Within Two Years, with Math as the Benchm

Comments & Questions (0)

No comments yet