Microsoft MDASH Multi-Agent System Tops CyberGym Benchmark, Surpassing Anthropic Mythos and OpenAI G-AI Topic

Microsoft’s Multi-Agent System Dethrones Mythos as the Top “Hacker AI”—Using Competitors’ Models

The strongest “hacker” large language model is no longer anthropic’s Mythos. Microsoft has claimed the number-one spot on CyberGym, the leading benchmark for AI vulnerability discovery, with a multi-agent system that outperforms Mythos by five percentage points. The twist: Microsoft does not possess a cutting-edge frontier model of its own. Instead, it assembled a system out of other companies’ publicly available models—and beat those very companies at their own game. The implications for the AI competitive landscape may be even more significant than the fACT that the tool has already unEarthed a string of critical Windows vulnerabilities.

The reigning champion Mythos has just been overtaken by a dark horse. On May 12, 2026, Microsoft unveiled MDASH, an AI Security system that immediately topped the CyberGym leaderboard with a score of 88.45%. Trailing behind were Anthropic’s Mythos Preview at 83.1% and OpenAI’s GPT-5.5 at 81.8%.

On this leaderboard, Anthropic fielded its most powerful proprietary model, Mythos, and openai fielded its own frontier model, GPT-5.5. What did Microsoft use? Other people’s models. In its official blog post, Microsoft explicitly stated that MDASH relies entirely on “Generally available models”—models that are publicly accessible today. Microsoft does not have an in-house model capable of rivaling Mythos or gpt-5.5. If it were to submit a single off-the-shelf model to the benchmark, its score would likely land in the middle or lower tiers. Instead, the company built an orchestrating system of over 100 specialized Agents that distribute tasks across multiple models, achieving a higher score than any single model can reach. It built the tallest tower using other people’s bricks.

Microsoft has already turned this tool on its own backyard, uncovering 16 high-severity vulnerabilities in Windows 11, including a rEMOte code execution flaw designated CVE-2026-33827 that can trigger a Blue Screen of Death without authentication.

What Is the CyberGym benchmark?
Developed by a team at UC Berkeley with a paper published at ICLR 2026, CyberGym is currently one of the most authoritative public benchmarks for evaluating AI security capabilities. Anthropic, OpenAI, Meta, and Zhipu AI have all submitted scores. The testing methodology is straightforward: the AI is given a snippet of code containing a known vulnerability along with a description, and it must autonomously write an exploit that triggers the bug. The dataset comprises 1,507 challenges drawn from 188 real-world open-source projects. It is worth noting that scores are self-reported by the submitting organizations; the benchmark code is public, but there is no independent third-party verification of results.

The Power of a Multi-Agent System
The core insight delivered by MDASH is that a well-designed “system” can erase or even reverse a “model” gap. Anthropic poured enormous R&D Investment into training Mythos, widely considered the strongest single model for security—so strong that the company has deCLIned to release it publicly, instead offering it only to a limited number of partners through an initiative called Project Glasswing. OpenAI’s GPT-5.5 is likewise a frontier model built at extreme cost. Microsoft has no such model. What it does have is a pipeline that deconstructs the workflow into five phases—preparation, scanning, verification, deduplication, and exploitation proof—with distinct agents and different models assigned to each phase. Auditing agents are separated from debate agents; vulnerability discovery is decoupled from proof-of-exploit generation; heavy reasoning tasks are routed to large models while high-frequency validation tasks run on distilled, smaller models.

Crucially, the system is not tied to any specific underlying model. When a new model is released, Microsoft can run A/B tests by swAPPing configurations while reusing all the engineering assets accumulated over time. The blog emphasizes this point: “The model is one input.” For Anthropic and OpenAI, this constitutes a novel threat. The model advantage they spent Astronomical sums to build has been neutralized, at least in this domain, by a systems-level competitor using engineering prowess. The sharper sting is that Microsoft achieved this using their models.

Potential Implications for the ASI Endgame
At the frontier-model poker table, only Anthropic and OpenAI currently hold genuine chips. Microsoft, despite being OpenAI’s largest investor and cloud partner, has never trained a flagship large model that truly belongs in the first tier. The CyberGym result forces an explicit question into the open: are there one path to ASI, or two?

Path one is the trajectory pursued by Anthropic and OpenAI—pushing a single model to its absolute limits. Mythos’s security capabilities are already so potent that its release must be restricted, and GPT-5.5 continues to refresh records across multiple benchmarks. This path demands colossal compute, Vast Datasets, and world-class research teams; the bARRier to entry is immense.

Path two is what Microsoft demonstrated with MDASH: not striving to build the single strongest model, but instead constructing a system that maximizes the capabilities of existing models. More than 100 agents each handle specialized sub-tasks; disagreements between models are converted into signals; a multi-stage pipeline accomplishes through task decomposition what single-pass inference cannot.

MDASH’s score proves that Path two is viable, at least in specific domains. However, it does not imply that Path two can replace Path one. The underlying models that power MDASH still come from companies on Path one. If Anthropic and OpenAI stopped training stronger models, MDASH’s ceiling would stagnate as well.

This Is Bigger Than Microsoft
The multi-agent paradigm is transitioning from experimentation to production. Several core members of the MDASH team came from Team Atlanta, the group that won the $29.5 million DARPA AI Cyber Challenge. A core judgment they validated is that achieving professional-grade security auditing with AI requires an engineering effort that far surpasses the model itself.

Microsoft simultaneously disclosed 16 Windows vulnerabilities discovered with MDASH’s assistance, four of which are rated Critical for remote code execution. Most can be triggered from the network side without authentication and were patched in the May Patch Tuesday cycle. In internal retrospective testing, MDASH achieved a 96% recall rate on confirmed vulnerabilities over the past five years in the Windows kernel component clfs.sys, and a 100% recall rate for tcpip.sys. The weight of these numbers comes from real-world Operations, not just benchmark scores. Sixteen CVEs have entered Microsoft’s formal patch pipeline, and the 96% recall rate is measured against vulnerabilities that attackers have actually exploited in the wild over the past five years. Microsoft noted that future Patch Tuesdays will grow larger; AI is accelerating the speed of vulnerability discovery, and the volume of patches will naturally swell in tandem. The other side of that coin is equally true: attackers can use the Same Technology. MDASH was built entirely with publicly available models and contains no exclusive technical moat.

What Else to Watch
For the industry, MDASH’s significance transcends the system itself. It validates a hypothesis: in the next phase of AI capability competition, “building a system around models” may be just as critical as “training a stronger model.” This carries distinct implications for three groups.

For model companies like Anthropic and OpenAI, it sounds a warning. Leadership at the model level does not automatically translate into leadership at the application layer. Others can use your models to beat you on your own turf.

For platform companies like Google and Microsoft, it illuminates a path of differentiation. Don’t have the strongest model? Build the strongest system. The prerequisite, however, is deep underStanding of domain-specific engineering details. The design of over 100 agents’ division of labor, domain plugins, and verification pipelines carry an accumulation barrier that is also extremely high.

For everyday users, the immediate takeaway is simple: patch Promptly. Otherwise, indiViduals with no technical expertise could leveRAGe AI to exploit such vulnerabilities. Like Mythos and the specialized Cyber version of GPT-5.5, MDASH is currently undergoing private previews with a small group of customers; Microsoft has not announced pricing or general availability.

★★★★★

Be the first to rate this article.

Microsoft MDASH Multi-Agent System Tops CyberGym Benchmark, Surpassing Anthropic Mythos and OpenAI G

Comments & Questions (0)

No comments yet