anthropic has introduced BioMysteryBench, a new benchmark comprising 99 Bioinformatics problems crafted by domAIn experts using real-world datasets. The problems span DNA/RNA sequencing, proteomics, and Metabolomics, with each answer verifiable through the objective properties of the data itself, independent of any specific analytical method chosen by the problem's author. For instance, a task might involve identifying which gene was knocked out in an experimental group based on a set of RNA-seq data, or determining parentage from whole-genome sequencing data.
In this evaluation, the AI model operates within a container pre-loaded with common bioInformatics tools, granting it the autonomy to install software via pip and conda and access public databases like NCBI and Ensembl for reference genomes. The analysis method is entirely up to the model, with only the final answer being judged. Of the 99 problems, at least one human expert successfully solved 76. The remaining 23 proved unsolvable, even after attempts by up to five different experts.
On the 76 problems solvable by humans, Claude Opus 4.6 achieved an accuracy of 77.4%, with Mythos Preview performing even better. Notably, on the 23 problems that stumped all human experts, Mythos Preview successfully solved 30%.
How does claude achieve this? One method leveRAGes the vast knowledge from its pre-training on a massive corpus of scientific papers, allowing it to deduce conclusions that would require a human to perform a meta-analysis of dozens of articles. Another strategy involves running multiple analytical methods simultaneously when uncertain, selecting the conclusion that is corroborated by several different APProaches.
However, getting an answer right and doing so consistently are two different things. On human-solvable problems, 86% of the answers Opus 4.6 got correct were right in at least 4 out of 5 repeated trials. On the more difficult problems, this consistency rate dropped to 44%, with nearly half of the correct answers being successful in only 1 or 2 out of 5 attempts, suggesting it may have fortuitously found a working path.
In a parallel development, Genentech, a subsidiary of Roche, released CompBioBench, a similar benchmark with 100 computational biology problems. On this independent eValuation, Claude Opus 4.6 achieved an overall accuracy of 81% and a 69% success rate on the most difficult subset, reinforcing the conclusions from the BioMysteryBench results.
Comments & Questions (0)
No comments yet
Be the first to comment!