SemiAnalysis: GPT-5.5 Returns to the Frontier, Though OpenAI Omits a Key Metric Where Opus Prevails-AI Topic

SemiAnalysis, a prominent Semiconductor and AI analytics firm, has released a comprehensive horizontal evaluation of leading AI Programming Assistants, covering GPT-5.5, Opus 4.7, and DeepSeek V4. The core conclusion indicates that GPT-5.5 marks OpenAI's first return to the AI frontier in half a year. Consequently, engineers at SemiAnalysis have begun alternating between Codex and Claude Code, having previously relied almost exclusively on claude. The new gpt-5.5 is built upon a new pre-training run codenamed "Spud," representing openai's first expansion of pre-training scale since GPT-4.5.

️ PrACTical APPlication: A Clear Division of Labor

In practical testing, a distinct workflow分工 (division of labor) emerged:

Claude: Preferred for new project planning and initial scaffolding.
codex (GPT-5.5): Superior for inference-intensive bug fixing.

Codex dEMOnstrated stronger capabilities in underStanding data structures and logical reasoning but struggled with inferring vague user intent. In a dashboard task, Claude automatically replicated the reference page layout but hallucinated a significant amount of data, whereas Codex skipped the layout details but provided far more accurate data.

The benchmark Controversy: SWE-bench Pro vs. Expert-SWE

The report exposes a strategic maneuver regarding benchmark testing. In February, OpenAI blogged about urging the industry to adopt SWE-bench Pro as the new standard for programming benchmarks. However, the GPT-5.5 announcement switched to a new benchmark named "Expert-SWE."

The reason lies in the fine print: GPT-5.5 was surpassed by Opus 4.7 on SWE-bench Pro and trailed significantly behind anthropic's unreleased "Mythos" model (which scored 77.8%).

️ Opus 4.7: Bugs and "Hidden" Price Hikes

Regarding Opus 4.7, Anthropic released a post-mortem analysis a week after launch, admitting to three bugs in Claude Code between March and April that affected nearly all users for several weeks. Previously, multiple engineers had reported performance degradation in version 4.6, which was dismissed as subjective perception.

Additionally, the new tokenizer in Opus 4.7 can increase token usage by up to 35%, a fact admitted by Anthropic, effectively acting as a hidden price hike.

DeepSeek V4: The Cost-Effective Challenger

DeepSeek V4 was evaluated as "following the frontier but not leading it," positioning itself as the lowest-cost alternative to closed-source models. The article notably remarked that "Claude still outperforms DeepSeek V4 Pro on high-difficulty Chinese writing tasks," adding that "Claude won against the Chinese model using their own language."

Key Concept: Cost-Per-Task vs. Cost-Per-Token

The report proposes a crucial concept for evaluating model pricing: focus on "cost-per-task" rather than "cost-per-token." While GPT-5.5's unit price is double that of GPT-5.4 (Input: $5, OuTPUt: $30 per 1M tokens), it completes the Same tasks using significantly fewer tokens, meaning the actual cost may not be higher. Preliminary data from SemiAnalysis shows Codex has an input-output ratio of 80:1, compared to Claude Code's 100:1.

★★★★★

Be the first to rate this article.

SemiAnalysis: GPT-5.5 Returns to the Frontier, Though OpenAI Omits a Key Metric Where Opus Prevails

️ PrACTical APPlication: A Clear Division of Labor

The benchmark Controversy: SWE-bench Pro vs. Expert-SWE

️ Opus 4.7: Bugs and "Hidden" Price Hikes

DeepSeek V4: The Cost-Effective Challenger

Key Concept: Cost-Per-Task vs. Cost-Per-Token

Comments & Questions (0)

No comments yet