AI News
Real Time

Alibaba RTPurboV2: Native Transformer Revival with 10x Sparse Attention via Minimal Training

"Full Attention is Being Forgotten"With the widespread adoption of AI Agents driving the demand for long-sequence processing, the Attention...
"Full Attention is Being Forgotten"
With the widespread adoption of AI Agents driving the demand for long-sequence processing, the Attention mechanism in traditional GPT architectures is increASIngly viewed as a performance bottleneck due to its O(N²) computational complexity. Consequently, the architectural iteration of attention mechanisms is advancing at an unprecedented pace. Currently, mAInstream industry solutions can be broadly categorized into two APProaches: Linear Attention and Sparse Attention. Linear Attention, represented by Qwen-Next and Kimi-K2, essentially achieves Information compression through improved Linear Attention, reducing stoRAGe costs to O(1) and computational costs to O(N). Sparse Attention, on the other hand, optimizes computational overhead primarily through spaRSIfication, prACTically achieving over 90% sparsity—a technical route formally adopted in DeepSeek-V4.
However, previous work on RTPUrbo [1] has clearly dEMOnstrated that combining Full Attention with Sliding Window Attention (SWA) can safely convert 85% of native Transformer attention heads into SWA without any loss in accuracy. This creates a hybrid architecture of 15% Full Attention and 85% SWA, achieving a 5X compression in both KV cache and Attention. Coincidentally, recent open-source architectures such as MIMO, GemmA4, and GPT-OSS have also adopted this SWA + Full Attention design, reflecting a design philosophy of "simplicity is the ultimate sophistication."
Despite replacing 85% of Full Attention with SWA, the remaining 15% Full Attention still becomes a performance bottleneck in ultra-long sequences (e.g., 1M tokens). Today, to彻底 resolve the Attention inference bottleneck, Alibaba's RTP team has introduced the SECond-generation Attention compression Technology: RTPurboV2. By combining Headwise compression, Low-Rank Projection compression, and clustering techniques, RTPurboV2 can achieve an additional 16~32X computational compression on the Full Attention portion based on the V1 architecture.
RTPurboV2: Comprehensive and Extreme Full Attention Compression
Full Attention models have spontaneously formed highly sparse attention structures during the pre-training process. Our goal is not to "impose" sparsity, but to "release" it. This judgment is built upon four quantifiable key findings:
Finding 1: 85% of Attention Heads are Naturally Suited for Sliding Windows
Researchers have discovered that different Attention Heads in Full Attention models actually assume different responsibilities. Some heads focus on capturing local information (e.g., relationships between adjacent tokens), while others are responsible for capturing long-range dependencies (e.g., associations with self-relevant information).
More specifically, through visual analysis, researchers observed that in the Qwen3 series models:
  • Approximately 15% of the Heads exhibit distinct "Retrieval Head" characteristics: their attention distributions are highly sparse, focusing only on a few key tokens and responsible for long-distance information retrieval.

  • The remaining 85% are "Streaming Heads": their attention distributions are relatively uniform, focusing more on local context.

This division of labor is highly stable across different inputs and sequence lengths, representing an intrinsic structure spontaneously learned during pre-training. The direct implication is that 85% of Full Attention computations can be safely replaced by SWA (as seen in RTPurbo) with virtually no impact on model capabilities. The only remaining challenge is the efficient computation of the remaining 15% Retrieval Heads.
Finding 2: Long-Range Retrieval is Dominated by Low-Dimensional Subspaces
The core task of Retrieval Heads is semantic matching across the entire sequence—which still appears to be an O(N²) problem. A core technical upgrade of RTPurboV2 is the meticulous understanding of Retrieval Heads and RoPE (Rotary Position embedding). After an in-depth analysis of the frequency structure of RoPE, the team discovered significant dimensional redundancy in the RoPE components of Retrieval Heads. Under RoPE, the Query-Key attention score can be decomposed into a superposition of different frequency components:
Where △ = m - n represents the positional offset. Different frequency components play fundamentally different roles:
  • Low-frequency components (smaller θ_i): Change slowly with positional offset, cARRying semantic correlation signals between tokens.

  • High-frequency components (larger θ_i): Oscillate rapidly with positional offset, introducing distance-sensitive interference.

For long-distance retrieval, high-frequency components cause attention scores to fluctuate drastically with positional distance, weakening the stable transmission of semantic signals. Given the nature of the retrieval task itself, the retrieval strength of a token should not fluctuate rapidly with relative position. Therefore, it can be inferred that high-frequency components on Retrieval Heads must be in a suppressed state; Retrieval Heads essentially only utilize the low-frequency components of RoPE.
Thus, a natural design is to train a low-dimensional projector. Through low-rank mapping, we compress the original feature dimension from D to r=16 (where r ≪ D), systematically retaining low-frequency semantic components while filtering out high-frequency positional noise. Experiments have verified that merely 16 dimensions can achieve a token recall rate of over 90%.
Finding 3: Redundancy in the Sequence Dimension: Adaptive Clustering Based on High-Quality Features
This is the second core technical upgrade of RTPurboV2. The team realized that the benefits of low-rank projection go beyond directly reducing computational overhead—it fundamentally improves the distribution quality of Key vectors in the semantic space. After high-frequency noise is filtered out, semantically similar tokens naturally cluster together in the low-rank space, while semantically irrelevant tokens move further apart. This creates ideal conditions for further compression in the sequence dimension.
Based on this characteristic, we introduce adaptive clustering in the sequence dimension, constructing a two-stage funnel-shaped computation process:
  1. Coarse-grained matching: Cluster N tokens into K semantic clusters (e.g., K=128). The Query first performs Lightweight matching with the K cluster centers, with a complexity of only O(N·K).

  2. Fine-grained computation: Full Attention computation is executed only within the hit relevant clusters.

By串联 (cascading) these two stages, the overall complexity leaps from O(N²) to O(N·K).
Significant synergistic gains exist between the two compression steps:
  • Feature dimension compression → Reduces single-step computational overhead while producing high-quality clustering inputs.

  • Sequence dimension compression (clustering) → Skips a large number of semantically irrelevant tokens, reducing total computational steps.

  • Synergistic effect → The purified vectors from feature compression make clustering centers more precise, maintaining high recall rates even under extreme compression ratios.

The two form a multiplicative effect: the more aggressive the compression ratio, the more significant the synergistic gain.
Finding 4: Dynamic Top-p Significantly Outperforms Fixed Top-k
Traditional sparse attention methods typically employ a fixed top-k strategy, retaining only the k tokens with the highest attention scores for each query. However, this approach has a fundamental flaw: the number of context tokens required varies enormously across different attention heads, sequence lengths, and queries.
Taking three Retrieval Heads in the Same layer of the same model as an example, under a 64K context, the number of tokens required to cover 90% of attention quality varies by three orders of magnitude. This means no fixed k value can satisfy all scenarios simultaneously.
Therefore, RTPurboV2 adopts a dynamic top-p strategy: for each query, it retains the set of tokens whose cumulative attention score reaches p (e.g., 0.9). Centralized queries automatically streamline their budget, while dispersed queries automatically expand coverage. Simultaneously, we designed a sorting-free top-p decoding kernel—replacing the sorting Operation with a 256-bin histogram, integrating scoring and filtering into a single kernel launch, and compressing memory overhead to O(1).
Two-Stage fine-tuning: Adapting to Sparsification in Hundreds of Steps
With these four findings converging, the inference architecture of RTPurboV2 naturally takes shape:
  • Streaming Heads (85%) → SWA (window size 8192)

  • Retrieval Heads (15%) → Low-rank projection + Clustering index + Dynamic top-p

Adapting the model to this sparsified architecture requires only about 600 training steps and approximately 1M label tokens. Specifically, RTPurboV2 training is divided into two stages:
Stage 1 - Projection Alignment: Freeze the main model body and train only the low-rank projection matrices for each Retrieval Head, minimizing the KL divergence between the projected attention distribution and the original distribution.
Stage 2 - End-to-End Self-Distillation: Enable sparse mode, allowing the sparse model to learn the next-token prediction distribution of the original dense model.
In the context of tens of trillions of pre-training tokens, 1M tokens are virtually negligible. This also verifies the core argument from another angle: the sparsity of Full Attention is endogenous; fine-tuning merely completes the transformation from implicit to explicit.
Experimental Results and Performance evaluation
To comprehensively validate the effectiveness of RTPurboV2, we conducteD systematic eValuations on two mainstream models, Qwen3-Coder-30B-A3B and Qwen3.5-35B-A3B, targeting core long-text benchmarks.
  1. Ruler Benchmark: Breakthroughs in Long-Range Retrieval Accuracy
    On the Qwen3-Coder-30B-A3B model, we identified approximately 15% of critical "Retrieval Heads" through offline calibration. For these heads, we applied Full Attention combined with K Cache clustering during the Prefill stage, and RTPurboV2 for sparsification during the Decode stage; all other Streaming Heads uniformly adopted SWA (local window set to 8192).

As shown in Figure 3, RTPurboV2 achieved the highest average scores at both 32K and 64k sequence lengths (89.69 and 85.61, respectively), significantly outperforming all baseline methods except Full Attention, proving its superior accuracy in long-range information retrieval.
  1. LongBenchV2 Benchmark: Lossless Compression with High Recall Ratios
    For the Qwen3.5-35B-A3B model, calibration showed that over 70% of its Heads possess retrieval characteristics. Therefore, we adopted a full-sparsification strategy. Experimental results (Figure 4) indicate that while significantly reducing computational overhead, RTPurboV2 fully retains the model's foundational capabilities, with accuracy performance on par with Full Attention.
  2. CoT Reasoning Tasks: Stable Support for Complex Logic
    In Chain-of-Thought (CoT) reasoning tasks, RTPurboV2 also performed exceptionally well (Figure 5), achieving near-lossless retention of the model's reasoning capabilities, further validating the robustness of this solution in complex logical scenarios.
The Bigger Picture
Currently, the focus of attention mechanism research is heavily concentrated on designing entirely new, highly efficient architectures. This path undoubtedly has its value. However, RTPurboV2 reveals an easily overlooked fact: Full Attention models themselves contain immense efficiency potential, and the cost of releasing this endogenous sparsity is extremely low.
600 training steps, virtually lossless accuracy, and up to 9.36X Prefill acceleration. This means that teams opting for the SWA + Full Attention hybrid architecture—including MIMO, Gemma 4, and GPT-OSS—can achieve compression efficiency close to SOTA new solutions without replacing their architecture.
"Native Transformers are Never Outdated. Full Attention Strikes Back."
About the Team
RTP-LLM is a high-performance large model inference engine independently developed by Alibaba's Intelligent Engine Team. It supports the large model inference needs of core businesses such as Taobao, Tmall, and Amap. Originating from Alibaba's search, recommendation, and advertising technologies, the Intelligent Engine is a pioneer and deep cultivator in Alibaba's AI Engineering field. The team focuses on the construction of AI engineering systems, leading the establishment of the big data AI engineering system AI・OS, and continuously providing high-quality AI engineering services for various businesses within the Alibaba Group.
★★★★★
★★★★★
Be the first to rate this article.

Comments & Questions (0)

Captcha
Please be respectful — let's keep the conversation friendly.

No comments yet

Be the first to comment!