AI Flash

Alibaba RTPurboV2: 10x Sparse Attention & the Resurgence of Native Transformer

2 weeks ago Jun 8, 2026 · 11:43 26 views
Quick Brief

Alibaba RTPurboV2: The RenAIssance of Native Transformer with 10x Sparse AttentionThe Resurgence of Full AttentionAs the demand for long sequences dri...

Alibaba RTPurboV2: The RenAIssance of Native Transformer with 10x Sparse Attention

The Resurgence of Full Attention
As the demand for long sequences driven by widespread Agent APPlications grows, the Attention mechanism in traditional GPT architectures is increASIngly seen as a performance bottleneck due to its (2) computational complexity. While the industry has been rapidly iterating on Attention architectures—primarily bifurcating into Linear Attention (e.g., Qwen-Next, Kimi-K2) and Sparse Attention (e.g., DeepSeek-V4)—a counter-nARRative is emerging.
Previous work, RTPUrbo (V1), dEMOnstrated that a hybrid architecture of 15% Full Attention + 85% Sliding Window Attention (SWA) could achieve a 5X reduction in KV cache and Attention computation without sacrificing accuracy. This suggested that "Full Attention" was being prematurely discarded. Building on this, Alibaba's RTP team has launched the SECond generation of Attention compression TechnologyRTPurboV2. By combining Headwise compression, Low-Rank Projection, and clustering, RTPurboV2 achieves a further 16x to 32x computational compression on the remaining Full Attention portion.
RTPurboV2: Holistic Full Attention Compression
The core philosophy of RTPurboV2 is that Full Attention models have inherently developed a highly sparse attention structure during pre-training. Rather than "imposing" spaRSIty, RTPurboV2 aims to "release" it. This is based on four key quantitative discoveries.
Discovery 1: 85% of Heads are Inherently Suited for Sliding Windows
Through visualization, researchers found that Attention Heads have distinct roles:
  • streaming Heads (85%): Focus on local context with relatively uniform distribution.

  • Retrieval Heads (15%): Exhibit sparse attention patterns, focusing only on a few critical tokens for long-range dependency.
    This分工 (division of labor) is stable across different inputs. The inference is direct: 85% of Full Attention computation can be safely replaced with SWA.

Discovery 2: Long-Range Retrieval is Dominated by a Low-Dimensional Subspace
For Retrieval Heads, the task is semantic matching. Analysis of the RoPE positional encoding revealed significant dimensional redundancy. High-frequency components in RoPE cause rapid oscillation, introducing distance-sensitive noise that hinders stable semantic signal transmission. RTPurboV2 introduces a low-rank projector ( =16 ) to systematically retain low-frequency semantic components and filter out high-frequency noise. Experiments show that only 16 dimensions can achieve a 90%+ token recall rate.
Discovery 3: Sequence-Level Redundancy & Adaptive Clustering
Low-rank projection doesn't just reduce computation; it improves the "quality" of Key vectors. By filtering out noise, semantically similar tokens naturally cluster together. RTPurboV2 leveRAGes this with a two-stage "funnel" computation process:
  1. Coarse Matching: Cluster  tokens into  semantic clusters (e.g., =128 ). The Query matches against cluster centers ( () ).

  2. Fine-Grained Computation: Execute full Attention only within the relevant clusters.
    This reduces complexity from (2) to () .

Discovery 4: Dynamic Top-p vs. Static Top-k
Fixed top-k sparsity is inefficient because different queries need different context sizes. RTPurboV2 adopts a dynamic top-p strategy, retaining tokens until the cumulative attention score reaches a threshold (e.g., 90%). This is implemented efficiently using a 256-bin histogram to avoid expensive sorting Operations.
Training: A 600-Step fine-tuning Paradigm
Adapting the model to this sparse architecture requires only approximately 600 training steps (about 1M labeled tokens). The process involves:
  1. Projection Alignment: Train only the low-rank projector matrices to minimize KL divergence.

  2. End-to-End Self-Distillation: Train the sparse model to mimic the next-token prediction of the dense model.
    This validates that sparsity is an endogenous property of Full Attention models.

Experimental Results & Performance
1. Ruler benchmark: Precision in Long-Range Retrieval
On the Qwen3-Coder-30B-A3B model, using 15% Retrieval Heads with RTPurboV2 achieved state-of-the-art average scores (89.69 at 32K, 85.61 at 64K), proving superior long-range recall capability.
2. LongBenchV2: Lossless Compression
On Qwen3.5-35B-A3B, where over 70% of heads showed retrieval charACTeristics, full sparsification retained accuracy on par with Full Attention.
3. CoT Reasoning: Robust Logic Support
RTPurboV2 demonstrated near-lossless retention of reasoning capabilities in Chain-of-Thought tasks.
The Bigger Picture
RTPurboV2 suggests that the native Transformer is not obsolete. With minimal training cost (600 steps), it delivers up to 9.36x acceleration in Prefill stages. This means teams using SWA + Full Attention hybrid architectures (like MIMO, Gemma 4, GPT-OSS) can achieve SOTA compression efficiency without changing their core architecture.
"Native Transformer has never been outdated. Full Attention strikes back."
For more details and to access the open-source project, visit the RTP-LLM GitHub repository.
★★★★★
★★★★★
Be the first to rate this article.

Comments & Questions (0)

Captcha
Please be respectful — let's keep the conversation friendly.

No comments yet

Be the first to comment!