Alibaba RTPurboV2: The RenAIssance of Native Transformer with 10x Sparse Attention
Through visualization, researchers found that Attention Heads have distinct roles:
streaming Heads (85%): Focus on local context with relatively uniform distribution.
Retrieval Heads (15%): Exhibit sparse attention patterns, focusing only on a few critical tokens for long-range dependency.
This分工 (division of labor) is stable across different inputs. The inference is direct: 85% of Full Attention computation can be safely replaced with SWA.
For Retrieval Heads, the task is semantic matching. Analysis of the RoPE positional encoding revealed significant dimensional redundancy. High-frequency components in RoPE cause rapid oscillation, introducing distance-sensitive noise that hinders stable semantic signal transmission. RTPurboV2 introduces a low-rank projector ( ) to systematically retain low-frequency semantic components and filter out high-frequency noise. Experiments show that only 16 dimensions can achieve a 90%+ token recall rate.
Low-rank projection doesn't just reduce computation; it improves the "quality" of Key vectors. By filtering out noise, semantically similar tokens naturally cluster together. RTPurboV2 leveRAGes this with a two-stage "funnel" computation process:
Coarse Matching: Cluster tokens into semantic clusters (e.g., ). The Query matches against cluster centers ( ).
Fine-Grained Computation: Execute full Attention only within the relevant clusters.
This reduces complexity from to .
Fixed top-k sparsity is inefficient because different queries need different context sizes. RTPurboV2 adopts a dynamic top-p strategy, retaining tokens until the cumulative attention score reaches a threshold (e.g., 90%). This is implemented efficiently using a 256-bin histogram to avoid expensive sorting Operations.
Adapting the model to this sparse architecture requires only approximately 600 training steps (about 1M labeled tokens). The process involves:
Projection Alignment: Train only the low-rank projector matrices to minimize KL divergence.
End-to-End Self-Distillation: Train the sparse model to mimic the next-token prediction of the dense model.
This validates that sparsity is an endogenous property of Full Attention models.
On the Qwen3-Coder-30B-A3B model, using 15% Retrieval Heads with RTPurboV2 achieved state-of-the-art average scores (89.69 at 32K, 85.61 at 64K), proving superior long-range recall capability.
On Qwen3.5-35B-A3B, where over 70% of heads showed retrieval charACTeristics, full sparsification retained accuracy on par with Full Attention.
RTPurboV2 demonstrated near-lossless retention of reasoning capabilities in Chain-of-Thought tasks.
RTPurboV2 suggests that the native Transformer is not obsolete. With minimal training cost (600 steps), it delivers up to 9.36x acceleration in Prefill stages. This means teams using SWA + Full Attention hybrid architectures (like MIMO, Gemma 4, GPT-OSS) can achieve SOTA compression efficiency without changing their core architecture.
Comments & Questions (0)
No comments yet
Be the first to comment!