Qwen Open-Sources FlashQLA, Delivering 3x Faster Linear Attention and Outperforming FlashInfer by 5x-AI Topic

Qwen Team has officially open-sourced FlashQLA, a high-performance operator library specifically designed for the Gated Delta Network (GDN), the Linear Attention layer that powers the entire Qwen series, including Qwen3-Next, Qwen3.5, and Qwen3.6. benchmarked on the Nvidia H200, FlashQLA delivers a 2-3x speedup in forward pass and a 2x speedup in backward pass compared to the FLA Triton kernel. In TP8 (Tensor Parallelism with 8 devices) scenarios, it achieves up to 5.33x faster forward performance than FlashInfer.

The core of this acceleration lies in leverAGIng the exponential decay property of GDN's gating values to implement automatic intra-card context parallelism (AutoCP). Traditional methods require computing a correction matrix M to stitch together states from different sub-sequences, a process whose computational cost can even exceed that of the state matrix itself. FlashQLA discovers that for 60%-80% of attention heads, the gating values are not constantly 1. This allows the system to reduce state error below the noise floor with only 6-8 chunks of wArm-up, thereby completely bypassing the calculation of matrix M. The system automatically determines whether to enable CP based on batch size, number of heads, and sequence length, eliminating the need for manual configuration.

At the operator level, FlashQLA adopts a balanced APProach, neither fusing the entire computation flow into a single kernel nor breaking it down into numerous independent ones. Instead, it is strategically divided into two fused kernels with a CP preprocessing step inserted in between. Within each kernel, TileLang is used to implement warp specialization. One producer warp group handles data movement, while three consumer warp groups compute different intermediate variables concurrently. A ping-pong buffering structure is employed to effectively overlap mEMOry access with computation. The most significant speedups are observed in scenarios with high tensor parallelism and a small number of heads, which is a typical configuration for multi-card deployment of large language models and end-side Agent inference.

★★★★★

Be the first to rate this article.

Qwen Open-Sources FlashQLA, Delivering 3x Faster Linear Attention and Outperforming FlashInfer by 5x

Comments & Questions (0)

No comments yet