The quadratic complexity of the standard attention mechanism in Transformers remains a primary bottleneck for processing long sequences. While sparse attention methods offer a promising solution, they often rely on fixed, static patterns or complex, learned mechanisms that may not be optimal across all layers of a deep network. We introduce **Elastic Sparse Attention (ESA)**, a novel sparse attention mechanism where the attention pattern deterministically and smoothly adapts based on layer depth. Early layers in the network employ a dense, local attention pattern to capture fine-grained local context, while deeper layers transition to a more dilated, long-range pattern to integrate global information. This layer-adaptive strategy is designed to create a comprehensive receptive field by the final layer, mitigating the risk of "attention holes." We present the algorithm, an optimized Triton kernel implementation, a method for visualizing the patterns, and a rigorous validation script that confirms full receptive field coverage for sequences up to 131,072 tokens. Code is available at https://github.com/HighCWu/elastic-sparse-attention .
Guan‐Yu LinJinwei LuoYinfeng LiGao ChenQiang LuoDepeng Jin
Xuanyu ZhangZhepeng LvQing Yang
Yang, ShangGuo, JunxianTang, HaotianHu, QinghaoXiao, GuangxuanTang, JiamingLin, YujunLiu, ZhijianLu, YaoHan, Song