Skip to main content
CUDA tiles, large, reduce Flash Attention TFLOPS by 18-43% across sequences, optimizing performance.

Editorial illustration for Large CUDA Tiles Reduce Flash Attention TFLOPS by 18‑43% Across Sequences

CUDA Tile Optimization Slashes Flash Attention Performance

Large CUDA Tiles Reduce Flash Attention TFLOPS by 18‑43% Across Sequences

3 min read

Flash Attention has become a go‑to kernel for transformer‑style models, promising near‑peak utilization on NVIDIA GPUs when the right tile size is chosen. Researchers set out to fine‑tune the CUDA tile dimensions, assuming that packing more elements into each tile would amortize overhead and push TFLOPS higher. Early benchmarks showed the expected gains for modest tile configurations, but the picture shifted when the tile grew beyond a certain threshold.

Across a range of sequence lengths, the larger‑tile runs consistently fell short of the baseline, delivering noticeably lower throughput. The slowdown traces back to the way the kernel handles arithmetic: each tile now forces separate multiply and add steps and forces the hardware into more precise, slower math paths. Those extra cycles turn what should be a compute‑heavy routine into a bottleneck, throttling the overall performance.

However, the result in TFLOPS are: Performance degraded by 18‑43% across all sequence lengths. This is the trap, where large tiles make performance worse. - Compute bottleneck: With more elements per tile, inefficient operations (separate mul/add, precise math) become the bottleneck. - Instruction o

However, the result in TFLOPS are: Performance degraded by 18-43% across all sequence lengths. This is the trap, where large tiles make performance worse. - Compute bottleneck: With more elements per tile, inefficient operations (separate mul/add, precise math) become the bottleneck.

- Instruction overhead: More work per tile means more instructions before the next memory operation. Lesson: Tile size and compute efficiency are interdependent. Large tiles only help if the computation is efficient enough to keep up.

NCU insight (SeqLen=1,024, NVIDIA B200): - Registers/thread jump to 168 (+31%), reducing theoretical occupancy to 18.75% - Achieved occupancy drops to 16.5% - Compute throughput collapses to 17.4% (the trap) - Memory throughput falls to 7.4% - Grid size shrinks to 512 (fewer blocks from larger tiles) 2. The rescue with fast math One of the bottlenecks is special functions: exp2 (exponential) and truediv (division). By default, these are IEEE-754 precise--highly accurate, but slow.

For deep learning, we can trade a tiny bit of precision for massive speedups: Before (precise operations): p = ct.exp2(qk) alpha = ct.exp2(m_i - m_ij) acc = ct.truediv(acc, l_i) After (fast math): p = ct.exp2(qk, flush_to_zero=True) alpha = ct.exp2(m_i - m_ij, flush_to_zero=True) acc = ct.truediv(acc, l_i, flush_to_zero=True, rounding_mode=RMd.APPROX) What these flags do: flush_to_zero=True : Denormal numbers (extremely small values near zero) become exactly zero. This avoids slow microcode paths on the GPU.rounding_mode=RMd.APPROX : Skips iterative refinement after initial hardware approximation. With fast math, we've "rescued" the large tiles, and the results in TFLOPS are: We now match or exceed the small-tile baseline, with 10-20% gains for longer sequences.

NCU insight (SeqLen=1,024, NVIDIA B200): - Registers/thread: 168 (unchanged) - Theoretical/achieved occupancy: 18.75% / 16.6% (unchanged) - Compute throughput rebounds to 24.0% - Memory throughput improves to 12.9% 3.

Did the larger tiles deliver? Not in this case. The study shows that expanding tile size in Flash Attention can cut TFLOPS by 18‑43 % across every tested sequence length.

While the implementation follows NVIDIA’s cuTile and includes a full production‑ready code path, the authors warn that naive scaling of tiles quickly becomes a trap. Because each tile now holds more elements, separate multiply and add steps and precise math turn into a compute bottleneck, throttling performance. Yet the post also outlines a “trap and rescue” workflow, applying FMA patterns, fast‑math intrinsics, loop splitting and adaptive tiling to reclaim lost throughput.

Whether these fixes restore the original speed for all workloads remains unclear, as the results focus on the specific configurations examined. The authors also note that the compute bottleneck stems from inefficient operations such as separate multiplication and addition, which become more pronounced as tile granularity increases. Future experiments could explore different math precisions or hardware scheduling to see if the penalty can be mitigated.

Ultimately, the findings remind developers that aggressive tiling must be paired with careful micro‑architectural tuning, or the intended gains may evaporate.

Further Reading

Common Questions Answered

How do large CUDA tiles impact Flash Attention's performance on NVIDIA GPUs?

Large CUDA tiles can actually degrade performance by 18-43% across different sequence lengths. This performance drop occurs because larger tiles create compute bottlenecks, with inefficient operations like separate multiply/add steps and precise math becoming performance constraints.

Why do larger tiles in Flash Attention not automatically improve computational efficiency?

Larger tiles introduce more instruction overhead and create computational inefficiencies that can throttle performance. As more elements are packed into each tile, the separate computational steps become less efficient, leading to a potential performance degradation instead of the expected performance gains.

What key lesson did researchers learn about tile size and compute efficiency in Flash Attention?

The study revealed that tile size and compute efficiency are deeply interdependent, and simply scaling up tile sizes does not guarantee better performance. Researchers discovered that large tiles only help if the computational characteristics can maintain efficiency, and naive scaling can quickly become a performance trap.