Back to feed

b8680

Apr 6, 2026
Meta/llama.cppCLIvb8680

[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159)

  • Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

  • Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

  • Address review comments

  • Address review comments

  • Revert variable names to original

macOS/iOS:

Linux:

Windows:

openEuler: