Back to feed

b7956

Feb 6, 2026
Meta/llama.cppCLIvb7956

vulkan: For coopmat2 FA, use fp16 accumulators for the final result (#19376)

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

macOS/iOS:

Linux:

Windows:

openEuler: