b7723
HIP: add fattn-mma-f16 for RDNA4 (#18481)
finish VQ mma
flash_attn_ext_f16_iter
KQ_rowsum
correct exp
fix scale error
fix softmax scale
fix softmax scale
enable fattn on cpu side
fix random error
disable fattn-mma-f16 on rdna3
fix wrong col for rdna
use identity mat to transpose
resolve conflicts
basic tuning for DeepSeek-R1-Distill-Qwen-1.5B
fix volta compile error
align rdna4 policy for fattn
adjust fattn policy
adjust kernel selection logic
update as the review comments
keep fattn-wmma logic
adjust kernel selection logic
Co-authored-by: zhang hui you@example.com Co-authored-by: Johannes Gäßler johannesg@5d6.de
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: