b7739
CUDA: Factor out and re-use block_reduce function (#18785)
CUDA: Refactor and expose two_stage_warp_reduce_* function
Use
two_stage_warp_reducealso in softmax kernel, move smem out of it
Moving smem out of __device__ function to __global__ function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (cudaFuncSetAttribute fails when not accounting for
it once for each call to two_stage_warp_reduce)
- Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta amangupta052@gmail.com
Use two_stage_warp_reduce in group_norm_f32
Use two_stage_warp_reduce in rms_norm_f32
Fix smem calculation which expects bytes
Make
two_stage_warp_reduceaccept all values warp_reduce accepts
Also integrate it into norm_f32 function
Use two_stage_warp_reduce in l2_norm_f32
Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
Make norm tests cover all cuda paths
Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
- Use
enum classforblock_reduce_method
This is more type-safe than plain enum
Rename variables as suggested in code review by @am17an
Rename two_stage_warp_reduce -> block_reduce
Fix trailing whitespace in common.cuh
Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here:
https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
- Inline definitions
Co-authored-by: Aman Gupta amangupta052@gmail.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: