b7317
[!WARNING] Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
cuda: optimize SOLVE_TRI using registers and FMAF (#17703)
- ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit
fmafinstructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to
char *before addition). - Remove unused
MAX_K_FASTdefinition.
Small cleanup
Remove comments in solve_tri.cu
Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
Use const for variables in solve_tri.cu
Replace fmaf with more readable code
remove last fmaf
Co-authored-by: Johannes Gäßler johannesg@5d6.de
macOS/iOS:
Linux:
Windows: