Back to feed

b7317

Dec 8, 2025
Meta/llama.cppCLIvb7317

[!WARNING] Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

cuda: optimize SOLVE_TRI using registers and FMAF (#17703)

  • ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
  • Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
  • Implement explicit fmaf instructions for the reduction loop.
  • Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to char * before addition).
  • Remove unused MAX_K_FAST definition.
  • Small cleanup

  • Remove comments in solve_tri.cu

  • Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

  • Use const for variables in solve_tri.cu

  • Replace fmaf with more readable code

  • remove last fmaf


Co-authored-by: Johannes Gäßler johannesg@5d6.de

macOS/iOS:

Linux:

Windows: