Back to feed

b8210

Mar 5, 2026
Meta/llama.cppCLIvb8210

CUDA: Improve performance via less synchronizations between token (#17795)

  • Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async()

  • Adds function to relax sync requirements between input copies on supported backends (CUDA for now)

  • Exchanges synchronous copy with async copy function.

  • Adds macro guards to allow compilation in non-CUDA builds

  • Reworked backend detection in ggml-backend.cpp to avoid linking conflicts

  • Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

  • Minor cleanup

  • Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now.

  • Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU.

  • Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization

  • Simplifies synchronizations to adhere to saaasg pattern.

  • Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

  • Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

macOS/iOS:

Linux:

Windows:

openEuler: