b8754
hexagon: improved Op queuing, buffer and cache management (#21705)
- hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
hex-utils: add explicit l2flush and l2clear helpers
hex-opreq: use fine-grain per tensor l2 management
hex-opreq: avoid redundant invalidates for tensors we already flushed
hex-opreq: update debug messages
htp-opreq: reuse ops_context
hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
hex-opreq: fix errors in log message
Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
hexagon: limit l2 flushes to 1MB which covers l2 cache
hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
hexagon: drop cache flush size to 2MB
hex-opreq: start reworking opreq packing
hex-opreq: introduce new way of packing opbatch where tensors are stored separately
hex-opreq: add a simple fastrpc call to force unmap all buffers
hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
hex-opreq: bump opreq batch size to 256
hex-mm: place src1 spad at the top of vtcm for easy reuse
hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
htp-opreq: use tensor pointers instead of copies
hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
hex-cumsum: fix error post opreq merge
hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
hex-buf: add support for allocating shared/pinned buffer for opreqs
hex-opbatch: make opbatches configurable
hex-naming: better name for ggml_hexagon_shared_buffer
hex-naming: add session->c_name() helper
hex-opbatch: start using shm but still copy for now
hex-opbatch: use shared buffer for packing opbatch
hex-opbatch: beter naming for opbatch related classes and code
hex-opbatch: reuse batched tensors with same data/dims/strides
hex-opbatch: update logging
hex-opbatch: add support for vmem limit for op batching
hex-opbatch: update htp side to properly support dynamic mmap/unmap
hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
hex-opbatch: fixed src1 handling in act ops
hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
- hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
hex-opbatch: allocate extra 1KB for dspqueue overhead
hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
hex-mm: properly handle hmx_disabled flag
hex-ops: update comments
hex-ops: add debug output for get/set-rows
hex-mmap: optimize un/mapping of buffers
hex-opreq: global cache flush and invalidate beyond 128KB threshold
hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
hex-mm: fixed hvx fallback path
hex-mm: lower the vmem threshold a bit further to ~3GB
hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack functions. We use those pointer to distinguish between buffer types.
- hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
hex-opbatch: cleanup naming and headers for opbatch and related descriptors
hex-fa: it's now better to enable FA during TG to reduce graph splits
hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops if needed for debugging or validation.
hexagon: fixed editorconfig check
Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Co-authored-by: Trivikram Reddy tamarnat@qti.qualcomm.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: