b8140
hexagon refactor all Ops to use local context struct (#19819)
hexagon: refactor set/get/sum-rows ops to use local context
hexagon: refactor ROPE and Softmax Ops to use local context
Improves performance a bit by precomputing things and saving in the context.
hexagon: refactor activation ops to use local context struct
hexagon: refactor unary ops to use local context struct and DMA/VTCM
hexagon: use aligned hvx_scale function
hexagon: remove unused fields from op_context
hexagon: rewrite ROPE to use DMA and VTCM scratchpad
hex-rope: keep N rows in scratchpad (instead of just two)
hex-rope: introduce rowidx cache
hex-rope: remove unused fields
hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute
also removes the need for fastdiv.
hex-rope: minor formatting
hex-rope: use indices and unroll the loops
hex-rope: more updates to cleanup rope-block handling
hexagon: cleanup supported type/dims checks
hexagon: all reduce funcs replicated across lanes
There is no need to explicitly replicate the first value.
- snapdragon: update adb and windows scripts to use ubatch-size 256
Updated Ops support handles larger ubatches.
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: