b7276
[!WARNING] Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
Add support for CUMSUM and TRI for CUDA. (#17584)
Add support for CUMSUM and TRI for CUDA.
Minor optimizations.
Correct warp_prefix_inclusive_sum in float2 variant to return float2
Optimize TRI
Whitespace
Fix strides.
Implement double loop
Whitespace
Fix HIP compilation bugs
Optimizations + big case performance tests
Implement using CUB with fallback to custom kernel
Remove error message.
Fixes from code review
Comment out CPU-unsupported F16/BF16 cases to fix CI
Fine, you win :P
Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
Vary warp-size based on physical warp size
Add GGML_UNUSED_VARS in tri as well
Use constexpr and call prefix_inclusive with warp_size template param
Update ggml/src/ggml-cuda/cumsum.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Apply suggestions from code review
Co-authored-by: Johannes Gäßler johannesg@5d6.de
Change to tid % warp_size
Fix strides; hardcode mask; add ggml_lane_mask_t
Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()
Too hasty...
Co-authored-by: Johannes Gäßler johannesg@5d6.de
macOS/iOS:
Linux:
Windows: