Back to feed

b9112

May 11, 2026
Meta/llama.cppCLIvb9112

CUDA: handle OW > 65535 in im2col (2D and 3D) (#22944)

im2col_cuda and im2col_3d_cuda both dispatch with block_nums.y = OW. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns invalid configuration argument.

Clamp block_nums.y to MIN(OW, MAX_GRIDDIM_Y) and loop inside the kernel with stride MAX_GRIDDIM_Y. Same in-kernel stride pattern already used for the z axis (MAX_GRIDDIM_Z). Both 2D im2col_kernel and 3D im2col_3d_kernel need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop).

Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns invalid configuration argument, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged.

macOS/iOS:

Linux:

Android:

Windows:

openEuler: