b8184
vulkan: improve partial offloading performance on AMD (#19976)
vulkan: fix and enable cpy_tensor_async function
use transfer_queue for async transfers on AMD, synchronize with timeline semaphore
update offload_op logic
fix missing transfer submission
disable async transfer queue on AMD GCN
revert op batch size change
fix cpy_tensor_async checks
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: