AI Changelog Aggregator

llama-quant : correct n_attention_wv usage (#20357)

llama-quant : correct n_attention_wv usage

In #19770, I introduced a regression in the way the quantize_state_impl counter values were initialized. I was incrementing and using n_attention_wv in the same loop, when it should have been fixed by the time we're deciding tensor types in llama_tensor_get_type_impl (for use_more_bits).

I never observed a difference in any of my tests

it was only after @bartowski kindly pointed this out that I realized it was incorrect. (Thanks!)

simplify

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Intel (x64)
iOS XCFramework

Linux:

Ubuntu x64 (CPU)
Ubuntu x64 (Vulkan)
Ubuntu x64 (ROCm 7.2)
Ubuntu s390x (CPU)

Windows:

Windows x64 (CPU)
Windows arm64 (CPU)
Windows x64 (CUDA 12) - CUDA 12.4 DLLs
Windows x64 (CUDA 13) - CUDA 13.1 DLLs
Windows x64 (Vulkan)
Windows x64 (SYCL)
Windows x64 (HIP)

openEuler:

openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)