b8297
ggml : add NVFP4 quantization type support (#19769)
WIP: add NVFP4 quantization support
tests
improve NVFP4 dot product implementation performance and fix bad super call
typo
Use nvfp4 kvalues
vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
vulcal and perf fixes
wip
Fix metal
fix vulcan
Rename threshold & fix wrong scale
Fix MOE
Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained.
- Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions.
quantize: add NVFP4 as a quantization type option
Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
- Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
- Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
- ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions.
CPU only backend 64 super-block layout
cleanup
Remove unused LUT
int
exclude NVFP4 from unsupported ops in metal build
remove quantization for now
store scales as native UE4M3, preserve original model bits when possible
Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
correct comment
format
reduce duplication and cleanup
Address comments
move detection to prepare_tensors
Use math instead of const
Move
fix comment
Shelf quantize tests
Rebase and move check
cleanup
lint
Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Use fallback quant config
Simplify
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
organize
Refactor
Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
add quantize_nvfp4 (required for test_quants.py)
add quantize_nvfp4 (required for test_quants.py)
add quantize_nvfp4 (required for test_quants.py)
fix return type
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: