b9180
llama + spec: MTP Support (#22673)
spec: support MTP
fix batch size
rename files
cont : simplify (#7)
MTP: clean-up (#9)
MTP: clean-up
review: use llama_context_type instead of llama_graph_type
review: remove llama_model_has_mtp
review: fix convert issues
convert: fix pycheck
review: formatting
use
mtp-for identifying mtp modelsconvert: fix mtp conversion
mtp -> draft-mtp
remove unused llama_arch
add need_embd in speculative
llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
draft_max by storing the GDN intermediates.
fix pending state
vulkan: add GDN partial rollback
meta: extend check to axis 1
metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: https://github.com/ggml-org/llama.cpp/commit/8c05923630110223669f069af2000e9cf10c02bc
Assisted-by: llama.cpp:local pi
delta_net_base: use ggml_pad instead of new_tensor
review: add need_rs_seq
review: rename part_bounded to n_rs
review: deslop comments
review: rename, add asserts
server : adjust checkpoint logic (#11)
server : adjust checkpoint logic
cont : rm asserts
server-context: fix early exit
spec : fix compatibility with n-gram and add TODOs (#13)
metal : cleanup
llama : fix faulty bitwise check in recurrent memory
server : disable RS-based MTP in combination with other spec types
spec : add TODOs
cont : fix comment
cont : update comment
common : fix logic for ngram + mtp compat
llama-memory: enable checkpointing with partial rollback
cont: add test-case for loading into a dirty ctx
llama-memory-recurrent: clear rs_idx in clear
download: fix mtp path
llama-arch: fix enorm op
docs: update docs
conversion: fix type annotations
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32)
- Ubuntu x64 (SYCL FP16)
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: