b7864
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
server: introduce self-speculative decoding
server: moved self-call into speculative.cpp
can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
server: can_speculate() tests self-spec
server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
server: can_speculate() requires a task instance
common: ngram map, config self-speculative decoding
common: add enum common_speculative_type
common: add vector of speculative states
common: add option --spec-draftless
server: cleanup (remove slot.batch_spec, rename)
common: moved self-spec impl to ngram-map
common: cleanup (use common_speculative_state_draft)
spec : refactor
cont : naming
spec: remove --spec-config
doc: (draftless) speculative decoding
common: print performance in spec decoding
minor : cleanup
common : better names
minor : cleanup + fix build
minor: comments
CODEOWNERS: add common/ngram-map.* (#18471)
common : rename speculative.draftless_type -> speculative.type
ngram-map : fix uninitialized values
ngram-map : take into account the input can become shorter
ngram-map : revert len check for now
arg : change
--spec-draftless->--spec-typespec : add common_speculative_state::accept()
spec : refactor + add common_speculative_begin()
spec : fix begin() call with mtmd
spec : additional refactor + remove common_speculative_params
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: