b9109
spec : parallel drafting support (#22838)
spec : refactor
spec : drop support for incompatible vocabs
spec : update common_speculative_init()
cont : pass seq_id
cont : dedup ctx_seq_rm_type
server : sketch the ctx_dft decode loop
server : draft prompt cache and checkpoints
server : improve ctx names
server, spec : transition to unified spec context
cont : sync main and drft contexts
cont : async drft eval when possible
cont : handle non-ckpt models
cont : pass correct n_past for drafting
cont : process images throught the draft context
spec : handle draft running out of context
server : fix mtmd draft processing
server : fix URL for draft model
server : add comment
server : clean-up + dry
speculative-simple : update
spec : fix n_past type
server : fix slot ctx_drft ptr
tools : update readme
naming : improve consistency
spec : refactor for multi-sequence speculative context
cont : prepare params
cont : prepare params
spec : support parallel drafts
server : support parallel drafting
llama : reuse device buffers when possible
server, spec : clean-up
cont : clean-up
cont : minor
spec : reset
draftingflag at the endspec : introduce
common_speculative_process()spec : allow for multiple spec types (chain of speculators)
replace old type field of type common_speculative_type in the common_params_speculative struct with a vector to allow multiple types to be specified
introduce common_get_enabled_speculative_impls(const std::vector) to figure out which implementations the user has enabled
introduce common_speculative_type_from_names(const std::vectorstd::string & names) to parse the already user provided spec types
all speculators run sequentially, best one wins (we verify its drafted tokens)
maximize expected accepted tokens for current round by calculating the product between the probability of accepting current token (n_acc_tokens / n_gen_drafts) and the draft's length
Co-authored-by: Petros Sideris petros.sideris@nokia.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32)
- Ubuntu x64 (SYCL FP16)
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: