Back to feed

b9045

May 6, 2026
Meta/llama.cppCLIvb9045

mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101)

  • mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)

Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking.

Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space.

Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel).

GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping.

Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding.

  • mtmd: rename gs_ prefixed tensors to generic/architecture names

  • mtmd: use tensor_mapping.py for all granite_speech tensors

  • convert: fold GraniteSpeechTextModel into GraniteModel

  • mtmd: replace n_layer hack with explicit has_standard_layers flag

  • mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech

  • mtmd: align KEY_A_ define spacing

  • convert: register GraniteModel for GraniteSpeechForConditionalGeneration

  • convert: fix ty type-check for GraniteSpeechMmprojModel registration

  • mtmd: align TN_ define spacing

  • mtmd: use generic layer loop for granite speech tensor loading

  • mtmd: merge qformer_proj_layer into clip_layer

  • mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs

  • mtmd: granite_speech add comment explaining why build_attn is not used

  • mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata

  • gguf: add spacing between granite_speech tensor mapping blocks

  • mtmd: make generic audio layer_norm_eps read optional

  • mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps

  • mtmd: align defines and struct fields in clip-impl.h and clip-model.h

  • mtmd: fix alignment and ordering issues across granite speech files

  • convert: granite_speech use filter_tensors instead of modify_tensors for skipping

macOS/iOS:

Linux:

Android:

Windows:

openEuler: