Back to feed

b7689

Jan 10, 2026
Meta/llama.cppCLIvb7689

mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)

  • Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

  • Add mobilenetv5 impl

  • Fix comments, remove unused vars

  • Fix permute and remove transpose of projection weights

  • Fix comments, remove debugging prints from hf_to_gguf

    1. Hard-code image_mean = 0 and image_std = 1
  1. Use available tensor mapping logic
  2. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
    1. Move mobilenetv5 helpers declarations to clip_graph_mobilenetv5 struct and definitions to mobilenetv5.cpp 2.Remove unused clip_is_gemma3n func declarations and definitions
  1. Remove redundant rescale_image_u8_to_f32 func and use normalize_image_u8_to_f32 with zero mean and unit std
  2. Calculate n_patches using image_size / patch_size
  • Remove obsolete comments

    • convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
  • convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
  • mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
  • clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
    • Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology
  • Fix stem conv bias name

  • Remove explicit handling of bias term for stem conv

    • Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
  • Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
  • clean up conversion script

  • fix code style

  • also preserve audio tensors

  • trailing space

  • split arch A and V

  • rm unused gemma3 func

  • fix alignment


Co-authored-by: Xuan Son Nguyen son@huggingface.co

macOS/iOS:

Linux:

Windows:

openEuler: