Back to feed

Patch release: Fix FIM tokenizer

May 30, 2024
Mistral AI/mistral-commonCLIv1.2.1

As noticed here: https://huggingface.co/mistralai/Codestral-22B-v0.1/discussions/10

The wrong tokenizer was used for FIM. This patch release fixes that so that the following works correctly:

from mistral_common.tokens.tokenizers.base import FIMRequest
from mistral_common_private.tokens.tokenizers.mistral import MistralTokenizer
tokenizer =  MistralTokenizer.v3()
tokenized = tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b"))
assert tokenized.text == "<s>[SUFFIX]return▁a▁+▁b[PREFIX]▁def▁f("