pyrit.message_normalizer.TokenizerTemplateNormalizer#

class TokenizerTemplateNormalizer(*, tokenizer: PreTrainedTokenizerBase, system_message_behavior: Literal['keep', 'squash', 'ignore', 'developer'] = 'keep')[source]#

Bases: MessageStringNormalizer

Enable application of the chat template stored in a Hugging Face tokenizer to a list of messages. For more details, see https://huggingface.co/docs/transformers/main/en/chat_templating.

__init__(*, tokenizer: PreTrainedTokenizerBase, system_message_behavior: Literal['keep', 'squash', 'ignore', 'developer'] = 'keep') None[source]#

Initialize an instance of the TokenizerTemplateNormalizer class.

Parameters:
  • tokenizer – A Hugging Face tokenizer with a chat template.

  • system_message_behavior – How to handle system messages. Options: - “keep”: Keep system messages as-is (default) - “squash”: Merge system into first user message - “ignore”: Drop system messages entirely - “developer”: Change system role to developer role

Methods

__init__(*, tokenizer[, system_message_behavior])

Initialize an instance of the TokenizerTemplateNormalizer class.

from_model(model_name_or_alias, *[, token, ...])

Create a normalizer from a model name or alias.

normalize_string_async(messages)

Apply the chat template stored in the tokenizer to a list of messages.

Attributes

MODEL_ALIASES: ClassVar[Dict[str, TokenizerModelConfig]] = {'chatml': TokenizerModelConfig(model_name='HuggingFaceH4/zephyr-7b-beta', system_message_behavior='keep'), 'falcon': TokenizerModelConfig(model_name='tiiuae/falcon-7b-instruct', system_message_behavior='keep'), 'gemma': TokenizerModelConfig(model_name='google/gemma-7b-it', system_message_behavior='squash'), 'llama2': TokenizerModelConfig(model_name='meta-llama/Llama-2-7b-chat-hf', system_message_behavior='keep'), 'llama3': TokenizerModelConfig(model_name='meta-llama/Meta-Llama-3-8B-Instruct', system_message_behavior='keep'), 'llama3-vision': TokenizerModelConfig(model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', system_message_behavior='keep'), 'mistral': TokenizerModelConfig(model_name='mistralai/Mistral-7B-Instruct-v0.2', system_message_behavior='keep'), 'openchat': TokenizerModelConfig(model_name='openchat/openchat-3.5-0106', system_message_behavior='keep'), 'phi3': TokenizerModelConfig(model_name='microsoft/Phi-3-mini-4k-instruct', system_message_behavior='keep'), 'qwen': TokenizerModelConfig(model_name='Qwen/Qwen2-7B-Instruct', system_message_behavior='keep'), 'tinyllama': TokenizerModelConfig(model_name='TinyLlama/TinyLlama-1.1B-Chat-v1.0', system_message_behavior='keep')}#
classmethod from_model(model_name_or_alias: str, *, token: str | None = None, system_message_behavior: Literal['keep', 'squash', 'ignore', 'developer'] | None = None) TokenizerTemplateNormalizer[source]#

Create a normalizer from a model name or alias.

This factory method simplifies creating a normalizer by handling tokenizer loading automatically. Use aliases for common models or provide a full HuggingFace model path.

Parameters:
  • model_name_or_alias – Either a full HuggingFace model name or an alias (e.g., ‘chatml’, ‘phi3’, ‘llama3’). See MODEL_ALIASES for available aliases.

  • token – Optional HuggingFace token for gated models. If not provided, falls back to HUGGINGFACE_TOKEN environment variable.

  • system_message_behavior – Override how to handle system messages. If not provided, uses the model’s default config.

Returns:

TokenizerTemplateNormalizer configured with the model’s tokenizer.

Raises:

ValueError – If the tokenizer doesn’t have a chat_template.

async normalize_string_async(messages: List[Message]) str[source]#

Apply the chat template stored in the tokenizer to a list of messages.

Handles system messages based on the configured system_message_behavior: - “keep”: Pass system messages as-is - “squash”: Merge system into first user message - “ignore”: Drop system messages entirely - “developer”: Change system role to developer role

Parameters:

messages – A list of Message objects.

Returns:

The formatted chat messages as a string.