9. MessageNormalizer#

MessageNormalizers convert PyRIT’s Message format into other formats that specific targets require. Different LLMs and APIs expect messages in different formats:

  • OpenAI-style APIs expect ChatMessage objects with role and content fields

  • HuggingFace models expect specific chat templates (ChatML, Llama, Mistral, etc.)

  • Some models don’t support system messages and need them merged into user messages

  • Attack components sometimes need conversation history as a formatted text string

The MessageNormalizer classes handle these conversions, making it easy to work with any target regardless of its expected input format.

Base Classes#

There are two base normalizer types:

  • MessageListNormalizer[T]: Converts List[Message]List[T] (e.g., to ChatMessage objects)

  • MessageStringNormalizer: Converts List[Message]str (e.g., to ChatML format)

Some normalizers implement both interfaces.

from pyrit.models import Message

# Create sample messages for demonstration
system_message = Message.from_prompt(prompt="You are a helpful assistant.", role="system")
user_message = Message.from_prompt(prompt="What is the capital of France?", role="user")
assistant_message = Message.from_prompt(prompt="The capital of France is Paris.", role="assistant")
followup_message = Message.from_prompt(prompt="What about Germany?", role="user")

messages = [system_message, user_message, assistant_message, followup_message]

print("Sample messages created:")
for msg in messages:
    print(f"  {msg.role}: {msg.get_piece().converted_value[:50]}...")
Sample messages created:
  system: You are a helpful assistant....
  user: What is the capital of France?...
  assistant: The capital of France is Paris....
  user: What about Germany?...

ChatMessageNormalizer#

The ChatMessageNormalizer converts Message objects to ChatMessage objects, which are the standard format for OpenAI chat-based API calls. It handles both single-part text messages and multipart messages (with images, audio, etc.).

Key features:

  • Single text pieces become simple string content

  • Multiple pieces become content arrays with type information

  • Supports use_developer_role=True for newer OpenAI models that use “developer” instead of “system”

from pyrit.message_normalizer import ChatMessageNormalizer

# Standard usage
normalizer = ChatMessageNormalizer()
chat_messages = await normalizer.normalize_async(messages)  # type: ignore[top-level-await]

print("ChatMessage output:")
for msg in chat_messages:  # type: ignore[assignment]
    print(f"  Role: {msg.role}, Content: {msg.content}")  # type: ignore[attr-defined]
ChatMessage output:
  Role: system, Content: You are a helpful assistant.
  Role: user, Content: What is the capital of France?
  Role: assistant, Content: The capital of France is Paris.
  Role: user, Content: What about Germany?
# With developer role for newer OpenAI models (o1, o3, gpt-4.1+)
dev_normalizer = ChatMessageNormalizer(use_developer_role=True)
dev_chat_messages = await dev_normalizer.normalize_async(messages)  # type: ignore[top-level-await]

print("ChatMessage with developer role:")
for msg in dev_chat_messages:  # type: ignore[assignment]
    print(f"  Role: {msg.role}, Content: {msg.content}")  # type: ignore[attr-defined]
ChatMessage with developer role:
  Role: developer, Content: You are a helpful assistant.
  Role: user, Content: What is the capital of France?
  Role: assistant, Content: The capital of France is Paris.
  Role: user, Content: What about Germany?
# ChatMessageNormalizer also implements MessageStringNormalizer for JSON output
json_output = await normalizer.normalize_string_async(messages)  # type: ignore[top-level-await]
print("JSON string output:")
print(json_output)
JSON string output:
[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "What is the capital of France?"
  },
  {
    "role": "assistant",
    "content": "The capital of France is Paris."
  },
  {
    "role": "user",
    "content": "What about Germany?"
  }
]

GenericSystemSquashNormalizer#

Some models don’t support system messages. The GenericSystemSquashNormalizer merges the system message into the first user message using a standardized instruction format.

The format is:

### Instructions ###

{system_content}

######

{user_content}
from pyrit.message_normalizer import GenericSystemSquashNormalizer

squash_normalizer = GenericSystemSquashNormalizer()
squashed_messages = await squash_normalizer.normalize_async(messages)  # type: ignore[top-level-await]

print(f"Original message count: {len(messages)}")
print(f"Squashed message count: {len(squashed_messages)}")
print("\nFirst message after squashing:")
print(squashed_messages[0].get_piece().converted_value)
Original message count: 4
Squashed message count: 3

First message after squashing:
### Instructions ###

You are a helpful assistant.

######

What is the capital of France?

ConversationContextNormalizer#

The ConversationContextNormalizer formats conversation history as a turn-based text string. This is useful for:

  • Including conversation history in attack prompts

  • Logging and debugging conversations

  • Creating context strings for adversarial chat

The output format is:

Turn 1:
User: <content>
Assistant: <content>

Turn 2:
User: <content>
...
from pyrit.message_normalizer import ConversationContextNormalizer

context_normalizer = ConversationContextNormalizer()
context_string = await context_normalizer.normalize_string_async(messages)  # type: ignore[top-level-await]

print("Conversation context format:")
print(context_string)
Conversation context format:
Turn 1:
User: What is the capital of France?
Assistant: The capital of France is Paris.
Turn 2:
User: What about Germany?

TokenizerTemplateNormalizer#

The TokenizerTemplateNormalizer uses HuggingFace tokenizer chat templates to format messages. This is essential for:

  • Local LLM inference with proper formatting

  • Matching the exact prompt format a model was trained with

  • Working with various open-source models

Using Model Aliases#

For convenience, common models have aliases that automatically configure the normalizer:

Alias

Model

Notes

chatml

HuggingFaceH4/zephyr-7b-beta

No auth required

phi3

microsoft/Phi-3-mini-4k-instruct

No auth required

qwen

Qwen/Qwen2-7B-Instruct

No auth required

llama3

meta-llama/Meta-Llama-3-8B-Instruct

Requires HF token

gemma

google/gemma-7b-it

Requires HF token, auto-squashes system

mistral

mistralai/Mistral-7B-Instruct-v0.2

Requires HF token

from pyrit.message_normalizer import TokenizerTemplateNormalizer

# Using an alias (no auth required for this model)
template_normalizer = TokenizerTemplateNormalizer.from_model("chatml")
formatted = await template_normalizer.normalize_string_async(messages)  # type: ignore[top-level-await]

print("ChatML formatted output:")
print(formatted)
No HuggingFace token provided. Gated models may fail to load without authentication.
ChatML formatted output:
<|system|>
You are a helpful assistant.</s>
<|user|>
What is the capital of France?</s>
<|assistant|>
The capital of France is Paris.</s>
<|user|>
What about Germany?</s>
<|assistant|>

System Message Behavior#

The TokenizerTemplateNormalizer supports different strategies for handling system messages:

  • keep: Pass system messages as-is (default)

  • squash: Merge system into first user message using GenericSystemSquashNormalizer

  • ignore: Drop system messages entirely

  • developer: Change system role to developer role (for newer OpenAI models)

# Using squash behavior for models that don't support system messages
squash_template_normalizer = TokenizerTemplateNormalizer.from_model("chatml", system_message_behavior="squash")
squashed_formatted = await squash_template_normalizer.normalize_string_async(messages)  # type: ignore[top-level-await]

print("ChatML with squashed system message:")
print(squashed_formatted)
No HuggingFace token provided. Gated models may fail to load without authentication.
ChatML with squashed system message:
<|user|>
### Instructions ###

You are a helpful assistant.

######

What is the capital of France?</s>
<|assistant|>
The capital of France is Paris.</s>
<|user|>
What about Germany?</s>
<|assistant|>

Using Custom Models#

You can also use any HuggingFace model with a chat template by providing the full model name.

# Using a custom HuggingFace model
# Note: Some models require authentication via HUGGINGFACE_TOKEN env var or token parameter
custom_normalizer = TokenizerTemplateNormalizer.from_model("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
custom_formatted = await custom_normalizer.normalize_string_async(messages)  # type: ignore[top-level-await]

print("TinyLlama formatted output:")
print(custom_formatted)
No HuggingFace token provided. Gated models may fail to load without authentication.
TinyLlama formatted output:
<|system|>
You are a helpful assistant.</s>
<|user|>
What is the capital of France?</s>
<|assistant|>
The capital of France is Paris.</s>
<|user|>
What about Germany?</s>
<|assistant|>

Creating Custom Normalizers#

You can create custom normalizers by extending the base classes.

from typing import List

from pyrit.message_normalizer import MessageStringNormalizer
from pyrit.models import Message


class SimpleMarkdownNormalizer(MessageStringNormalizer):
    """Custom normalizer that formats messages as Markdown."""

    async def normalize_string_async(self, messages: List[Message]) -> str:
        lines = []
        for msg in messages:
            piece = msg.get_piece()
            role = piece.role.capitalize()
            content = piece.converted_value
            lines.append(f"**{role}**: {content}")
        return "\n\n".join(lines)


# Use the custom normalizer
md_normalizer = SimpleMarkdownNormalizer()
md_output = await md_normalizer.normalize_string_async(messages)  # type: ignore[top-level-await]

print("Markdown formatted output:")
print(md_output)
Markdown formatted output:
**System**: You are a helpful assistant.

**User**: What is the capital of France?

**Assistant**: The capital of France is Paris.

**User**: What about Germany?