pyrit.prompt_converter.AzureSpeechAudioToTextConverter#

class AzureSpeechAudioToTextConverter(azure_speech_region: str | None = None, azure_speech_key: str | None = None, azure_speech_resource_id: str | None = None, use_entra_auth: bool = False, recognition_language: str = 'en-US')[source]#

Bases: PromptConverter

Transcribes a .wav audio file into text using Azure AI Speech service.

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text

__init__(azure_speech_region: str | None = None, azure_speech_key: str | None = None, azure_speech_resource_id: str | None = None, use_entra_auth: bool = False, recognition_language: str = 'en-US') None[source]#

Initializes the converter with Azure Speech service credentials and recognition language.

Parameters:
  • azure_speech_region (str, Optional) – The name of the Azure region.

  • azure_speech_key (str, Optional) – The API key for accessing the service (if not using Entra ID auth).

  • azure_speech_resource_id (str, Optional) – The resource ID for accessing the service when using Entra ID auth. This can be found by selecting ‘Properties’ in the ‘Resource Management’ section of your Azure Speech resource in the Azure portal.

  • use_entra_auth (bool) – Whether to use Entra ID authentication. If True, azure_speech_resource_id must be provided. If False, azure_speech_key must be provided. Defaults to False.

  • recognition_language (str) – Recognition voice language. Defaults to “en-US”. For more on supported languages, see the following link: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support

Raises:

ValueError – If the required environment variables are not set, if azure_speech_key is passed in when use_entra_auth is True, or if azure_speech_resource_id is passed in when use_entra_auth is False.

Methods

__init__([azure_speech_region, ...])

Initializes the converter with Azure Speech service credentials and recognition language.

convert_async(*, prompt[, input_type])

Converts the given audio file into its text representation.

convert_tokens_async(*, prompt[, ...])

Converts substrings within a prompt that are enclosed by specified start and end tokens.

get_identifier()

Returns an identifier dictionary for the converter.

input_supported(input_type)

Checks if the input type is supported by the converter.

output_supported(output_type)

Checks if the output type is supported by the converter.

recognize_audio(audio_bytes)

Recognizes audio file and returns transcribed text.

stop_cb(evt, recognizer)

Callback function that stops continuous recognition upon receiving an event 'evt'.

transcript_cb(evt, transcript)

Callback function that appends transcribed text upon receiving a "recognized" event.

Attributes

AZURE_SPEECH_KEY_ENVIRONMENT_VARIABLE

The API key for accessing the service.

AZURE_SPEECH_REGION_ENVIRONMENT_VARIABLE

The name of the Azure region.

AZURE_SPEECH_RESOURCE_ID_ENVIRONMENT_VARIABLE

The resource ID for accessing the service when using Entra ID auth.

supported_input_types

Returns a list of supported input types for the converter.

supported_output_types

Returns a list of supported output types for the converter.

AZURE_SPEECH_KEY_ENVIRONMENT_VARIABLE: str = 'AZURE_SPEECH_KEY'#

The API key for accessing the service.

AZURE_SPEECH_REGION_ENVIRONMENT_VARIABLE: str = 'AZURE_SPEECH_REGION'#

The name of the Azure region.

AZURE_SPEECH_RESOURCE_ID_ENVIRONMENT_VARIABLE: str = 'AZURE_SPEECH_RESOURCE_ID'#

The resource ID for accessing the service when using Entra ID auth.

async convert_async(*, prompt: str, input_type: Literal['text', 'image_path', 'audio_path', 'video_path', 'url', 'reasoning', 'error', 'function_call', 'tool_call', 'function_call_output'] = 'audio_path') ConverterResult[source]#

Converts the given audio file into its text representation.

Parameters:
  • prompt (str) – File path to the audio file to be transcribed.

  • input_type (PromptDataType) – The type of input data.

Returns:

The result containing the transcribed text.

Return type:

ConverterResult

Raises:

ValueError – If the input type is not supported or if the provided file is not a .wav file.

input_supported(input_type: Literal['text', 'image_path', 'audio_path', 'video_path', 'url', 'reasoning', 'error', 'function_call', 'tool_call', 'function_call_output']) bool[source]#

Checks if the input type is supported by the converter.

Parameters:

input_type (PromptDataType) – The input type to check.

Returns:

True if the input type is supported, False otherwise.

Return type:

bool

output_supported(output_type: Literal['text', 'image_path', 'audio_path', 'video_path', 'url', 'reasoning', 'error', 'function_call', 'tool_call', 'function_call_output']) bool[source]#

Checks if the output type is supported by the converter.

Parameters:

output_type (PromptDataType) – The output type to check.

Returns:

True if the output type is supported, False otherwise.

Return type:

bool

recognize_audio(audio_bytes: bytes) str[source]#

Recognizes audio file and returns transcribed text.

Parameters:

audio_bytes (bytes) – Audio bytes input.

Returns:

Transcribed text.

Return type:

str

stop_cb(evt: Any, recognizer: Any) None[source]#

Callback function that stops continuous recognition upon receiving an event ‘evt’.

Parameters:
  • evt (speechsdk.SpeechRecognitionEventArgs) – Event.

  • recognizer (speechsdk.SpeechRecognizer) – Speech recognizer object.

transcript_cb(evt: Any, transcript: list[str]) None[source]#

Callback function that appends transcribed text upon receiving a “recognized” event.

Parameters:
  • evt (speechsdk.SpeechRecognitionEventArgs) – Event.

  • transcript (list) – List to store transcribed text.