Skip to main content

Using Text-to-Speech (TTS) Tools

This guide provides technical documentation for the text-to-speech (TTS) tools available in the Solace Agent Mesh (SAM).

1. Overview

The audio tool group provides two primary TTS tools for generating high-quality audio artifacts:

  1. text_to_speech: Converts a string of text to speech using a single voice, featuring intelligent tone selection.
  2. multi_speaker_text_to_speech: Converts a conversational script, delineated by speaker, into a multi-speaker audio file.

2. Setup and Configuration

Prerequisites

  • API Key: A valid Google Gemini API key with access to the TTS model is required.
  • Dependencies: The pydub library is necessary for audio processing and format conversion. It can be installed via pip install pydub.

Basic Configuration

  1. Environment Variable: The Gemini API key must be set as an environment variable.
    export GEMINI_API_KEY="your_gemini_api_key_here"
  2. Enablement: The audio tool group must be enabled in the agent's app_config.yml.
    tools:
    - tool_type: builtin-group
    group_name: "audio"

3. Advanced Configuration

You can exercise more granular control over the TTS tools by providing a tool_config block for each tool in your app_config.yml.

text_to_speech Configuration

This example shows how to set a default voice and define the mapping between tones and specific voice models.

- tool_type: builtin
tool_name: "text_to_speech"
tool_config:
gemini_api_key: ${GEMINI_API_KEY}
model: "gemini-2.5-flash-preview-tts"
voice_name: "Kore" # Default voice if no tone is matched
language: "en-US" # Default language
output_format: "mp3"
# Voice selection by tone mapping
voice_tone_mapping:
bright: ["Zephyr", "Autonoe"]
upbeat: ["Puck", "Laomedeia"]
informative: ["Charon", "Rasalgethi"]
firm: ["Kore", "Orus", "Alnilam"]
friendly: ["Achird"]
casual: ["Zubenelgenubi"]
warm: ["Sulafar"]

multi_speaker_text_to_speech Configuration

This example defines default voice configurations for up to five speakers.

- tool_type: builtin
tool_name: "multi_speaker_text_to_speech"
tool_config:
gemini_api_key: ${GEMINI_API_KEY}
model: "gemini-2.5-flash-preview-tts"
language: "en-US"
output_format: "mp3"
# Default speaker voice configurations
default_speakers:
- { name: "Speaker1", voice: "Kore", tone: "firm" }
- { name: "Speaker2", voice: "Puck", tone: "upbeat" }
- { name: "Speaker3", voice: "Charon", tone: "informative" }
- { name: "Speaker4", voice: "Achird", tone: "friendly" }
- { name: "Speaker5", voice: "Sulafar", tone: "warm" }
# The voice_tone_mapping can also be included here

4. Features

Intelligent Tone Selection

The system supports tone-based voice selection, allowing for dynamic voice choice based on desired emotional or stylistic output, rather than explicit voice names.

Available Tones: bright, upbeat, informative, firm, excitable, youthful, breezy, easy-going, breathy, clear, smooth, gravelly, soft, even, mature, forward, friendly, casual, gentle, lively, knowledgeable, warm

Tone Aliases:

  • professionalfirm
  • cheerfulupbeat
  • calmsoft
  • conversationalcasual

Multi-Language Support

The tools support over 25 languages, specified via BCP-47 language codes (for example, en-US, fr-FR, es-US, ja-JP).

5. Usage Examples

Single-Voice Text-to-Speech (text_to_speech)

Basic Usage

Convert the following text to speech: "Welcome to the technical briefing on artificial intelligence."

With Tone Selection

Convert this text to speech with a professional tone: "Thank you for joining today's technical review."

Multi-Speaker Text-to-Speech (multi_speaker_text_to_speech)

Basic Conversation

Convert this conversation to speech:
Speaker1: Welcome to the podcast.
Speaker2: Thank you for having me.

With Custom Speaker Tones

Convert this conversation using specific tones for each speaker:
- Speaker1 should sound professional
- Speaker2 should sound friendly

Conversation:
Speaker1: Good morning, this is the daily security briefing.
Speaker2: Hi everyone, let's review the agenda for today's session.

6. Tool Reference

text_to_speech

ParameterTypeDescription
textstringThe text to be synthesized.
output_filenamestring(Optional) A custom MP3 filename.
voice_namestring(Optional) A specific voice name to use.
tonestring(Optional) The desired voice tone.
languagestring(Optional) The BCP-47 language code.

multi_speaker_text_to_speech

ParameterTypeDescription
conversation_textstringA string of text with speaker labels (for example, S1: ...).
output_filenamestring(Optional) A custom MP3 filename.
speaker_configsarray(Optional) An array to configure tones for specific speakers.
languagestring(Optional) The BCP-47 language code.

7. Output and Metadata

Both tools generate an MP3 audio artifact that includes a rich set of metadata:

  • The source text (or a truncated version for long inputs)
  • The voice(s) and language used for synthesis
  • The generation timestamp and the specific tool invoked
  • The requested tone and any speaker-specific configurations

8. Troubleshooting

  • Error: GEMINI_API_KEY is required: This indicates that the GEMINI_API_KEY environment variable has not been set correctly.
  • Warning: Unknown tone 'xyz': The specified tone is not recognized. Refer to the list of supported tones. The system will fall back to a default voice.
  • Error: Failed to convert WAV to MP3: This typically indicates that pydub is not installed or that the underlying system is missing necessary audio codecs (for example, ffmpeg).