Skip to main content

Documentation Index

Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt

Use this file to discover all available pages before exploring further.

Custom TTS lets you connect Rapida to any WebSocket-based speech synthesis service without writing a new Go transformer. The UI stores provider credentials and assistant options; assistant-api reads them through the custom-tts transformer and maps LLM text packets and provider audio frames with WebSocket DSL rules. Provider identifier: custom-tts

Where it is configured

1

Create the provider credential

In the Rapida dashboard, open Integrations > Models, choose Custom TTS, and create a credential.
Credential fieldRequiredValue
apiCompatibilityYeswebsocket_v1
baseUrlYesWebSocket URL for your TTS service, for example wss://tts.example.com/v1/speak
headersNoHeader map sent on the WebSocket handshake, for example {"Authorization":"Bearer sk_..."}
The backend also accepts snake case keys: api_compatibility and base_url.
2

Select Custom TTS on the assistant

Open the assistant voice settings and select Custom TTS as the text-to-speech provider.
3

Fill the TTS arguments

Set the voice, model, language, audio format, query parameters, request rules, and response rules. The argument reference below maps one-to-one to the UI fields.

TTS arguments

The UI stores these keys with the speaker. metadata prefix, but the transformer reads the option keys shown below.
Option keyRequiredDefaultDescription
speak.voice.idYesNoneProvider voice identifier. Reference it in DSL as voice_id or config.voice.id.
speak.modelNoEmptyProvider model identifier. Reference it in DSL as model or config.model.
speak.languageNoEmptyProvider language code. Reference it in DSL as language or config.language.
speak.audio.encodingYesLINEAR16Audio encoding expected back from the provider. Supported values: LINEAR16, MuLaw8.
speak.audio.sample_rateYes16000Audio sample rate expected back from the provider. Supported UI values: 8000, 16000, 22050, 24000, 32000, 44100, 48000.
speak.ws.query_paramsNo{}Flat JSON object appended to baseUrl as query parameters. Values can use DSL expressions.
speak.ws.request_rulesYesSee example belowOrdered JSON array that tells Rapida what to send for each outbound packet. Must contain at least one text rule.
speak.ws.response_rulesYesNoneOrdered JSON array that tells Rapida how to parse provider WebSocket frames into audio, done, or error events.

Query parameters

Use speak.ws.query_params when your provider expects configuration in the WebSocket URL. Supported variables:
VariableSource
message_idCurrent synthesis message ID
voice_idspeak.voice.id
modelspeak.model
languagespeak.language
encodingspeak.audio.encoding
sample_ratespeak.audio.sample_rate
{
  "voice": { "$var": "voice_id" },
  "model": { "$var": "model" },
  "language": { "$var": "language" },
  "sample_rate": {
    "$cast": "number",
    "value": { "$var": "sample_rate" }
  }
}

Request rules

Request rules are evaluated for normalized TTS packets produced by Rapida.
PacketWhen it is sentAvailable paths
textLLM text is ready for synthesispacket.kind, packet.message_id, packet.text, config.voice.id, config.model, config.language, config.audio.encoding, config.audio.sample_rate
doneThe LLM response is completepacket.kind, packet.message_id, packet.text, config.*
interruptUser interruption is detectedpacket.kind, packet.message_id, packet.text, config.*
Supported outbound frame types are binary, json, and text.
Add an interrupt rule if your provider needs an explicit cancel/clear message. Without it, queued provider audio can continue after the user starts speaking.

One-shot synthesis

Use this when the provider synthesizes each text packet immediately.
[
  {
    "when": { "packet": "text" },
    "send": {
      "frame": "json",
      "body": {
        "text": { "$path": "packet.text" },
        "voice_id": { "$path": "config.voice.id" },
        "message_id": { "$path": "packet.message_id" },
        "model": { "$path": "config.model" },
        "language": { "$path": "config.language" },
        "audio": {
          "encoding": { "$path": "config.audio.encoding" },
          "sample_rate": {
            "$cast": "number",
            "value": { "$path": "config.audio.sample_rate" }
          }
        }
      }
    }
  }
]

Two-step synthesis with done

Use this when the provider expects text first and a final flush/done packet.
[
  {
    "when": { "packet": "text" },
    "send": {
      "frame": "json",
      "body": {
        "text": { "$path": "packet.text" },
        "voice_id": { "$path": "config.voice.id" },
        "message_id": { "$path": "packet.message_id" }
      }
    }
  },
  {
    "when": { "packet": "done" },
    "send": {
      "frame": "json",
      "body": {
        "type": "done",
        "message_id": { "$path": "packet.message_id" }
      }
    }
  },
  {
    "when": { "packet": "interrupt" },
    "send": {
      "frame": "json",
      "body": {
        "type": "interrupt",
        "message_id": { "$path": "packet.message_id" }
      }
    }
  }
]

Response rules

Response rules parse provider WebSocket frames into Rapida audio packets. Supported inbound frame types:
FrameUse when
binaryThe provider streams raw audio frames.
jsonThe provider returns JSON messages with base64 audio or status fields.
Supported emit keys:
Emit keyTypeDescription
audioBinary bytes or base64-decoded bytesAudio chunk to play.
message_idstringOptional message ID associated with the audio or done event.
donebooleantrue when synthesis for the message is complete.
errorAnyProvider error value. When present, Rapida treats the frame as an error.

Binary audio responses

[
  {
    "when": { "frame": "binary" },
    "emit": {
      "audio": { "$frame": "binary" }
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "done" },
    "emit": {
      "message_id": { "$path": "message_id" },
      "done": true
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "error" },
    "emit": {
      "message_id": { "$path": "message_id" },
      "error": { "$path": "error.message" },
      "done": true
    }
  }
]

JSON base64 audio responses

[
  {
    "when": { "frame": "json", "path": "type", "equals": "chunk" },
    "emit": {
      "audio": {
        "$decode": "base64",
        "value": { "$path": "audio" }
      },
      "message_id": { "$path": "message_id" }
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "done" },
    "emit": {
      "message_id": { "$path": "message_id" },
      "done": true
    }
  }
]

TTS DSL design

Custom TTS uses a JSON-template DSL with three sections: query parameters, request rules, and response rules. The DSL is intentionally small. It does not run scripts, call functions, concatenate strings, perform regex matching, or read environment variables.

TTS frame support

Frame typeOutbound requestInbound responseNotes
binaryYesYesUse inbound binary frames for raw provider audio chunks.
jsonYesYesUse for provider request payloads, base64 audio chunks, done events, and errors.
textYesNoUse for outbound provider control frames when required.
Inbound parsing rules:
  • WebSocket message type 2 is treated as binary.
  • Non-binary messages are parsed as JSON when they contain exactly one valid JSON value.
  • Non-JSON messages are treated as text, but TTS response rules do not support text response frames.

TTS operators

Every operator object must contain only that operator and its required field.
OperatorWhere supportedDescription
$varQuery parametersReads message_id, voice_id, model, language, encoding, or sample_rate.
$pathRequest rules, response rulesReads a dot path from request scope or a JSON response frame.
$castQuery parameters, request rules, response rulesCasts to string, number, or boolean.
$frameResponse rulesReads the full current binary response frame.
$decodeResponse rulesDecodes a base64 string into bytes. Only base64 is supported.
Unsupported TTS operators:
  • $frame: "text" and $frame: "json" are not supported for TTS emit rules.
  • $decode supports only base64.

Cast behavior

CastBehavior
stringConverts strings, bytes, numbers, booleans, and null to string form.
numberConverts JSON numbers, numeric values, or numeric strings to an integer or float.
booleanConverts booleans, boolean strings, and numeric values. JSON numbers are accepted as 0 or 1; typed numeric values use zero as false and non-zero as true.

JSON path behavior

$path uses dot-separated paths.
{ "$path": "packet.text" }
Objects are traversed by key. Arrays are traversed by numeric index.
{ "$path": "chunks.0.audio" }
Limits:
  • Keys containing a literal dot are not addressable.
  • Request rules can only read from config and packet.
  • Response rules can use $path only with JSON response frames.
  • A missing path in when.path means the rule does not match.
  • A missing path in emit or send.body is an error.

Query parameter rules

speak.ws.query_params must be a flat JSON object. Each value must resolve to a primitive value: string, number, boolean, or null. Nested objects and arrays are rejected unless the object is a DSL expression.
{
  "voice": { "$var": "voice_id" },
  "model": { "$var": "model" },
  "sample_rate": {
    "$cast": "number",
    "value": { "$var": "sample_rate" }
  }
}
The rendered query parameters are appended to baseUrl. Existing query parameters in baseUrl are preserved unless the same key is rendered by speak.ws.query_params.
TTS request rules can read packet.text, but text is not a supported query parameter variable.

Request rule shape

speak.ws.request_rules is an ordered JSON array. Every matching rule is sent, so one packet can produce multiple WebSocket messages.
FieldRequiredDescription
when.packetYestext, done, or interrupt.
send.frameYesbinary, json, or text.
send.bodyYesStatic value or DSL expression tree.
Body validation depends on send.frame:
send.framesend.body must resolve to
binaryBytes or string.
jsonValid JSON value. Byte arrays are not valid JSON bodies.
textValue convertible to string.

TTS request scope

The request scope is the data available to $path in request rules.
{
  "config": {
    "voice": {
      "id": "voice_123"
    },
    "model": "model-a",
    "language": "en-US",
    "audio": {
      "encoding": "LINEAR16",
      "sample_rate": 16000
    }
  },
  "packet": {
    "kind": "text",
    "message_id": "msg_123",
    "text": "Hello world"
  }
}
For done and interrupt, packet.text is present but may be empty.

Response rule shape

speak.ws.response_rules is an ordered JSON array. The first matching rule is evaluated; later rules are skipped for that frame.
FieldRequiredDescription
when.frameYesbinary or json.
when.pathNoDot path inside a JSON frame. Must be paired with when.equals.
when.equalsNoPrimitive value compared against when.path.
emitYesObject containing supported TTS emit keys.
Match behavior:
FrameMatch behavior
binaryMatches by frame type only. when.path and when.equals are not allowed.
jsonIf when.path and when.equals are omitted, matches any JSON frame. If provided, both fields are required and compared exactly.

TTS emit keys

Emit keyType after evaluationEffect
audiobytes or stringEmits a TTS audio chunk.
message_idstringAssociates audio, error, or done with a message. Falls back to the current context ID when omitted.
donebooleanEnds synthesis for the message, closes the connection, and emits a TTS end packet.
errorstringEmits a TTS error.

Binary audio response

Use this for providers that stream raw audio as binary WebSocket frames.
[
  {
    "when": { "frame": "binary" },
    "emit": {
      "audio": { "$frame": "binary" }
    }
  }
]

JSON base64 audio response

Use $decode when the provider returns base64-encoded audio inside JSON.
[
  {
    "when": { "frame": "json", "path": "type", "equals": "chunk" },
    "emit": {
      "audio": {
        "$decode": "base64",
        "value": { "$path": "audio" }
      },
      "message_id": { "$path": "request_id" }
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "done" },
    "emit": {
      "message_id": { "$path": "request_id" },
      "done": true
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "error" },
    "emit": {
      "message_id": { "$path": "request_id" },
      "error": { "$path": "error.message" },
      "done": true
    }
  }
]

Text, done, and interrupt recipe

Use this pattern when the provider expects text payloads, an explicit final message, and an explicit cancel message.
[
  {
    "when": { "packet": "text" },
    "send": {
      "frame": "json",
      "body": {
        "type": "speak",
        "text": { "$path": "packet.text" },
        "voice": { "$path": "config.voice.id" },
        "request_id": { "$path": "packet.message_id" },
        "audio": {
          "encoding": { "$path": "config.audio.encoding" },
          "sample_rate": {
            "$cast": "number",
            "value": { "$path": "config.audio.sample_rate" }
          }
        }
      }
    }
  },
  {
    "when": { "packet": "done" },
    "send": {
      "frame": "json",
      "body": {
        "type": "done",
        "request_id": { "$path": "packet.message_id" }
      }
    }
  },
  {
    "when": { "packet": "interrupt" },
    "send": {
      "frame": "json",
      "body": {
        "type": "interrupt",
        "request_id": { "$path": "packet.message_id" }
      }
    }
  }
]

Runtime behavior

  • The connection URL is built from baseUrl and speak.ws.query_params.
  • Headers are copied from the credential and are not templated.
  • The transformer opens a connection per active message/context. A new context closes the previous connection.
  • text packets open the WebSocket connection if needed.
  • done and interrupt request rules are optional. If no rule exists for that packet, nothing is sent.
  • On interruption, Rapida sends the optional interrupt rule first, then closes the connection.
  • Audio returned by the provider is interpreted as speak.audio.encoding and speak.audio.sample_rate, then resampled to Rapida’s internal audio format when needed.
  • If no response rule matches an inbound frame, the frame is ignored.
  • If a response emits error, Rapida emits a TTS error packet.
  • If a response emits done, Rapida closes the connection and emits a TTS end packet.

Current TTS limits

  • No regex, contains, starts-with, greater-than, or compound match conditions.
  • No string interpolation or concatenation.
  • No fallback values inside expressions.
  • No dynamic headers or dynamic WebSocket path segments.
  • No text response handling for TTS.
  • No $frame: "json" selector in emit rules.
  • $decode supports only base64.

Backend mapping

assistant-api resolves custom-tts in api/assistant-api/internal/transformer/transformer.go, then dispatches to the WebSocket v1 implementation in api/assistant-api/internal/transformer/custom/tts_websocket_v1. The WebSocket v1 transformer validates:
  • baseUrl is present in the credential.
  • speak.voice.id is present.
  • speak.audio.encoding is not empty.
  • speak.audio.sample_rate is positive.
  • speak.ws.request_rules contains at least one text packet rule.
  • speak.ws.response_rules contains at least one rule.

Custom STT

Configure WebSocket speech-to-text.

TTS overview

Transformer interface and supported providers.