Skip to main content

Documentation Index

Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt

Use this file to discover all available pages before exploring further.

Custom STT lets you connect Rapida to any WebSocket-based transcription service without writing a new Go transformer. The UI stores provider credentials and assistant options; assistant-api reads them through the custom-stt transformer and maps audio packets and provider responses with WebSocket DSL rules. Provider identifier: custom-stt

Where it is configured

1

Create the provider credential

In the Rapida dashboard, open Integrations > Models, choose Custom STT, and create a credential.
Credential fieldRequiredValue
apiCompatibilityYeswebsocket_v1
baseUrlYesWebSocket URL for your STT service, for example wss://stt.example.com/v1/listen
headersNoHeader map sent on the WebSocket handshake, for example {"Authorization":"Bearer sk_..."}
The backend also accepts snake case keys: api_compatibility and base_url.
2

Select Custom STT on the assistant

Open the assistant voice settings and select Custom STT as the speech-to-text provider.
3

Fill the STT arguments

Set the model, language, audio format, query parameters, request rules, and response rules. The argument reference below maps one-to-one to the UI fields.

STT arguments

The UI stores these keys with the microphone. metadata prefix, but the transformer reads the option keys shown below.
Option keyRequiredDefaultDescription
listen.modelNoEmptyProvider model identifier. Reference it in DSL as model or config.model.
listen.languageNoEmptyProvider language code. Reference it in DSL as language or config.language.
listen.audio.encodingYesLINEAR16Audio encoding sent to the provider. Supported values: LINEAR16, MuLaw8.
listen.audio.sample_rateYes16000Audio sample rate sent to the provider. Supported UI values: 8000, 16000, 22050, 24000, 32000, 44100, 48000.
listen.ws.query_paramsNo{}Flat JSON object appended to baseUrl as query parameters. Values can use DSL expressions.
listen.ws.request_rulesYesSee example belowOrdered JSON array that tells Rapida what to send for each outbound packet. Must contain at least one audio rule.
listen.ws.response_rulesYesNoneOrdered JSON array that tells Rapida how to parse provider WebSocket frames into transcripts or errors.

Query parameters

Use listen.ws.query_params when your provider expects configuration in the WebSocket URL. Supported variables:
VariableSource
modellisten.model
languagelisten.language
encodinglisten.audio.encoding
sample_ratelisten.audio.sample_rate
{
  "language": { "$var": "language" },
  "model": { "$var": "model" },
  "encoding": { "$var": "encoding" },
  "sample_rate": {
    "$cast": "number",
    "value": { "$var": "sample_rate" }
  }
}

Request rules

Request rules are evaluated for normalized packets produced by Rapida.
PacketWhen it is sentAvailable paths
turn_changeA new turn/context startspacket.kind, packet.context_id, config.model, config.language, config.audio.encoding, config.audio.sample_rate
audioAudio is ready to streampacket.kind, packet.context_id, packet.audio.bytes, packet.audio.base64, config.*
interruptUser interruption is detectedpacket.kind, packet.context_id, config.*
Supported outbound frame types are binary, json, and text.
packet.audio.bytes is raw audio bytes for binary frames. Use packet.audio.base64 when the provider expects audio inside a JSON payload.

Binary audio stream

[
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "binary",
      "body": { "$path": "packet.audio.bytes" }
    }
  }
]

JSON audio payload

[
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "json",
      "body": {
        "audio": { "$path": "packet.audio.base64" },
        "encoding": { "$path": "config.audio.encoding" },
        "sample_rate": {
          "$cast": "number",
          "value": { "$path": "config.audio.sample_rate" }
        }
      }
    }
  }
]

Response rules

Response rules parse provider WebSocket frames into Rapida transcript packets. Supported inbound frame types:
FrameUse when
jsonThe provider returns JSON messages.
textThe provider returns plain transcript text.
Supported emit keys:
Emit keyTypeDescription
scriptstringTranscript text. When present, Rapida emits an STT transcript packet.
confidencenumberOptional transcript confidence.
languagestringOptional language code for the transcript.
interimbooleantrue for partial transcripts, false for final transcripts.
errorAnyProvider error value. When present, Rapida treats the frame as an error.
[
  {
    "when": { "frame": "json", "path": "type", "equals": "partial" },
    "emit": {
      "script": { "$path": "text" },
      "confidence": {
        "$cast": "number",
        "value": { "$path": "confidence" }
      },
      "language": { "$path": "language" },
      "interim": true
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "final" },
    "emit": {
      "script": { "$path": "text" },
      "confidence": {
        "$cast": "number",
        "value": { "$path": "confidence" }
      },
      "language": { "$path": "language" },
      "interim": false
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "error" },
    "emit": {
      "error": { "$path": "error.message" }
    }
  }
]

STT DSL design

Custom STT uses a JSON-template DSL with three sections: query parameters, request rules, and response rules. The DSL is intentionally small. It does not run scripts, call functions, concatenate strings, perform regex matching, or read environment variables.

STT frame support

Frame typeOutbound requestInbound responseNotes
binaryYesNoUse for raw provider audio input.
jsonYesYesUse for provider control packets and structured transcripts.
textYesYesUse for providers that accept or return plain text frames.
Inbound parsing rules:
  • WebSocket message type 2 is treated as binary, but STT response rules do not support binary response frames.
  • Non-binary messages are parsed as JSON when they contain exactly one valid JSON value.
  • Non-JSON messages are treated as text.
  • For STT, JSON string primitives and non-object JSON values are treated as text; JSON response rules operate on JSON objects.

STT operators

Every operator object must contain only that operator and its required field.
OperatorWhere supportedDescription
$varQuery parametersReads model, language, encoding, or sample_rate.
$pathRequest rules, response rulesReads a dot path from request scope or a JSON response frame.
$castQuery parameters, request rules, response rulesCasts to string, number, or boolean.
$frameResponse rulesReads the full current text response frame.
Unsupported STT operators:
  • $decode is not supported for STT.
  • $frame: "binary" and $frame: "json" are not supported for STT emit rules.

Cast behavior

CastBehavior
stringConverts strings, bytes, numbers, booleans, and null to string form.
numberConverts JSON numbers, numeric values, or numeric strings to an integer or float.
booleanConverts booleans, boolean strings, and numeric values. JSON numbers are accepted as 0 or 1; typed numeric values use zero as false and non-zero as true.

JSON path behavior

$path uses dot-separated paths.
{ "$path": "packet.audio.base64" }
Objects are traversed by key. Arrays are traversed by numeric index.
{ "$path": "results.0.transcript" }
Limits:
  • Keys containing a literal dot are not addressable.
  • Request rules can only read from config and packet.
  • Response rules can use $path only with JSON response frames.
  • A missing path in when.path means the rule does not match.
  • A missing path in emit or send.body is an error.

Query parameter rules

listen.ws.query_params must be a flat JSON object. Each value must resolve to a primitive value: string, number, boolean, or null. Nested objects and arrays are rejected unless the object is a DSL expression.
{
  "language": { "$var": "language" },
  "model": { "$var": "model" },
  "sample_rate": {
    "$cast": "number",
    "value": { "$var": "sample_rate" }
  }
}
The rendered query parameters are appended to baseUrl. Existing query parameters in baseUrl are preserved unless the same key is rendered by listen.ws.query_params.

Request rule shape

listen.ws.request_rules is an ordered JSON array. Every matching rule is sent, so one packet can produce multiple WebSocket messages.
FieldRequiredDescription
when.packetYesturn_change, audio, or interrupt.
send.frameYesbinary, json, or text.
send.bodyYesStatic value or DSL expression tree.
Body validation depends on send.frame:
send.framesend.body must resolve to
binaryBytes or string.
jsonValid JSON value. Byte arrays are not valid JSON bodies.
textValue convertible to string.

STT request scope

The request scope is the data available to $path in request rules.
{
  "config": {
    "model": "model-a",
    "language": "en-US",
    "audio": {
      "encoding": "LINEAR16",
      "sample_rate": 16000
    }
  },
  "packet": {
    "kind": "audio",
    "context_id": "ctx_123",
    "audio": {
      "bytes": "<raw bytes>",
      "base64": "AAE="
    }
  }
}
packet.audio exists only for audio packets.

Response rule shape

listen.ws.response_rules is an ordered JSON array. The first matching rule is evaluated; later rules are skipped for that frame.
FieldRequiredDescription
when.frameYesjson or text.
when.pathNoDot path inside a JSON frame. Must be paired with when.equals.
when.equalsNoPrimitive value compared against when.path, or against the full text frame.
emitYesObject containing supported STT emit keys.
Match behavior:
FrameMatch behavior
jsonIf when.path and when.equals are omitted, matches any JSON object. If provided, both fields are required and compared exactly.
textIf when.equals is omitted, matches any text frame. If provided, compares the full text frame exactly. when.path is not allowed.

STT emit keys

Emit keyType after evaluationEffect
scriptstringTranscript text. Empty transcripts are ignored.
confidencenumberTranscript confidence. Defaults to 0 when omitted.
languagestringTranscript language. Falls back to listen.language when omitted.
interimbooleantrue emits an interim transcript; false emits a completed transcript.
errorstringEmits an STT error instead of a transcript.

Plain text transcript response

Use this for providers that return transcript chunks as raw text frames.
[
  {
    "when": { "frame": "text" },
    "emit": {
      "script": { "$frame": "text" },
      "interim": false
    }
  }
]

Nested JSON transcript response

[
  {
    "when": { "frame": "json", "path": "result.final", "equals": false },
    "emit": {
      "script": { "$path": "result.transcript" },
      "interim": true
    }
  },
  {
    "when": { "frame": "json", "path": "result.final", "equals": true },
    "emit": {
      "script": { "$path": "result.transcript" },
      "confidence": {
        "$cast": "number",
        "value": { "$path": "result.confidence" }
      },
      "language": { "$path": "result.language" },
      "interim": false
    }
  }
]

Start, audio, and interrupt recipe

Use this pattern when the provider expects a session-start message, binary audio frames, and a flush message on interruption.
[
  {
    "when": { "packet": "turn_change" },
    "send": {
      "frame": "json",
      "body": {
        "type": "start",
        "language": { "$path": "config.language" },
        "sample_rate": {
          "$cast": "number",
          "value": { "$path": "config.audio.sample_rate" }
        }
      }
    }
  },
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "binary",
      "body": { "$path": "packet.audio.bytes" }
    }
  },
  {
    "when": { "packet": "interrupt" },
    "send": {
      "frame": "json",
      "body": { "type": "flush" }
    }
  }
]

Runtime behavior

  • The connection URL is built from baseUrl and listen.ws.query_params.
  • Headers are copied from the credential and are not templated.
  • Audio is resampled from Rapida’s internal audio format to listen.audio.encoding and listen.audio.sample_rate before request rules are evaluated.
  • turn_change and audio packets open the WebSocket connection if needed.
  • interrupt rules are sent only when a connection is already active.
  • If no response rule matches an inbound frame, the frame is ignored.
  • If a response emits error, Rapida emits an STT error packet.
  • If a response emits non-empty script, Rapida emits a transcript packet and conversation event.

Current STT limits

  • No regex, contains, starts-with, greater-than, or compound match conditions.
  • No string interpolation or concatenation.
  • No fallback values inside expressions.
  • No dynamic headers or dynamic WebSocket path segments.
  • No $decode.
  • No binary response handling for STT.
  • No $frame: "json" selector in emit rules.

Backend mapping

assistant-api resolves custom-stt in api/assistant-api/internal/transformer/transformer.go, then dispatches to the WebSocket v1 implementation in api/assistant-api/internal/transformer/custom/stt_websocket_v1. The WebSocket v1 transformer validates:
  • baseUrl is present in the credential.
  • listen.audio.encoding is not empty.
  • listen.audio.sample_rate is positive.
  • listen.ws.request_rules contains at least one audio packet rule.
  • listen.ws.response_rules contains at least one rule.

Custom TTS

Configure WebSocket text-to-speech.

STT overview

Transformer interface and supported providers.