# swaram.live > Real-time Malayalam voice API. Send a user's voice over one WebSocket and hear natural Malayalam stream back. Two opaque modes (Simple, Premium); OpenAI-Realtime-compatible event protocol; bring your own instructions and tools. ## Docs - [Introduction](https://swaram.live/docs/index.md): What swaram is, the base URL, and the two modes. - [Quickstart](https://swaram.live/docs/quickstart.md): Connect, configure a session, stream audio, play the reply (Python/Node/browser). - [Authentication](https://swaram.live/docs/authentication.md): Server-side secret keys and short-lived browser tokens. - [Sessions & context](https://swaram.live/docs/sessions.md): Configure instructions, voice, and tools at the start of a session. - [Function calling](https://swaram.live/docs/tools.md): Give swaram tools to act on; the call/return round-trip (Python/Node/browser). - [Audio](https://swaram.live/docs/audio.md): PCM16 24 kHz mono base64 audio in and out; chunking, playback, and echo cancellation. - [Transcripts](https://swaram.live/docs/transcripts.md): Read the text of what the user said and what swaram said back. - [Turn-taking & barge-in](https://swaram.live/docs/turn-taking.md): Automatic turns, interrupting a reply, and sending only the user's voice (client VAD / push-to-talk). - [Events reference](https://swaram.live/docs/events.md): Every client and server event, plus close codes. - [Errors](https://swaram.live/docs/errors.md): Error events, close codes, and reconnection. - [FAQ](https://swaram.live/docs/faq.md): Common questions about modes, languages, billing, and data. --- # Introduction swaram is a real-time **Malayalam voice API**. You send a user's voice over one connection, and hear natural Malayalam stream back — speech in, Malayalam out. You bring the **context** — your instructions, your tools, and your data — and swaram is the voice. It follows the **OpenAI Realtime** event protocol, so if you've built with a real-time voice model before, this will feel familiar. ## Base URL ``` wss://api.swaram.live/v1/realtime ``` Sign up and manage your API keys at [app.swaram.live](https://app.swaram.live). ## Two modes You pick a mode with the `model` setting when you connect. Everything else — events, tools, voices — is **identical** between them. | Mode | `model` | Best for | |---|---|---| | Simple | `mal-realtime-simple` | Natural Malayalam voice at low cost. | | Premium | `mal-realtime-premium` | Lower latency and a more expressive voice. | Switching modes is just a different `model` value; your code stays the same. ## What you'll do 1. [Create an API key](authentication.html) on the dashboard. 2. [Connect](quickstart.html) and configure your session — instructions, voice, tools. 3. Stream the user's voice and play the Malayalam audio you get back. ## Already using a Realtime client? swaram speaks the OpenAI Realtime event subset, so most existing real-time voice clients work by changing three things: the **URL** (`wss://api.swaram.live/v1/realtime`), the **API key**, and the **model** (`mal-realtime-simple`). See the [Quickstart](quickstart.html) for plain-WebSocket examples in Python, Node, and the browser. ## For AI agents The full documentation is available as plain Markdown for tooling and agents: every page is served at `/docs/.md`, and an index lives at [`/llms.txt`](/llms.txt) (with the whole set concatenated at [`/llms-full.txt`](/llms-full.txt)). --- # Quickstart Connect, configure a session, stream the user's voice, and play the Malayalam audio that streams back. Here's the whole loop. ## 1. Get an API key Create an account on [app.swaram.live](https://app.swaram.live) and create a key. It looks like `swaram_…` and is shown once — keep it on your server. ## 2. Connect and talk Open a WebSocket, send your settings as the first message, then stream audio. Audio is **16-bit PCM, 24 kHz, mono, base64** in both directions. ```python import asyncio, base64, json, websockets API_KEY = "swaram_your_key_here" URL = "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple" async def main(): headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect(URL, additional_headers=headers) as ws: # 1) configure the session up front (before streaming audio) await ws.send(json.dumps({ "type": "session.update", "session": { "instructions": "You are a friendly Malayalam assistant.", "voice": "mal-female", }, })) # 2) stream the user's microphone as base64 PCM16 @ 24 kHz, in chunks await ws.send(json.dumps({ "type": "input_audio_buffer.append", "audio": base64.b64encode(pcm16_chunk).decode(), })) # 3) read events; play the audio you get back async for raw in ws: event = json.loads(raw) if event["type"] == "response.output_audio.delta": play(base64.b64decode(event["delta"])) # PCM16 @ 24 kHz elif event["type"] == "response.output_audio_transcript.delta": print(event["delta"], end="", flush=True) asyncio.run(main()) ``` ```node import WebSocket from "ws"; const API_KEY = "swaram_your_key_here"; const URL = "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple"; const ws = new WebSocket(URL, { headers: { Authorization: `Bearer ${API_KEY}` } }); ws.on("open", () => { // 1) configure the session up front ws.send(JSON.stringify({ type: "session.update", session: { instructions: "You are a friendly Malayalam assistant.", voice: "mal-female", }, })); // 2) stream the user's microphone as base64 PCM16 @ 24 kHz, in chunks ws.send(JSON.stringify({ type: "input_audio_buffer.append", audio: pcm16Chunk.toString("base64"), })); }); ws.on("message", (data) => { const event = JSON.parse(data); if (event.type === "response.output_audio.delta") { play(Buffer.from(event.delta, "base64")); // PCM16 @ 24 kHz } else if (event.type === "response.output_audio_transcript.delta") { process.stdout.write(event.delta); } }); ``` ```browser // Browsers can't set an Authorization header on a WebSocket, so your backend // mints a short-lived token (see Authentication) and the page connects with it. const token = await fetch("/your-backend/realtime-token").then((r) => r.text()); const ws = new WebSocket( "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple", ["realtime", "openai-insecure-api-key." + token] // pass the token as a subprotocol ); ws.onopen = () => { ws.send(JSON.stringify({ type: "session.update", session: { instructions: "You are a friendly Malayalam assistant.", voice: "mal-female", }, })); // capture the mic (getUserMedia with echoCancellation: true so the model // doesn't hear itself), downsample to 24 kHz PCM16 with an AudioWorklet, and // send chunks as: { type: "input_audio_buffer.append", audio: } }; ws.onmessage = (e) => { const event = JSON.parse(e.data); if (event.type === "response.output_audio.delta") { playPcm16(event.delta); // base64 → PCM16 @ 24 kHz } }; ``` That's the whole loop: **configure once, stream audio, play what comes back.** ## What happens on connect The server replies with `session.created`, then `session.updated` after your config. As the user speaks it emits `input_audio_buffer.speech_started` / `speech_stopped`, then a `response.created`, a stream of `response.output_audio.delta` (the audio) and `response.output_audio_transcript.delta` (the words), and finally `response.done`. See the [Events reference](events.html) for the full list. ## Next steps - [Authentication](authentication.html) — server-side keys and browser tokens. - [Sessions & context](sessions.html) — instructions, voices, and the config contract. - [Function calling](tools.html) — give swaram tools to act on. - [Transcripts](transcripts.html) — read the text of what the user said and what swaram said back. --- # Authentication Every connection is tied to your account with an API key. There are two ways to use it, depending on whether the code runs on **your server** or in an **untrusted client** (a browser or mobile app). ## Server-side: your secret key From your own backend, send your secret key in the `Authorization` header when you open the connection: ``` Authorization: Bearer swaram_your_key_here ``` Your secret key looks like `swaram_…`, is shown once when you create it, and is stored only as a hash — keep it on your server. **Never put a `swaram_` key in a browser or app**, where a user could read it. ## Browser / mobile: a short-lived token Untrusted clients shouldn't hold your secret key. Instead, your backend mints a short-lived **ephemeral token** (it looks like `swaram_ek_…`) and hands only that to the client. The token can carry locked settings — model, voice, instructions, tools — that the client can't change, and it expires after a minute or so. **1. On your backend**, exchange your secret key for a token: ```python import httpx resp = httpx.post( "https://api.swaram.live/v1/realtime/client_secrets", headers={"Authorization": "Bearer swaram_your_key_here"}, json={ "model": "mal-realtime-simple", "session": { "instructions": "You are a friendly Malayalam assistant.", "voice": "mal-female", }, }, ) token = resp.json()["value"] # a swaram_ek_… token — send THIS to the client ``` **2. In the client**, connect with that token. Browsers can't set headers on a WebSocket, so pass it as a subprotocol: ```browser const ws = new WebSocket( "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple", ["realtime", "openai-insecure-api-key." + token] ); ``` A bad or expired token is rejected and the connection closes (code `4001`). ## How the key is read The server looks for your credential in this order: 1. `Authorization: Bearer ` header — the server-side path. 2. `?api_key=` query parameter. 3. `Sec-WebSocket-Protocol: openai-insecure-api-key.` subprotocol — the browser path. ## Config at creation Whichever path you use, send your configuration **at the start** — in the token, or as the first `session.update` right after you connect. It's locked once the conversation begins. See [Sessions & context](sessions.html) for the details. --- # Sessions & context A session is one connection. You configure it **once, at the start** — your instructions, voice, and tools — and that configuration applies to the whole call. This is identical in both modes. ## Configure at the start Send a `session.update` as your first message right after connecting (or lock the settings into a [browser token](authentication.html)). The settings are **fixed once the conversation begins**, so set them before you start streaming audio. ```json { "type": "session.update", "session": { "instructions": "You are a helpful Malayalam tutor. Keep replies short.", "voice": "mal-female", "tools": [] } } ``` The server echoes the effective configuration back as a `session.updated` event. ## Settings | Setting | What it does | |---|---| | `instructions` | Your system prompt — the persona, the policy, the tone, and any context. | | `voice` | The Malayalam voice: `mal-female` (default) or `mal-male`. | | `tools` | Actions the model can call — see [Function calling](tools.html). | | `tool_choice` | `auto` (default), `none`, or `required`. | | `turn_detection` | Automatic turn-taking (`default`) — swaram replies when the user stops speaking. | > **Replies are speakable Malayalam by default.** swaram always returns natural, > conversational Malayalam written for the ear — numbers, units, and dates as words, > no markdown or formatting. Your `instructions` layer **on top** of that: they set > the persona, behaviour, and length (replies default to one or two sentences; ask > for a longer or different style in your instructions). ## Adding your data swaram is the voice — your data stays yours. Put the information you want it to use into the `instructions`, or give it a [tool](tools.html) to look things up at the moment it's needed. Conversations aren't stored; you bring your context each session. ## Voices Two Malayalam voices, the same in both modes: | `voice` | | |---|---| | `mal-female` | Female voice (default). | | `mal-male` | Male voice. | ## Why "at the start"? Setting configuration once and locking it keeps both modes behaving identically and predictably for the whole call. If you need different instructions or tools, start a new session with the new configuration. --- # Function calling Give swaram **actions** to take — look up an order, check the weather, save a note. You list your tools when you configure the session; the model decides when one is needed and asks for it; your app runs it and sends the result back; then swaram speaks the answer in Malayalam. It works the same in both modes. ## 1. Declare your tools Add a `tools` array to your opening `session.update`. Each tool is a function with a name, a description, and a JSON-Schema for its arguments. ```json { "type": "session.update", "session": { "instructions": "You are a helpful shop assistant. Speak Malayalam.", "voice": "mal-female", "tools": [ { "type": "function", "name": "get_order_status", "description": "Look up an order by its id.", "parameters": { "type": "object", "properties": { "order_id": { "type": "string" } }, "required": ["order_id"] } } ], "tool_choice": "auto" } } ``` ## 2. Handle the call When the model wants a tool, you receive a `response.function_call_arguments.done` event with the `name`, a `call_id`, and the `arguments` (a JSON string). Run the function, then send the result back as a `conversation.item.create` of type `function_call_output` — keyed by the same `call_id`. swaram continues and speaks the answer. ```python async for raw in ws: event = json.loads(raw) if event["type"] == "response.function_call_arguments.done": args = json.loads(event["arguments"]) result = lookup_order(args["order_id"]) # your code await ws.send(json.dumps({ "type": "conversation.item.create", "item": { "type": "function_call_output", "call_id": event["call_id"], "output": json.dumps(result), # a JSON string }, })) ``` ```node ws.on("message", (data) => { const event = JSON.parse(data); if (event.type === "response.function_call_arguments.done") { const args = JSON.parse(event.arguments); const result = lookupOrder(args.order_id); // your code ws.send(JSON.stringify({ type: "conversation.item.create", item: { type: "function_call_output", call_id: event.call_id, output: JSON.stringify(result), // a JSON string }, })); } }); ``` ```browser ws.onmessage = (e) => { const event = JSON.parse(e.data); if (event.type === "response.function_call_arguments.done") { const args = JSON.parse(event.arguments); const result = lookupOrder(args.order_id); // your code ws.send(JSON.stringify({ type: "conversation.item.create", item: { type: "function_call_output", call_id: event.call_id, output: JSON.stringify(result), // a JSON string }, })); } }; ``` ## Choosing when tools run `tool_choice` controls whether the model may call a tool: | `tool_choice` | Behaviour | |---|---| | `auto` | The model decides (default). | | `none` | Never call a tool — just speak. | | `required` | The model must call a tool. | ## Tips > **Keep tools fast.** A tool call adds a short pause before swaram replies. > Keep the work quick, and confirm before anything important. You can set your > instructions so swaram says a brief Malayalam filler while it waits. - Send the `output` as a **JSON string**. If it isn't valid JSON it's wrapped as `{"result": ""}`. - The `call_id` ties a result to its call — always echo back the one you received. - Parallel calls are supported; answer each `call_id` you're given. --- # Audio Audio in both directions is **16-bit PCM, 24 kHz, mono, little-endian**, sent as **base64** inside JSON events. ## Sending the user's voice Capture the microphone, resample to 24 kHz mono PCM16, and send it in small chunks as you go: ```json { "type": "input_audio_buffer.append", "audio": "" } ``` Send chunks continuously while the user speaks — there's no need to wait for them to finish. swaram detects when they stop and replies on its own (see [Turn-taking](turn-taking.html)). ## Clean capture (echo cancellation) swaram judges only the audio you send it. If the model's own voice — playing out of the speaker — leaks back into the microphone, swaram hears it as the user and talks over itself or replies to its own words. Make sure you send **only the user's voice**: - **Enable echo cancellation when you capture.** In the browser, set it on `getUserMedia`: ```browser const stream = await navigator.mediaDevices.getUserMedia({ audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, autoGainControl: true }, }); ``` - **Headphones** remove the problem entirely — the speaker never reaches the mic. - On a **phone call**, the device and network usually cancel echo for you. - Where you can't fully cancel it on open speakers, use **push-to-talk** or your own voice detection so the mic is only open while the user is speaking — see [Turn-taking](turn-taking.html). ## Playing the reply The reply streams back as a series of audio deltas. Decode the base64 and play the PCM16 pieces in order: ```json { "type": "response.output_audio.delta", "response_id": "resp_…", "item_id": "item_…", "delta": "" } ``` You also get the words as text, alongside the audio: ```json { "type": "response.output_audio_transcript.delta", "delta": "നമസ്കാരം" } ``` To capture **both** sides as text — what the user said *and* what swaram said back — see [Transcripts](transcripts.html). ## Tips - **Chunk size** doesn't have to be exact — a few hundred milliseconds of audio per `append` is fine. Odd/short chunks are handled gracefully. - **Play in order.** Buffer the `delta` pieces and play them back-to-back for smooth speech. - **Resampling.** If your mic captures at 16 kHz or 48 kHz, resample to 24 kHz before sending. In the browser, an `AudioWorklet` is the usual way to capture and downsample. - **Barge-in.** If the user speaks while swaram is talking, stop your playback right away — swaram stops generating too. See [Turn-taking](turn-taking.html). --- # Transcripts Every conversation gives you the **text of both sides** — what the caller said and what swaram said back — alongside the audio. Use it to log conversations, show live captions, run analytics, or debug. There are two streams, delivered by two events: | You want | Event | The text | |---|---|---| | **What the user said** | `conversation.item.input_audio_transcription.completed` | `event.transcript` — the whole turn | | **What swaram said** | `response.output_audio_transcript.delta` | `event.delta` — a piece; accumulate | Both work identically in **Simple and Premium** — your handling doesn't change between modes. ## What the user said (input transcript) When the user finishes a turn, swaram transcribes their speech and sends a single event with the complete text of that turn: ```json { "type": "conversation.item.input_audio_transcription.completed", "item_id": "item_…", "transcript": "കേരളത്തിന്റെ തലസ്ഥാനം ഏതാണ്?" } ``` It's **one event per user turn** (not streamed in pieces), so `transcript` is the full thing — just read it. > The same text also rides on the `conversation.item.created` event for that turn > (`item.role === "user"`, in `item.content[].text`) if you prefer to track items — > but `…input_audio_transcription.completed` above is the simplest, standard way. ## What swaram said (output transcript) The reply's text streams back **as deltas**, in step with the audio. Concatenate the deltas to get the full reply: ```json { "type": "response.output_audio_transcript.delta", "delta": "തിരുവനന്തപുരം" } ``` A reply is framed by `response.created` … (audio + transcript deltas) … `response.done`, so accumulate the deltas until `response.done` for one complete model turn. ## Putting both together — a conversation log Capture each user turn and each model turn into a running log: ```python conversation = [] # [{"role": "user"|"assistant", "text": ...}] reply = "" async for raw in ws: event = json.loads(raw) t = event["type"] if t == "conversation.item.input_audio_transcription.completed": conversation.append({"role": "user", "text": event["transcript"]}) elif t == "response.output_audio_transcript.delta": reply += event["delta"] # accumulate the model's words elif t == "response.done": if reply: conversation.append({"role": "assistant", "text": reply}) reply = "" ``` ```node const conversation = []; // [{ role: "user"|"assistant", text }] let reply = ""; ws.on("message", (data) => { const event = JSON.parse(data); if (event.type === "conversation.item.input_audio_transcription.completed") { conversation.push({ role: "user", text: event.transcript }); } else if (event.type === "response.output_audio_transcript.delta") { reply += event.delta; // accumulate the model's words } else if (event.type === "response.done") { if (reply) conversation.push({ role: "assistant", text: reply }); reply = ""; } }); ``` ```browser const conversation = []; // [{ role: "user"|"assistant", text }] let reply = ""; ws.onmessage = (e) => { const event = JSON.parse(e.data); if (event.type === "conversation.item.input_audio_transcription.completed") { conversation.push({ role: "user", text: event.transcript }); } else if (event.type === "response.output_audio_transcript.delta") { reply += event.delta; // accumulate the model's words } else if (event.type === "response.done") { if (reply) conversation.push({ role: "assistant", text: reply }); reply = ""; } }; ``` ## Turn structure For a normal voice turn the events arrive in this order: 1. `input_audio_buffer.speech_started` / `speech_stopped` — the user's turn boundaries 2. `conversation.item.input_audio_transcription.completed` — what they said 3. `response.created` — the reply begins 4. `response.output_audio.delta` + `response.output_audio_transcript.delta` — audio + words 5. `response.done` — the reply is complete ## Notes & edge cases - **Silent turns.** If the user's audio has no speech, there's nothing to transcribe and no input-transcript event for that turn. - **Tool calls.** When swaram calls one of your [tools](tools.html) it emits `response.function_call_arguments.done` instead of an audio transcript — that turn has no spoken reply until you return the result. - **Barge-in.** If the user interrupts mid-reply, the model transcript so far is partial and the turn ends with `response.done` (status `cancelled`). Keep what you accumulated. - **Same in both modes.** Simple and Premium emit the identical events. ## Related - [Events reference](events.html) — every event, both directions. - [Audio](audio.html) — the audio that accompanies these transcripts. - [Turn-taking & barge-in](turn-taking.html) — how turns start and end. --- # Turn-taking & barge-in swaram handles turns for you. It notices when the user stops speaking and replies on its own — automatic turn-taking, the same in both modes. ## Automatic turns Just stream the user's audio with `input_audio_buffer.append`. When they pause, you'll see: 1. `input_audio_buffer.speech_started` when they begin, 2. `input_audio_buffer.speech_stopped` when they stop, 3. a `response.created`, then the audio and transcript deltas, then `response.done`. You don't need to tell swaram when a turn ends — automatic turn-taking handles it in both modes. The `input_audio_buffer.commit` and `response.create` events are **optional** nudges for finer control in the Simple mode; Premium always uses automatic turn detection, so it ignores them. ## Barge-in (interrupting) If the user starts speaking while swaram is talking, swaram **stops right away** and listens — the in-flight reply is cancelled and audio stops at once. For the smoothest feel, have your app **stop its own playback** the moment the user starts speaking. You'll know from `input_audio_buffer.speech_started` (or your own client-side voice detection). You can also send `response.cancel` to interrupt explicitly. > **Tip.** Real-time speech feels best when both sides stop instantly. Drop any > queued audio you haven't played yet as soon as the user cuts in. ## Send only the user's voice swaram runs turn detection on whatever audio you send, so you don't mark turn boundaries yourself. But it can only judge what it receives — so do a little on your side to keep it hearing the user and nothing else: - **Gate the microphone.** Use **push-to-talk** (capture only while a button is held or toggled on) or your own lightweight **voice activity detection**, so the model's playback and background noise aren't streamed back as if the user were speaking. - **Cancel echo at capture.** Turn on the browser's echo cancellation and noise suppression — see [Audio](audio.html). On open speakers this is what stops the model interrupting itself; **headphones** sidestep it completely. - **Stop your playback the instant the user speaks** (see Barge-in above) for the snappiest interruption. --- # Events reference Messages flow both ways over the connection as JSON on the text channel. Audio is base64 inside those messages. This is the OpenAI Realtime event subset. ## You send | Event | What it's for | |---|---| | `session.update` | Set instructions, voice, and tools (at the start). | | `input_audio_buffer.append` | Send a piece of the user's voice (base64 PCM16). | | `input_audio_buffer.commit` | Optional — mark the end of the user's turn (Simple mode). | | `input_audio_buffer.clear` | Optional — discard buffered audio (Simple mode). | | `response.create` | Optional — ask for a reply now (Simple mode). | | `response.cancel` | Stop the current reply (interrupt). | | `conversation.item.create` | Send back the result of a tool it called. | > Turn-taking is **automatic in both modes** — you just stream audio. The three > "optional" events above are best-effort nudges; Premium uses automatic turn > detection and ignores them. ## You receive | Event | What it means | |---|---| | `session.created` | The session is ready (sent on connect). | | `session.updated` | Your settings were applied. | | `input_audio_buffer.speech_started` | The user started speaking. | | `input_audio_buffer.speech_stopped` | The user stopped speaking. | | `conversation.item.input_audio_transcription.completed` | The transcript of what the **user** said — see [Transcripts](transcripts.html). | | `conversation.item.created` | A turn item was added (the user's turn, or a tool result). | | `response.created` | A reply started. | | `response.output_audio.delta` | A piece of Malayalam audio to play — the main payload. | | `response.output_audio_transcript.delta` | The text of what **swaram** is saying — see [Transcripts](transcripts.html). | | `response.function_call_arguments.done` | It wants to call one of your [tools](tools.html). | | `response.done` | The reply finished. | | `error` | Something went wrong — see [Errors](errors.html). | ## Close codes When the server closes the connection, the WebSocket close code tells you why: | Code | Meaning | |---|---| | `4001` | Invalid or missing API key. | | `4003` | Out of credits — add credits to continue. | | `4008` | Too many concurrent connections for your plan. | | `1013` | Server busy — reconnect shortly. | See [Errors](errors.html) for how to handle these gracefully. --- # Errors Problems arrive as an `error` event with a short, human-readable message. Most errors leave the connection open; auth, credit, and limit problems close it with a [close code](events.html#close-codes). ```json { "type": "error", "error": { "type": "server_error", "code": "server_error", "message": "the voice service is temporarily unavailable, please retry" } } ``` ## What to handle | Situation | What happens | What to do | |---|---|---| | Invalid or missing key | `error` then close `4001` | Check the key / token. | | Out of credits | `error` then close `4003` | Add credits, then reconnect. | | Too many concurrent calls | `error` then close `4008` | You've hit your plan's limit — close another session or upgrade. | | Service busy or unavailable | `error` asking you to retry | Reconnect after a short backoff. | | Bad message | `error`, connection stays open | Fix the offending event and continue. | ## Reconnecting For `1013` (busy) and transient `server_error`s, reconnect with a short **exponential backoff** (e.g. 0.5s, 1s, 2s, capped). Start a fresh session and re-send your `session.update` configuration on the new connection. For `4001`, `4003`, and `4008`, don't blindly retry — fix the cause first (the key, your balance, or the number of open sessions). > **Note.** Error messages are deliberately generic and safe to surface or log; > the details you need for debugging are the `code` and the close code. --- # FAQ ## Does it work with the voice clients I already use? Yes. swaram follows the same event protocol as the OpenAI Realtime model, so most clients work by changing the address (`wss://api.swaram.live/v1/realtime`), the key, and the model name (`mal-realtime-simple` or `mal-realtime-premium`). ## What's the difference between the two modes? Both speak natural Malayalam with the exact same events, tools, and voices. **Simple** is the low-cost option; **Premium** has lower latency and a more expressive voice. Switch by changing the `model` value. ## What languages does it support? Malayalam, first and foremost. It also handles common English words, and you set the tone and style through your [instructions](sessions.html). ## Can I use it for phone calls? Yes. swaram is the voice layer — connect your own telephony setup and send the call audio to it as 24 kHz PCM16. ## The model talks over itself or replies to its own voice — what do I do? It's hearing its own playback through the microphone. Send it **only the user's voice**: enable **echo cancellation** when you capture (`echoCancellation: true` in the browser), use **headphones**, or switch to **push-to-talk** so the mic is open only while the user speaks. See [Audio](audio.html) and [Turn-taking](turn-taking.html). ## How am I billed? Per minute, in credits. You can see your balance and usage on the [dashboard](https://app.swaram.live). A session is refused (and ends) when your balance reaches zero. ## Where's my data? Conversations aren't stored by default. You bring your own context each session, and you own your data. ## Can agents read these docs? Yes — every page is available as plain Markdown at `/docs/.md`, with an index at [`/llms.txt`](/llms.txt) and the full set at [`/llms-full.txt`](/llms-full.txt).