# swaram.live

> Real-time Malayalam voice API. Send a user's voice over one WebSocket and hear natural Malayalam stream back. Two opaque modes (Simple, Premium); OpenAI-Realtime-compatible event protocol; bring your own instructions and tools.

## Docs

- [Introduction](https://swaram.live/docs/index.md): What swaram is, the base URL, and the two modes.
- [Quickstart](https://swaram.live/docs/quickstart.md): Connect, configure a session, stream audio, play the reply (Python/Node/browser).
- [Authentication](https://swaram.live/docs/authentication.md): Server-side secret keys and short-lived browser tokens.
- [Sessions & context](https://swaram.live/docs/sessions.md): Configure instructions, voice, and tools at the start of a session.
- [Function calling](https://swaram.live/docs/tools.md): Give swaram tools to act on; the call/return round-trip (Python/Node/browser).
- [Audio](https://swaram.live/docs/audio.md): PCM16 24 kHz mono base64 audio in and out; chunking, playback, and echo cancellation.
- [Transcripts](https://swaram.live/docs/transcripts.md): Read the text of what the user said and what swaram said back.
- [Turn-taking & barge-in](https://swaram.live/docs/turn-taking.md): Automatic turns, interrupting a reply, and sending only the user's voice (client VAD / push-to-talk).
- [Events reference](https://swaram.live/docs/events.md): Every client and server event, plus close codes.
- [Errors](https://swaram.live/docs/errors.md): Error events, close codes, and reconnection.
- [FAQ](https://swaram.live/docs/faq.md): Common questions about modes, languages, billing, and data.


---

# Introduction

swaram is a real-time **Malayalam voice API**. You send a user's voice over one
connection, and hear natural Malayalam stream back — speech in, Malayalam out.

You bring the **context** — your instructions, your tools, and your data — and
swaram is the voice. It follows the **OpenAI Realtime** event protocol, so if
you've built with a real-time voice model before, this will feel familiar.

## Base URL

```
wss://api.swaram.live/v1/realtime
```

Sign up and manage your API keys at [app.swaram.live](https://app.swaram.live).

## Two modes

You pick a mode with the `model` setting when you connect. Everything else —
events, tools, voices — is **identical** between them.

| Mode | `model` | Best for |
|---|---|---|
| Simple | `mal-realtime-simple` | Natural Malayalam voice at low cost. |
| Premium | `mal-realtime-premium` | Lower latency and a more expressive voice. |

Switching modes is just a different `model` value; your code stays the same.

## What you'll do

1. [Create an API key](authentication.html) on the dashboard.
2. [Connect](quickstart.html) and configure your session — instructions, voice, tools.
3. Stream the user's voice and play the Malayalam audio you get back.

## Already using a Realtime client?

swaram speaks the OpenAI Realtime event subset, so most existing real-time voice
clients work by changing three things: the **URL** (`wss://api.swaram.live/v1/realtime`),
the **API key**, and the **model** (`mal-realtime-simple`). See the
[Quickstart](quickstart.html) for plain-WebSocket examples in Python, Node, and the browser.

## For AI agents

The full documentation is available as plain Markdown for tooling and agents:
every page is served at `/docs/<page>.md`, and an index lives at
[`/llms.txt`](/llms.txt) (with the whole set concatenated at
[`/llms-full.txt`](/llms-full.txt)).

---

# Quickstart

Connect, configure a session, stream the user's voice, and play the Malayalam
audio that streams back. Here's the whole loop.

## 1. Get an API key

Create an account on [app.swaram.live](https://app.swaram.live) and create a key.
It looks like `swaram_…` and is shown once — keep it on your server.

## 2. Connect and talk

Open a WebSocket, send your settings as the first message, then stream audio.
Audio is **16-bit PCM, 24 kHz, mono, base64** in both directions.

```python
import asyncio, base64, json, websockets

API_KEY = "swaram_your_key_here"
URL = "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple"

async def main():
    headers = {"Authorization": f"Bearer {API_KEY}"}
    async with websockets.connect(URL, additional_headers=headers) as ws:
        # 1) configure the session up front (before streaming audio)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "instructions": "You are a friendly Malayalam assistant.",
                "voice": "mal-female",
            },
        }))

        # 2) stream the user's microphone as base64 PCM16 @ 24 kHz, in chunks
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(pcm16_chunk).decode(),
        }))

        # 3) read events; play the audio you get back
        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "response.output_audio.delta":
                play(base64.b64decode(event["delta"]))   # PCM16 @ 24 kHz
            elif event["type"] == "response.output_audio_transcript.delta":
                print(event["delta"], end="", flush=True)

asyncio.run(main())
```

```node
import WebSocket from "ws";

const API_KEY = "swaram_your_key_here";
const URL = "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple";

const ws = new WebSocket(URL, { headers: { Authorization: `Bearer ${API_KEY}` } });

ws.on("open", () => {
  // 1) configure the session up front
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are a friendly Malayalam assistant.",
      voice: "mal-female",
    },
  }));

  // 2) stream the user's microphone as base64 PCM16 @ 24 kHz, in chunks
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: pcm16Chunk.toString("base64"),
  }));
});

ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "response.output_audio.delta") {
    play(Buffer.from(event.delta, "base64"));   // PCM16 @ 24 kHz
  } else if (event.type === "response.output_audio_transcript.delta") {
    process.stdout.write(event.delta);
  }
});
```

```browser
// Browsers can't set an Authorization header on a WebSocket, so your backend
// mints a short-lived token (see Authentication) and the page connects with it.
const token = await fetch("/your-backend/realtime-token").then((r) => r.text());

const ws = new WebSocket(
  "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple",
  ["realtime", "openai-insecure-api-key." + token]   // pass the token as a subprotocol
);

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are a friendly Malayalam assistant.",
      voice: "mal-female",
    },
  }));
  // capture the mic (getUserMedia with echoCancellation: true so the model
  // doesn't hear itself), downsample to 24 kHz PCM16 with an AudioWorklet, and
  // send chunks as: { type: "input_audio_buffer.append", audio: <base64> }
};

ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "response.output_audio.delta") {
    playPcm16(event.delta);   // base64 → PCM16 @ 24 kHz
  }
};
```

That's the whole loop: **configure once, stream audio, play what comes back.**

## What happens on connect

The server replies with `session.created`, then `session.updated` after your
config. As the user speaks it emits `input_audio_buffer.speech_started` /
`speech_stopped`, then a `response.created`, a stream of
`response.output_audio.delta` (the audio) and
`response.output_audio_transcript.delta` (the words), and finally `response.done`.
See the [Events reference](events.html) for the full list.

## Next steps

- [Authentication](authentication.html) — server-side keys and browser tokens.
- [Sessions & context](sessions.html) — instructions, voices, and the config contract.
- [Function calling](tools.html) — give swaram tools to act on.
- [Transcripts](transcripts.html) — read the text of what the user said and what swaram said back.

---

# Authentication

Every connection is tied to your account with an API key. There are two ways to
use it, depending on whether the code runs on **your server** or in an
**untrusted client** (a browser or mobile app).

## Server-side: your secret key

From your own backend, send your secret key in the `Authorization` header when
you open the connection:

```
Authorization: Bearer swaram_your_key_here
```

Your secret key looks like `swaram_…`, is shown once when you create it, and is
stored only as a hash — keep it on your server. **Never put a `swaram_` key in a
browser or app**, where a user could read it.

## Browser / mobile: a short-lived token

Untrusted clients shouldn't hold your secret key. Instead, your backend mints a
short-lived **ephemeral token** (it looks like `swaram_ek_…`) and hands only that
to the client. The token can carry locked settings — model, voice, instructions,
tools — that the client can't change, and it expires after a minute or so.

**1. On your backend**, exchange your secret key for a token:

```python
import httpx

resp = httpx.post(
    "https://api.swaram.live/v1/realtime/client_secrets",
    headers={"Authorization": "Bearer swaram_your_key_here"},
    json={
        "model": "mal-realtime-simple",
        "session": {
            "instructions": "You are a friendly Malayalam assistant.",
            "voice": "mal-female",
        },
    },
)
token = resp.json()["value"]   # a swaram_ek_… token — send THIS to the client
```

**2. In the client**, connect with that token. Browsers can't set headers on a
WebSocket, so pass it as a subprotocol:

```browser
const ws = new WebSocket(
  "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple",
  ["realtime", "openai-insecure-api-key." + token]
);
```

A bad or expired token is rejected and the connection closes (code `4001`).

## How the key is read

The server looks for your credential in this order:

1. `Authorization: Bearer <key>` header — the server-side path.
2. `?api_key=<key>` query parameter.
3. `Sec-WebSocket-Protocol: openai-insecure-api-key.<key>` subprotocol — the browser path.

## Config at creation

Whichever path you use, send your configuration **at the start** — in the token,
or as the first `session.update` right after you connect. It's locked once the
conversation begins. See [Sessions & context](sessions.html) for the details.

---

# Sessions & context

A session is one connection. You configure it **once, at the start** — your
instructions, voice, and tools — and that configuration applies to the whole
call. This is identical in both modes.

## Configure at the start

Send a `session.update` as your first message right after connecting (or lock the
settings into a [browser token](authentication.html)). The settings are **fixed
once the conversation begins**, so set them before you start streaming audio.

```json
{
  "type": "session.update",
  "session": {
    "instructions": "You are a helpful Malayalam tutor. Keep replies short.",
    "voice": "mal-female",
    "tools": []
  }
}
```

The server echoes the effective configuration back as a `session.updated` event.

## Settings

| Setting | What it does |
|---|---|
| `instructions` | Your system prompt — the persona, the policy, the tone, and any context. |
| `voice` | The Malayalam voice: `mal-female` (default) or `mal-male`. |
| `tools` | Actions the model can call — see [Function calling](tools.html). |
| `tool_choice` | `auto` (default), `none`, or `required`. |
| `turn_detection` | Automatic turn-taking (`default`) — swaram replies when the user stops speaking. |

> **Replies are speakable Malayalam by default.** swaram always returns natural,
> conversational Malayalam written for the ear — numbers, units, and dates as words,
> no markdown or formatting. Your `instructions` layer **on top** of that: they set
> the persona, behaviour, and length (replies default to one or two sentences; ask
> for a longer or different style in your instructions).

## Adding your data

swaram is the voice — your data stays yours. Put the information you want it to
use into the `instructions`, or give it a [tool](tools.html) to look things up at
the moment it's needed. Conversations aren't stored; you bring your context each
session.

## Voices

Two Malayalam voices, the same in both modes:

| `voice` | |
|---|---|
| `mal-female` | Female voice (default). |
| `mal-male` | Male voice. |

## Why "at the start"?

Setting configuration once and locking it keeps both modes behaving identically
and predictably for the whole call. If you need different instructions or tools,
start a new session with the new configuration.

---

# Function calling

Give swaram **actions** to take — look up an order, check the weather, save a
note. You list your tools when you configure the session; the model decides when
one is needed and asks for it; your app runs it and sends the result back; then
swaram speaks the answer in Malayalam. It works the same in both modes.

## 1. Declare your tools

Add a `tools` array to your opening `session.update`. Each tool is a function
with a name, a description, and a JSON-Schema for its arguments.

```json
{
  "type": "session.update",
  "session": {
    "instructions": "You are a helpful shop assistant. Speak Malayalam.",
    "voice": "mal-female",
    "tools": [
      {
        "type": "function",
        "name": "get_order_status",
        "description": "Look up an order by its id.",
        "parameters": {
          "type": "object",
          "properties": { "order_id": { "type": "string" } },
          "required": ["order_id"]
        }
      }
    ],
    "tool_choice": "auto"
  }
}
```

## 2. Handle the call

When the model wants a tool, you receive a
`response.function_call_arguments.done` event with the `name`, a `call_id`, and
the `arguments` (a JSON string). Run the function, then send the result back as a
`conversation.item.create` of type `function_call_output` — keyed by the same
`call_id`. swaram continues and speaks the answer.

```python
async for raw in ws:
    event = json.loads(raw)
    if event["type"] == "response.function_call_arguments.done":
        args = json.loads(event["arguments"])
        result = lookup_order(args["order_id"])          # your code
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": event["call_id"],
                "output": json.dumps(result),            # a JSON string
            },
        }))
```

```node
ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "response.function_call_arguments.done") {
    const args = JSON.parse(event.arguments);
    const result = lookupOrder(args.order_id);            // your code
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: event.call_id,
        output: JSON.stringify(result),                   // a JSON string
      },
    }));
  }
});
```

```browser
ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "response.function_call_arguments.done") {
    const args = JSON.parse(event.arguments);
    const result = lookupOrder(args.order_id);            // your code
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: event.call_id,
        output: JSON.stringify(result),                   // a JSON string
      },
    }));
  }
};
```

## Choosing when tools run

`tool_choice` controls whether the model may call a tool:

| `tool_choice` | Behaviour |
|---|---|
| `auto` | The model decides (default). |
| `none` | Never call a tool — just speak. |
| `required` | The model must call a tool. |

## Tips

> **Keep tools fast.** A tool call adds a short pause before swaram replies.
> Keep the work quick, and confirm before anything important. You can set your
> instructions so swaram says a brief Malayalam filler while it waits.

- Send the `output` as a **JSON string**. If it isn't valid JSON it's wrapped as
  `{"result": "<your text>"}`.
- The `call_id` ties a result to its call — always echo back the one you received.
- Parallel calls are supported; answer each `call_id` you're given.

---

# Audio

Audio in both directions is **16-bit PCM, 24 kHz, mono, little-endian**, sent as
**base64** inside JSON events.

## Sending the user's voice

Capture the microphone, resample to 24 kHz mono PCM16, and send it in small
chunks as you go:

```json
{ "type": "input_audio_buffer.append", "audio": "<base64 PCM16 @ 24 kHz>" }
```

Send chunks continuously while the user speaks — there's no need to wait for them
to finish. swaram detects when they stop and replies on its own (see
[Turn-taking](turn-taking.html)).

## Clean capture (echo cancellation)

swaram judges only the audio you send it. If the model's own voice — playing out
of the speaker — leaks back into the microphone, swaram hears it as the user and
talks over itself or replies to its own words. Make sure you send **only the
user's voice**:

- **Enable echo cancellation when you capture.** In the browser, set it on
  `getUserMedia`:

```browser
const stream = await navigator.mediaDevices.getUserMedia({
  audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, autoGainControl: true },
});
```

- **Headphones** remove the problem entirely — the speaker never reaches the mic.
- On a **phone call**, the device and network usually cancel echo for you.
- Where you can't fully cancel it on open speakers, use **push-to-talk** or your
  own voice detection so the mic is only open while the user is speaking — see
  [Turn-taking](turn-taking.html).

## Playing the reply

The reply streams back as a series of audio deltas. Decode the base64 and play
the PCM16 pieces in order:

```json
{
  "type": "response.output_audio.delta",
  "response_id": "resp_…",
  "item_id": "item_…",
  "delta": "<base64 PCM16 @ 24 kHz>"
}
```

You also get the words as text, alongside the audio:

```json
{ "type": "response.output_audio_transcript.delta", "delta": "നമസ്കാരം" }
```

To capture **both** sides as text — what the user said *and* what swaram said back —
see [Transcripts](transcripts.html).

## Tips

- **Chunk size** doesn't have to be exact — a few hundred milliseconds of audio
  per `append` is fine. Odd/short chunks are handled gracefully.
- **Play in order.** Buffer the `delta` pieces and play them back-to-back for
  smooth speech.
- **Resampling.** If your mic captures at 16 kHz or 48 kHz, resample to 24 kHz
  before sending. In the browser, an `AudioWorklet` is the usual way to capture
  and downsample.
- **Barge-in.** If the user speaks while swaram is talking, stop your playback
  right away — swaram stops generating too. See [Turn-taking](turn-taking.html).

---

# Transcripts

Every conversation gives you the **text of both sides** — what the caller said and
what swaram said back — alongside the audio. Use it to log conversations, show live
captions, run analytics, or debug.

There are two streams, delivered by two events:

| You want | Event | The text |
|---|---|---|
| **What the user said** | `conversation.item.input_audio_transcription.completed` | `event.transcript` — the whole turn |
| **What swaram said** | `response.output_audio_transcript.delta` | `event.delta` — a piece; accumulate |

Both work identically in **Simple and Premium** — your handling doesn't change
between modes.

## What the user said (input transcript)

When the user finishes a turn, swaram transcribes their speech and sends a single
event with the complete text of that turn:

```json
{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_…",
  "transcript": "കേരളത്തിന്റെ തലസ്ഥാനം ഏതാണ്?"
}
```

It's **one event per user turn** (not streamed in pieces), so `transcript` is the
full thing — just read it.

> The same text also rides on the `conversation.item.created` event for that turn
> (`item.role === "user"`, in `item.content[].text`) if you prefer to track items —
> but `…input_audio_transcription.completed` above is the simplest, standard way.

## What swaram said (output transcript)

The reply's text streams back **as deltas**, in step with the audio. Concatenate the
deltas to get the full reply:

```json
{ "type": "response.output_audio_transcript.delta", "delta": "തിരുവനന്തപുരം" }
```

A reply is framed by `response.created` … (audio + transcript deltas) …
`response.done`, so accumulate the deltas until `response.done` for one complete
model turn.

## Putting both together — a conversation log

Capture each user turn and each model turn into a running log:

```python
conversation = []          # [{"role": "user"|"assistant", "text": ...}]
reply = ""

async for raw in ws:
    event = json.loads(raw)
    t = event["type"]
    if t == "conversation.item.input_audio_transcription.completed":
        conversation.append({"role": "user", "text": event["transcript"]})
    elif t == "response.output_audio_transcript.delta":
        reply += event["delta"]                      # accumulate the model's words
    elif t == "response.done":
        if reply:
            conversation.append({"role": "assistant", "text": reply})
        reply = ""
```

```node
const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
});
```

```browser
const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
};
```

## Turn structure

For a normal voice turn the events arrive in this order:

1. `input_audio_buffer.speech_started` / `speech_stopped` — the user's turn boundaries
2. `conversation.item.input_audio_transcription.completed` — what they said
3. `response.created` — the reply begins
4. `response.output_audio.delta` + `response.output_audio_transcript.delta` — audio + words
5. `response.done` — the reply is complete

## Notes & edge cases

- **Silent turns.** If the user's audio has no speech, there's nothing to
  transcribe and no input-transcript event for that turn.
- **Tool calls.** When swaram calls one of your [tools](tools.html) it emits
  `response.function_call_arguments.done` instead of an audio transcript — that turn
  has no spoken reply until you return the result.
- **Barge-in.** If the user interrupts mid-reply, the model transcript so far is
  partial and the turn ends with `response.done` (status `cancelled`). Keep what you
  accumulated.
- **Same in both modes.** Simple and Premium emit the identical events.

## Related

- [Events reference](events.html) — every event, both directions.
- [Audio](audio.html) — the audio that accompanies these transcripts.
- [Turn-taking & barge-in](turn-taking.html) — how turns start and end.

---

# Turn-taking & barge-in

swaram handles turns for you. It notices when the user stops speaking and replies
on its own — automatic turn-taking, the same in both modes.

## Automatic turns

Just stream the user's audio with `input_audio_buffer.append`. When they pause,
you'll see:

1. `input_audio_buffer.speech_started` when they begin,
2. `input_audio_buffer.speech_stopped` when they stop,
3. a `response.created`, then the audio and transcript deltas, then `response.done`.

You don't need to tell swaram when a turn ends — automatic turn-taking handles it
in both modes. The `input_audio_buffer.commit` and `response.create` events are
**optional** nudges for finer control in the Simple mode; Premium always uses
automatic turn detection, so it ignores them.

## Barge-in (interrupting)

If the user starts speaking while swaram is talking, swaram **stops right away**
and listens — the in-flight reply is cancelled and audio stops at once.

For the smoothest feel, have your app **stop its own playback** the moment the
user starts speaking. You'll know from `input_audio_buffer.speech_started` (or
your own client-side voice detection). You can also send `response.cancel` to
interrupt explicitly.

> **Tip.** Real-time speech feels best when both sides stop instantly. Drop any
> queued audio you haven't played yet as soon as the user cuts in.

## Send only the user's voice

swaram runs turn detection on whatever audio you send, so you don't mark turn
boundaries yourself. But it can only judge what it receives — so do a little on
your side to keep it hearing the user and nothing else:

- **Gate the microphone.** Use **push-to-talk** (capture only while a button is
  held or toggled on) or your own lightweight **voice activity detection**, so the
  model's playback and background noise aren't streamed back as if the user were
  speaking.
- **Cancel echo at capture.** Turn on the browser's echo cancellation and noise
  suppression — see [Audio](audio.html). On open speakers this is what stops the
  model interrupting itself; **headphones** sidestep it completely.
- **Stop your playback the instant the user speaks** (see Barge-in above) for the
  snappiest interruption.

---

# Events reference

Messages flow both ways over the connection as JSON on the text channel. Audio is
base64 inside those messages. This is the OpenAI Realtime event subset.

## You send

| Event | What it's for |
|---|---|
| `session.update` | Set instructions, voice, and tools (at the start). |
| `input_audio_buffer.append` | Send a piece of the user's voice (base64 PCM16). |
| `input_audio_buffer.commit` | Optional — mark the end of the user's turn (Simple mode). |
| `input_audio_buffer.clear` | Optional — discard buffered audio (Simple mode). |
| `response.create` | Optional — ask for a reply now (Simple mode). |
| `response.cancel` | Stop the current reply (interrupt). |
| `conversation.item.create` | Send back the result of a tool it called. |

> Turn-taking is **automatic in both modes** — you just stream audio. The three
> "optional" events above are best-effort nudges; Premium uses automatic turn
> detection and ignores them.

## You receive

| Event | What it means |
|---|---|
| `session.created` | The session is ready (sent on connect). |
| `session.updated` | Your settings were applied. |
| `input_audio_buffer.speech_started` | The user started speaking. |
| `input_audio_buffer.speech_stopped` | The user stopped speaking. |
| `conversation.item.input_audio_transcription.completed` | The transcript of what the **user** said — see [Transcripts](transcripts.html). |
| `conversation.item.created` | A turn item was added (the user's turn, or a tool result). |
| `response.created` | A reply started. |
| `response.output_audio.delta` | A piece of Malayalam audio to play — the main payload. |
| `response.output_audio_transcript.delta` | The text of what **swaram** is saying — see [Transcripts](transcripts.html). |
| `response.function_call_arguments.done` | It wants to call one of your [tools](tools.html). |
| `response.done` | The reply finished. |
| `error` | Something went wrong — see [Errors](errors.html). |

## Close codes

When the server closes the connection, the WebSocket close code tells you why:

| Code | Meaning |
|---|---|
| `4001` | Invalid or missing API key. |
| `4003` | Out of credits — add credits to continue. |
| `4008` | Too many concurrent connections for your plan. |
| `1013` | Server busy — reconnect shortly. |

See [Errors](errors.html) for how to handle these gracefully.

---

# Errors

Problems arrive as an `error` event with a short, human-readable message. Most
errors leave the connection open; auth, credit, and limit problems close it with
a [close code](events.html#close-codes).

```json
{
  "type": "error",
  "error": {
    "type": "server_error",
    "code": "server_error",
    "message": "the voice service is temporarily unavailable, please retry"
  }
}
```

## What to handle

| Situation | What happens | What to do |
|---|---|---|
| Invalid or missing key | `error` then close `4001` | Check the key / token. |
| Out of credits | `error` then close `4003` | Add credits, then reconnect. |
| Too many concurrent calls | `error` then close `4008` | You've hit your plan's limit — close another session or upgrade. |
| Service busy or unavailable | `error` asking you to retry | Reconnect after a short backoff. |
| Bad message | `error`, connection stays open | Fix the offending event and continue. |

## Reconnecting

For `1013` (busy) and transient `server_error`s, reconnect with a short
**exponential backoff** (e.g. 0.5s, 1s, 2s, capped). Start a fresh session and
re-send your `session.update` configuration on the new connection.

For `4001`, `4003`, and `4008`, don't blindly retry — fix the cause first (the
key, your balance, or the number of open sessions).

> **Note.** Error messages are deliberately generic and safe to surface or log;
> the details you need for debugging are the `code` and the close code.

---

# FAQ

## Does it work with the voice clients I already use?

Yes. swaram follows the same event protocol as the OpenAI Realtime model, so most
clients work by changing the address (`wss://api.swaram.live/v1/realtime`), the
key, and the model name (`mal-realtime-simple` or `mal-realtime-premium`).

## What's the difference between the two modes?

Both speak natural Malayalam with the exact same events, tools, and voices.
**Simple** is the low-cost option; **Premium** has lower latency and a more
expressive voice. Switch by changing the `model` value.

## What languages does it support?

Malayalam, first and foremost. It also handles common English words, and you set
the tone and style through your [instructions](sessions.html).

## Can I use it for phone calls?

Yes. swaram is the voice layer — connect your own telephony setup and send the
call audio to it as 24 kHz PCM16.

## The model talks over itself or replies to its own voice — what do I do?

It's hearing its own playback through the microphone. Send it **only the user's
voice**: enable **echo cancellation** when you capture (`echoCancellation: true`
in the browser), use **headphones**, or switch to **push-to-talk** so the mic is
open only while the user speaks. See [Audio](audio.html) and
[Turn-taking](turn-taking.html).

## How am I billed?

Per minute, in credits. You can see your balance and usage on the
[dashboard](https://app.swaram.live). A session is refused (and ends) when your
balance reaches zero.

## Where's my data?

Conversations aren't stored by default. You bring your own context each session,
and you own your data.

## Can agents read these docs?

Yes — every page is available as plain Markdown at `/docs/<page>.md`, with an
index at [`/llms.txt`](/llms.txt) and the full set at
[`/llms-full.txt`](/llms-full.txt).