View as .md

Quickstart

Connect, configure a session, stream the user's voice, and play the Malayalam audio that streams back. Here's the whole loop.

1. Get an API key

Create an account on app.swaram.live and create a key. It looks like swaram_… and is shown once — keep it on your server.

2. Connect and talk

Open a WebSocket, send your settings as the first message, then stream audio. Audio is 16-bit PCM, 24 kHz, mono, base64 in both directions.

import asyncio, base64, json, websockets

API_KEY = "swaram_your_key_here"
URL = "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple"

async def main():
    headers = {"Authorization": f"Bearer {API_KEY}"}
    async with websockets.connect(URL, additional_headers=headers) as ws:
        # 1) configure the session up front (before streaming audio)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "instructions": "You are a friendly Malayalam assistant.",
                "voice": "mal-female",
            },
        }))

        # 2) stream the user's microphone as base64 PCM16 @ 24 kHz, in chunks
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(pcm16_chunk).decode(),
        }))

        # 3) read events; play the audio you get back
        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "response.output_audio.delta":
                play(base64.b64decode(event["delta"]))   # PCM16 @ 24 kHz
            elif event["type"] == "response.output_audio_transcript.delta":
                print(event["delta"], end="", flush=True)

asyncio.run(main())

import WebSocket from "ws";

const API_KEY = "swaram_your_key_here";
const URL = "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple";

const ws = new WebSocket(URL, { headers: { Authorization: `Bearer ${API_KEY}` } });

ws.on("open", () => {
  // 1) configure the session up front
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are a friendly Malayalam assistant.",
      voice: "mal-female",
    },
  }));

  // 2) stream the user's microphone as base64 PCM16 @ 24 kHz, in chunks
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: pcm16Chunk.toString("base64"),
  }));
});

ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "response.output_audio.delta") {
    play(Buffer.from(event.delta, "base64"));   // PCM16 @ 24 kHz
  } else if (event.type === "response.output_audio_transcript.delta") {
    process.stdout.write(event.delta);
  }
});

// Browsers can't set an Authorization header on a WebSocket, so your backend
// mints a short-lived token (see Authentication) and the page connects with it.
const token = await fetch("/your-backend/realtime-token").then((r) => r.text());

const ws = new WebSocket(
  "wss://api.swaram.live/v1/realtime?model=mal-realtime-simple",
  ["realtime", "openai-insecure-api-key." + token]   // pass the token as a subprotocol
);

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are a friendly Malayalam assistant.",
      voice: "mal-female",
    },
  }));
  // capture the mic (getUserMedia with echoCancellation: true so the model
  // doesn't hear itself), downsample to 24 kHz PCM16 with an AudioWorklet, and
  // send chunks as: { type: "input_audio_buffer.append", audio: <base64> }
};

ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "response.output_audio.delta") {
    playPcm16(event.delta);   // base64 → PCM16 @ 24 kHz
  }
};

That's the whole loop: configure once, stream audio, play what comes back.

What happens on connect

The server replies with session.created, then session.updated after your config. As the user speaks it emits input_audio_buffer.speech_started / speech_stopped, then a response.created, a stream of response.output_audio.delta (the audio) and response.output_audio_transcript.delta (the words), and finally response.done. See the Events reference for the full list.

Next steps

Authentication — server-side keys and browser tokens.
Sessions & context — instructions, voices, and the config contract.
Function calling — give swaram tools to act on.
Transcripts — read the text of what the user said and what swaram said back.

← All docs