Why swaramFeatures DocsSign in
View as .md

Audio

Audio in both directions is 16-bit PCM, 24 kHz, mono, little-endian, sent as base64 inside JSON events.

Sending the user's voice

Capture the microphone, resample to 24 kHz mono PCM16, and send it in small chunks as you go:

JSON
{ "type": "input_audio_buffer.append", "audio": "<base64 PCM16 @ 24 kHz>" }

Send chunks continuously while the user speaks — there's no need to wait for them to finish. swaram detects when they stop and replies on its own (see Turn-taking).

Clean capture (echo cancellation)

swaram judges only the audio you send it. If the model's own voice — playing out of the speaker — leaks back into the microphone, swaram hears it as the user and talks over itself or replies to its own words. Make sure you send only the user's voice:

Browser
const stream = await navigator.mediaDevices.getUserMedia({
  audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, autoGainControl: true },
});

Playing the reply

The reply streams back as a series of audio deltas. Decode the base64 and play the PCM16 pieces in order:

JSON
{
  "type": "response.output_audio.delta",
  "response_id": "resp_…",
  "item_id": "item_…",
  "delta": "<base64 PCM16 @ 24 kHz>"
}

You also get the words as text, alongside the audio:

JSON
{ "type": "response.output_audio_transcript.delta", "delta": "നമസ്കാരം" }

To capture both sides as text — what the user said and what swaram said back — see Transcripts.

Tips

← All docs