Audio

Audio in both directions is 16-bit PCM, 24 kHz, mono, little-endian, sent as base64 inside JSON events.

Sending the user's voice

Capture the microphone, resample to 24 kHz mono PCM16, and send it in small chunks as you go:

JSON

{ "type": "input_audio_buffer.append", "audio": "<base64 PCM16 @ 24 kHz>" }

Send chunks continuously while the user speaks — there's no need to wait for them to finish. swaram detects when they stop and replies on its own (see Turn-taking).

Clean capture (echo cancellation)

swaram judges only the audio you send it. If the model's own voice — playing out of the speaker — leaks back into the microphone, swaram hears it as the user and talks over itself or replies to its own words. Make sure you send only the user's voice:

Enable echo cancellation when you capture. In the browser, set it on getUserMedia:

Browser

const stream = await navigator.mediaDevices.getUserMedia({
  audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, autoGainControl: true },
});

Headphones remove the problem entirely — the speaker never reaches the mic.
On a phone call, the device and network usually cancel echo for you.
Where you can't fully cancel it on open speakers, use push-to-talk or your own voice detection so the mic is only open while the user is speaking — see Turn-taking.

Playing the reply

The reply streams back as a series of audio deltas. Decode the base64 and play the PCM16 pieces in order:

JSON

{
  "type": "response.output_audio.delta",
  "response_id": "resp_…",
  "item_id": "item_…",
  "delta": "<base64 PCM16 @ 24 kHz>"
}

You also get the words as text, alongside the audio:

JSON

{ "type": "response.output_audio_transcript.delta", "delta": "നമസ്കാരം" }

To capture both sides as text — what the user said and what swaram said back — see Transcripts.

Tips

Chunk size doesn't have to be exact — a few hundred milliseconds of audio per append is fine. Odd/short chunks are handled gracefully.
Play in order. Buffer the delta pieces and play them back-to-back for smooth speech.
Resampling. If your mic captures at 16 kHz or 48 kHz, resample to 24 kHz before sending. In the browser, an AudioWorklet is the usual way to capture and downsample.
Barge-in. If the user speaks while swaram is talking, stop your playback right away — swaram stops generating too. See Turn-taking.

← All docs