Audio
Audio in both directions is 16-bit PCM, 24 kHz, mono, little-endian, sent as base64 inside JSON events.
Sending the user's voice
Capture the microphone, resample to 24 kHz mono PCM16, and send it in small chunks as you go:
{ "type": "input_audio_buffer.append", "audio": "<base64 PCM16 @ 24 kHz>" }
Send chunks continuously while the user speaks — there's no need to wait for them to finish. swaram detects when they stop and replies on its own (see Turn-taking).
Clean capture (echo cancellation)
swaram judges only the audio you send it. If the model's own voice — playing out of the speaker — leaks back into the microphone, swaram hears it as the user and talks over itself or replies to its own words. Make sure you send only the user's voice:
- Enable echo cancellation when you capture. In the browser, set it on
getUserMedia:
const stream = await navigator.mediaDevices.getUserMedia({
audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, autoGainControl: true },
});
- Headphones remove the problem entirely — the speaker never reaches the mic.
- On a phone call, the device and network usually cancel echo for you.
- Where you can't fully cancel it on open speakers, use push-to-talk or your own voice detection so the mic is only open while the user is speaking — see Turn-taking.
Playing the reply
The reply streams back as a series of audio deltas. Decode the base64 and play the PCM16 pieces in order:
{
"type": "response.output_audio.delta",
"response_id": "resp_…",
"item_id": "item_…",
"delta": "<base64 PCM16 @ 24 kHz>"
}
You also get the words as text, alongside the audio:
{ "type": "response.output_audio_transcript.delta", "delta": "നമസ്കാരം" }
To capture both sides as text — what the user said and what swaram said back — see Transcripts.
Tips
- Chunk size doesn't have to be exact — a few hundred milliseconds of audio
per
appendis fine. Odd/short chunks are handled gracefully. - Play in order. Buffer the
deltapieces and play them back-to-back for smooth speech. - Resampling. If your mic captures at 16 kHz or 48 kHz, resample to 24 kHz
before sending. In the browser, an
AudioWorkletis the usual way to capture and downsample. - Barge-in. If the user speaks while swaram is talking, stop your playback right away — swaram stops generating too. See Turn-taking.