# Audio

Audio in both directions is **16-bit PCM, 24 kHz, mono, little-endian**, sent as
**base64** inside JSON events.

## Sending the user's voice

Capture the microphone, resample to 24 kHz mono PCM16, and send it in small
chunks as you go:

```json
{ "type": "input_audio_buffer.append", "audio": "<base64 PCM16 @ 24 kHz>" }
```

Send chunks continuously while the user speaks — there's no need to wait for them
to finish. swaram detects when they stop and replies on its own (see
[Turn-taking](turn-taking.html)).

## Clean capture (echo cancellation)

swaram judges only the audio you send it. If the model's own voice — playing out
of the speaker — leaks back into the microphone, swaram hears it as the user and
talks over itself or replies to its own words. Make sure you send **only the
user's voice**:

- **Enable echo cancellation when you capture.** In the browser, set it on
  `getUserMedia`:

```browser
const stream = await navigator.mediaDevices.getUserMedia({
  audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true, autoGainControl: true },
});
```

- **Headphones** remove the problem entirely — the speaker never reaches the mic.
- On a **phone call**, the device and network usually cancel echo for you.
- Where you can't fully cancel it on open speakers, use **push-to-talk** or your
  own voice detection so the mic is only open while the user is speaking — see
  [Turn-taking](turn-taking.html).

## Playing the reply

The reply streams back as a series of audio deltas. Decode the base64 and play
the PCM16 pieces in order:

```json
{
  "type": "response.output_audio.delta",
  "response_id": "resp_…",
  "item_id": "item_…",
  "delta": "<base64 PCM16 @ 24 kHz>"
}
```

You also get the words as text, alongside the audio:

```json
{ "type": "response.output_audio_transcript.delta", "delta": "നമസ്കാരം" }
```

To capture **both** sides as text — what the user said *and* what swaram said back —
see [Transcripts](transcripts.html).

## Tips

- **Chunk size** doesn't have to be exact — a few hundred milliseconds of audio
  per `append` is fine. Odd/short chunks are handled gracefully.
- **Play in order.** Buffer the `delta` pieces and play them back-to-back for
  smooth speech.
- **Resampling.** If your mic captures at 16 kHz or 48 kHz, resample to 24 kHz
  before sending. In the browser, an `AudioWorklet` is the usual way to capture
  and downsample.
- **Barge-in.** If the user speaks while swaram is talking, stop your playback
  right away — swaram stops generating too. See [Turn-taking](turn-taking.html).
