View as .md

Turn-taking & barge-in

swaram handles turns for you. It notices when the user stops speaking and replies on its own — automatic turn-taking, the same in both modes.

Automatic turns

Just stream the user's audio with input_audio_buffer.append. When they pause, you'll see:

input_audio_buffer.speech_started when they begin,
input_audio_buffer.speech_stopped when they stop,
a response.created, then the audio and transcript deltas, then response.done.

You don't need to tell swaram when a turn ends — automatic turn-taking handles it in both modes. The input_audio_buffer.commit and response.create events are optional nudges for finer control in the Simple mode; Premium always uses automatic turn detection, so it ignores them.

Barge-in (interrupting)

If the user starts speaking while swaram is talking, swaram stops right away and listens — the in-flight reply is cancelled and audio stops at once.

For the smoothest feel, have your app stop its own playback the moment the user starts speaking. You'll know from input_audio_buffer.speech_started (or your own client-side voice detection). You can also send response.cancel to interrupt explicitly.

Tip. Real-time speech feels best when both sides stop instantly. Drop any queued audio you haven't played yet as soon as the user cuts in.

Send only the user's voice

swaram runs turn detection on whatever audio you send, so you don't mark turn boundaries yourself. But it can only judge what it receives — so do a little on your side to keep it hearing the user and nothing else:

Gate the microphone. Use push-to-talk (capture only while a button is held or toggled on) or your own lightweight voice activity detection, so the model's playback and background noise aren't streamed back as if the user were speaking.
Cancel echo at capture. Turn on the browser's echo cancellation and noise suppression — see Audio. On open speakers this is what stops the model interrupting itself; headphones sidestep it completely.
Stop your playback the instant the user speaks (see Barge-in above) for the snappiest interruption.

← All docs