Transcripts
Every conversation gives you the text of both sides — what the caller said and what swaram said back — alongside the audio. Use it to log conversations, show live captions, run analytics, or debug.
There are two streams, delivered by two events:
| You want | Event | The text |
|---|---|---|
| What the user said | conversation.item.input_audio_transcription.completed |
event.transcript — the whole turn |
| What swaram said | response.output_audio_transcript.delta |
event.delta — a piece; accumulate |
Both work identically in Simple and Premium — your handling doesn't change between modes.
What the user said (input transcript)
When the user finishes a turn, swaram transcribes their speech and sends a single event with the complete text of that turn:
{
"type": "conversation.item.input_audio_transcription.completed",
"item_id": "item_…",
"transcript": "കേരളത്തിന്റെ തലസ്ഥാനം ഏതാണ്?"
}
It's one event per user turn (not streamed in pieces), so transcript is the
full thing — just read it.
The same text also rides on the
conversation.item.createdevent for that turn (item.role === "user", initem.content[].text) if you prefer to track items — but…input_audio_transcription.completedabove is the simplest, standard way.
What swaram said (output transcript)
The reply's text streams back as deltas, in step with the audio. Concatenate the deltas to get the full reply:
{ "type": "response.output_audio_transcript.delta", "delta": "തിരുവനന്തപുരം" }
A reply is framed by response.created … (audio + transcript deltas) …
response.done, so accumulate the deltas until response.done for one complete
model turn.
Putting both together — a conversation log
Capture each user turn and each model turn into a running log:
conversation = [] # [{"role": "user"|"assistant", "text": ...}]
reply = ""
async for raw in ws:
event = json.loads(raw)
t = event["type"]
if t == "conversation.item.input_audio_transcription.completed":
conversation.append({"role": "user", "text": event["transcript"]})
elif t == "response.output_audio_transcript.delta":
reply += event["delta"] # accumulate the model's words
elif t == "response.done":
if reply:
conversation.append({"role": "assistant", "text": reply})
reply = ""
const conversation = []; // [{ role: "user"|"assistant", text }]
let reply = "";
ws.on("message", (data) => {
const event = JSON.parse(data);
if (event.type === "conversation.item.input_audio_transcription.completed") {
conversation.push({ role: "user", text: event.transcript });
} else if (event.type === "response.output_audio_transcript.delta") {
reply += event.delta; // accumulate the model's words
} else if (event.type === "response.done") {
if (reply) conversation.push({ role: "assistant", text: reply });
reply = "";
}
});
const conversation = []; // [{ role: "user"|"assistant", text }]
let reply = "";
ws.onmessage = (e) => {
const event = JSON.parse(e.data);
if (event.type === "conversation.item.input_audio_transcription.completed") {
conversation.push({ role: "user", text: event.transcript });
} else if (event.type === "response.output_audio_transcript.delta") {
reply += event.delta; // accumulate the model's words
} else if (event.type === "response.done") {
if (reply) conversation.push({ role: "assistant", text: reply });
reply = "";
}
};
Turn structure
For a normal voice turn the events arrive in this order:
input_audio_buffer.speech_started/speech_stopped— the user's turn boundariesconversation.item.input_audio_transcription.completed— what they saidresponse.created— the reply beginsresponse.output_audio.delta+response.output_audio_transcript.delta— audio + wordsresponse.done— the reply is complete
Notes & edge cases
- Silent turns. If the user's audio has no speech, there's nothing to transcribe and no input-transcript event for that turn.
- Tool calls. When swaram calls one of your tools it emits
response.function_call_arguments.doneinstead of an audio transcript — that turn has no spoken reply until you return the result. - Barge-in. If the user interrupts mid-reply, the model transcript so far is
partial and the turn ends with
response.done(statuscancelled). Keep what you accumulated. - Same in both modes. Simple and Premium emit the identical events.
Related
- Events reference — every event, both directions.
- Audio — the audio that accompanies these transcripts.
- Turn-taking & barge-in — how turns start and end.