View as .md

Transcripts

Every conversation gives you the text of both sides — what the caller said and what swaram said back — alongside the audio. Use it to log conversations, show live captions, run analytics, or debug.

There are two streams, delivered by two events:

You want	Event	The text
What the user said	`conversation.item.input_audio_transcription.completed`	`event.transcript` — the whole turn
What swaram said	`response.output_audio_transcript.delta`	`event.delta` — a piece; accumulate

Both work identically in Simple and Premium — your handling doesn't change between modes.

What the user said (input transcript)

When the user finishes a turn, swaram transcribes their speech and sends a single event with the complete text of that turn:

JSON

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_…",
  "transcript": "കേരളത്തിന്റെ തലസ്ഥാനം ഏതാണ്?"
}

It's one event per user turn (not streamed in pieces), so transcript is the full thing — just read it.

The same text also rides on the conversation.item.created event for that turn (item.role === "user", in item.content[].text) if you prefer to track items — but …input_audio_transcription.completed above is the simplest, standard way.

What swaram said (output transcript)

The reply's text streams back as deltas, in step with the audio. Concatenate the deltas to get the full reply:

JSON

{ "type": "response.output_audio_transcript.delta", "delta": "തിരുവനന്തപുരം" }

A reply is framed by response.created … (audio + transcript deltas) … response.done, so accumulate the deltas until response.done for one complete model turn.

Putting both together — a conversation log

Capture each user turn and each model turn into a running log:

conversation = []          # [{"role": "user"|"assistant", "text": ...}]
reply = ""

async for raw in ws:
    event = json.loads(raw)
    t = event["type"]
    if t == "conversation.item.input_audio_transcription.completed":
        conversation.append({"role": "user", "text": event["transcript"]})
    elif t == "response.output_audio_transcript.delta":
        reply += event["delta"]                      # accumulate the model's words
    elif t == "response.done":
        if reply:
            conversation.append({"role": "assistant", "text": reply})
        reply = ""

const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
});

const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
};

Turn structure

For a normal voice turn the events arrive in this order:

input_audio_buffer.speech_started / speech_stopped — the user's turn boundaries
conversation.item.input_audio_transcription.completed — what they said
response.created — the reply begins
response.output_audio.delta + response.output_audio_transcript.delta — audio + words
response.done — the reply is complete

Notes & edge cases

Silent turns. If the user's audio has no speech, there's nothing to transcribe and no input-transcript event for that turn.
Tool calls. When swaram calls one of your tools it emits response.function_call_arguments.done instead of an audio transcript — that turn has no spoken reply until you return the result.
Barge-in. If the user interrupts mid-reply, the model transcript so far is partial and the turn ends with response.done (status cancelled). Keep what you accumulated.
Same in both modes. Simple and Premium emit the identical events.

Events reference — every event, both directions.
Audio — the audio that accompanies these transcripts.
Turn-taking & barge-in — how turns start and end.