Why swaramFeatures DocsSign in
View as .md

Transcripts

Every conversation gives you the text of both sides — what the caller said and what swaram said back — alongside the audio. Use it to log conversations, show live captions, run analytics, or debug.

There are two streams, delivered by two events:

You want Event The text
What the user said conversation.item.input_audio_transcription.completed event.transcript — the whole turn
What swaram said response.output_audio_transcript.delta event.delta — a piece; accumulate

Both work identically in Simple and Premium — your handling doesn't change between modes.

What the user said (input transcript)

When the user finishes a turn, swaram transcribes their speech and sends a single event with the complete text of that turn:

JSON
{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_…",
  "transcript": "കേരളത്തിന്റെ തലസ്ഥാനം ഏതാണ്?"
}

It's one event per user turn (not streamed in pieces), so transcript is the full thing — just read it.

The same text also rides on the conversation.item.created event for that turn (item.role === "user", in item.content[].text) if you prefer to track items — but …input_audio_transcription.completed above is the simplest, standard way.

What swaram said (output transcript)

The reply's text streams back as deltas, in step with the audio. Concatenate the deltas to get the full reply:

JSON
{ "type": "response.output_audio_transcript.delta", "delta": "തിരുവനന്തപുരം" }

A reply is framed by response.created … (audio + transcript deltas) … response.done, so accumulate the deltas until response.done for one complete model turn.

Putting both together — a conversation log

Capture each user turn and each model turn into a running log:

conversation = []          # [{"role": "user"|"assistant", "text": ...}]
reply = ""

async for raw in ws:
    event = json.loads(raw)
    t = event["type"]
    if t == "conversation.item.input_audio_transcription.completed":
        conversation.append({"role": "user", "text": event["transcript"]})
    elif t == "response.output_audio_transcript.delta":
        reply += event["delta"]                      # accumulate the model's words
    elif t == "response.done":
        if reply:
            conversation.append({"role": "assistant", "text": reply})
        reply = ""
const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
});
const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
};

Turn structure

For a normal voice turn the events arrive in this order:

  1. input_audio_buffer.speech_started / speech_stopped — the user's turn boundaries
  2. conversation.item.input_audio_transcription.completed — what they said
  3. response.created — the reply begins
  4. response.output_audio.delta + response.output_audio_transcript.delta — audio + words
  5. response.done — the reply is complete

Notes & edge cases

← All docs