# Transcripts

Every conversation gives you the **text of both sides** — what the caller said and
what swaram said back — alongside the audio. Use it to log conversations, show live
captions, run analytics, or debug.

There are two streams, delivered by two events:

| You want | Event | The text |
|---|---|---|
| **What the user said** | `conversation.item.input_audio_transcription.completed` | `event.transcript` — the whole turn |
| **What swaram said** | `response.output_audio_transcript.delta` | `event.delta` — a piece; accumulate |

Both work identically in **Simple and Premium** — your handling doesn't change
between modes.

## What the user said (input transcript)

When the user finishes a turn, swaram transcribes their speech and sends a single
event with the complete text of that turn:

```json
{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_…",
  "transcript": "കേരളത്തിന്റെ തലസ്ഥാനം ഏതാണ്?"
}
```

It's **one event per user turn** (not streamed in pieces), so `transcript` is the
full thing — just read it.

> The same text also rides on the `conversation.item.created` event for that turn
> (`item.role === "user"`, in `item.content[].text`) if you prefer to track items —
> but `…input_audio_transcription.completed` above is the simplest, standard way.

## What swaram said (output transcript)

The reply's text streams back **as deltas**, in step with the audio. Concatenate the
deltas to get the full reply:

```json
{ "type": "response.output_audio_transcript.delta", "delta": "തിരുവനന്തപുരം" }
```

A reply is framed by `response.created` … (audio + transcript deltas) …
`response.done`, so accumulate the deltas until `response.done` for one complete
model turn.

## Putting both together — a conversation log

Capture each user turn and each model turn into a running log:

```python
conversation = []          # [{"role": "user"|"assistant", "text": ...}]
reply = ""

async for raw in ws:
    event = json.loads(raw)
    t = event["type"]
    if t == "conversation.item.input_audio_transcription.completed":
        conversation.append({"role": "user", "text": event["transcript"]})
    elif t == "response.output_audio_transcript.delta":
        reply += event["delta"]                      # accumulate the model's words
    elif t == "response.done":
        if reply:
            conversation.append({"role": "assistant", "text": reply})
        reply = ""
```

```node
const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
});
```

```browser
const conversation = [];   // [{ role: "user"|"assistant", text }]
let reply = "";

ws.onmessage = (e) => {
  const event = JSON.parse(e.data);
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    conversation.push({ role: "user", text: event.transcript });
  } else if (event.type === "response.output_audio_transcript.delta") {
    reply += event.delta;                            // accumulate the model's words
  } else if (event.type === "response.done") {
    if (reply) conversation.push({ role: "assistant", text: reply });
    reply = "";
  }
};
```

## Turn structure

For a normal voice turn the events arrive in this order:

1. `input_audio_buffer.speech_started` / `speech_stopped` — the user's turn boundaries
2. `conversation.item.input_audio_transcription.completed` — what they said
3. `response.created` — the reply begins
4. `response.output_audio.delta` + `response.output_audio_transcript.delta` — audio + words
5. `response.done` — the reply is complete

## Notes & edge cases

- **Silent turns.** If the user's audio has no speech, there's nothing to
  transcribe and no input-transcript event for that turn.
- **Tool calls.** When swaram calls one of your [tools](tools.html) it emits
  `response.function_call_arguments.done` instead of an audio transcript — that turn
  has no spoken reply until you return the result.
- **Barge-in.** If the user interrupts mid-reply, the model transcript so far is
  partial and the turn ends with `response.done` (status `cancelled`). Keep what you
  accumulated.
- **Same in both modes.** Simple and Premium emit the identical events.

## Related

- [Events reference](events.html) — every event, both directions.
- [Audio](audio.html) — the audio that accompanies these transcripts.
- [Turn-taking & barge-in](turn-taking.html) — how turns start and end.