OpenAI Realtime Out Of Band Transcription

Realtime Out Of Band Transcription

chatgptopenaigpt-4examplesopenai-apiopenai-cookbook

alph-notebooks/openai-cookbook / Realtime_out_of_band_transcription.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Transcribing User Audio with a Separate Realtime Request

Purpose: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio out-of-band using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).

We call this out-of-band transcription using the Realtime model. It’s simply a second response.create request on the same Realtime WebSocket, tagged so it doesn’t write back to the active conversation state. The model runs again with a different set of instructions (a transcription prompt), triggering a new inference pass that’s separate from the assistant’s main speech turn.

It covers how to build a server-to-server client that:

Streams microphone audio to an OpenAI Realtime voice agent.
Plays back the agent's spoken replies.
After each user turn, generates a high-quality text-only transcript using the same Realtime model.

This is achieved via a secondary response.create request:

{
    "type": "response.create",
    "response": {
        "conversation": "none",
        "output_modalities": ["text"],
        "instructions": transcription_instructions
    }
}

This notebook demonstrates using the Realtime model itself for transcription:

Context-aware transcription: Uses the full session context to improve transcript accuracy.
Non-intrusive: Runs outside the live conversation, so the transcript is never added back to session state.
Customizable instructions: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.

1. Why use out-of-band transcription?

The Realtime API offers built-in user input transcription, but this relies on a separate ASR model (e.g., gpt-4o-transcribe). Using different models for transcription and response generation can lead to discrepancies. For example:

User speech transcribed as: I had otoo accident
Realtime response interpreted correctly as: Got it, you had an auto accident

Accurate transcriptions can be very important, particularly when:

Transcripts trigger downstream actions (e.g., tool calls), where errors propagate through the system.
Transcripts are summarized or passed to other components, risking context pollution.
Transcripts are displayed to end users, leading to poor user experiences if errors occur.

The potential advantages of using out-of-band transcription include:

Reduced Mismatch: The same model is used for both transcription and generation, minimizing inconsistencies between what the user says and how the agent responds.
Greater Steerability: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.
Session Context Awareness: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.

In terms of trade-offs:

Realtime Model (for transcription):
- Audio Input → Text Output: $32.00 per 1M audio tokens +$ 16.00 per 1M text tokens out.
- Cached Session Context: $0.40 per 1M cached context tokens.
- Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00
GPT-4o Transcription:
- Audio Input: $6.00 per 1M audio tokens
- Text Input: $2.50 per 1M tokens.
- Text Output: $10.00 per 1M tokens
- Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00
Direct Cost Comparison (see examples in the end of the cookbook):
- Using full session context: 16-22x (if transcription cost is 0.001 $/session, realtime transcription will be 0.016$ /session)
  - The cost is higher since you are always passing the growing session context. However, this can potentially help with transcription.
- Using only latest user turn: 3-5x (if transcription cost is 0.001 $/session, realtime transcription will be 0.003$ /session)
  - The cost is lower since you are only transcribing the latest user audio turn. However, you no longer have access to the session context for transcription quality.
- Using 1 < N (turn) < Full Context, the price would be between 3-20x more expensive depending on how many turns you decide to keep in context
- Note: These cost estimates are specific to the examples covered in this cookbook. Actual costs may vary depending on factors such as session length, how often context is cached, the ratio of audio to text input, and the details of your particular use case.
Other Considerations:
- Implementing transcription via the Realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.

Note: Out-of-band responses using the Realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.

2. Requirements & Setup

Ensure your environment meets these requirements:

Python 3.10 or later
PortAudio (required by sounddevice):
- macOS:
```
brew install portaudio
```
Python Dependencies:
```
pip install sounddevice websockets
```
OpenAI API Key (with Realtime API access): Set your key as an environment variable:
```
export OPENAI_API_KEY=sk-...
```

[100]

3. Prompts

We use two distinct prompts:

Voice Agent Prompt (REALTIME_MODEL_PROMPT): This is an example prompt used with the Realtime model for the Speech 2 Speech interactions.
Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.

For the REALTIME_MODEL_TRANSCRIPTION_PROMPT, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!

[ ]

4. Core configuration

We define:

Imports
Audio and model defaults
Constants for transcription event handling

[126]

/var/folders/cn/p1ryy08146b7vvvhbh24j9b00000gn/T/ipykernel_48882/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated
  from websockets.client import WebSocketClientProtocol

[127]

5. Building the Realtime session & the out‑of‑band request

The Realtime session (session.update) configures:

Audio input/output
Server‑side VAD
Set built‑in transcription (input_audio_transcription_model)
- We set this so that we can compare to the Realtime model transcription

The out‑of‑band transcription is a response.create triggered after user input audio is committed input_audio_buffer.committed:

conversation: "none" – use session state but don’t write to the main conversation session state
output_modalities: ["text"] – get a text transcript only

Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.

[138]

6. Audio streaming: mic → Realtime → speakers

We now define:

encode_audio – base64 helper
playback_audio – play assistant audio on the default output device
send_audio_from_queue – send buffered mic audio to input_audio_buffer
stream_microphone_audio – capture PCM16 from the mic and feed the queue

[139]

7. Extracting and comparing transcripts

The function below enables us to generate two transcripts for each user turn:

Realtime model transcript: from our out-of-band response.create call.
Built-in ASR transcript: from the standard transcription model (input_audio_transcription_model).

We align and display both clearly in the terminal:

=== User Turn (Realtime Transcript) ===
...

=== User Turn (Built-in ASR Transcript) ===
...

[140]

8. Listening for Realtime events

listen_for_events drives the session:

Watches for speech_started / speech_stopped / committed
Sends the out‑of‑band transcription request when a user turn finishes (input_audio_buffer.committed) when only_last_user_turn == False
Sends the out‑of‑band transcription request when a user turn is added to conversation (conversation.item.added") when only_last_user_turn == True
Calculates token usage and cost for both transcription methods
Streams assistant audio to the playback queue
Buffers text deltas per response_id

[148]

[149]

9. Run Script

In this step, we run the code which will allow us to view the Realtime model transcription vs transcription model transcriptions. The code does the following:

Loads configuration and prompts
Establishes a WebSocket connection
Starts concurrent tasks:
- listen_for_events (handle incoming messages)
- stream_microphone_audio (send microphone audio)
- Mutes mic when assistant is speaking
- playback_audio (play assistant responses)
- prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.
Run session until you interrupt

Output should look like:

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Hello.

=== User turn (Transcription model) ===
Hello


=== Assistant response ===
Hello, and thank you for calling. Let's start with your full name, please.

[150]

[ ]

Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Hello.

=== User turn (Transcription model) ===
Hello


=== Assistant response ===
Hello! Let's get started with your claim. Can you tell me your full name, please?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My name is M I N H A J U L H O Q U E

=== User turn (Transcription model) ===
My name is Minhajul Hoque.


=== Assistant response ===
Thank you. Just to confirm, I heard your full name as Minhajul Hoque. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Yep.

=== User turn (Transcription model) ===
Yep.


=== Assistant response ===
Great, thank you for confirming. Now, could you provide your policy number, please?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My policy number is X077-B025.

=== User turn (Transcription model) ===
My policy number is X077B025.


=== Assistant response ===
Thank you. Let me confirm: I have your policy number as X077B025. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== Assistant response ===
Of course. Your full name is Minhajul Hoque. Now, let’s move on. What type of accident are you reporting—auto, home, or something else?


=== User turn (Realtime transcript) ===
Yeah, can you ask me my name again?

=== User turn (Transcription model) ===
Can you ask me my name again?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
No, can you ask me my name again, this is important.

=== User turn (Transcription model) ===
No, can you ask me by name again?


=== Assistant response ===
Understood. Let me repeat your full name again to confirm. Your name is Minhajul Hoque. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My name is Minhajul Hoque.

=== User turn (Transcription model) ===
My name is Minhaj ul Haq.

Session cancelled; closing.

From the above example, we can notice:

The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses "this is important." while the realtime transcription gets it correctly.
The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).
With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., "Minhaj ul Haq").

Example with Cost Calculations

There are significant price differences between the available methods for transcribing user audio. GPT-4o-Transcribe is by far the most cost-effective approach: it charges only for the raw audio input and a small amount of text output, resulting in transcripts that cost just fractions of a cent per turn. In contrast, using the Realtime model for out-of-band transcription is more expensive. If you transcribe only the latest user turn with Realtime, it typically costs about 3–5× more than GPT-4o-Transcribe. If you include the full session context in each transcription request, the cost can increase to about 16–20× higher. This is because each request to the Realtime model processes the entire session context again at higher pricing, and the cost grows as the conversation gets longer.

Cost for Transcribing Only the Latest Turn

Let's walk through an example that uses full session context for realtime out-of-band transcription:

[111]

Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
conversation.item.added: {'id': 'item_Cfpt8RCQdpsNsz2OZ4rxQ', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}
conversation.item.added: {'id': 'item_Cfpt9JS3PCvlCxoO15mLt', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}

=== User turn (Realtime transcript) ===
Hello. How can I help you today?
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1841,
  "input_tokens": 1830,
  "output_tokens": 11,
  "input_token_details": {
    "text_tokens": 1830,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 0,
    "cached_tokens_details": {
      "text_tokens": 0,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 11,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.007320, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000176, audio_out=$0.000000, total=$0.007496

=== User turn (Transcription model) ===
Hello
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 19,
  "input_tokens": 16,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 16
  },
  "output_tokens": 3
}
[Transcription model cost estimate] audio_in=$0.000096, text_in=$0.000000, text_out=$0.000030, total=$0.000126

[Realtime usage]
{
  "total_tokens": 1327,
  "input_tokens": 1042,
  "output_tokens": 285,
  "input_token_details": {
    "text_tokens": 1026,
    "audio_tokens": 16,
    "image_tokens": 0,
    "cached_tokens": 0,
    "cached_tokens_details": {
      "text_tokens": 0,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 66,
    "audio_tokens": 219
  }
}

=== Assistant response ===
Thank you for calling OpenAI Insurance Claims. My name is Ava, and I’ll help you file your claim today. Let’s start with your full legal name as it appears on your policy. Could you share that with me, please?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
conversation.item.added: {'id': 'item_CfptNPygis1UcQYQMDh1f', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}
conversation.item.added: {'id': 'item_CfptSg4tU6WnRkdiPvR3D', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}

=== User turn (Realtime transcript) ===
My full legal name would be M-I-N-H, H-O-Q-U-E.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 2020,
  "input_tokens": 2001,
  "output_tokens": 19,
  "input_token_details": {
    "text_tokens": 1906,
    "audio_tokens": 95,
    "image_tokens": 0,
    "cached_tokens": 1856,
    "cached_tokens_details": {
      "text_tokens": 1856,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 19,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000200, text_in_cached=$0.000742, audio_in=$0.003040, audio_in_cached=$0.000000, text_out=$0.000304, audio_out=$0.000000, total=$0.004286

=== User turn (Transcription model) ===
My full legal name would be Minhajul Hoque.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 71,
  "input_tokens": 57,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 57
  },
  "output_tokens": 14
}
[Transcription model cost estimate] audio_in=$0.000342, text_in=$0.000000, text_out=$0.000140, total=$0.000482

[Realtime usage]
{
  "total_tokens": 1675,
  "input_tokens": 1394,
  "output_tokens": 281,
  "input_token_details": {
    "text_tokens": 1102,
    "audio_tokens": 292,
    "image_tokens": 0,
    "cached_tokens": 1344,
    "cached_tokens_details": {
      "text_tokens": 1088,
      "audio_tokens": 256,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 63,
    "audio_tokens": 218
  }
}

=== Assistant response ===
Thank you, Minhajul Hoque. I’ve got your full name noted. Next, may I have your policy number? Please share it in the format of four digits, a dash, and then four more digits.


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
conversation.item.added: {'id': 'item_CfpthEQKfNqaoD86Iolvf', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}
conversation.item.added: {'id': 'item_CfptnqCGAdlEXuAxGUvvK', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}

=== User turn (Realtime transcript) ===
My policy number is P-0-0-2-X-0-7-5.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 2137,
  "input_tokens": 2116,
  "output_tokens": 21,
  "input_token_details": {
    "text_tokens": 1963,
    "audio_tokens": 153,
    "image_tokens": 0,
    "cached_tokens": 1856,
    "cached_tokens_details": {
      "text_tokens": 1856,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 21,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000428, text_in_cached=$0.000742, audio_in=$0.004896, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.006402

=== User turn (Transcription model) ===
My policy number is P002X075.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 70,
  "input_tokens": 59,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 59
  },
  "output_tokens": 11
}
[Transcription model cost estimate] audio_in=$0.000354, text_in=$0.000000, text_out=$0.000110, total=$0.000464

[Realtime usage]
{
  "total_tokens": 1811,
  "input_tokens": 1509,
  "output_tokens": 302,
  "input_token_details": {
    "text_tokens": 1159,
    "audio_tokens": 350,
    "image_tokens": 0,
    "cached_tokens": 832,
    "cached_tokens_details": {
      "text_tokens": 832,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 57,
    "audio_tokens": 245
  }
}

=== Assistant response ===
I want to confirm I heard that correctly. It sounded like your policy number is P002-X075. Could you please confirm if that’s correct, or provide any clarification if needed?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
conversation.item.added: {'id': 'item_Cfpu59HqXhBMHvHmW0SvX', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}
conversation.item.added: {'id': 'item_Cfpu8juH7cCWuQAxCsYUT', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}

=== User turn (Realtime transcript) ===
That is indeed correct.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 2233,
  "input_tokens": 2226,
  "output_tokens": 7,
  "input_token_details": {
    "text_tokens": 2014,
    "audio_tokens": 212,
    "image_tokens": 0,
    "cached_tokens": 1856,
    "cached_tokens_details": {
      "text_tokens": 1856,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 7,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000632, text_in_cached=$0.000742, audio_in=$0.006784, audio_in_cached=$0.000000, text_out=$0.000112, audio_out=$0.000000, total=$0.008270

=== User turn (Transcription model) ===
That is indeed correct.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 39,
  "input_tokens": 32,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 32
  },
  "output_tokens": 7
}
[Transcription model cost estimate] audio_in=$0.000192, text_in=$0.000000, text_out=$0.000070, total=$0.000262

[Realtime usage]
{
  "total_tokens": 1818,
  "input_tokens": 1619,
  "output_tokens": 199,
  "input_token_details": {
    "text_tokens": 1210,
    "audio_tokens": 409,
    "image_tokens": 0,
    "cached_tokens": 832,
    "cached_tokens_details": {
      "text_tokens": 832,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 49,
    "audio_tokens": 150
  }
}

=== Assistant response ===
Thank you for confirming. Now, could you tell me the type of accident you’re filing this claim for—whether it’s auto, home, or something else?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
conversation.item.added: {'id': 'item_CfpuJcnmWJEzfxS2MgHv0', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}
conversation.item.added: {'id': 'item_CfpuPtFYTrlz1uQJBKMVF', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}

=== User turn (Realtime transcript) ===
It's an auto one, but I think you got my name wrong. Can you ask my name again?
[Realtime out-of-band transcription usage]
{
  "total_tokens": 2255,
  "input_tokens": 2232,
  "output_tokens": 23,
  "input_token_details": {
    "text_tokens": 2055,
    "audio_tokens": 177,
    "image_tokens": 0,
    "cached_tokens": 1856,
    "cached_tokens_details": {
      "text_tokens": 1856,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 23,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000796, text_in_cached=$0.000742, audio_in=$0.005664, audio_in_cached=$0.000000, text_out=$0.000368, audio_out=$0.000000, total=$0.007570

=== User turn (Transcription model) ===
It's a auto one, but I think you got my name wrong, can you ask my name again?
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 83,
  "input_tokens": 60,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 60
  },
  "output_tokens": 23
}
[Transcription model cost estimate] audio_in=$0.000360, text_in=$0.000000, text_out=$0.000230, total=$0.000590

[Realtime usage]
{
  "total_tokens": 1779,
  "input_tokens": 1625,
  "output_tokens": 154,
  "input_token_details": {
    "text_tokens": 1251,
    "audio_tokens": 374,
    "image_tokens": 0,
    "cached_tokens": 832,
    "cached_tokens_details": {
      "text_tokens": 832,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 41,
    "audio_tokens": 113
  }
}

=== Assistant response ===
Of course, let’s make sure I have it correct. Could you please spell out your full legal name for me again, carefully?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
conversation.item.added: {'id': 'item_CfpuYJBwNQubeb7uuHqQQ', 'type': 'message', 'status': 'completed', 'role': 'user', 'content': [{'type': 'input_audio', 'transcript': None}]}
conversation.item.added: {'id': 'item_CfpuaI6ZvKBwZG6yXxE1l', 'type': 'message', 'status': 'in_progress', 'role': 'assistant', 'content': []}

=== User turn (Realtime transcript) ===
Minhajul Hoque.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 2261,
  "input_tokens": 2252,
  "output_tokens": 9,
  "input_token_details": {
    "text_tokens": 2092,
    "audio_tokens": 160,
    "image_tokens": 0,
    "cached_tokens": 1856,
    "cached_tokens_details": {
      "text_tokens": 1856,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 9,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000944, text_in_cached=$0.000742, audio_in=$0.005120, audio_in_cached=$0.000000, text_out=$0.000144, audio_out=$0.000000, total=$0.006950

=== User turn (Transcription model) ===
مينهاجو حق.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 27,
  "input_tokens": 20,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 20
  },
  "output_tokens": 7
}
[Transcription model cost estimate] audio_in=$0.000120, text_in=$0.000000, text_out=$0.000070, total=$0.000190

[Realtime usage]
{
  "total_tokens": 1902,
  "input_tokens": 1645,
  "output_tokens": 257,
  "input_token_details": {
    "text_tokens": 1288,
    "audio_tokens": 357,
    "image_tokens": 0,
    "cached_tokens": 832,
    "cached_tokens_details": {
      "text_tokens": 832,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 54,
    "audio_tokens": 203
  }
}

=== Assistant response ===
Thank you. Let me confirm: your full legal name is spelled M-I-N-H-A-J-U-L, and the last name H-O-Q-U-E. Is that correct?

Session cancelled; closing.

Transcription Cost Comparison

Costs Summary

Realtime Out-of-Band (OOB): $0.040974 total (~$ 0.006829 per turn)
Dedicated Transcription: $0.002114 total (~$ 0.000352 per turn)
OOB is ~19× more expensive using full session context

Considerations

Caching: Because these conversations are short, you benefit little from caching beyond the initial system prompt.
Transcription System Prompt: The transcription model uses a minimal system prompt, so input costs would typically be higher.

Recommended Cost-Saving Strategy

Limit transcription to recent turns: Minimizing audio/text context significantly reduces OOB transcription costs.

Understanding Cache Behavior

Effective caching requires stable prompt instructions (usually 1,024+ tokens).
Different instruction prompts between OOB and main assistant sessions result in separate caches.

Cost for Transcribing Only the Latest Turn

You can limit transcription to only the latest user turn by supplying input item_references like this:

    if item_ids:
        response["input"] = [
            {"type": "item_reference", "id": item_id} for item_id in item_ids
        ]

    return {
        "type": "response.create",
        "response": response,
    }

Transcribing just the most recent user turn lowers costs by restricting the session context sent to the model. However, this approach has trade-offs: the model won’t have access to previous conversation history to help resolve ambiguities or correct errors (for example, accurately recalling a username mentioned earlier). Additionally, because you’re always updating which input is referenced, little caching benefit is realized, the cache prefix changes each turn, so you don’t accumulate reusable context.

Now, let’s look at a second example that uses only the most recent user audio turn for realtime out-of-band transcription:

[136]

Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Hello.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1813,
  "input_tokens": 1809,
  "output_tokens": 4,
  "input_token_details": {
    "text_tokens": 1809,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 0,
    "cached_tokens_details": {
      "text_tokens": 0,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 4,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.007236, text_in_cached=$0.000000, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000064, audio_out=$0.000000, total=$0.007300

=== User turn (Transcription model) ===
Hello
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 17,
  "input_tokens": 14,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 14
  },
  "output_tokens": 3
}
[Transcription model cost estimate] audio_in=$0.000084, text_in=$0.000000, text_out=$0.000030, total=$0.000114


=== Assistant response ===
Thank you for calling OpenAI Insurance Claims. My name is Alex, and I’ll help you file your claim today. May I please have your full legal name as it appears on your policy?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My full legal name is M-I-N-H A-J-U-L H-O-Q-U-E
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1829,
  "input_tokens": 1809,
  "output_tokens": 20,
  "input_token_details": {
    "text_tokens": 1809,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 1792,
    "cached_tokens_details": {
      "text_tokens": 1792,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 20,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000320, audio_out=$0.000000, total=$0.001105

=== User turn (Transcription model) ===
My full legal name is Minhajul Hoque.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 87,
  "input_tokens": 74,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 74
  },
  "output_tokens": 13
}
[Transcription model cost estimate] audio_in=$0.000444, text_in=$0.000000, text_out=$0.000130, total=$0.000574


=== Assistant response ===
Thank you, Minhajul Hoque. I’ve noted your full legal name. Next, could you please provide your policy number? Remember, it's usually in a format like XXXX-XXXX.


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
My policy number is X007-PX75.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1821,
  "input_tokens": 1809,
  "output_tokens": 12,
  "input_token_details": {
    "text_tokens": 1809,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 1792,
    "cached_tokens_details": {
      "text_tokens": 1792,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 12,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000977

=== User turn (Transcription model) ===
Sure, my policy number is AG007-PX75.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 102,
  "input_tokens": 88,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 88
  },
  "output_tokens": 14
}
[Transcription model cost estimate] audio_in=$0.000528, text_in=$0.000000, text_out=$0.000140, total=$0.000668


=== Assistant response ===
Thank you. Just to confirm, I heard your policy number as E G 0 0 7 - P X 7 5. Is that correct?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
No, I said X007-PX75.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1821,
  "input_tokens": 1809,
  "output_tokens": 12,
  "input_token_details": {
    "text_tokens": 1809,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 1792,
    "cached_tokens_details": {
      "text_tokens": 1792,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 12,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000192, audio_out=$0.000000, total=$0.000977

=== User turn (Transcription model) ===
No, I said X007-PX75.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 65,
  "input_tokens": 53,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 53
  },
  "output_tokens": 12
}
[Transcription model cost estimate] audio_in=$0.000318, text_in=$0.000000, text_out=$0.000120, total=$0.000438


=== Assistant response ===
Thank you for clarifying. I’ve got it now. Your policy number is E G 0 0 7 - P X 7 5. Let’s move on. Could you tell me the type of accident—is it auto, home, or something else?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
It's an auto, but I think you got my name wrong, can you ask me again?
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1830,
  "input_tokens": 1809,
  "output_tokens": 21,
  "input_token_details": {
    "text_tokens": 1809,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 1792,
    "cached_tokens_details": {
      "text_tokens": 1792,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 21,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000336, audio_out=$0.000000, total=$0.001121

=== User turn (Transcription model) ===
It's an auto, but I think you got my name wrong. Can you ask me again?
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 67,
  "input_tokens": 46,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 46
  },
  "output_tokens": 21
}
[Transcription model cost estimate] audio_in=$0.000276, text_in=$0.000000, text_out=$0.000210, total=$0.000486


=== Assistant response ===
Of course, I’m happy to correct that. Let’s go back. Could you please spell your full legal name for me, so I can make sure I’ve got it exactly right?


[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Yeah, my full legal name is Minhajul Haque.
[Realtime out-of-band transcription usage]
{
  "total_tokens": 1824,
  "input_tokens": 1809,
  "output_tokens": 15,
  "input_token_details": {
    "text_tokens": 1809,
    "audio_tokens": 0,
    "image_tokens": 0,
    "cached_tokens": 1792,
    "cached_tokens_details": {
      "text_tokens": 1792,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  },
  "output_token_details": {
    "text_tokens": 15,
    "audio_tokens": 0
  }
}
[Realtime out-of-band transcription cost estimate] text_in=$0.000068, text_in_cached=$0.000717, audio_in=$0.000000, audio_in_cached=$0.000000, text_out=$0.000240, audio_out=$0.000000, total=$0.001025

=== User turn (Transcription model) ===
Yeah, my full legal name is Minhajul Haque.
[Transcription model usage]
{
  "type": "tokens",
  "total_tokens": 60,
  "input_tokens": 45,
  "input_token_details": {
    "text_tokens": 0,
    "audio_tokens": 45
  },
  "output_tokens": 15
}
[Transcription model cost estimate] audio_in=$0.000270, text_in=$0.000000, text_out=$0.000150, total=$0.000420


=== Assistant response ===
Thank you for that. Just to confirm, your full legal name is Minhajul Hoque. Is that correct?

Session cancelled; closing.

Cost Analysis Summary

Realtime Out-of-Band Transcription (OOB)

Total Cost: $0.013354
Average per Turn: ~$0.001908

Dedicated Transcription Model

Total Cost: $0.002630
Average per Turn: ~$0.000376

Difference in Costs

Additional cost using OOB: +$0.010724
Cost Multiplier: OOB is about 5× more expensive than the dedicated transcription model.

This approach costs significantly less than using the full session context. You should evaluate your use case to decide whether regular transcription, out-of-band transcription with full context, or transcribing only the latest turn best fits your needs. You can also choose an intermediate strategy, such as including just the last N turns in the input.

Conclusion

Exploring out-of-band transcription could be beneficial for your use case if:

You're still experiencing unreliable transcriptions, even after optimizing the transcription model prompt.
You need a more reliable and steerable method for generating transcriptions.
The current transcripts fail to normalize entities correctly, causing downstream issues.

Keep in mind the trade-offs:

Cost: Out-of-band (OOB) transcription is more expensive. Be sure that the extra expense makes sense for your typical session lengths and business needs.
Complexity: Implementing OOB transcription takes extra engineering effort to connect all the pieces correctly. Only choose this approach if its benefits are important for your use case.

If you decide to pursue this method, make sure you:

Set up the transcription trigger correctly, ensuring it activates after the audio commit.
Carefully iterate and refine the prompt to align closely with your specific use case and needs.

Realtime Out Of Band Transcription

Transcribing User Audio with a Separate Realtime Request

1. Why use out-of-band transcription?

2. Requirements & Setup

3. Prompts

4. Core configuration

5. Building the Realtime session & the out‑of‑band request

6. Audio streaming: mic → Realtime → speakers

7. Extracting and comparing transcripts

8. Listening for Realtime events

9. Run Script

Example with Cost Calculations

Cost for Transcribing Only the Latest Turn

Transcription Cost Comparison

Costs Summary

Considerations

Recommended Cost-Saving Strategy

Understanding Cache Behavior

Cost for Transcribing Only the Latest Turn

Cost Analysis Summary

Conclusion

Documentation: