[Agent Name] · Realtime Voice
A phone-callable AI voice agent. Customer dials your Twilio number, agent picks up, conversation streams through OpenAI's Realtime API, your business logic provides function calls for booking, lookups, and warm transfers.
Source of truth
Production runs as a long-lived WebSocket relay on Fly.io. Twilio Media Streams send raw audio in; OpenAI Realtime sends audio back; your function calls hit your business backend. Postgres holds call transcripts + outcomes.
Tech stack
Node 22 + TypeScript + Fastify with ws for the Twilio Media Streams WebSocket. OpenAI Realtime API (gpt-realtime-mini for cost, full Realtime for harder calls). Twilio Programmable Voice (<Stream> TwiML). Postgres for transcripts. Optional: ElevenLabs voice for non-OpenAI voice options.
Deploy
fly deploy. Twilio number's webhook points at https://yourdomain.com/twilio/incoming. Set Twilio <Stream> to wss://yourdomain.com/twilio/stream.
File map
src/twilio/incoming.tsHTTP webhook returning TwiML that opens the streamsrc/twilio/stream.tsWebSocket: Twilio audio in <-> OpenAI audio outsrc/realtime/session.tsOpenAI Realtime session setup + event handlerssrc/functions/business logic exposed as Realtime function calls (lookup_hours, book_appointment, transfer_to_human)src/db/transcripts, call_outcomes tablessrc/lib/audio.tsmulaw <-> PCM16 conversion (Twilio is mulaw, OpenAI wants PCM16)prompts/system.mdagent personality + escalation rules
.env keys
OPENAI_API_KEYTWILIO_ACCOUNT_SID,TWILIO_AUTH_TOKENTWILIO_PHONE_NUMBERyour bought numberBUSINESS_API_BASEyour booking + lookup backendDATABASE_URLHUMAN_TRANSFER_NUMBERwhere to warm-transfer when agent escalates
Hard rules
- Audio formats: Twilio sends 8kHz mulaw; OpenAI Realtime wants PCM16 (24kHz). Convert in both directions. There are off-by-one bugs here that ruin audio quality; test with a real phone call, not a recorded clip.
- The agent's first turn MUST greet within 500ms or callers hang up. Keep
systemprompt short andinstructionsdirect. - Function calls return JSON. Keep them small. The agent has to talk while waiting; don't make functions take 3 seconds.
- Always have a human-transfer fallback. Some calls are emergencies. The agent must know its limits.
- Recordings are sensitive. Comply with two-party consent states (announce recording in the greeting if any of your callers are in IL, CA, FL, MA, MD, MT, NH, PA, WA).
- Cost: Realtime API is per-second-of-audio expensive. Cap call duration. Disconnect dormant calls.
Recent significant changes
- 2026-05-18: Scaffolded. Locked: OpenAI Realtime over Deepgram+ElevenLabs pipeline (latency wins), Twilio over Vonage (better docs), Fly over AWS Lambda (long-lived WebSockets).
Next session: start here
- Buy a Twilio number. Set webhook to
/twilio/incoming(use ngrok in dev). - Wire
OPENAI_API_KEY. Test the WebSocket relay with a real phone call. - Implement first function call (e.g.
lookup_business_hours). Make sure the agent uses it. - Test escalation: ask something hard, confirm the warm transfer fires.
- Record 10 real test calls before going live. Listen to all of them.