For the last 6 months we at Aceverse have been building voice AI agents for online retailers — both inbound and outbound. By now I have a clear map of where voice genuinely takes load off and makes money, and where it gets bolted on "because it's trendy" and just annoys customers.
The key point: the value isn't in the LLM, it's in the integrations. An agent not wired into your CRM and tracking is an expensive voicemail. Wired in, it's an operator that works 24/7.
Inbound scenarios
1. "Where's my order?" — the agent pulls status from the CRM and tracking from the delivery API and answers out loud. The most common, highly repetitive call — ideal to automate.
2. Returns and exchanges — walks the customer through the flow and files a request in the CRM; escalates conflict cases to a human.
3. FAQ, availability, shipping — RAG over your catalog and knowledge base: instant answers from current data.
4. AI receptionist with human handoff (HITL) — answers 24/7, resolves simple cases, hands complex ones to an operator with full context. In our experience ~50% of after-hours calls used to be simply lost — direct missed orders.
5. Voice order capture — taking an order in conversation and writing it to the CRM.
Outbound scenarios
6. Cart-abandonment recovery — a voice call lands where email gets ignored. One call, politely, with an opt-out.
7. Win-back of dormant customers — personal reactivation; segmentation matters more than the script.
8. Order confirmation — verifying cash-on-delivery orders cuts non-pickups and saves on logistics.
9. Parcel pickup reminder — "it's at the post office for 2 days, otherwise it ships back" → fewer returns from "forgot to pick up".
10. Anti-fraud — on high-risk orders (common for US merchants), fraudsters enter the cardholder's phone to bypass Shopify/Stripe anti-fraud. The agent instantly calls that number to verify whether they placed the order; if not — an immediate refund, avoiding chargebacks and losses.
11. NPS, reviews, upsell — voice feedback after purchase; the whole conversation lands as a transcript in the CRM.
How to build it: architecture
There are two approaches. Realtime / speech-to-speech (S2S) — one neural network listens, reasons and speaks (audio → audio): lower latency, more natural barge-in, but less control and fewer voices/languages. Cascaded (STT → LLM → TTS) — separate components: more hops, but every layer is swappable and tunable per language.
Rule of thumb: S2S when the language is well supported and latency is critical; cascaded when you need quality and control for a specific language. For our English-language agent Anna we chose S2S (OpenAI Realtime on LiveKit); for Ukrainian contact centers we more often go cascaded.
Realtime models (official vendor data only)
| Model | Type | Languages | Latency | Notes |
|---|---|---|---|---|
| OpenAI Realtime (gpt-realtime-2) | proprietary | multilingual | "low latency" (no figure) | function calling + MCP; WebRTC/WS/SIP; our stack for Anna |
| xAI Grok Voice Agent | proprietary | 20+ | sub-second (claimed) | OpenAI Realtime API-compatible + official LiveKit plugin; $0.05/min |
| Google Gemini Live | proprietary | 70 | not stated | barge-in, affective dialog, Google Search |
| Amazon Nova 2 Sonic | proprietary | EN/FR/IT/DE/ES/PT/HI | "low-latency" | polyglot voices, RAG, via Bedrock |
| Kyutai Moshi | open-source | English | 160ms theor. / ~200ms on L4 | fully self-hosted, CC-BY 4.0 weights |
Cascaded stack: STT — Deepgram, Soniox, OpenAI Whisper, Mistral Voxtral; TTS — ElevenLabs, Cartesia, Respeecher. Measure WER/MOS on your own calls — public benchmarks on studio-clean speech don't reflect the phone channel.
Why orchestration: LiveKit / Pipecat
Between "model" and "phone call" there's a lot of real-time plumbing: transport (WebRTC/SIP), VAD, end-of-turn detection, interruptions, wiring components together, scaling. Building it by hand takes months. So you use a framework: LiveKit Agents (infrastructure-first, native WebRTC + SIP — our main choice) or Pipecat (pipeline-first, control of every step). Managed platforms — Vapi / Retell / Synthflow — give a fast start at the cost of per-minute markup and vendor lock-in.
CRM / CMS integration
Tools (function calling) are where the value lives: the model calls get_order_status(phone) → your backend → CRM/CMS API (Shopify Admin, WooCommerce, delivery tracking, CRM) → the answer is spoken. For standardized access — MCP servers. Outbound scenarios fire on a webhook event (abandoned cart, parcel at the post office, risky order) → a call via the same stack.
Benefits, honestly
- 24/7 and peak scalability — the agent doesn't choke on Black Friday.
- No lost calls = no lost orders — the most direct money effect.
- Lower cost per contact on routine requests vs a live operator.
- One transcript in the CRM — analytics, QA and training data at once.
Where voice AI does NOT belong
- Emotionally complex / conflict calls — a human is needed; the agent's job is to escalate gracefully.
- Low call volume — the integration won't pay off. Honestly: don't.
- Dirty CRM data — voice will just announce your mess more loudly.
- Legally/medically sensitive topics — not without a human.
The full article was published on DOU (in Ukrainian): dou.ua/forums/topic/60031.


