Voice AI in E-commerce: Where It Works and Where It Doesn't

For the last 6 months we at Aceverse have been building voice AI agents for online retailers — both inbound and outbound. By now I have a clear map of where voice genuinely takes load off and makes money, and where it gets bolted on "because it's trendy" and just annoys customers.

The key point: the value isn't in the LLM, it's in the integrations. An agent not wired into your CRM and tracking is an expensive voicemail. Wired in, it's an operator that works 24/7.

Inbound scenarios

1. "Where's my order?" — the agent pulls status from the CRM and tracking from the delivery API and answers out loud. The most common, highly repetitive call — ideal to automate.

2. Returns and exchanges — walks the customer through the flow and files a request in the CRM; escalates conflict cases to a human.

3. FAQ, availability, shipping — RAG over your catalog and knowledge base: instant answers from current data.

4. AI receptionist with human handoff (HITL) — answers 24/7, resolves simple cases, hands complex ones to an operator with full context. In our experience ~50% of after-hours calls used to be simply lost — direct missed orders.

5. Voice order capture — taking an order in conversation and writing it to the CRM.

Outbound scenarios

6. Cart-abandonment recovery — a voice call lands where email gets ignored. One call, politely, with an opt-out.

7. Win-back of dormant customers — personal reactivation; segmentation matters more than the script.

8. Order confirmation — verifying cash-on-delivery orders cuts non-pickups and saves on logistics.

9. Parcel pickup reminder — "it's at the post office for 2 days, otherwise it ships back" → fewer returns from "forgot to pick up".

10. Anti-fraud — on high-risk orders (common for US merchants), fraudsters enter the cardholder's phone to bypass Shopify/Stripe anti-fraud. The agent instantly calls that number to verify whether they placed the order; if not — an immediate refund, avoiding chargebacks and losses.

11. NPS, reviews, upsell — voice feedback after purchase; the whole conversation lands as a transcript in the CRM.

How to build it: architecture

There are two approaches. Realtime / speech-to-speech (S2S) — one neural network listens, reasons and speaks (audio → audio): lower latency, more natural barge-in, but less control and fewer voices/languages. Cascaded (STT → LLM → TTS) — separate components: more hops, but every layer is swappable and tunable per language.

Rule of thumb: S2S when the language is well supported and latency is critical; cascaded when you need quality and control for a specific language. For our English-language agent Anna we chose S2S (OpenAI Realtime on LiveKit); for Ukrainian contact centers we more often go cascaded.

Realtime models (official vendor data only)

Model	Type	Languages	Latency	Notes
OpenAI Realtime (gpt-realtime-2)	proprietary	multilingual	"low latency" (no figure)	function calling + MCP; WebRTC/WS/SIP; our stack for Anna
xAI Grok Voice Agent	proprietary	20+	sub-second (claimed)	OpenAI Realtime API-compatible + official LiveKit plugin; $0.05/min
Google Gemini Live	proprietary	70	not stated	barge-in, affective dialog, Google Search
Amazon Nova 2 Sonic	proprietary	EN/FR/IT/DE/ES/PT/HI	"low-latency"	polyglot voices, RAG, via Bedrock
Kyutai Moshi	open-source	English	160ms theor. / ~200ms on L4	fully self-hosted, CC-BY 4.0 weights

Cascaded stack: STT — Deepgram, Soniox, OpenAI Whisper, Mistral Voxtral; TTS — ElevenLabs, Cartesia, Respeecher. Measure WER/MOS on your own calls — public benchmarks on studio-clean speech don't reflect the phone channel.

Why orchestration: LiveKit / Pipecat

Between "model" and "phone call" there's a lot of real-time plumbing: transport (WebRTC/SIP), VAD, end-of-turn detection, interruptions, wiring components together, scaling. Building it by hand takes months. So you use a framework: LiveKit Agents (infrastructure-first, native WebRTC + SIP — our main choice) or Pipecat (pipeline-first, control of every step). Managed platforms — Vapi / Retell / Synthflow — give a fast start at the cost of per-minute markup and vendor lock-in.

CRM / CMS integration

Tools (function calling) are where the value lives: the model calls get_order_status(phone) → your backend → CRM/CMS API (Shopify Admin, WooCommerce, delivery tracking, CRM) → the answer is spoken. For standardized access — MCP servers. Outbound scenarios fire on a webhook event (abandoned cart, parcel at the post office, risky order) → a call via the same stack.

Benefits, honestly

24/7 and peak scalability — the agent doesn't choke on Black Friday.
No lost calls = no lost orders — the most direct money effect.
Lower cost per contact on routine requests vs a live operator.
One transcript in the CRM — analytics, QA and training data at once.

Where voice AI does NOT belong

Emotionally complex / conflict calls — a human is needed; the agent's job is to escalate gracefully.
Low call volume — the integration won't pay off. Honestly: don't.
Dirty CRM data — voice will just announce your mess more loudly.
Legally/medically sensitive topics — not without a human.

The full article was published on DOU (in Ukrainian): dou.ua/forums/topic/60031.

Does voice AI pay off for a small online store?

It pays off where there's a volume of repetitive requests (status, returns, FAQ) and clean CRM data. If you get a few calls a day, the integration won't pay off — and we say so honestly.

Which model is best for the Ukrainian language?

For Ukrainian a cascaded approach (STT → LLM → TTS) usually wins: a strong STT (Whisper, Google Chirp, Azure, Soniox cover Ukrainian) plus a TTS with Ukrainian support (ElevenLabs, Respeecher). Realtime models don't yet offer strong native Ukrainian voices.

Will voice AI replace support operators?

No. The right split is routine to the machine, complex to the human: the agent closes typical requests 24/7 and hands complex cases to an operator with full context (human-in-the-loop).

How do you start an implementation?

With a single scenario (usually "order status"), a ~2-week pilot on real traffic with human escalation from day one. You measure deflection rate and CSAT — if it works, you expand; if not, you wind it down honestly.