What is the minimum latency of a voice AI agent?

With a cascade pipeline and optimizations (streaming overlap, caching, connection pooling), 200–800ms end-to-end latency is achievable. Speech-to-speech models can reach 200ms but don't support tools.

Does STT support Ukrainian?

Unfortunately, there are no production-grade open-source real-time STT models for Ukrainian. We use API solutions — Deepgram, Soniox and ElevenLabs, which have acceptable WER for Ukrainian.

How much does a voice AI agent cost?

Cost depends on call volume and integrations. At Aceverse, we offer packages from basic (150 minutes) to extended (500 minutes). View pricing →

Can the voice agent integrate with CRM?

Yes, the cascade pipeline supports Tool calling and ReAct approaches, enabling full integration with any CRM, database, Google Calendar and other systems.

How quickly can a voice agent be launched?

A basic agent with your prompt and integrations launches in 1–3 days. More complex solutions with custom RAG and workflows — up to 2 weeks. Book a free demo →

How to Build a Ukrainian-Speaking Voice AI Agent for Your Business in 2026

First, let me explain what voice agents can actually do. The same things as text agents, but with voice :-) And in principle, that's exactly right, with one important addition — voice agents can make phone calls and carry conversations, which opens up additional value for your business.

I won't list every possible use case because at Aceverse we focus on building working solutions that actually work in your business.

How to Create a Real-Time Streaming Agent in 2026?

First, let's understand how it works and what options exist. Solutions can be roughly divided into 2 approaches:

Speech-to-Speech models (e.g. kyutai-labs/moshi) — end-to-end approach
Cascade pipeline — modular approach using Deepgram, ElevenLabs and others

The first approach gives incredible latency — down to 200ms end-to-end from the user's last token to the agent's first token. These models also feature native emotional expressiveness.

However, using these models for specific business tasks is practically impossible today — they work speech-to-speech and only accept a prompt before session creation. No RAG, no Tool calling, no ReAct. In simple terms, it's a great chatbot but it can't integrate with your DB or CRM.

Cascade Pipeline: How It Works

That's why the cascade pipeline is the real deal today. Why cascade? Because it combines multiple cutting-edge technologies working together.

VAD — Voice Activity Detection

The pipeline starts with VAD — Voice Activity Detection. It determines where speech exists in the audio stream versus silence/noise. Typically small neural networks that run fast, like Silero VAD.

STT (Speech-to-Text) — The Biggest Problem for Ukrainian

Next comes STT — probably the biggest challenge for anyone building real-time voice AI agents. These models recognize speech and convert it to text.

Unfortunately, there are currently no production-grade open-source models that work with real-time streaming and have good WER metrics for Ukrainian. What's particularly frustrating is that even Mistral's Voxtral-transcribe-2 supports 12 languages, including Russian, but not Ukrainian.

So we have to use APIs from Deepgram, ElevenLabs and others. These models account for the largest portion of latency — 400–600ms.

LLM — The Agent's Brain

Once we have text, we use an LLM to answer user questions or call tools. There are many more options here, but latency concerns remain.

Even the relatively fast GPT-4o has approximately 450–500ms latency. So we use Groq API with open models, which gives us 200–300ms end latency.

Even better — running models locally on GPU for faster inference, though this is costly at low call volumes.

TTS — Voice Generation

For voice generation, powerful options are emerging, but no self-hostable production-ready model exists yet.

The best sounding option is Respeecher AI with English and Ukrainian models. The voice quality is impressive, but the 400–500ms latency is too much for production.

Cartesia has 1-2 Ukrainian voices, but they don't sound right. So our favorite remains ElevenLabs eleven_flash_v2_5 — fast (under 200ms), reliable, many voices.

Component	Latency	Provider
VAD	~20ms	Silero VAD
STT	400–600ms	Deepgram / Soniox
LLM	200–500ms	Groq / GPU
TTS	~200ms	ElevenLabs

Total Latency — and Why It's a Problem

If you add up the total latency from when the user stops speaking to when the agent starts responding, it comes to over a second. Add SIP latency on top, and conversations with such an agent become unpleasant — users notice immediately.

Practical Approaches to Reducing Latency

The biggest win comes from reducing the number of LLM calls per turn. We replaced a two-stage pipeline (classify LLM → respond LLM) with smart_respond — a single LLM call that simultaneously classifies the message and generates a response. This saved ~300ms per message.

Keyword pre-filter (0ms)

Connection pooling (−20-50ms)

resume_false_interruption

EOUModel turn detector

Smart respond (−300ms)

Keyword pre-filter — regex patterns that work in 0ms, intercepting greetings, goodbyes, and profanity without wasting time on LLM at all.

Connection pooling (httpx persistent connections to Groq API) eliminates TCP/TLS handshake on each request — ~20–50ms savings.

Turn detector (EOUModel) adds semantic understanding of phrase endings on top of VAD — instead of reacting to any 200ms pause, the model evaluates whether the person actually finished their thought.

Streaming Overlap: STT → LLM → TTS

The key idea is don't wait for the previous stage to complete — start the next one as soon as the first data appears. In the ideal pipeline, all three models work simultaneously.

STT → LLM

Streaming STT produces interim results before the person finishes speaking. A more aggressive approach is speculative inference: run the LLM on interim text, and if the final matches — the response is already ready.

LLM → TTS

The most critical junction. The LLM generates a response token by token. Instead of waiting for the complete response, TTS starts synthesizing speech as soon as the first sentence accumulates.

The key metric is TTFB (Time To First Byte) of audio: time from end of user speech to first sound of response.

Phrase Caching

Pre-cached wav files are delivered to the user with zero latency. This significantly improves the conversation, especially at the beginning of the dialogue, when the question is whether the person will hang up or keep listening.

Result: With various optimizations, it's realistic to achieve 200–800ms latency, making the conversation natural and comfortable enough.