6 min read
How to Build a Ukrainian-Speaking Voice AI Agent for Your Business in 2026
First, let me explain what voice agents can actually do. The same things as text agents, but with voice :-) And in principle, that's exactly right, with one important addition — voice agents can make phone calls and carry conversations, which opens up additional value for your business.
I won't list every possible use case because at Aceverse we focus on building working solutions that actually work in your business.
How to Create a Real-Time Streaming Agent in 2026?
First, let's understand how it works and what options exist. Solutions can be roughly divided into 2 approaches:
- Speech-to-Speech models (e.g. kyutai-labs/moshi) — end-to-end approach
- Cascade pipeline — modular approach using Deepgram, ElevenLabs and others
The first approach gives incredible latency — down to 200ms end-to-end from the user's last token to the agent's first token. These models also feature native emotional expressiveness.
However, using these models for specific business tasks is practically impossible today — they work speech-to-speech and only accept a prompt before session creation. No RAG, no Tool calling, no ReAct. In simple terms, it's a great chatbot but it can't integrate with your DB or CRM.
Cascade Pipeline: How It Works
That's why the cascade pipeline is the real deal today. Why cascade? Because it combines multiple cutting-edge technologies working together.
VAD — Voice Activity Detection
The pipeline starts with VAD — Voice Activity Detection. It determines where speech exists in the audio stream versus silence/noise. Typically small neural networks that run fast, like Silero VAD.
STT (Speech-to-Text) — The Biggest Problem for Ukrainian
Next comes STT — probably the biggest challenge for anyone building real-time voice AI agents. These models recognize speech and convert it to text.
Unfortunately, there are currently no production-grade open-source models that work with real-time streaming and have good WER metrics for Ukrainian. What's particularly frustrating is that even Mistral's Voxtral-transcribe-2 supports 12 languages, including Russian, but not Ukrainian.
So we have to use APIs from Deepgram, ElevenLabs and others. These models account for the largest portion of latency — 400–600ms.
LLM — The Agent's Brain
Once we have text, we use an LLM to answer user questions or call tools. There are many more options here, but latency concerns remain.
Even the relatively fast GPT-4o has approximately 450–500ms latency. So we use Groq API with open models, which gives us 200–300ms end latency.
Even better — running models locally on GPU for faster inference, though this is costly at low call volumes.
TTS — Voice Generation
For voice generation, powerful options are emerging, but no self-hostable production-ready model exists yet.
The best sounding option is Respeecher AI with English and Ukrainian models. The voice quality is impressive, but the 400–500ms latency is too much for production.
Cartesia has 1-2 Ukrainian voices, but they don't sound right. So our favorite remains ElevenLabs eleven_flash_v2_5 — fast (under 200ms), reliable, many voices.
Total Latency — and Why It's a Problem
If you add up the total latency from when the user stops speaking to when the agent starts responding, it comes to over a second. Add SIP latency on top, and conversations with such an agent become unpleasant — users notice immediately.
Practical Approaches to Reducing Latency
The biggest win comes from reducing the number of LLM calls per turn. We replaced a two-stage pipeline (classify LLM → respond LLM) with smart_respond — a single LLM call that simultaneously classifies the message and generates a response. This saved ~300ms per message.
Keyword pre-filter — regex patterns that work in 0ms, intercepting greetings, goodbyes, and profanity without wasting time on LLM at all.
Connection pooling (httpx persistent connections to Groq API) eliminates TCP/TLS handshake on each request — ~20–50ms savings.
Turn detector (EOUModel) adds semantic understanding of phrase endings on top of VAD — instead of reacting to any 200ms pause, the model evaluates whether the person actually finished their thought.
Streaming Overlap: STT → LLM → TTS
The key idea is don't wait for the previous stage to complete — start the next one as soon as the first data appears. In the ideal pipeline, all three models work simultaneously.
STT → LLM
Streaming STT produces interim results before the person finishes speaking. A more aggressive approach is speculative inference: run the LLM on interim text, and if the final matches — the response is already ready.
LLM → TTS
The most critical junction. The LLM generates a response token by token. Instead of waiting for the complete response, TTS starts synthesizing speech as soon as the first sentence accumulates.
The key metric is TTFB (Time To First Byte) of audio: time from end of user speech to first sound of response.
Phrase Caching
Pre-cached wav files are delivered to the user with zero latency. This significantly improves the conversation, especially at the beginning of the dialogue, when the question is whether the person will hang up or keep listening.
Try the agent right now
Try Aceverse's voice AI agent on our website or book a free demo for your business.
Try on website Book a Free Demo