Whisper in the Loop: ASR, Not TTS

Preamble

Whisper is automatic speech recognition: audio in, text out. Listing it next to Kokoro and Qwen3-TTS still makes sense because voice products need both directions. The failure mode to watch for is teams treating Whisper like a speech generator or expecting TTS to “listen.” Naming the loop clearly saves weeks of misfired tickets.

Strengths to lean on

Whisper-class models are a strong open baseline for:

Captions and meeting notes when privacy policy allows cloud or self-hosted inference.
Voice commands where vocabulary is broad and accents vary.
RAG ingestion over spoken content: transcribe audio, chunk text, embed—RAG Foundations: Embeddings, Chunking, and the Retrieval Loop’s loop with a microphone upstream.

The robustness across noise and accents is the headline feature; it is why Whisper became the default mental model for ASR in many stacks.

Costs and real-time constraints

Larger tiers improve accuracy and robustness; they also raise latency and compute. Real-time use often means chunked inference, voice activity detection (VAD) to skip silence, or smaller distilled ASR models when Whisper is overkill.

Hardware is sized the same way as LLM serving: expected concurrency, SLO for end-to-end voice turns, and batching strategy.

Failure modes: hallucinations and music

ASR models can emit plausible text in silent segments or under music—classic hallucination territory. Mitigations include VAD, confidence heuristics, and post-filters that reject empty-audio transcripts. None are perfect; residual risk should be documented for product and legal review.

Architecture placement

For assistants, Whisper belongs before the LLM:

microphone → ASR → text → (RAG + LLM) → text → TTS → speaker

That ordering keeps responsibilities clean. Qwen3-TTS and Multilingual Neural TTS Trade-offs focuses on neural TTS output quality; RVC in the Stack: Voice Conversion After TTS adds RVC when the output timbre must match a specific speaker profile.

Step-by-step: batch transcription you can regression-test

Pin model — large-v3 vs medium is a product choice; record checkpoint file hash or faster-whisper wheel version.
Normalize audio — Resample to what the model expects (often 16 kHz mono); store ffmpeg command lines in the ADR.
Chunk long files — VAD-based segments (silero, WebRTC VAD) reduce hallucinated text in music-only regions.
Decode — faster-whisper or whisper.cpp with fixed beam_size / temperature for reproducibility; log language explicitly or force language="en" when appropriate.
Post-filter — Drop segments with no speech per VAD; optionally cap max segment duration.

CLI-shaped reproducible run (faster-whisper example):

python - <<'PY'
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("sample.wav", beam_size=5, vad_filter=True)
print("lang", info.language, "prob", info.language_probability)
for s in segments:
    print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")
PY

Pros: excellent out-of-the-box WER on diverse accents; Cons: GPU memory and latency for large-*; hallucinations under silence/music unless VAD + policy.

Feeding RAG from audio (retrieval loop + microphone upstream)

After ASR, the pipeline is plain text: transcript → chunk → embed → index (same shape as RAG Foundations: Embeddings, Chunking, and the Retrieval Loop). Record ASR model id as metadata on derived chunks so you can re-index when ASR improves. Do not embed raw audio in a text embedder without a dedicated multimodal model—that is a different architecture.

Conclusion

Whisper is the ears of the stack, not the mouth. Treating ASR and TTS symmetrically at the API layer—timeouts, quotas, logging—makes the full voice pipeline operable.