Design Patterns on Veydh's Blog

Qwen3-TTS and Multilingual Neural TTS Trade-offs

Veydh Gooljar — Sat, 04 Apr 2026 00:00:00 +0000

Preamble

Large multilingual TTS stacks—including Qwen3-TTS-class models—aim for natural prosody, broad language coverage, and assistant-grade clarity. They occupy the opposite corner of the design space from Kokoro TTS: Lightweight Neural Speech and Where It Fits’s lightweight engines: more expressive, heavier to run, and proportionally more complex to operate.

This post is not a benchmark manifesto; it is a frame for deciding when to pay that cost.

When heavy TTS earns its GPUs

Moving up-stack makes sense when:

The product is voice-first—users judge quality in seconds.
Multilingual parity matters; lightweight bundles do not cover locales we sell into.
Controllability (rate, style, explicit speaker control when supported) is part of the UX contract.

In those cases, cost per minute is still tracked, but it is weighed against churn and brand perception—not against batch IVR pricing alone.

Operational reality

Bigger models mean more VRAM, longer cold starts unless kept warm, and trickier autoscaling. TTS capacity is co-designed with Serving Distilled Models Behind an HTTP API’s LLM serving: both contend for GPU pools unless fleets are split.

Licensing and voice rights do not disappear because audio is synthetic. Terms of use for checkpoints, commercial redistribution, and cloned speaker data still bind shipping products. Legal review belongs in the same loop as for third-party LLM APIs.

Comparison back to Kokoro-class

Kokoro TTS: Lightweight Neural Speech and Where It Fits’s lightweight path trades range and nuance for efficiency. Heavy TTS inverts that trade. Neither is “correct” globally—requirements pick the winner.

When marketing later demands a specific timbre not native to the TTS roster, RVC (RVC in the Stack: Voice Conversion After TTS) becomes the bridge: synthesize with a strong base, then convert timbre—at the cost of latency and governance overhead.

Decision worksheet (fill before you buy GPUs)

Question	If “yes” →
Do we sell into locales beyond the lightweight roster?	Bias toward heavy multilingual; prove WER/MOS per locale.
Is voice a primary differentiator (podcast, companion app)?	Bias toward heavy; budget warm pools.
Is this IVR / notifications at huge volume?	Bias toward light (Kokoro TTS: Lightweight Neural Speech and Where It Fits); measure RTF.
Must timbre match a specific human?	Plan RVC (RVC in the Stack: Voice Conversion After TTS) + legal; rarely solved by TTS speaker id alone.

Operational comparison vs lightweight Kokoro-class TTS

Dimension	Heavy multilingual (Qwen3-TTS-class)	Lightweight TTS (Kokoro TTS: Lightweight Neural Speech and Where It Fits)
GPU RAM	High; often multi-GPU at scale	Low; CPU viable
Cold start	Keep-warm or pay latency	Usually friendly to autoscale
Voice control	Richer style/locale knobs	Smaller roster
Upgrade risk	Large checkpoint swaps	Smaller binary swaps

Reproducibility: pin checkpoint, speaker/style tokens, sample rate, and inference framework commit the same way Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face pins LLM Hub revisions.

Conclusion

Choose heavy TTS when quality and multilingual breadth dominate unit economics; choose light TTS when cost per minute and footprint dominate. For an optional RVC second stage aimed at timbre control, see RVC in the Stack: Voice Conversion After TTS.

RVC in the Stack: Voice Conversion After TTS

Veydh Gooljar — Sat, 04 Apr 2026 00:00:00 +0000

Preamble

Retrieval-based Voice Conversion (RVC) maps audio from one timbre toward another using models trained on target-speaker data. It is not a text front-end: it consumes waveforms. That is why the common assistant pattern chains TTS → RVC when you want machine-generated speech to resemble a specific voice embedding learned from clean clips.

This is separate from “pick speaker ID 3 in the TTS API.” RVC is about conversion, not selecting a bundled persona—though the two blur in product marketing.

Reference architecture

The TTS-first pipeline:

Text → TTS engine (neutral or roster speaker) → waveform → RVC → output audio

Each hop adds latency and a failure domain. TTS prosody mistakes propagate; RVC adjusts timbre but cannot always rewrite cadence or emotion wholesale.

An alternate path is human speech → RVC for dubbing, anonymization, or creative workflows. Consent and contract terms differ sharply from synthetic pipelines; governance checklists should not be reused blindly across those use cases.

Why add RVC at all

When branding demands a recognizable voice and the base TTS cannot reproduce it natively, RVC offers a reskin without retraining the entire synthesis stack. You can also swap TTS backends while retaining a stable timbre layer—if ops accepts the extra GPU work.

Costs and risks

Cascaded errors are the engineering tax: two models mean two places to tune, monitor, and version.

Extra latency per second of audio matters for interactive assistants; batch offline jobs tolerate it more easily.

Data ethics are non-negotiable. Train only on speakers who consented; treat unauthorized cloning as an abuse incident, not a support ticket. Deepfakes carry legal exposure; product, legal, and security should sign the same threat model.

Operations

RVC model revision, voice profile ID, and linkage back to TTS engine version should be logged—the same audit trail mindset as logging embedder IDs in RAG. Without that, debugging “she sounded different last Tuesday” is guesswork.

Step-by-step: train-and-convert (engineering outline)

Training recipes change with upstream repos; the invariants to enforce are:

Consent artifact — Written permission + scope (commercial, regions, duration) stored beside the dataset hash.
Clean reference clips — Denoise, strip silence, single speaker; 10–30 minutes often beats hours of noisy data.
Feature extract — Use the RVC toolchain’s pinned content encoder (often HuBERT-based); record its revision.
Train — Fixed seed, fixed batch, validation loss curve saved; no “best guess” checkpoint without eval listening.
Export — Versioned .pth / ONNX / whatever the inference server loads; never overwrite in place.
Inference test — TTS(reference_speaker) → RVC(target) on held-out text; compare to golden wavs.

Latency math: T_tts + T_rvc + buffering must fit your assistant turn budget; profile with batch size = 1 first.

When not to use RVC

Skip the second stage when native multispeaker TTS already matches brand needs, or when extra GPU + governance cannot be justified. RVC shines when timbre is non-negotiable and base TTS is fungible.

Conclusion

RVC is a power tool with policy weight

Whisper in the Loop: ASR, Not TTS

Veydh Gooljar — Sat, 04 Apr 2026 00:00:00 +0000

Preamble

Whisper is automatic speech recognition: audio in, text out. Listing it next to Kokoro and Qwen3-TTS still makes sense because voice products need both directions. The failure mode to watch for is teams treating Whisper like a speech generator or expecting TTS to “listen.” Naming the loop clearly saves weeks of misfired tickets.

Strengths to lean on

Whisper-class models are a strong open baseline for:

Captions and meeting notes when privacy policy allows cloud or self-hosted inference.
Voice commands where vocabulary is broad and accents vary.
RAG ingestion over spoken content: transcribe audio, chunk text, embed—RAG Foundations: Embeddings, Chunking, and the Retrieval Loop’s loop with a microphone upstream.

The robustness across noise and accents is the headline feature; it is why Whisper became the default mental model for ASR in many stacks.

Costs and real-time constraints

Larger tiers improve accuracy and robustness; they also raise latency and compute. Real-time use often means chunked inference, voice activity detection (VAD) to skip silence, or smaller distilled ASR models when Whisper is overkill.

Hardware is sized the same way as LLM serving: expected concurrency, SLO for end-to-end voice turns, and batching strategy.

Failure modes: hallucinations and music

ASR models can emit plausible text in silent segments or under music—classic hallucination territory. Mitigations include VAD, confidence heuristics, and post-filters that reject empty-audio transcripts. None are perfect; residual risk should be documented for product and legal review.

Architecture placement

For assistants, Whisper belongs before the LLM:

microphone → ASR → text → (RAG + LLM) → text → TTS → speaker

That ordering keeps responsibilities clean. Qwen3-TTS and Multilingual Neural TTS Trade-offs focuses on neural TTS output quality; RVC in the Stack: Voice Conversion After TTS adds RVC when the output timbre must match a specific speaker profile.

Step-by-step: batch transcription you can regression-test

Pin model — large-v3 vs medium is a product choice; record checkpoint file hash or faster-whisper wheel version.
Normalize audio — Resample to what the model expects (often 16 kHz mono); store ffmpeg command lines in the ADR.
Chunk long files — VAD-based segments (silero, WebRTC VAD) reduce hallucinated text in music-only regions.
Decode — faster-whisper or whisper.cpp with fixed beam_size / temperature for reproducibility; log language explicitly or force language="en" when appropriate.
Post-filter — Drop segments with no speech per VAD; optionally cap max segment duration.

CLI-shaped reproducible run (faster-whisper example):

python - <<'PY'
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("sample.wav", beam_size=5, vad_filter=True)
print("lang", info.language, "prob", info.language_probability)
for s in segments:
    print(f"[{s.start:.2f}-{s.end:.2f}] {s.text}")
PY

Pros: excellent out-of-the-box WER on diverse accents; Cons: GPU memory and latency for large-*; hallucinations under silence/music unless VAD + policy.

Feeding RAG from audio (retrieval loop + microphone upstream)

After ASR, the pipeline is plain text: transcript → chunk → embed → index (same shape as RAG Foundations: Embeddings, Chunking, and the Retrieval Loop). Record ASR model id as metadata on derived chunks so you can re-index when ASR improves. Do not embed raw audio in a text embedder without a dedicated multimodal model—that is a different architecture.

Conclusion

Whisper is the ears of the stack, not the mouth. Treating ASR and TTS symmetrically at the API layer—timeouts, quotas, logging—makes the full voice pipeline operable.

Kokoro TTS: Lightweight Neural Speech and Where It Fits

Veydh Gooljar — Wed, 04 Mar 2026 00:00:00 +0000

Preamble

Text-to-speech sits at the far end of the assistant pipeline: the LLM (and maybe RAG) decides what to say; TTS turns that text into audio users actually hear. Kokoro-class engines target efficiency—fast cold start, modest RAM, and throughput that fits small VMs or edge devices. Exact checkpoints and revisions evolve; the design trade-off to anchor is footprint versus expressiveness.

This post is the lightweight leg of the 2026 speech arc; Qwen3-TTS and Multilingual Neural TTS Trade-offs contrasts it with heavier multilingual stacks; RVC in the Stack: Voice Conversion After TTS adds RVC when timbre cloning enters the picture.

Where lightweight TTS wins

Smaller neural TTS fits when:

Cost per minute dominates—IVR prompts, notification speech, or high-volume read-aloud.
Provisioning is tight—no room for multi-gigabyte multispeaker servers on every node.
Latency budgets are aggressive and the script is short.

In those regimes, “good enough” prosody beats “stunning but slow.” Users forgive a slightly flat delivery on a parking confirmation more than they forgive a three-second wait.

Trade-offs to expect up front

Prosody and emotional range rarely match the largest commercial or open giants. You get clarity and speed; you may not get nuanced acting.

Voice roster may be limited to bundled speakers. If marketing demands a specific timbre, you either fine-tune within the engine’s constraints or add a second stage—see RVC in the Stack: Voice Conversion After TTS.

Language coverage varies by checkpoint. Every claimed locale should be validated; multilingual tokenization issues are not unique to LLMs.

Integration beside the LLM layer

TTS belongs behind the same API discipline as Serving Distilled Models Behind an HTTP API’s chat surface: timeouts, quotas, and structured errors when synthesis fails mid-utterance. Streaming audio raises questions of chunking and buffering; clients should handle partial playback and cancellation when the user interrupts.

Relation to the rest of the stack

Whisper in the Loop: ASR, Not TTS covers Whisper on the listening side (ASR). The mental model is directional: Whisper → text → LLM/RAG → text → TTS → audio. Mixing ASR and TTS responsibilities is a common beginner confusion; naming the boundary clearly keeps architecture reviews honest.

Reproducible evaluation loop (lightweight TTS)

Treat TTS like any other model artifact: pin revision, pin voice, record sample rate, store golden waveforms for regression.

Freeze the stack — Note Python version, package versions, and whether you run ONNX, CPU, or GPU build. One-line smoke: synthesize the same 3 sentences on every deploy.
Measure what product cares about — RTF (real-time factor) = audio_duration / wall_time; cold start ms until first byte; RSS at steady state for N concurrent streams.
ABX or MOS-lite — For internal QA, maintain a spreadsheet of {text, reference_wav} pairs; reviewers rate naturalness 1–5. Cheap, catches prosody regressions automated metrics miss.

Illustrative Python call pattern (replace with the actual Kokoro / edge engine API you standardize on—the shape stays the same):

# Pseudocode shape: engine.load(voice_id, revision=REV)
# audio = engine.synthesize(text, sample_rate=24000)
# Path("golden/hello.wav").write_bytes(audio.to_wav())

Log voice_id, engine_revision, and sample_rate beside the request_id discipline from Serving Distilled Models Behind an HTTP API so ops can answer “what spoke this?”

Tool comparison (same text in, audio out)

Option	Footprint	Quality headroom	Ops note
Lightweight neural TTS (Kokoro-class)	Low RAM/CPU	Moderate prosody	Best for alerts/IVR-scale volume
Cloud vendor TTS	Zero local GPU	High	Contract + egress + voice licensing
Heavy open multilingual (Qwen3-TTS and Multilingual Neural TTS Trade-offs)	High GPU	High	Fleet planning + checkpoint compliance

Conclusion

Kokoro-tier TTS is a practical default when efficiency matters more than maximal expressiveness. Qwen3-TTS and Multilingual Neural TTS Trade-offs compares Qwen3-TTS-class heavy stacks when quality and control win the negotiation.

Token Budgets: Enforcing Limits at the API Layer

Veydh Gooljar — Wed, 04 Mar 2026 00:00:00 +0000

Preamble

Uncapped generation is how you burn VRAM, money, and user patience in a single request. Limits belong in middleware—before the GPU does heavy work—not as a vague suggestion in the front-end. Token budgets are policy: product, security, and finance get a say; engineering implements them deterministically.

This pairs directly with Serving Distilled Models Behind an HTTP API’s serving layer and with RAG Foundations: Embeddings, Chunking, and the Retrieval Loop’s RAG context stuffing: the same window is contested by system prompts, retrieved chunks, tool outputs, and the user’s message history.

Hard caps: `max_tokens` and friends

Server-side max_new_tokens (or API max_tokens, depending on the stack) is the simplest brake: stop after N new tokens regardless of whether the model “feels done.” Defaults should be sane per tier, with higher quotas for authenticated power users where appropriate.

The important detail is consistent semantics across streaming and non-streaming paths. Partial streams must surface a stop reason so clients know the answer was truncated by length, not completed naturally.

Pre-flight token counting

Rejecting oversize prompts before allocation saves money. Token counts should use the same tokenizer family the model uses—approximations from rough character heuristics drift and cause surprise denials.

When a prompt is too large, responses should be actionable: which limit tripped, whether retry is possible after truncation, and never a generic 500 when the fix is user-side editing.

Truncation strategies for long threads

For chat history, naive “keep the last K messages” drops important early instructions. Alternatives to mix and match:

Drop middle: keep system/developer instructions and recent turns; summarize or elide the middle.
Summarize: periodically roll older turns into a running summary (watch for summary drift and stale facts).
Retrieve instead of paste: RAG pulls fresh chunks rather than embedding entire old threads in the prompt.

Each strategy has failure modes; the right choice depends on whether the product is support (needs ticket context) or creative (needs recent style).

Streaming and client contracts

With SSE or chunked responses, clients must handle incremental tokens, finish reasons, and errors mid-stream. Whether partial JSON tool calls can arrive and how to recover should be documented—same care as any long-lived HTTP stream.

Observability and fairness

Histograms for prompt length, output length, and time-to-first-token expose abusive patterns and misconfigured clients before finance does. Per-tenant quotas pair with these metrics so one noisy integration does not starve the rest.

Implementation sketch: pre-flight with `tiktoken` or the Hub tokenizer

Use the same tokenizer class the model uses—not a rough len(text)//4 heuristic.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(MODEL_ID, revision=MODEL_REV)

def count_messages(messages) -> int:
    # Prefer the model's chat template when available
    return len(tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True))

def assert_within_budget(messages, max_prompt_tokens: int) -> None:
    n = count_messages(messages)
    if n > max_prompt_tokens:
        raise ValueError(f"prompt_tokens={n} exceeds max_prompt_tokens={max_prompt_tokens}")

In middleware, return HTTP 413 or 400 with a JSON body { "error": "context_length", "prompt_tokens": n, "limit": max_prompt_tokens } so clients know to truncate or summarize—mirroring how public APIs surface limits.

For RAG, count tokens after you assemble system + retrieved passages + user message; if over budget, drop lowest-scoring chunks first (RAG in Production: Re-ranking, HyDE, and Simple Evals’s reranker makes that ordering meaningful).

Reproducible truncation regression test

Keep a fixture of long threads in tests/fixtures/long_chat.json. After any change to truncation or summarization:

Token-count before/after.
Golden hash of the exact byte sequence sent to the model (or the list of chunk ids injected for RAG).

CI fails when the fixture’s effective context changes without a version bump—same instinct as index versioning in RAG Foundations: Embeddings, Chunking, and the Retrieval Loop.

Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face

Veydh Gooljar — Fri, 13 Feb 2026 00:00:00 +0000

Preamble

Mistral-family checkpoints are common bases for assistants: strong quality per dollar, good tooling, and a large community of derivative models. Pulling weights from the Hub is only the first step. Training without the model’s official chat template—or mixing tokenizer specials inconsistently—is how you get sparkling loss curves and broken inference.

The snippets below assume PyTorch + Transformers + PEFT; translate the same invariants to your runner of choice.

This post is a checklist to run before any serious LoRA SFT job on Mistral-class bases. It mirrors the dependency-pinning mindset from earlier DevOps writing: reproducibility beats heroics.

Pin the checkpoint, not just the name

Hub models move. revision should be pinned to a commit hash for training and for inference. The same applies to tokenizer files bundled with that revision. “Latest” is fine for a weekend experiment; it is a liability when you need to diff a regression three weeks later.

Chat templates and special tokens

Instruction-tuned models expect BOS/EOS and role markers in a specific layout. Training on raw concatenated strings—user and assistant blobs jammed together without structure—teaches the adapter the wrong conditional distribution.

Always:

Load tokenizer from the same revision as model.
Apply the model card’s recommended chat template (or tokenizer.apply_chat_template when available) to build training examples.
Mask labels so loss applies only to assistant tokens when that is the objective; masking prevents the model from learning to imitate user prefixes.

If any of those steps feel optional, revisit RAG Foundations: Embeddings, Chunking, and the Retrieval Loop: garbage structure upstream becomes garbage behavior downstream.

Memory, precision, and sequence length

Gradient checkpointing trades compute for activation memory; on long contexts it is often mandatory. bf16 is used when hardware supports it well; fp16 can work but needs watching for overflow. OOM during the first real batch is a signal to shorten sequences, reduce micro-batch, or enable ZeRO-style sharding—not to immediately give up on LoRA.

Saving and merging adapters

During iteration, adapter weights only are saved—small, easy to diff and A/B swap. Merging adapters into a full weight file is useful for some deployment paths (single artifact, certain runtimes); it complicates updates because every merge is a new binary. Which path production uses should be documented so nobody confuses “adapter v3” with “merged full model v3”.

Documentation alongside the weights

An ADR-style note (even a short Markdown file in the repo) should record Hub IDs, dataset hashes, seeds, learning rate schedule, and eval commands. That is the same habit as recording Docker base images and lockfiles: future you is also a teammate.

End-to-end recipe (SFT on JSONL chat rows)

Assume each training row is {"messages": [{"role":"system","content":...}, {"role":"user",...}, {"role":"assistant",...}]} — one assistant turn per example for simplicity.

Step 1 — Load pinned revision

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
REV = "abc123deadbeef..."  # Hub commit sha

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, revision=REV)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, revision=REV, torch_dtype=torch.bfloat16, device_map="auto"
)

Step 2 — Apply chat template and mask user/system tokens

Use tokenizer.apply_chat_template with return_dict=True and token masks, or build labels manually: set labels[i] = -100 for tokens that belong to system + user prefixes so only assistant tokens contribute loss. Skipping this is how you get a model that parrots the user channel.

Step 3 — Attach LoRA and train

Match target_modules to Mistral’s linear names (q_proj, k_proj, v_proj, o_proj are the usual set). Watch train loss and a small held-out generation set weekly—loss can fall while JSON/tool format falls apart.

Step 4 — Save adapter + manifest

lora-out/
  adapter_config.json
  adapter_model.safetensors
manifest.json   # { "base": MODEL_ID, "revision": REV, "dataset_sha256": "...", "seed": 42 }

Step 5 — Inference smoke test

Load base + adapter, run the same apply_chat_template path used in training, and compare against a golden transcript file in CI if possible.

CLI alternative: `trl` / Axolotl / LLaMA-Factory

Higher-level runners (trl SFT, Axolotl, LLaMA-Factory YAMLs) encode the same steps: dataset → template → masked labels → PEFT. Pros: less boilerplate, proven configs on Discord. Cons: another layer to debug when templates change between library versions—still pin the library version and export the rendered tokenization for one row to disk when incidents happen.

Conclusion

Mistral-class fine-tunes are won or lost on tokenizer fidelity and pinned revisions. Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas extends the same discipline to Qwen and LLaMA derivatives where multilingual vocabs and RoPE extensions add their own footguns.

Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas

Veydh Gooljar — Thu, 12 Feb 2026 00:00:00 +0000

Preamble (plain English)

Hugging Face (HF) hosts thousands of model checkpoints. Among them, Qwen and LLaMA families are everywhere: raw bases, chat-tuned “instruct” versions, and distilled smaller models trained to imitate bigger ones for speed.

LoRA (Low-Rank Adaptation) means you do not retrain the whole model—you add a thin trainable “side path” and leave most weights frozen. That is cheap and fast, but the adapter is not a free-floating brain: it is tied to one specific base checkpoint.

The training recipe from Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face still applies. This post is the “read me before you merge adapters” note: tokenizer, position encoding (RoPE), attention layout (GQA), and licenses differ by family. Getting any of those wrong often fails silently (the model runs, but quality drifts or breaks in edge cases).

Tiny glossary (terms that show up in configs and papers)

Tokenizer — The piece that turns raw text into integer tokens the model sees. Different families use different tokenizers; swap the tokenizer and you have effectively changed the “alphabet” the model was trained on.
RoPE — Rotary Positional Embedding. A standard way to tell the model how far apart tokens are in the sequence (position information) without paying the full cost of older absolute-position schemes. “RoPE scaling” tweaks that math so the model can pretend it was trained on longer sequences than it mostly saw—useful for long context, but easy to get wrong between train and serve.
GQA — Grouped-Query Attention. A memory-saving trick: several query heads share the same key/value heads instead of every query head getting its own K/V. Layout differs across checkpoints; mixing configs can break loading or silently change behavior if something maps layers wrong.
SFT — Supervised Fine-Tuning. You show (prompt, desired answer) pairs and train the model to imitate them—typical “instruction tuning.”
PEFT — Parameter-Efficient Fine-Tuning (e.g. LoRA via the peft library). You train few parameters; the rest stay frozen.
ADR — Architecture Decision Record — a short doc in your repo that says why you picked a base model and how you pin versions.

Qwen: multilingual tokenization changes the effective budget

Think of the tokenizer as a ruler that measures text in tokens, not characters. Qwen’s tokenizer is built for many languages. That is great for products that are not English-only; it also changes planning in ways people miss.

Gotcha 1 — Same “length” in characters ≠ same token budget.
A paragraph in English might become 800 tokens; the same idea in another script might become 1,400 tokens. Your “4k context” window is 4k tokens, not 4k letters—so non-English locales can feel like they have less room for instructions, tools, and retrieved documents.

Gotcha 2 — English-only metrics lie.
If you only watch loss or accuracy on English dev data, you can ship something that looks fine in dashboards but degrades in other languages. Check each language you actually ship, including formatting quirks (dates, numbers, mixed scripts).

Gotcha 3 — RAG stacks compound the bias.
If your chunking, embedders, and eval sets were built English-first, retrieval quality drops where the tokenizer and embeddings were not aligned with real user text. The LoRA cannot fix upstream retrieval that never surfaces the right chunks.

LLaMA: RoPE, GQA, and “long context” bases

RoPE and “long context” marketing
A checkpoint may advertise a large context window, but pretraining data is still mostly shorter sequences. The model learns long-context behavior partly through extrapolation (math and scaling tricks). If your SFT data jumps wildly between very short and very long examples without a clear pattern, you can get brittle behavior at the lengths you care about in production.

Gotcha 4 — Train-time vs serve-time RoPE settings must match.
rope_scaling, max_position_embeddings, and related flags in config.json are not cosmetic. If training uses scaling recipe A and your inference server uses B (or an older Transformers version interprets them differently), you often get no crash—just worse answers, especially on long prompts. Pin library versions and dump the effective config next to your adapter in the same manifest.

Gotcha 5 — GQA layout is part of the architecture contract.
Community and vendor builds differ in how many heads group together. An adapter trained on one revision is not a guaranteed drop-in on another revision if attention layout or hidden sizes changed. Treat “same model name, different revision” like a different car engine—the body looks similar, parts may not interchange.

Gotcha 6 — “Eval average loss” hides long-context bugs.
Short training runs and average metrics smooth over failures that only appear at max production context. Smoke tests at 200 steps are fine for wiring; you still need at least one eval that stresses the longest context you plan to use (or the longest you can afford to simulate).

Licenses and redistribution

Gotcha 7 — LoRA does not reset licensing.
You still ship a story that includes the base weights (loaded at runtime) plus your adapter (or a merged full checkpoint). Some variants restrict commercial use, require attribution, or impose compliance steps. “We only fine-tuned a little” does not exempt you from the base model’s terms.

Gotcha 8 — Document Hub IDs like you document dependencies.
Put license notes beside exact model IDs and revisions (git-style hashes on the Hub) in the same ADR where you record dataset hashes and seeds. That is what legal and security reviews actually read.

One adapter per base revision

Gotcha 9 — Do not casually “port” adapters across bases.
One LoRA per base revision is the safe default. Research sometimes transfers adapters between related checkpoints; in production that is unforced risk—small numerical differences in layernorms, embeddings, or RoPE implementation can accumulate into large behavior changes.

If you must try transfer, treat it like a new model: full eval gates, regression tests on golden prompts, and no “we eyeballed one benchmark screenshot.”

Step-by-step when switching families (Qwen ↔ LLaMA)

LoRA: Fine-Tuning Small Distilled Models Without Full Weights and Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face’s PEFT recipe stays; below are extra gates per family.

Tokenizer probe — For each locale you ship, tokenize a fixed paragraph of representative text. Record token count and verify round-trip: for string inputs, decode(encode(x)) should equal x (or document intentional normalization). Multilingual bases punish “character budget” planning that assumed English token density.
Template probe — Dump tokenizer.apply_chat_template on one golden multi-turn example to a .txt file in git; diff that file in PRs that touch chat formatting. Qwen and LLaMA instruct variants use different role markers and spacing; mixing templates means you train the model to expect the wrong framing at inference.
RoPE / context flags — Copy into your run manifest: rope_scaling, max_position_embeddings, and any other fields your training and serving stacks read from config.json. Training with scaling A and serving with scaling B is a classic silent quality regression.
Train short, eval long — Run a short smoke job for plumbing; then evaluate at max context you intend in production (or the nearest affordable subset). Extrapolation bugs show up here, not in smoothed training loss.
Merge policy — Decide early whether production loads PeftModel (base + adapter at runtime) or a merged save_pretrained directory. Merge simplifies some C++ or single-binary servers; it complicates “swap adapter only” deploys and can blur audit trails if you lose the separation between base and delta.

Optional merge (when you explicitly want one directory of full weights):

import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(BASE_ID, revision=BASE_REV, torch_dtype=torch.bfloat16, device_map="cpu")
base = PeftModel.from_pretrained(base, "./lora-out")
merged = base.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Gotcha 10 — Re-evaluate after merge.
Run the same eval suite on merged weights. Do not assume merge is bitwise-identical to “base + adapter” forward pass in every runtime without checking—numerical paths and optimizations can differ.

Hub / tooling matrix (same training, different ergonomics)

Workflow	Tooling	When to use it
Notebook → script	Transformers + PEFT + TRL (Transformer Reinforcement Learning library—helpers for SFT/RLHF-style training on HF)	Small teams, full control
YAML-driven SFT	LLaMA-Factory, Axolotl	Faster iteration when configs are shared
Managed fine-tune	Cloud vendor jobs	When compliance wants a hosted trail; still export adapter + manifest

The reproducibility invariant is the same: pinned base revision, frozen dataset hash, logged lora_config, identical chat template in train and serve.

Conclusion

In one sentence: family-specific tokenizer and RoPE (and attention layout) details decide whether your fine-tune is stable—or fragile in ways that only show up under real prompts and real languages. Serving Distilled Models Behind an HTTP API moves from training to serving: HTTP boundaries, OpenAI-compatible APIs, and runtime choices once the adapter exists.

Serving Distilled Models Behind an HTTP API

Veydh Gooljar — Mon, 09 Feb 2026 00:00:00 +0000

Preamble

Training a LoRA is pretty satisfying (once you see the results start leaning closer to the thing you’re striving towards); serving it is where software engineering returns. Clients should see stable JSON, explicit errors, and predictable latency—not a Jupyter cell calling model.generate. The HTTP layer is a contract: versioning, observability, and back-pressure belong there, not buried inside CUDA kernels.

This post frames OpenAI-compatible surfaces and the runtimes behind them, in the same spirit as the interop and API-design posts from 2024.

Why OpenAI-compatible routes help

Exposing /v1/chat/completions (or a close cousin) means existing SDKs, proxies, and internal tools often drop in without a rewrite. You still document your own limits—max context, supported modalities, tool-calling quirks—but you avoid inventing yet another message schema for every new service.

Compatibility is not perfection. Deltas should be documented explicitly (streaming format, stop sequences, tool JSON) so client teams do not rely on undocumented behavior.

Runtime choices: vLLM, TGI, llama.cpp, and friends

The right engine depends on constraints more than on hype:

Multi-GPU, high throughput, Python ecosystem
vLLM and similar servers emphasize batching, PagedAttention-style memory use, and throughput. Good when you have NVIDIA hardware and want many concurrent sessions.
Hugging Face TGI
Another solid path when you want tight integration with Hub workflows and well-traveled Docker images.
CPU or edge, smaller models
llama.cpp-family servers shine when GPUs are scarce or you need a portable binary. Latency per token differs; total cost of ownership may still win.

The choice rests on SLO, hardware, and team familiarity—not benchmark leaderboard bragging rights alone.

Operations: health, versions, request IDs

Health checks should validate more than “process is up”: load the tokenizer, warm the model, and optionally run a tiny forward pass in staging. Model version belongs in response headers or metadata so clients can log which revision answered.

Request IDs propagate through auth, rate limiting, inference, and logging—the same discipline as wiring gRPC and HTTP across Python and Java services. When something regresses, one string traces the path.

Cost control: restricting token usage

Cost tracks tokens processed (prompt + completion) and how often you run inference. To manage spend you need hard limits clients cannot bypass, visibility into who burned what, and fairness so one integration cannot exhaust the pool.

1. Cap generation length server-side. OpenAI-shaped APIs accept max_tokens (or max_completion_tokens) on the request; that only helps if you enforce a ceiling: ignore values above your tier’s cap, or clamp them downward before the request reaches the engine. Otherwise a buggy or malicious client can ask for tens of thousands of new tokens per call. The engine should still return a clear finish reason when generation stops on length.

2. Bound the context window at the runtime. Engines expose a maximum sequence length (e.g. vLLM’s --max-model-len, llama.cpp’s -c). That limits worst-case memory and per-request work. It does not replace policy: you still want pre-flight checks (same tokenizer as the model) so oversize prompts fail fast with a 4xx instead of tying up a slot—see Token Budgets: Enforcing Limits at the API Layer for counting, truncation, and streaming semantics.

3. Rate-limit and quota by tenant or key. Tokens per minute (TPM) and requests per minute (RPM), optionally combined with daily or monthly budgets, belong in the gateway or auth layer (API gateway, Envoy, Kong, or a thin BFF) keyed off API keys or JWT claims. That caps aggregate burn even when each individual request is small.

4. Meter and alert. Log or export prompt_tokens, completion_tokens, and total per request ID and tenant; chart burn rate and set alerts before the bill surprises anyone. Chargeback by team or product line turns “model cost” from a black box into an engineering lever.

5. Operational knobs that are not a substitute for policy. Continuous batching and high concurrency improve throughput; they do not stop runaway prompts. Default low max_tokens for anonymous or trial tiers; require review or higher entitlements for large completions.

In short: runtime caps define physics (max length, VRAM); API/gateway caps define product and finance (per-call generation limit, per-tenant TPM, monthly budget). You want both.

Concrete runtimes: commands and trade-offs

These are illustrative; pin container tags and model revisions in your own manifests.

vLLM (multi-session GPU throughput) — OpenAI-compatible server, good when you already have CUDA and want continuous batching:

# Example only: replace model path and image tag with pinned versions
# --max-model-len caps context (and thus per-request upper bound on work)
docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --dtype bfloat16 \
  --max-model-len 8192

TGI (Text Generation Inference) — Strong Hub/Docker story; similar niche to vLLM for transformer serving behind HTTP.

llama.cpp server — CPU or smaller GPUs; compile with the same quant you evaluated (Q4_K_M, etc.):

./server -m ./models/model.gguf --port 8080 -c 4096

Runtime choice: vLLM/TGI when concurrent GPU sessions and batching dominate SLOs; llama.cpp when footprint, CPU-only, or edge packaging matters more than peak tokens/sec. Always re-benchmark your adapter + prompt template on the chosen runtime—kernel and quant interact with LoRA paths.

Minimal OpenAI-shaped `curl` contract test

After deploy, hit non-streaming chat with a frozen prompt and compare SHA of normalized JSON (strip volatile fields like id timestamps if needed):

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3","messages":[{"role":"user","content":"Say OK."}],"max_tokens":8,"temperature":0}' \
  | jq .

Check choices[0].finish_reason, token usage fields, and error schema for oversize prompts—those are the surfaces Token Budgets: Enforcing Limits at the API Layer tightens.

Conclusion

This post closes the loop from “weights on disk” to “service behind a load balancer,” including where to enforce token and rate limits so costs stay predictable: ceilings on completion length and context at the engine, plus TPM/RPM and budgets at the gateway. Token Budgets: Enforcing Limits at the API Layer goes deeper on pre-flight counting, truncation strategies, and client contracts—stack it with the serving layer you run here.

RAG Foundations: Embeddings, Chunking, and the Retrieval Loop

Veydh Gooljar — Wed, 14 Jan 2026 00:00:00 +0000

Preamble

Retrieval-Augmented Generation is the pattern everyone names when they want the model to cite your PDFs, tickets, or runbooks instead of improvising from weights alone. It is less like a single model feature and more like a small data product: you ingest documents, turn them into searchable units, pull the best candidates at query time, and only then hand text to the generator.

This month locks in vocabulary and failure modes before the later posts go deeper on re-ranking, evaluation, and the distilled-model stack paired with RAG through the rest of 2026.

The core loop, end to end

The happy path looks boring on purpose—that is a good sign.

Ingest
You load sources (HTML, Markdown, PDF text, Slack exports, whatever). Normalization matters: strip boilerplate headers, fix encoding, and keep enough metadata (source URL, section, updated_at) that you can filter later.
Chunk
Models do not read whole repositories in one gulp. You split text into overlapping segments—classically hundreds to low thousands of tokens per chunk, with overlap so sentences at boundaries are not orphaned. Tables and lists punish naive “split every N characters” strategies; Structure-aware splitting is preferable when the format allows it.
Embed
An embedding model maps each chunk to a dense vector. At query time you embed the user question (sometimes the query and the document live better with asymmetric encoders—one flavor for passages, one for questions—but many stacks use a single model for both).
Retrieve
You score candidates with cosine similarity, dot product (when vectors are normalized differently), or a hybrid of BM25 + vectors. You return top-k chunks, optionally after metadata filters (tenant, product line, language).
Generate
The LLM receives a system or developer instruction, the retrieved passages in a clearly delimited block, and the user message. The model’s job is to ground the answer in those passages while refusing when nothing supports a claim.

That loop is simple to diagram and surprisingly easy to get subtly wrong in production.

Chunking: where the most silent damage shows up

If chunks are too large, you dilute relevance—the vector averages over too many topics and retrieval becomes fuzzy. If chunks are too small, you lose definitions and antecedents (“it”, “the policy”) that only make sense in a wider window.

Overlap is not waste; it is insurance that a sentence sliced in half still appears whole in at least one neighbor chunk. Overlap percentage is a tunable knob alongside chunk size, not a fixed constant from a tutorial.

Tables deserve a callout. Row-wise splitting can destroy joins across columns; sometimes keeping a whole small table in one chunk is better than perfect token parity. When tables are huge, consider extracting them to a structured store and retrieving them by key while still RAG-ing the narrative docs.

Embeddings and the vector store

The embedding model defines the geometry of your search. A model trained for semantic similarity helps with paraphrases; a weaker or badly matched embedder sends the generator irrelevant paragraphs that look related in keyword space only.

Vector databases and libraries differ in filtering, hybrid search, quantization, and ops maturity. For early work, reproducibility (fixed model revision, recorded chunking parameters) matters more than shaving microseconds off single-query latency. Once traffic grows, batching, caching query embeddings, and separating hot tenants matter.

Pitfalls to plan for explicitly

Stale corpora
If the retrieval index lags the truth, the model confidently cites outdated policy. Version your index builds and expose “as of” metadata when users need it.

Lost in the middle
Stuffing ten long chunks into the context window does not mean the model weighs them evenly. Keeping k modest, summarizing long threads, or using re-ranking (RAG in Production: Re-ranking, HyDE, and Simple Evals) presents fewer, sharper passages.

Evaluation theater
Eyeballing answers in a notebook is not a test harness. Hold out question–answer pairs, measure whether the right chunk would be retrieved, and track regressions when you change embedder or chunking.

Tooling map: how teams actually wire the loop

The loop above is runtime-agnostic. In practice you glue ingest → chunk → embed → index → query → generate with one of these families (often mixed):

Layer	Representative tools	What you trade
Orchestration	LangChain, LlamaIndex, Haystack, custom Python	LangChain/LlamaIndex ship fast demos and integrations; custom code minimizes magic and keeps dependencies thin.
Chunking	Framework splitters, Unstructured for messy PDFs, `langchain_text_splitters`, hand-rolled Markdown headers	PDFs/HTML reward structure-aware parsers; naive `CharacterTextSplitter` is fine only for clean text.
Embeddings	sentence-transformers, OpenAI/Cohere/Voyage APIs, models on the Hub	APIs = ops simplicity + cost; self-hosted = reproducibility + no egress; pin model revision.
Vector index	FAISS (in-process), Chroma, Qdrant, pgvector, managed cloud DBs	FAISS/Chroma for prototypes; pgvector when you already own Postgres and want SQL + metadata filters; Qdrant/Milvus when you outgrow a single node.
Lexical / hybrid	BM25 via Elasticsearch/OpenSearch, built-in hybrid in some vector DBs	Pure vectors miss exact SKUs and codes; hybrid fixes “semantic near-miss” failures.
Generation	Any chat API or vLLM/TGI self-host (see Serving Distilled Models Behind an HTTP API)	Same RAG stack; swap generator without re-indexing if prompts stay stable.

Pros / cons in one sentence each: Framework-first (LangChain/LlamaIndex) accelerates time-to-demo but can hide tokenizer/chunk bugs until production. Vector-in-Postgres reduces moving parts for teams that already run Postgres and need row-level security. Separate vector DB scales ANN search and hybrid features but adds another backup and monitoring target.

Minimal reproducible path (self-hosted embed + local index)

Goal: one machine, fixed dependency versions, same chunks every run.

Environment — Python 3.11+, pip install chromadb sentence-transformers (pin versions in requirements.txt; record pip freeze hash in your ADR). For header-aware splitting later, add langchain-text-splitters or Unstructured and record the extra dependency as part of the index version.
Corpus — Start with a folder of .md or .txt files; later swap in Unstructured for PDFs. Record file list + checksums.
Chunk — Fixed chunk_size (tokens or characters), fixed chunk_overlap, single splitter implementation. Log the parameters beside the index name.
Embed — Load one Hub model id with revision=; embed documents in batches; persist the Chroma collection name + embed model id.
Retrieve — For each eval question, collection.query(query_texts=[q], n_results=k); log returned chunk ids + scores.
Generate — Build one prompt template: system instruction + --- delimited passages + user question. Pin chat model revision the same way as the embedder.

A compact indexing sketch (adapt names and paths):

# index_build.py — illustrative; pin package versions in production
import chromadb
from chromadb.utils import embedding_functions

CHUNK_SIZE = 800
CHUNK_OVERLAP = 120

def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Character-based chunks; swap for structure-aware splitters when format allows."""
    chunks: list[str] = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + size, n)
        chunks.append(text[start:end])
        if end == n:
            break
        start = max(end - overlap, start + 1)
    return chunks

def load_docs(paths: list[str]) -> list[str]:
    chunks: list[str] = []
    for p in paths:
        text = open(p, encoding="utf-8").read()
        chunks.extend(chunk_text(text))
    return chunks

def main():
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    client = chromadb.PersistentClient(path="./chroma_db")
    col = client.get_or_create_collection("docs", embedding_function=ef)
    chunks = load_docs(["./corpus/a.md", "./corpus/b.md"])
    ids = [f"chunk-{i}" for i in range(len(chunks))]
    col.add(ids=ids, documents=chunks)
    print("indexed", len(chunks), "chunks")

if __name__ == "__main__":
    main()

What “reproducible” means here: anyone with the same corpus bytes, splitter parameters, embedder revision, and index path gets the same retrieval ids for the same queries. Change any of those without a version bump and you should expect eval drift.

When to add hybrid search or metadata first

If users query IDs, error codes, or proper nouns, add BM25 or hybrid before chasing a larger embedder. If you have tenant_id or product_line, add metadata filters on the vector store before raising k. Both are cheaper than swapping a 7B generator to “fix” bad retrieval.

Conclusion

RAG is systems engineering: ingestion quality, chunk boundaries, and retrieval metrics matter as much as the chat model’s charisma. The distilled LLMs fine-tuned later in this series only shine when the passages they read are the right passages.

RAG in Production: Re-ranking, HyDE, and Simple Evals picks up re-ranking, query reformulation, and lightweight evaluation so naive top-k does not become the ceiling.

RAG in Production: Re-ranking, HyDE, and Simple Evals

Veydh Gooljar — Sun, 11 Jan 2026 00:00:00 +0000

Preamble

Naive top-k vector search is a decent demo and a fragile production strategy. Users paraphrase; documents use different jargon; multi-hop questions need facts that never co-occur in a single chunk. When the backend is a smaller distilled model (Mistral-class, Qwen, LLaMA derivatives), the generator is less able to “paper over” bad retrieval with parametric knowledge—so the retrieval layer has to pull its weight.

This post collects upgrades worth reaching for before touching fancier graph RAG or agent loops. They pair well with the foundations in RAG Foundations: Embeddings, Chunking, and the Retrieval Loop and with the LoRA/serving posts that follow (LoRA: Fine-Tuning Small Distilled Models Without Full Weights, Serving Distilled Models Behind an HTTP API).

Re-ranking: a second opinion on (query, passage) pairs

Bi-encoders (your embedding model) score candidates cheaply because query and document vectors are precomputed. Cross-encoders and learned re-rankers jointly attend to query and text; they are heavier per pair but far sharper at deciding whether a chunk truly answers the question.

In production the usual pattern is:

Retrieve a wider candidate set (e.g. 50–200) with fast vectors.
Re-rank down to 3–10 with a cross-encoder or a small reranker model.
Pass only that short list to the LLM.

The latency trade-off is real. p95 end-to-end warrants watching, and reranking is sometimes limited to “high value” intents or gated behind a cheap ambiguity classifier.

Query expansion and HyDE

Query expansion adds terms or sub-questions so lexical and embedding gaps shrink. Simple forms include synonyms from a domain glossary or LLM-generated bullet points that rephrase the user ask.

HyDE (Hypothetical Document Embeddings) flips the idea: ask the LLM to write a fake answer as if it knew the corpus, embed that synthetic paragraph, and search with it. It can bridge vocabulary mismatch; it can also amplify hallucinated specifics if you are not careful. HyDE fits better as an optional path behind a feature flag, not the default for regulated domains.

Metadata filters and tenancy

Vectors without filters become a soup. Retrieval should be scoped by tenant, product, language, document type, or time range whenever those axes exist in the source system. That mirrors the “bounded context” thinking from earlier architecture writing: fewer collisions, easier debugging, and less chance of leaking one customer’s chunk into another’s session.

Evaluation that catches regressions

Three layers of signal beat one pretty demo:

Retrieval quality
Labeled (query, relevant_chunk_id) pairs—or weaker proxies like “any gold paragraph in top-k”—give you recall@k and MRR. When embedders or chunking change, this layer screams first.
Answer groundedness
Does the final answer stick to retrieved text? LLM-as-judge helps if the judge is validated against human ratings and prompts stay versioned. A single automated score should not pass as compliance evidence alone.
Latency and cost
Track histograms for embed, search, rerank, and generate. A “smarter” stack that misses SLOs is a failed deploy.

Automate a regression job when the corpus or models change. Same discipline as pinning dependencies: record model revisions, index build IDs, and dataset hashes.

Step-by-step: widen → re-rank → generate

This is the default production upgrade path once naive top_k plateaus.

Vector retrieve — Pull k_wide (50–200) ids + texts with scores from your index (same embedder revision as in RAG Foundations: Embeddings, Chunking, and the Retrieval Loop).
Re-rank — Score each (query, passage) pair with a cross-encoder or hosted rerank API; sort descending; keep k_tight (3–10).
Prompt — Inject only k_tight passages into the chat template; log their ids on the response object for debugging.
Measure — Offline: recall@k on labeled data before and after rerank (rerank can hurt if the bi-encoder already ranked gold in position 80 and the cross-encoder mis-scores length-biased chunks).

Tooling comparison for re-ranking

Approach	Examples	Pros	Cons
Cross-encoder (self-hosted)	`sentence-transformers` `CrossEncoder`, small Hub models	Strong accuracy; no per-query API bill; full control	GPU latency; batching complexity at high QPS
Hosted rerank	Cohere rerank, Voyage, Jina rerank APIs	Fast integration; scales with vendor	Cost + egress; must pin API version behavior
LLM pointwise scoring	“Rate 0–5 how well this passage answers…”	Flexible rubric	Slow, expensive, judge drift; validate against humans
ColBERT-style late interaction	RAGatouille, ColBERT indexes	Excellent retrieval quality for some corpora	Heavier index + ops than flat vectors

Cross-encoder or hosted rerank is the usual workhorse; LLM-as-judge for reranking is a last resort after simpler baselines fail.

Illustrative cross-encoder pass (after you already have candidate strings passages):

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, p] for p in passages]
scores = model.predict(pairs)
ranked = [p for _, p in sorted(zip(scores, passages), reverse=True)]
tight = ranked[:8]

Pin the CrossEncoder weights the same way you pin Hub LLM revisions.

HyDE and query expansion: controlled experiments

Query expansion — Maintain a domain glossary JSON; at query time append synonyms, or ask a small LLM for 3 paraphrases and embed the centroid / run multi-query fusion (retrieve per variant, merge with max score). Log which variant hit gold.

HyDE — (1) Prompt LLM: “Write a short factual paragraph that would answer: …” (2) Embed that paragraph. (3) Search. A/B behind a flag: measure precision@k on a holdout set; watch for hallucinated proper nouns in the synthetic paragraph polluting retrieval.

Lightweight eval harness (copy-paste structure)

Keep a questions.jsonl with question, gold_chunk_ids[], optional metadata_filter. For each line:

Run retrieval (and optional rerank).
Compute recall@k: did any gold id appear in the top k returned ids?
Emit a CSV: question_id, gold_hit@20, gold_rank, latency_ms.

Gate merges on recall@k not moving backward when you change chunking, embedder, or reranker. The “answer groundedness” layer in this post stacks on top once retrieval is no longer random.

Conclusion

Treat RAG as a product surface. Log which chunk IDs supported each answer; wire request IDs through search and generation; and assume distilled backends will not hide bad retrieval.

LoRA: Fine-Tuning Small Distilled Models Without Full Weights picks up LoRA on those same backends so domain tone and tools improve without full fine-tunes.

LoRA: Fine-Tuning Small Distilled Models Without Full Weights

Veydh Gooljar — Sat, 10 Jan 2026 00:00:00 +0000

Preamble

Low-Rank Adaptation (LoRA) is the compromise when a model should sound like our support desk, respect our JSON tool schema, or stop doing that one thing—without paying the full price of updating every weight in a multi-billion-parameter checkpoint.

The idea is elementary linear algebra dressed as a training trick: instead of learning a full update matrix for a layer, you learn a low-rank factorization added to frozen weights. VRAM drops, iteration speed rises, and you can often ship adapter-only artifacts that are easy to version beside your data.

Why pair distillation with LoRA

Distilled students (from Mistral, Qwen, LLaMA families on the Hub) trade some ceiling capability for throughput and cost. They are attractive bases for assistants because you can serve them on fewer GPUs or smaller batches.

LoRA sits on top of that trade: you keep the frozen student as a stable anchor, then nudge behavior with a thin adapter trained on curated SFT data. You are not trying to re-teach the entire internet—just steer tone, format, and domain vocabulary.

What actually gets trained

Implementations differ, but the mental model is:

Frozen: most of the transformer—embedding tables and the bulk of attention/MLP weights stay put.
Trainable: small matrices injected into chosen projections (q_proj, k_proj, v_proj, o_proj, sometimes MLP gates). The library merges or applies these at forward time.

Rank r controls adapter capacity. Tiny r underfits quirky datasets; large r inches toward full fine-tune behavior and cost.

Alpha scaling (often paired with r) scales the effective strength of the adapter. Both are tuned together rather than leaving defaults unquestioned.

Dropout on adapter paths helps when the SFT set is small; it fights memorization of noisy examples.

Data hygiene beats clever hyperparameters

Datasets are versioned like code: git, manifests, and hashes. Deduplication matters—paraphrased duplicates inflate metrics and encourage verbatim regurgitation. System vs user turns should be labeled or separated consistently so chat templates do not scramble during training.

Step-by-step: LoRA with Transformers + PEFT (mental checklist)

This is the “one canonical recipe” kept stable; Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face and Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas only swap tokenizer/template details.

Pin base — model_id + revision (git commit on the Hub) for weights and tokenizer JSON.
Choose targets — Default: attention projections (q_proj, k_proj, v_proj, o_proj). Add MLP only if evals plateau and VRAM allows.
Set PEFT — LoraConfig(r=8 or 16, lora_alpha=16 or 32, lora_dropout=0.05, target_modules=[...], bias="none", task_type="CAUSAL_LM"). Treat r and alpha as coupled: effective scale is roughly alpha/r.
Load in bf16 — model = AutoModelForCausalLM.from_pretrained(..., torch_dtype=torch.bfloat16) when the GPU supports it; enable gradient checkpointing if sequence length × batch threatens OOM.
Tokenize with chat template — Build messages list → tokenizer.apply_chat_template(..., tokenize=True, return_tensors="pt", add_generation_prompt=False) for SFT; mask non-assistant token positions with -100 in labels so loss applies only where you want imitation (Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face expands Mistral-specific layout).
Train — Trainer or SFTTrainer with small learning rate (e.g. 1e-4 to 2e-4 for adapters, lower if unstable), cosine or linear decay, weight decay 0–0.1, max_steps or epochs chosen from a tiny dry run that completes without NaN.
Save — trainer.model.save_pretrained("./lora-out") saves adapter-only shards; record dataset hash, seed, and CLI args beside the folder.
Load for inference — PeftModel.from_pretrained(base, "./lora-out") or merge for deployment paths that require a single weight file.

Minimal wiring (abbreviated; align with your model family’s target_modules names):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
REV = "<pin-a-commit-sha>"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, revision=REV)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    revision=REV,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
model = get_peft_model(model, lora)

# ... build tokenized dataset with masked labels, then:
args = TrainingArguments(
    output_dir="./lora-run",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=True,
    gradient_checkpointing=True,
)
trainer = Trainer(model=model, args=args, train_dataset=tokenized_train)
trainer.train()
trainer.model.save_pretrained("./lora-out")

Pros of this path: battle-tested libraries, easy diff between runs, adapter artifacts stay small. Cons: you still own data hygiene, template correctness, and eval gates—PEFT does not fix mis-tokenized chats.

Conclusion

Pick one canonical training recipe—Transformers + PEFT, or a higher-level runner—and change only data and hyperparameters between experiments. Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face and Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas walk family-specific quirks (Mistral templates, Qwen tokenization, LLaMA RoPE) so the same LoRA discipline does not stumble on tokenizer details.

Algorithms Retrospective: DFS, BFS, Dijkstra, and Backtracking

Veydh Gooljar — Wed, 12 Nov 2025 00:00:00 +0000

Preamble

The 2020–2022 graph posts on this blog—DFS, BFS, Dijkstra, grid islands, word-ladder style searches, N-Queens backtracking—were exercises in state exploration with clear invariants. November pulls them into one narrative: what each technique buys you, when it misleads you, and how that thinking echoes systems work like scheduling and search over configuration spaces.

DFS: depth-first exploration

DFS walks until it hits a wall, then backtracks. It is natural for exhaustive search, connectivity checks, and puzzles with deep decision chains. It is not automatically the shortest path unless the problem structure guarantees it—knowing why saves interview time and production bugs.

BFS: layers and shortest path in unweighted graphs

BFS expands layer by layer, yielding shortest hop counts when edges have uniform cost. Grid problems and social graphs love BFS when “minimum steps” is the metric.

Dijkstra: non-negative weighted shortest paths

When weights appear, Dijkstra (with a priority queue) is the conservative default for non-negative edges. I still narrate relaxation aloud while coding—if I cannot explain the invariant, I should not trust the heap updates.

Backtracking: constraint propagation with undo

N-Queens and similar puzzles are backtracking with pruning. The pattern is choose, recurse, undo—the same shape as exploratory configuration changes with rollback in operational tooling (metaphorically, not literally identical code).

Study loop that stuck for me

Implement each pattern twice—I used Python and Java in the original series—timed against randomized inputs. Explaining invariants out loud catches fence-post errors faster than staring at IDE highlights.

Mini implementations side by side (same graph API)

Assume adj: Dict[int, List[Tuple[int, int]]] maps node → [(neighbor, weight), …]; for BFS/DFS, ignore weights or set them to 1.

BFS (shortest hops, unweighted):

from collections import deque

def bfs(adj, start, target):
    q = deque([start])
    prev = {start: None}
    while q:
        u = q.popleft()
        if u == target:
            break
        for v, _w in adj.get(u, []):
            if v not in prev:
                prev[v] = u
                q.append(v)
    return prev  # reconstruct path from prev

Dijkstra (non-negative weights):

import heapq

def dijkstra(adj, start):
    dist = {start: 0}
    pq = [(0, start)]
    while pq:
        d, u = heapq.heappop(pq)
        if d != dist.get(u, float("inf")):
            continue
        for v, w in adj.get(u, []):
            nd = d + w
            if nd < dist.get(v, float("inf")):
                dist[v] = nd
                heapq.heappush(pq, (nd, v))
    return dist

DFS (connectivity / exhaustive search):

def dfs(adj, start, seen=None):
    if seen is None:
        seen = set()
    seen.add(start)
    for v, _w in adj.get(start, []):
        if v not in seen:
            dfs(adj, v, seen)
    return seen

When to use which (decision table):

Goal	Algorithm	Invariant you should be able to say aloud
Min hops, unweighted	BFS	Nodes dequeued in nondecreasing distance from start
Min cost, edges ≥ 0	Dijkstra	`dist[u]` is final when `u` is popped
Explore all / detect cycle	DFS	`seen` partitions visited vs not

Port the same three functions to Java (ArrayDeque, PriorityQueue) to mirror the 2020–2022 posts—identical inputs should yield identical shortest-path distances.

Conclusion

Algorithm fluency and systems fluency both reward careful state management. Debugging Concurrent Systems: Books and Practices ties debugging and concurrency reading to how you hold graph and systems work in your head.

Architecture for Evolvable Services After Polyglot Expansion

Veydh Gooljar — Tue, 14 Oct 2025 00:00:00 +0000

Preamble

Adding Gleam or Rust workers does not rewrite your domain rules—it changes deployment topology, build graphs, and operational surface area. If core logic leaks across process boundaries, every new language multiplies rewrite cost. This post is the checklist after Polyglot Interop: HTTP and gRPC Between Python and Java: keep contracts honest and observability uniform.

APIs as the real boundary

HTTP or gRPC schemas are the stability layer. Internal refactors swap; protobuf/OpenAPI versions negotiate. Contract tests that fail CI when consumers and producers diverge silently are worth the investment.

Feature flags and cutover

Rolling out a new runtime behind a flag de-risks partial failures. Metrics should compare error budgets and latency between old and new paths on identical traffic slices—not on cherry-picked demos.

Observability parity

The same trace ID must traverse Python, Java, Gleam, and Rust services. Span names and attributes should be boringly consistent. If one stack logs strings and another logs structured JSON only, on-call pays the tax forever.

Clean architecture echoes

Clean Architecture and PoEAA posts from earlier years still apply: frameworks at the edge, domain in the center. Languages become plugins to boundaries you already named.

Contract tests: make polyglot changes reproducible

Consumer-driven contract sketch (concept applies to Pact, Spring Cloud Contract, or hand-rolled):

Commit an OpenAPI or protobuf schema as the source of truth.
CI runs schemathesis / Dredd / buf breaking-change checks on each PR.
For each language worker (Python, Java, Gleam, Rust), a thin adapter maps DTOs ↔ domain types; tests assert the adapter round-trips golden JSON fixtures.

Pros: regressions surface before deploy; Cons: contract suites need curation or they become flaky—pin API version headers and time fields in fixtures.

Observability: one trace through four runtimes

Propagate traceparent (W3C) or your vendor’s equivalent. Minimum viable attributes:

service.name, service.version (git SHA)
http.route or gRPC method
messaging.destination for queues

Reproducible debug session: given request_id, you should fetch one trace spanning all hops without grepping raw logs first.

Conclusion

Good architecture minimizes how many places must change when languages do. Algorithms Retrospective: DFS, BFS, Dijkstra, and Backtracking loops back to algorithms—graph search patterns that underpin both interviews and scheduling intuition.

Debugging Concurrent Systems: Books and Practices

Veydh Gooljar — Thu, 11 Sep 2025 00:00:00 +0000

Preamble

Concurrent bugs laugh at printf. Ordering issues hide until load, timing, and deployment topology align just wrong. September is the month to stop pretending heroics scale: systematic methods and canonical books belong in the same toolkit as traces and benchmarks.

Agans, Debugging

Agans’ habits—reproduce, simplify, read backwards, divide and conquer—apply whether the symptom is a segfault or a stuck BEAM process. For concurrency, “reproduce” often means stress plus fault injection, not a single lucky curl.

Herlihy & Shavit, The Art of Multiprocessor Programming

When locks, lock-free structures, or memory ordering are in play, invariants should be stated in plain language: what must be true before and after each step? The book is dense; even selective chapters upgrade intuition about why a “tiny” race takes weeks to find.

Practices that paid off

Record-replay and time-travel debuggers where available—expensive to set up, cheap at incident o’clock.
Stress tests with injected crashes (the 2025 workload theme).
Sequence numbers in logs so “event A before B” survives async reordering.
Distributed traces (OpenTelemetry Traces Across Python and Java) tying HTTP/gRPC spans to internal task lifetimes.

Cross-training runtimes

BEAM observer and tracing plus Rust backtraces plus OpenTelemetry spans shorten unknown-unknowns. No single tool wins; correlation IDs are the glue.

Reproducible incident checklist (printable)

Freeze the world — Git SHA, container digest, OTP/Rust toolchain versions, feature flags, traffic slice.
One graph — Pick one metric that hurts (p99, error rate, mailbox depth); screenshot with absolute time axis.
Minimal replay — Script that replays one failing request against staging with the same seeds as A Language-Agnostic Concurrent Workload for 2025 Comparisons when possible.
Divide — Binary-search components: disable rerank, disable RAG, shrink concurrency—same tactic as Agans’ divide-and-conquer.
Write the invariant — One sentence: “We assumed X; under load Y became false.” File it beside the fix.

Tooling matrix (what to reach for first)

Symptom stack	First tool	Second tool
BEAM tail latency	`observer`, `process_info`	`recon` / `etop`
Tokio stalls	`tokio-console`, tracing spans	`perf` / flamegraph
Cross-service	OpenTelemetry trace by `request_id`	Log join on span id

Conclusion

Tools and languages rotate; methodical debugging does not. Architecture for Evolvable Services After Polyglot Expansion returns to architecture when Gleam/Rust workers join Python/Java systems—boundaries matter more than logos.

Send, Sync, and Fearless Concurrency in Rust

Veydh Gooljar — Wed, 13 Aug 2025 00:00:00 +0000

Preamble

Rust’s concurrency story is not only ownership—it is also Send and Sync, marker traits the compiler uses to prove whether values may cross thread or task boundaries safely. When Rust and Tokio: The Same Concurrent Workload in Type-Safe Threads’s benchmark introduced shared aggregators, these traits stopped being trivia and became API design.

`Send`: moving ownership across threads

A type is Send when ownership can transfer to another thread without breaking aliasing rules. Most owned data is Send; Rc is a classic counterexample for naive sharing. If a spawned task closure captures something non-Send, the compiler refuses—good—because you were about to ship a data race.

`Sync`: shared references across threads

Sync means &T is Send: many threads may hold immutable references concurrently when the type’s interior rules allow. Mutex<T> is Sync when T is Send—the mutex serializes mutation so references stay disciplined.

Aggregators in the benchmark

Global counters tempt Arc<Mutex<Stats>>. That works until contention dominates; then message passing to a single owner task often simplifies invariants and reduces lock thrashing. The “right” choice is measured, not ideological.

Compiler errors as design feedback

When Rust rejects a spawn site, the error can be read as a design review from the typechecker. Languages without these checks still have the races—they just discover them after deploy, with fewer breadcrumbs.

Code: `Arc<Mutex<T>>` versus a single-owner stats task

Shared mutex (simple, contends under load):

use std::sync::{Arc, Mutex};
// use tokio::sync::mpsc; // in full binary

#[derive(Default)]
struct Stats { done: u64, errors: u64 }

// async fn worker(..., stats: Arc<Mutex<Stats>>) {
//     let mut g = stats.lock().unwrap();
//     g.done += 1;
// }

Message passing (often less thrash; mirrors a dedicated collector process on the BEAM):

pub struct StatsDelta { pub done: u64, pub errors: u64 }

// async fn stats_task(mut srx: mpsc::Receiver<StatsDelta>) {
//     let mut acc = StatsDelta { done: 0, errors: 0 };
//     while let Some(d) = srx.recv().await {
//         acc.done += d.done;
//         acc.errors += d.errors;
//     }
// }

Uncomment and wire mpsc channels when you port the JSONL metrics from A Language-Agnostic Concurrent Workload for 2025 Comparisons: one owner task serializes updates without a hot mutex. If profiling shows lock wait is negligible, Arc<Mutex<Stats>> stays the smaller program.

Conclusion

Prefer message passing when it clarifies ownership; use locks when metrics structures are genuinely shared and hot—then profile. Debugging Concurrent Systems: Books and Practices folds in debugging practices and classic texts so incidents close faster than printf loops.

Design Patterns on Veydh's Blog

Qwen3-TTS and Multilingual Neural TTS Trade-offs

Preamble

When heavy TTS earns its GPUs

Operational reality

Comparison back to Kokoro-class

Decision worksheet (fill before you buy GPUs)

Operational comparison vs lightweight Kokoro-class TTS

Conclusion

RVC in the Stack: Voice Conversion After TTS

Preamble

Reference architecture

Why add RVC at all

Costs and risks

Operations

Step-by-step: train-and-convert (engineering outline)

When not to use RVC

Conclusion

Whisper in the Loop: ASR, Not TTS

Preamble

Strengths to lean on

Costs and real-time constraints

Failure modes: hallucinations and music

Architecture placement

Step-by-step: batch transcription you can regression-test

Feeding RAG from audio (retrieval loop + microphone upstream)

Conclusion

Kokoro TTS: Lightweight Neural Speech and Where It Fits

Preamble

Where lightweight TTS wins

Trade-offs to expect up front

Integration beside the LLM layer

Relation to the rest of the stack

Reproducible evaluation loop (lightweight TTS)

Tool comparison (same text in, audio out)

Conclusion

Token Budgets: Enforcing Limits at the API Layer

Preamble

Hard caps: max_tokens and friends

Pre-flight token counting

Truncation strategies for long threads

Streaming and client contracts

Observability and fairness

Implementation sketch: pre-flight with tiktoken or the Hub tokenizer

Reproducible truncation regression test

Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face

Preamble

Pin the checkpoint, not just the name

Chat templates and special tokens

Memory, precision, and sequence length

Saving and merging adapters

Documentation alongside the weights

End-to-end recipe (SFT on JSONL chat rows)

CLI alternative: trl / Axolotl / LLaMA-Factory

Conclusion

Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas

Preamble (plain English)

Tiny glossary (terms that show up in configs and papers)

Qwen: multilingual tokenization changes the effective budget

LLaMA: RoPE, GQA, and “long context” bases

Licenses and redistribution

One adapter per base revision

Step-by-step when switching families (Qwen ↔ LLaMA)

Hub / tooling matrix (same training, different ergonomics)

Conclusion

Serving Distilled Models Behind an HTTP API

Preamble

Why OpenAI-compatible routes help

Runtime choices: vLLM, TGI, llama.cpp, and friends

Operations: health, versions, request IDs

Cost control: restricting token usage

Concrete runtimes: commands and trade-offs

Minimal OpenAI-shaped curl contract test

Conclusion

RAG Foundations: Embeddings, Chunking, and the Retrieval Loop

Preamble

The core loop, end to end

Chunking: where the most silent damage shows up

Embeddings and the vector store

Pitfalls to plan for explicitly

Hard caps: `max_tokens` and friends

Implementation sketch: pre-flight with `tiktoken` or the Hub tokenizer

CLI alternative: `trl` / Axolotl / LLaMA-Factory

Minimal OpenAI-shaped `curl` contract test

`Send`: moving ownership across threads

`Sync`: shared references across threads

Code: `Arc<Mutex<T>>` versus a single-owner stats task