<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Design Patterns on Veydh&#39;s Blog</title>
    <link>https://blog.veydh.com/tags/design-patterns/</link>
    <description>Recent content in Design Patterns on Veydh&#39;s Blog</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Wed, 10 Apr 2024 00:00:00 +0000</lastBuildDate>
    
    <atom:link href="https://blog.veydh.com/tags/design-patterns/" rel="self" type="application/rss+xml" />
    <item>
      <title>Qwen3-TTS and Multilingual Neural TTS Trade-offs</title>
      <link>https://blog.veydh.com/2026/2026-04-qwen3-tts-and-multilingual-neural-tts/</link>
      <pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-04-qwen3-tts-and-multilingual-neural-tts/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Large &lt;strong&gt;multilingual TTS&lt;/strong&gt; stacks—including &lt;strong&gt;Qwen3-TTS&lt;/strong&gt;-class models—aim for natural &lt;strong&gt;prosody&lt;/strong&gt;, broad &lt;strong&gt;language coverage&lt;/strong&gt;, and assistant-grade clarity. They occupy the opposite corner of the design space from &lt;a href=&#34;https://blog.veydh.com/2026/2026-03-kokoro-tts-lightweight-synthesis/&#34;&gt;Kokoro TTS: Lightweight Neural Speech and Where It Fits&lt;/a&gt;’s lightweight engines: more expressive, heavier to run, and proportionally more complex to operate.&lt;/p&gt;
&lt;p&gt;This post is not a benchmark manifesto; it is a frame for deciding &lt;strong&gt;when&lt;/strong&gt; to pay that cost.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;when-heavy-tts-earns-its-gpus&#34;&gt;When heavy TTS earns its GPUs&lt;/h2&gt;
&lt;p&gt;Moving up-stack makes sense when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The product is &lt;strong&gt;voice-first&lt;/strong&gt;—users judge quality in seconds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multilingual&lt;/strong&gt; parity matters; lightweight bundles do not cover locales we sell into.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Controllability&lt;/strong&gt; (rate, style, explicit speaker control when supported) is part of the UX contract.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In those cases, cost per minute is still tracked, but it is weighed against churn and brand perception—not against batch IVR pricing alone.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;operational-reality&#34;&gt;Operational reality&lt;/h2&gt;
&lt;p&gt;Bigger models mean &lt;strong&gt;more VRAM&lt;/strong&gt;, longer cold starts unless kept warm, and trickier autoscaling. TTS capacity is co-designed with &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt;’s LLM serving: both contend for GPU pools unless fleets are split.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Licensing&lt;/strong&gt; and &lt;strong&gt;voice rights&lt;/strong&gt; do not disappear because audio is synthetic. Terms of use for checkpoints, commercial redistribution, and cloned speaker data still bind shipping products. Legal review belongs in the same loop as for third-party LLM APIs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;comparison-back-to-kokoro-class&#34;&gt;Comparison back to Kokoro-class&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://blog.veydh.com/2026/2026-03-kokoro-tts-lightweight-synthesis/&#34;&gt;Kokoro TTS: Lightweight Neural Speech and Where It Fits&lt;/a&gt;’s lightweight path trades &lt;strong&gt;range and nuance&lt;/strong&gt; for &lt;strong&gt;efficiency&lt;/strong&gt;. Heavy TTS inverts that trade. Neither is “correct” globally—&lt;strong&gt;requirements&lt;/strong&gt; pick the winner.&lt;/p&gt;
&lt;p&gt;When marketing later demands a &lt;strong&gt;specific&lt;/strong&gt; timbre not native to the TTS roster, &lt;strong&gt;RVC&lt;/strong&gt; (&lt;a href=&#34;https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/&#34;&gt;RVC in the Stack: Voice Conversion After TTS&lt;/a&gt;) becomes the bridge: synthesize with a strong base, then convert timbre—at the cost of latency and governance overhead.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;decision-worksheet-fill-before-you-buy-gpus&#34;&gt;Decision worksheet (fill before you buy GPUs)&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Question&lt;/th&gt;
          &lt;th&gt;If “yes” →&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Do we sell into locales beyond the lightweight roster?&lt;/td&gt;
          &lt;td&gt;Bias toward &lt;strong&gt;heavy multilingual&lt;/strong&gt;; prove WER/MOS per locale.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Is voice a &lt;strong&gt;primary&lt;/strong&gt; differentiator (podcast, companion app)?&lt;/td&gt;
          &lt;td&gt;Bias toward &lt;strong&gt;heavy&lt;/strong&gt;; budget warm pools.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Is this &lt;strong&gt;IVR / notifications&lt;/strong&gt; at huge volume?&lt;/td&gt;
          &lt;td&gt;Bias toward &lt;strong&gt;light&lt;/strong&gt; (&lt;a href=&#34;https://blog.veydh.com/2026/2026-03-kokoro-tts-lightweight-synthesis/&#34;&gt;Kokoro TTS: Lightweight Neural Speech and Where It Fits&lt;/a&gt;); measure RTF.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Must timbre match a &lt;strong&gt;specific&lt;/strong&gt; human?&lt;/td&gt;
          &lt;td&gt;Plan &lt;strong&gt;RVC&lt;/strong&gt; (&lt;a href=&#34;https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/&#34;&gt;RVC in the Stack: Voice Conversion After TTS&lt;/a&gt;) + legal; rarely solved by TTS speaker id alone.&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id=&#34;operational-comparison-vs-lightweight-kokoro-class-tts&#34;&gt;Operational comparison vs lightweight Kokoro-class TTS&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Dimension&lt;/th&gt;
          &lt;th&gt;Heavy multilingual (Qwen3-TTS-class)&lt;/th&gt;
          &lt;th&gt;Lightweight TTS (&lt;a href=&#34;https://blog.veydh.com/2026/2026-03-kokoro-tts-lightweight-synthesis/&#34;&gt;Kokoro TTS: Lightweight Neural Speech and Where It Fits&lt;/a&gt;)&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;GPU RAM&lt;/td&gt;
          &lt;td&gt;High; often multi-GPU at scale&lt;/td&gt;
          &lt;td&gt;Low; CPU viable&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Cold start&lt;/td&gt;
          &lt;td&gt;Keep-warm or pay latency&lt;/td&gt;
          &lt;td&gt;Usually friendly to autoscale&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Voice control&lt;/td&gt;
          &lt;td&gt;Richer style/locale knobs&lt;/td&gt;
          &lt;td&gt;Smaller roster&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Upgrade risk&lt;/td&gt;
          &lt;td&gt;Large checkpoint swaps&lt;/td&gt;
          &lt;td&gt;Smaller binary swaps&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Reproducibility:&lt;/strong&gt; pin &lt;strong&gt;checkpoint&lt;/strong&gt;, &lt;strong&gt;speaker/style tokens&lt;/strong&gt;, &lt;strong&gt;sample rate&lt;/strong&gt;, and &lt;strong&gt;inference framework commit&lt;/strong&gt; the same way &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/&#34;&gt;Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face&lt;/a&gt; pins LLM Hub revisions.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Choose &lt;strong&gt;heavy TTS&lt;/strong&gt; when quality and multilingual breadth dominate unit economics; choose &lt;strong&gt;light TTS&lt;/strong&gt; when cost per minute and footprint dominate. For an optional &lt;strong&gt;RVC&lt;/strong&gt; second stage aimed at timbre control, see &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/&#34;&gt;RVC in the Stack: Voice Conversion After TTS&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>RVC in the Stack: Voice Conversion After TTS</title>
      <link>https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/</link>
      <pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Retrieval-based Voice Conversion (RVC)&lt;/strong&gt; maps &lt;strong&gt;audio&lt;/strong&gt; from one timbre toward another using models trained on &lt;strong&gt;target-speaker&lt;/strong&gt; data. It is not a text front-end: it consumes waveforms. That is why the common assistant pattern chains &lt;strong&gt;TTS → RVC&lt;/strong&gt; when you want machine-generated speech to resemble a &lt;strong&gt;specific&lt;/strong&gt; voice embedding learned from clean clips.&lt;/p&gt;
&lt;p&gt;This is separate from “pick speaker ID 3 in the TTS API.” RVC is about &lt;strong&gt;conversion&lt;/strong&gt;, not selecting a bundled persona—though the two blur in product marketing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;reference-architecture&#34;&gt;Reference architecture&lt;/h2&gt;
&lt;p&gt;The TTS-first pipeline:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Text → TTS engine (neutral or roster speaker) → waveform → RVC → output audio&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Each hop adds &lt;strong&gt;latency&lt;/strong&gt; and a &lt;strong&gt;failure domain&lt;/strong&gt;. TTS prosody mistakes propagate; RVC adjusts timbre but cannot always rewrite cadence or emotion wholesale.&lt;/p&gt;
&lt;p&gt;An alternate path is &lt;strong&gt;human speech → RVC&lt;/strong&gt; for dubbing, anonymization, or creative workflows. Consent and contract terms differ sharply from synthetic pipelines; governance checklists should not be reused blindly across those use cases.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-add-rvc-at-all&#34;&gt;Why add RVC at all&lt;/h2&gt;
&lt;p&gt;When branding demands a &lt;strong&gt;recognizable&lt;/strong&gt; voice and the base TTS cannot reproduce it natively, RVC offers a &lt;strong&gt;reskin&lt;/strong&gt; without retraining the entire synthesis stack. You can also swap TTS backends while retaining a stable timbre layer—if ops accepts the extra GPU work.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;costs-and-risks&#34;&gt;Costs and risks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Cascaded errors&lt;/strong&gt; are the engineering tax: two models mean two places to tune, monitor, and version.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Extra latency&lt;/strong&gt; per second of audio matters for interactive assistants; batch offline jobs tolerate it more easily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data ethics&lt;/strong&gt; are non-negotiable. Train only on speakers who &lt;strong&gt;consented&lt;/strong&gt;; treat unauthorized cloning as an abuse incident, not a support ticket. &lt;strong&gt;Deepfakes&lt;/strong&gt; carry legal exposure; product, legal, and security should sign the same threat model.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;operations&#34;&gt;Operations&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;RVC model revision&lt;/strong&gt;, &lt;strong&gt;voice profile ID&lt;/strong&gt;, and linkage back to &lt;strong&gt;TTS engine version&lt;/strong&gt; should be logged—the same audit trail mindset as logging embedder IDs in RAG. Without that, debugging “she sounded different last Tuesday” is guesswork.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;step-by-step-train-and-convert-engineering-outline&#34;&gt;Step-by-step: train-and-convert (engineering outline)&lt;/h2&gt;
&lt;p&gt;Training recipes change with upstream repos; the &lt;strong&gt;invariants&lt;/strong&gt; to enforce are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Consent artifact&lt;/strong&gt; — Written permission + scope (commercial, regions, duration) stored beside the dataset hash.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clean reference clips&lt;/strong&gt; — Denoise, strip silence, single speaker; 10–30 minutes often beats hours of noisy data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Feature extract&lt;/strong&gt; — Use the RVC toolchain’s pinned &lt;strong&gt;content encoder&lt;/strong&gt; (often HuBERT-based); record its revision.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Train&lt;/strong&gt; — Fixed seed, fixed batch, validation loss curve saved; no “best guess” checkpoint without eval listening.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Export&lt;/strong&gt; — Versioned &lt;code&gt;.pth&lt;/code&gt; / ONNX / whatever the inference server loads; &lt;strong&gt;never&lt;/strong&gt; overwrite in place.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inference test&lt;/strong&gt; — &lt;code&gt;TTS(reference_speaker) → RVC(target)&lt;/code&gt; on held-out text; compare to golden wavs.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Latency math:&lt;/strong&gt; &lt;code&gt;T_tts + T_rvc + buffering&lt;/code&gt; must fit your assistant turn budget; profile with &lt;strong&gt;batch size = 1&lt;/strong&gt; first.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;when-not-to-use-rvc&#34;&gt;When &lt;strong&gt;not&lt;/strong&gt; to use RVC&lt;/h2&gt;
&lt;p&gt;Skip the second stage when &lt;strong&gt;native multispeaker TTS&lt;/strong&gt; already matches brand needs, or when &lt;strong&gt;extra GPU + governance&lt;/strong&gt; cannot be justified. RVC shines when &lt;strong&gt;timbre&lt;/strong&gt; is non-negotiable and &lt;strong&gt;base TTS&lt;/strong&gt; is fungible.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;RVC is a &lt;strong&gt;power tool&lt;/strong&gt; with &lt;strong&gt;policy weight&lt;/strong&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Whisper in the Loop: ASR, Not TTS</title>
      <link>https://blog.veydh.com/2026/2026-04-whisper-in-the-speech-stack-asr/</link>
      <pubDate>Sat, 04 Apr 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-04-whisper-in-the-speech-stack-asr/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Whisper&lt;/strong&gt; is &lt;strong&gt;automatic speech recognition&lt;/strong&gt;: audio in, text out. Listing it next to Kokoro and Qwen3-TTS still makes sense because voice products need &lt;strong&gt;both&lt;/strong&gt; directions. The failure mode to watch for is teams treating Whisper like a speech &lt;em&gt;generator&lt;/em&gt; or expecting TTS to “listen.” Naming the loop clearly saves weeks of misfired tickets.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;strengths-to-lean-on&#34;&gt;Strengths to lean on&lt;/h2&gt;
&lt;p&gt;Whisper-class models are a strong &lt;strong&gt;open baseline&lt;/strong&gt; for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Captions&lt;/strong&gt; and meeting notes when privacy policy allows cloud or self-hosted inference.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Voice commands&lt;/strong&gt; where vocabulary is broad and accents vary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAG ingestion&lt;/strong&gt; over spoken content: transcribe audio, chunk text, embed—&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt;’s loop with a microphone upstream.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The robustness across noise and accents is the headline feature; it is why Whisper became the default mental model for ASR in many stacks.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;costs-and-real-time-constraints&#34;&gt;Costs and real-time constraints&lt;/h2&gt;
&lt;p&gt;Larger tiers improve accuracy and robustness; they also raise &lt;strong&gt;latency&lt;/strong&gt; and &lt;strong&gt;compute&lt;/strong&gt;. Real-time use often means &lt;strong&gt;chunked&lt;/strong&gt; inference, voice activity detection (&lt;strong&gt;VAD&lt;/strong&gt;) to skip silence, or smaller distilled ASR models when Whisper is overkill.&lt;/p&gt;
&lt;p&gt;Hardware is sized the same way as LLM serving: expected concurrency, SLO for end-to-end voice turns, and batching strategy.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;failure-modes-hallucinations-and-music&#34;&gt;Failure modes: hallucinations and music&lt;/h2&gt;
&lt;p&gt;ASR models can emit plausible text in &lt;strong&gt;silent&lt;/strong&gt; segments or under &lt;strong&gt;music&lt;/strong&gt;—classic hallucination territory. Mitigations include VAD, confidence heuristics, and post-filters that reject empty-audio transcripts. None are perfect; residual risk should be documented for product and legal review.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;architecture-placement&#34;&gt;Architecture placement&lt;/h2&gt;
&lt;p&gt;For assistants, Whisper belongs &lt;strong&gt;before&lt;/strong&gt; the LLM:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;microphone → ASR → text → (RAG + LLM) → text → TTS → speaker&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;That ordering keeps responsibilities clean. &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-qwen3-tts-and-multilingual-neural-tts/&#34;&gt;Qwen3-TTS and Multilingual Neural TTS Trade-offs&lt;/a&gt; focuses on &lt;strong&gt;neural TTS&lt;/strong&gt; output quality; &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/&#34;&gt;RVC in the Stack: Voice Conversion After TTS&lt;/a&gt; adds &lt;strong&gt;RVC&lt;/strong&gt; when the output timbre must match a specific speaker profile.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;step-by-step-batch-transcription-you-can-regression-test&#34;&gt;Step-by-step: batch transcription you can regression-test&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Pin model&lt;/strong&gt; — &lt;code&gt;large-v3&lt;/code&gt; vs &lt;code&gt;medium&lt;/code&gt; is a product choice; record &lt;strong&gt;checkpoint file hash&lt;/strong&gt; or &lt;code&gt;faster-whisper&lt;/code&gt; wheel version.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Normalize audio&lt;/strong&gt; — Resample to what the model expects (often 16 kHz mono); store &lt;strong&gt;ffmpeg&lt;/strong&gt; command lines in the ADR.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk long files&lt;/strong&gt; — VAD-based segments (silero, WebRTC VAD) reduce hallucinated text in music-only regions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Decode&lt;/strong&gt; — &lt;code&gt;faster-whisper&lt;/code&gt; or &lt;code&gt;whisper.cpp&lt;/code&gt; with fixed &lt;code&gt;beam_size&lt;/code&gt; / &lt;code&gt;temperature&lt;/code&gt; for reproducibility; log &lt;strong&gt;language&lt;/strong&gt; explicitly or force &lt;code&gt;language=&amp;quot;en&amp;quot;&lt;/code&gt; when appropriate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Post-filter&lt;/strong&gt; — Drop segments with &lt;strong&gt;no speech&lt;/strong&gt; per VAD; optionally cap &lt;strong&gt;max segment duration&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;CLI-shaped reproducible run&lt;/strong&gt; (&lt;code&gt;faster-whisper&lt;/code&gt; example):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;python - &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;lt;&amp;lt;&amp;#39;PY&amp;#39;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;from faster_whisper import WhisperModel
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;model = WhisperModel(&amp;#34;large-v3&amp;#34;, device=&amp;#34;cuda&amp;#34;, compute_type=&amp;#34;float16&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;segments, info = model.transcribe(&amp;#34;sample.wav&amp;#34;, beam_size=5, vad_filter=True)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;print(&amp;#34;lang&amp;#34;, info.language, &amp;#34;prob&amp;#34;, info.language_probability)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;for s in segments:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;    print(f&amp;#34;[{s.start:.2f}-{s.end:.2f}] {s.text}&amp;#34;)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;PY&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; excellent out-of-the-box WER on diverse accents; &lt;strong&gt;Cons:&lt;/strong&gt; GPU memory and latency for &lt;code&gt;large-*&lt;/code&gt;; hallucinations under silence/music unless VAD + policy.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;feeding-rag-from-audio-retrieval-loop--microphone-upstream&#34;&gt;Feeding RAG from audio (retrieval loop + microphone upstream)&lt;/h2&gt;
&lt;p&gt;After ASR, the pipeline is plain text: &lt;strong&gt;transcript → chunk → embed → index&lt;/strong&gt; (same shape as &lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt;). Record &lt;strong&gt;ASR model id&lt;/strong&gt; as metadata on derived chunks so you can re-index when ASR improves. Do &lt;strong&gt;not&lt;/strong&gt; embed raw audio in a text embedder without a dedicated multimodal model—that is a different architecture.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Whisper is the &lt;strong&gt;ears&lt;/strong&gt; of the stack, not the &lt;strong&gt;mouth&lt;/strong&gt;. Treating ASR and TTS symmetrically at the API layer—timeouts, quotas, logging—makes the full voice pipeline operable.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Kokoro TTS: Lightweight Neural Speech and Where It Fits</title>
      <link>https://blog.veydh.com/2026/2026-03-kokoro-tts-lightweight-synthesis/</link>
      <pubDate>Wed, 04 Mar 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-03-kokoro-tts-lightweight-synthesis/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Text-to-speech sits at the far end of the assistant pipeline: the LLM (and maybe RAG) decides &lt;em&gt;what&lt;/em&gt; to say; &lt;strong&gt;TTS&lt;/strong&gt; turns that text into audio users actually hear. &lt;strong&gt;Kokoro&lt;/strong&gt;-class engines target &lt;strong&gt;efficiency&lt;/strong&gt;—fast cold start, modest RAM, and throughput that fits small VMs or edge devices. Exact checkpoints and revisions evolve; the &lt;strong&gt;design trade-off&lt;/strong&gt; to anchor is footprint versus expressiveness.&lt;/p&gt;
&lt;p&gt;This post is the lightweight leg of the 2026 speech arc; &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-qwen3-tts-and-multilingual-neural-tts/&#34;&gt;Qwen3-TTS and Multilingual Neural TTS Trade-offs&lt;/a&gt; contrasts it with heavier multilingual stacks; &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/&#34;&gt;RVC in the Stack: Voice Conversion After TTS&lt;/a&gt; adds &lt;strong&gt;RVC&lt;/strong&gt; when timbre cloning enters the picture.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;where-lightweight-tts-wins&#34;&gt;Where lightweight TTS wins&lt;/h2&gt;
&lt;p&gt;Smaller neural TTS fits when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost per minute&lt;/strong&gt; dominates—IVR prompts, notification speech, or high-volume read-aloud.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provisioning&lt;/strong&gt; is tight—no room for multi-gigabyte multispeaker servers on every node.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt; budgets are aggressive and the script is short.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In those regimes, “good enough” prosody beats “stunning but slow.” Users forgive a slightly flat delivery on a parking confirmation more than they forgive a three-second wait.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;trade-offs-to-expect-up-front&#34;&gt;Trade-offs to expect up front&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Prosody and emotional range&lt;/strong&gt; rarely match the largest commercial or open giants. You get clarity and speed; you may not get nuanced acting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Voice roster&lt;/strong&gt; may be limited to bundled speakers. If marketing demands a &lt;strong&gt;specific&lt;/strong&gt; timbre, you either fine-tune within the engine’s constraints or add a &lt;strong&gt;second stage&lt;/strong&gt;—see &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-rvc-voice-conversion-architecture/&#34;&gt;RVC in the Stack: Voice Conversion After TTS&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Language coverage&lt;/strong&gt; varies by checkpoint. Every claimed locale should be validated; multilingual tokenization issues are not unique to LLMs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;integration-beside-the-llm-layer&#34;&gt;Integration beside the LLM layer&lt;/h2&gt;
&lt;p&gt;TTS belongs &lt;strong&gt;behind the same API discipline&lt;/strong&gt; as &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt;’s chat surface: timeouts, quotas, and structured errors when synthesis fails mid-utterance. Streaming audio raises questions of chunking and buffering; clients should handle partial playback and cancellation when the user interrupts.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;relation-to-the-rest-of-the-stack&#34;&gt;Relation to the rest of the stack&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://blog.veydh.com/2026/2026-04-whisper-in-the-speech-stack-asr/&#34;&gt;Whisper in the Loop: ASR, Not TTS&lt;/a&gt; covers &lt;strong&gt;Whisper&lt;/strong&gt; on the &lt;strong&gt;listening&lt;/strong&gt; side (ASR). The mental model is directional: &lt;strong&gt;Whisper → text → LLM/RAG → text → TTS → audio&lt;/strong&gt;. Mixing ASR and TTS responsibilities is a common beginner confusion; naming the boundary clearly keeps architecture reviews honest.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;reproducible-evaluation-loop-lightweight-tts&#34;&gt;Reproducible evaluation loop (lightweight TTS)&lt;/h2&gt;
&lt;p&gt;Treat TTS like any other model artifact: &lt;strong&gt;pin revision&lt;/strong&gt;, &lt;strong&gt;pin voice&lt;/strong&gt;, &lt;strong&gt;record sample rate&lt;/strong&gt;, &lt;strong&gt;store golden waveforms&lt;/strong&gt; for regression.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Freeze the stack&lt;/strong&gt; — Note Python version, package versions, and whether you run ONNX, CPU, or GPU build. One-line smoke: synthesize the same 3 sentences on every deploy.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measure what product cares about&lt;/strong&gt; — &lt;strong&gt;RTF&lt;/strong&gt; (real-time factor) = audio_duration / wall_time; &lt;strong&gt;cold start&lt;/strong&gt; ms until first byte; &lt;strong&gt;RSS&lt;/strong&gt; at steady state for N concurrent streams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ABX or MOS-lite&lt;/strong&gt; — For internal QA, maintain a spreadsheet of &lt;code&gt;{text, reference_wav}&lt;/code&gt; pairs; reviewers rate naturalness 1–5. Cheap, catches prosody regressions automated metrics miss.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Illustrative Python call pattern&lt;/strong&gt; (replace with the actual Kokoro / edge engine API you standardize on—the shape stays the same):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Pseudocode shape: engine.load(voice_id, revision=REV)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# audio = engine.synthesize(text, sample_rate=24000)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Path(&amp;#34;golden/hello.wav&amp;#34;).write_bytes(audio.to_wav())&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Log &lt;code&gt;voice_id&lt;/code&gt;, &lt;code&gt;engine_revision&lt;/code&gt;, and &lt;code&gt;sample_rate&lt;/code&gt; beside the &lt;code&gt;request_id&lt;/code&gt; discipline from &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt; so ops can answer “what spoke this?”&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;tool-comparison-same-text-in-audio-out&#34;&gt;Tool comparison (same text in, audio out)&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Option&lt;/th&gt;
          &lt;th&gt;Footprint&lt;/th&gt;
          &lt;th&gt;Quality headroom&lt;/th&gt;
          &lt;th&gt;Ops note&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Lightweight neural TTS (Kokoro-class)&lt;/td&gt;
          &lt;td&gt;Low RAM/CPU&lt;/td&gt;
          &lt;td&gt;Moderate prosody&lt;/td&gt;
          &lt;td&gt;Best for alerts/IVR-scale volume&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Cloud vendor TTS&lt;/td&gt;
          &lt;td&gt;Zero local GPU&lt;/td&gt;
          &lt;td&gt;High&lt;/td&gt;
          &lt;td&gt;Contract + egress + voice licensing&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Heavy open multilingual (&lt;a href=&#34;https://blog.veydh.com/2026/2026-04-qwen3-tts-and-multilingual-neural-tts/&#34;&gt;Qwen3-TTS and Multilingual Neural TTS Trade-offs&lt;/a&gt;)&lt;/td&gt;
          &lt;td&gt;High GPU&lt;/td&gt;
          &lt;td&gt;High&lt;/td&gt;
          &lt;td&gt;Fleet planning + checkpoint compliance&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Kokoro-tier TTS is a &lt;strong&gt;practical default&lt;/strong&gt; when efficiency matters more than maximal expressiveness. &lt;a href=&#34;https://blog.veydh.com/2026/2026-04-qwen3-tts-and-multilingual-neural-tts/&#34;&gt;Qwen3-TTS and Multilingual Neural TTS Trade-offs&lt;/a&gt; compares &lt;strong&gt;Qwen3-TTS&lt;/strong&gt;-class heavy stacks when quality and control win the negotiation.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Token Budgets: Enforcing Limits at the API Layer</title>
      <link>https://blog.veydh.com/2026/2026-03-token-limits-budgets-and-truncation/</link>
      <pubDate>Wed, 04 Mar 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-03-token-limits-budgets-and-truncation/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Uncapped generation is how you burn &lt;strong&gt;VRAM&lt;/strong&gt;, &lt;strong&gt;money&lt;/strong&gt;, and &lt;strong&gt;user patience&lt;/strong&gt; in a single request. Limits belong in &lt;strong&gt;middleware&lt;/strong&gt;—before the GPU does heavy work—not as a vague suggestion in the front-end. Token budgets are &lt;strong&gt;policy&lt;/strong&gt;: product, security, and finance get a say; engineering implements them deterministically.&lt;/p&gt;
&lt;p&gt;This pairs directly with &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt;’s serving layer and with &lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt;’s RAG context stuffing: the same window is contested by system prompts, retrieved chunks, tool outputs, and the user’s message history.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;hard-caps-max_tokens-and-friends&#34;&gt;Hard caps: &lt;code&gt;max_tokens&lt;/code&gt; and friends&lt;/h2&gt;
&lt;p&gt;Server-side &lt;strong&gt;&lt;code&gt;max_new_tokens&lt;/code&gt;&lt;/strong&gt; (or API &lt;code&gt;max_tokens&lt;/code&gt;, depending on the stack) is the simplest brake: stop after N new tokens regardless of whether the model “feels done.” Defaults should be sane per tier, with higher quotas for authenticated power users where appropriate.&lt;/p&gt;
&lt;p&gt;The important detail is &lt;strong&gt;consistent semantics&lt;/strong&gt; across streaming and non-streaming paths. Partial streams must surface a &lt;strong&gt;stop reason&lt;/strong&gt; so clients know the answer was truncated by length, not completed naturally.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;pre-flight-token-counting&#34;&gt;Pre-flight token counting&lt;/h2&gt;
&lt;p&gt;Rejecting oversize prompts &lt;strong&gt;before&lt;/strong&gt; allocation saves money. Token counts should use the same tokenizer family the model uses—approximations from rough character heuristics drift and cause surprise denials.&lt;/p&gt;
&lt;p&gt;When a prompt is too large, responses should be &lt;strong&gt;actionable&lt;/strong&gt;: which limit tripped, whether retry is possible after truncation, and never a generic 500 when the fix is user-side editing.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;truncation-strategies-for-long-threads&#34;&gt;Truncation strategies for long threads&lt;/h2&gt;
&lt;p&gt;For chat history, naive “keep the last K messages” drops important early instructions. Alternatives to mix and match:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Drop middle&lt;/strong&gt;: keep system/developer instructions and recent turns; summarize or elide the middle.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Summarize&lt;/strong&gt;: periodically roll older turns into a running summary (watch for summary drift and stale facts).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieve instead of paste&lt;/strong&gt;: RAG pulls fresh chunks rather than embedding entire old threads in the prompt.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each strategy has failure modes; the right choice depends on whether the product is &lt;strong&gt;support&lt;/strong&gt; (needs ticket context) or &lt;strong&gt;creative&lt;/strong&gt; (needs recent style).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;streaming-and-client-contracts&#34;&gt;Streaming and client contracts&lt;/h2&gt;
&lt;p&gt;With &lt;strong&gt;SSE&lt;/strong&gt; or chunked responses, clients must handle incremental tokens, &lt;strong&gt;finish reasons&lt;/strong&gt;, and errors mid-stream. Whether partial JSON tool calls can arrive and how to recover should be documented—same care as any long-lived HTTP stream.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;observability-and-fairness&#34;&gt;Observability and fairness&lt;/h2&gt;
&lt;p&gt;Histograms for &lt;strong&gt;prompt length&lt;/strong&gt;, &lt;strong&gt;output length&lt;/strong&gt;, and &lt;strong&gt;time-to-first-token&lt;/strong&gt; expose abusive patterns and misconfigured clients before finance does. Per-tenant &lt;strong&gt;quotas&lt;/strong&gt; pair with these metrics so one noisy integration does not starve the rest.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;implementation-sketch-pre-flight-with-tiktoken-or-the-hub-tokenizer&#34;&gt;Implementation sketch: pre-flight with &lt;code&gt;tiktoken&lt;/code&gt; or the Hub tokenizer&lt;/h2&gt;
&lt;p&gt;Use the &lt;strong&gt;same tokenizer class&lt;/strong&gt; the model uses—not a rough &lt;code&gt;len(text)//4&lt;/code&gt; heuristic.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; transformers &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; AutoTokenizer
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;tok &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; AutoTokenizer&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(MODEL_ID, revision&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;MODEL_REV)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;count_messages&lt;/span&gt;(messages) &lt;span style=&#34;color:#f92672&#34;&gt;-&amp;gt;&lt;/span&gt; int:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#75715e&#34;&gt;# Prefer the model&amp;#39;s chat template when available&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;return&lt;/span&gt; len(tok&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;apply_chat_template(messages, tokenize&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;True&lt;/span&gt;, add_generation_prompt&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;True&lt;/span&gt;))
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;assert_within_budget&lt;/span&gt;(messages, max_prompt_tokens: int) &lt;span style=&#34;color:#f92672&#34;&gt;-&amp;gt;&lt;/span&gt; &lt;span style=&#34;color:#66d9ef&#34;&gt;None&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    n &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; count_messages(messages)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; n &lt;span style=&#34;color:#f92672&#34;&gt;&amp;gt;&lt;/span&gt; max_prompt_tokens:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;raise&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;ValueError&lt;/span&gt;(&lt;span style=&#34;color:#e6db74&#34;&gt;f&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;prompt_tokens=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;{&lt;/span&gt;n&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt; exceeds max_prompt_tokens=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;{&lt;/span&gt;max_prompt_tokens&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In &lt;strong&gt;middleware&lt;/strong&gt;, return &lt;strong&gt;HTTP 413&lt;/strong&gt; or &lt;strong&gt;400&lt;/strong&gt; with a JSON body &lt;code&gt;{ &amp;quot;error&amp;quot;: &amp;quot;context_length&amp;quot;, &amp;quot;prompt_tokens&amp;quot;: n, &amp;quot;limit&amp;quot;: max_prompt_tokens }&lt;/code&gt; so clients know to truncate or summarize—mirroring how public APIs surface limits.&lt;/p&gt;
&lt;p&gt;For &lt;strong&gt;RAG&lt;/strong&gt;, count tokens &lt;strong&gt;after&lt;/strong&gt; you assemble system + retrieved passages + user message; if over budget, drop lowest-scoring chunks first (&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-operations-re-ranking-and-evaluation/&#34;&gt;RAG in Production: Re-ranking, HyDE, and Simple Evals&lt;/a&gt;’s reranker makes that ordering meaningful).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;reproducible-truncation-regression-test&#34;&gt;Reproducible truncation regression test&lt;/h2&gt;
&lt;p&gt;Keep a &lt;strong&gt;fixture&lt;/strong&gt; of long threads in &lt;code&gt;tests/fixtures/long_chat.json&lt;/code&gt;. After any change to truncation or summarization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Token-count before/after.&lt;/li&gt;
&lt;li&gt;Golden hash of the &lt;strong&gt;exact&lt;/strong&gt; byte sequence sent to the model (or the list of chunk ids injected for RAG).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CI fails when the fixture’s effective context changes without a version bump—same instinct as index versioning in &lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face</title>
      <link>https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/</link>
      <pubDate>Fri, 13 Feb 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Mistral&lt;/strong&gt;-family checkpoints are common bases for assistants: strong quality per dollar, good tooling, and a large community of derivative models. Pulling weights from the Hub is only the first step. Training &lt;strong&gt;without&lt;/strong&gt; the model’s official &lt;strong&gt;chat template&lt;/strong&gt;—or mixing tokenizer specials inconsistently—is how you get sparkling loss curves and broken inference.&lt;/p&gt;
&lt;p&gt;The snippets below assume &lt;strong&gt;PyTorch + Transformers + PEFT&lt;/strong&gt;; translate the same invariants to your runner of choice.&lt;/p&gt;
&lt;p&gt;This post is a checklist to run before any serious LoRA SFT job on Mistral-class bases. It mirrors the dependency-pinning mindset from earlier DevOps writing: reproducibility beats heroics.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;pin-the-checkpoint-not-just-the-name&#34;&gt;Pin the checkpoint, not just the name&lt;/h2&gt;
&lt;p&gt;Hub models move. &lt;strong&gt;&lt;code&gt;revision&lt;/code&gt;&lt;/strong&gt; should be pinned to a commit hash for training &lt;em&gt;and&lt;/em&gt; for inference. The same applies to tokenizer files bundled with that revision. “Latest” is fine for a weekend experiment; it is a liability when you need to diff a regression three weeks later.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;chat-templates-and-special-tokens&#34;&gt;Chat templates and special tokens&lt;/h2&gt;
&lt;p&gt;Instruction-tuned models expect &lt;strong&gt;BOS/EOS&lt;/strong&gt; and role markers in a specific layout. Training on raw concatenated strings—user and assistant blobs jammed together without structure—teaches the adapter the wrong conditional distribution.&lt;/p&gt;
&lt;p&gt;Always:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Load &lt;code&gt;tokenizer&lt;/code&gt; from the same revision as &lt;code&gt;model&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Apply the model card’s recommended &lt;strong&gt;chat template&lt;/strong&gt; (or &lt;code&gt;tokenizer.apply_chat_template&lt;/code&gt; when available) to build training examples.&lt;/li&gt;
&lt;li&gt;Mask labels so loss applies only to &lt;strong&gt;assistant&lt;/strong&gt; tokens when that is the objective; masking prevents the model from learning to imitate user prefixes.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If any of those steps feel optional, revisit &lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt;: garbage structure upstream becomes garbage behavior downstream.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;memory-precision-and-sequence-length&#34;&gt;Memory, precision, and sequence length&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Gradient checkpointing&lt;/strong&gt; trades compute for activation memory; on long contexts it is often mandatory. &lt;strong&gt;bf16&lt;/strong&gt; is used when hardware supports it well; &lt;strong&gt;fp16&lt;/strong&gt; can work but needs watching for overflow. &lt;strong&gt;OOM&lt;/strong&gt; during the first real batch is a signal to shorten sequences, reduce micro-batch, or enable ZeRO-style sharding—not to immediately give up on LoRA.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;saving-and-merging-adapters&#34;&gt;Saving and merging adapters&lt;/h2&gt;
&lt;p&gt;During iteration, &lt;strong&gt;adapter weights only&lt;/strong&gt; are saved—small, easy to diff and A/B swap. &lt;strong&gt;Merging&lt;/strong&gt; adapters into a full weight file is useful for some deployment paths (single artifact, certain runtimes); it complicates updates because every merge is a new binary. Which path production uses should be documented so nobody confuses “adapter v3” with “merged full model v3”.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;documentation-alongside-the-weights&#34;&gt;Documentation alongside the weights&lt;/h2&gt;
&lt;p&gt;An &lt;strong&gt;ADR-style&lt;/strong&gt; note (even a short Markdown file in the repo) should record Hub IDs, dataset hashes, seeds, learning rate schedule, and eval commands. That is the same habit as recording Docker base images and lockfiles: future you is also a teammate.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;end-to-end-recipe-sft-on-jsonl-chat-rows&#34;&gt;End-to-end recipe (SFT on JSONL chat rows)&lt;/h2&gt;
&lt;p&gt;Assume each training row is &lt;code&gt;{&amp;quot;messages&amp;quot;: [{&amp;quot;role&amp;quot;:&amp;quot;system&amp;quot;,&amp;quot;content&amp;quot;:...}, {&amp;quot;role&amp;quot;:&amp;quot;user&amp;quot;,...}, {&amp;quot;role&amp;quot;:&amp;quot;assistant&amp;quot;,...}]}&lt;/code&gt; — one assistant turn per example for simplicity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1 — Load pinned revision&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; torch
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; transformers &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; AutoModelForCausalLM, AutoTokenizer
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;MODEL_ID &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;mistralai/Mistral-7B-Instruct-v0.3&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;REV &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;abc123deadbeef...&amp;#34;&lt;/span&gt;  &lt;span style=&#34;color:#75715e&#34;&gt;# Hub commit sha&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;tokenizer &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; AutoTokenizer&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(MODEL_ID, revision&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;REV)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;model &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; AutoModelForCausalLM&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    MODEL_ID, revision&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;REV, torch_dtype&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;torch&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;bfloat16, device_map&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;auto&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Step 2 — Apply chat template and mask user/system tokens&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Use &lt;code&gt;tokenizer.apply_chat_template&lt;/code&gt; with &lt;code&gt;return_dict=True&lt;/code&gt; and token masks, or build labels manually: set &lt;code&gt;labels[i] = -100&lt;/code&gt; for tokens that belong to system + user prefixes so &lt;strong&gt;only assistant tokens contribute loss&lt;/strong&gt;. Skipping this is how you get a model that parrots the user channel.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3 — Attach LoRA and train&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Match &lt;code&gt;target_modules&lt;/code&gt; to Mistral’s linear names (&lt;code&gt;q_proj&lt;/code&gt;, &lt;code&gt;k_proj&lt;/code&gt;, &lt;code&gt;v_proj&lt;/code&gt;, &lt;code&gt;o_proj&lt;/code&gt; are the usual set). Watch &lt;strong&gt;train loss&lt;/strong&gt; &lt;em&gt;and&lt;/em&gt; a small &lt;strong&gt;held-out generation&lt;/strong&gt; set weekly—loss can fall while JSON/tool format falls apart.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4 — Save adapter + manifest&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;lora-out/
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  adapter_config.json
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  adapter_model.safetensors
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;manifest.json   # { &amp;#34;base&amp;#34;: MODEL_ID, &amp;#34;revision&amp;#34;: REV, &amp;#34;dataset_sha256&amp;#34;: &amp;#34;...&amp;#34;, &amp;#34;seed&amp;#34;: 42 }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Step 5 — Inference smoke test&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Load base + adapter, run the same &lt;code&gt;apply_chat_template&lt;/code&gt; path used in training, and compare against a &lt;strong&gt;golden&lt;/strong&gt; transcript file in CI if possible.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;cli-alternative-trl--axolotl--llama-factory&#34;&gt;CLI alternative: &lt;code&gt;trl&lt;/code&gt; / Axolotl / LLaMA-Factory&lt;/h2&gt;
&lt;p&gt;Higher-level runners (&lt;code&gt;trl&lt;/code&gt; SFT, Axolotl, LLaMA-Factory YAMLs) encode the same steps: dataset → template → masked labels → PEFT. &lt;strong&gt;Pros:&lt;/strong&gt; less boilerplate, proven configs on Discord. &lt;strong&gt;Cons:&lt;/strong&gt; another layer to debug when templates change between library versions—still pin the &lt;strong&gt;library version&lt;/strong&gt; and export the rendered tokenization for one row to disk when incidents happen.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Mistral-class fine-tunes are won or lost on &lt;strong&gt;tokenizer fidelity&lt;/strong&gt; and &lt;strong&gt;pinned revisions&lt;/strong&gt;. &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-qwen-and-llama-distills-lora-on-hf/&#34;&gt;Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas&lt;/a&gt; extends the same discipline to &lt;strong&gt;Qwen&lt;/strong&gt; and &lt;strong&gt;LLaMA&lt;/strong&gt; derivatives where multilingual vocabs and RoPE extensions add their own footguns.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas</title>
      <link>https://blog.veydh.com/2026/2026-02-qwen-and-llama-distills-lora-on-hf/</link>
      <pubDate>Thu, 12 Feb 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-02-qwen-and-llama-distills-lora-on-hf/</guid>
      <description>&lt;h2 id=&#34;preamble-plain-english&#34;&gt;Preamble (plain English)&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Hugging Face (HF)&lt;/strong&gt; hosts thousands of model checkpoints. Among them, &lt;strong&gt;Qwen&lt;/strong&gt; and &lt;strong&gt;LLaMA&lt;/strong&gt; families are everywhere: raw bases, chat-tuned “instruct” versions, and &lt;strong&gt;distilled&lt;/strong&gt; smaller models trained to imitate bigger ones for speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;LoRA&lt;/strong&gt; (Low-Rank Adaptation) means you do not retrain the whole model—you add a thin trainable “side path” and leave most weights frozen. That is cheap and fast, but the adapter is not a free-floating brain: it is tied to &lt;strong&gt;one specific base checkpoint&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The training recipe from &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/&#34;&gt;Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face&lt;/a&gt; still applies. This post is the “read me before you merge adapters” note: &lt;strong&gt;tokenizer&lt;/strong&gt;, &lt;strong&gt;position encoding (RoPE)&lt;/strong&gt;, &lt;strong&gt;attention layout (GQA)&lt;/strong&gt;, and &lt;strong&gt;licenses&lt;/strong&gt; differ by family. Getting any of those wrong often fails &lt;strong&gt;silently&lt;/strong&gt; (the model runs, but quality drifts or breaks in edge cases).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;tiny-glossary-terms-that-show-up-in-configs-and-papers&#34;&gt;Tiny glossary (terms that show up in configs and papers)&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tokenizer&lt;/strong&gt; — The piece that turns raw text into integer &lt;strong&gt;tokens&lt;/strong&gt; the model sees. Different families use different tokenizers; swap the tokenizer and you have effectively changed the “alphabet” the model was trained on.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RoPE&lt;/strong&gt; — &lt;strong&gt;Rotary Positional Embedding&lt;/strong&gt;. A standard way to tell the model &lt;strong&gt;how far apart tokens are&lt;/strong&gt; in the sequence (position information) without paying the full cost of older absolute-position schemes. “&lt;strong&gt;RoPE scaling&lt;/strong&gt;” tweaks that math so the model can &lt;strong&gt;pretend&lt;/strong&gt; it was trained on longer sequences than it mostly saw—useful for long context, but easy to get wrong between train and serve.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GQA&lt;/strong&gt; — &lt;strong&gt;Grouped-Query Attention&lt;/strong&gt;. A memory-saving trick: several &lt;strong&gt;query&lt;/strong&gt; heads share the same &lt;strong&gt;key/value&lt;/strong&gt; heads instead of every query head getting its own K/V. Layout differs across checkpoints; mixing configs can break loading or silently change behavior if something maps layers wrong.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SFT&lt;/strong&gt; — &lt;strong&gt;Supervised Fine-Tuning&lt;/strong&gt;. You show (prompt, desired answer) pairs and train the model to imitate them—typical “instruction tuning.”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PEFT&lt;/strong&gt; — &lt;strong&gt;Parameter-Efficient Fine-Tuning&lt;/strong&gt; (e.g. LoRA via the &lt;code&gt;peft&lt;/code&gt; library). You train few parameters; the rest stay frozen.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ADR&lt;/strong&gt; — &lt;strong&gt;Architecture Decision Record&lt;/strong&gt; — a short doc in your repo that says &lt;em&gt;why&lt;/em&gt; you picked a base model and how you pin versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;qwen-multilingual-tokenization-changes-the-effective-budget&#34;&gt;Qwen: multilingual tokenization changes the effective budget&lt;/h2&gt;
&lt;p&gt;Think of the tokenizer as a &lt;strong&gt;ruler&lt;/strong&gt; that measures text in &lt;strong&gt;tokens&lt;/strong&gt;, not characters. Qwen’s tokenizer is built for &lt;strong&gt;many languages&lt;/strong&gt;. That is great for products that are not English-only; it also changes planning in ways people miss.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 1 — Same “length” in characters ≠ same token budget.&lt;/strong&gt;&lt;br&gt;
A paragraph in English might become 800 tokens; the same &lt;em&gt;idea&lt;/em&gt; in another script might become 1,400 tokens. Your “4k context” window is &lt;strong&gt;4k tokens&lt;/strong&gt;, not 4k letters—so non-English locales can &lt;strong&gt;feel&lt;/strong&gt; like they have less room for instructions, tools, and retrieved documents.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 2 — English-only metrics lie.&lt;/strong&gt;&lt;br&gt;
If you only watch loss or accuracy on English dev data, you can ship something that looks fine in dashboards but degrades in other languages. Check &lt;strong&gt;each language you actually ship&lt;/strong&gt;, including formatting quirks (dates, numbers, mixed scripts).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 3 — RAG stacks compound the bias.&lt;/strong&gt;&lt;br&gt;
If your chunking, embedders, and eval sets were built English-first, retrieval quality drops where the tokenizer and embeddings were not aligned with real user text. The LoRA cannot fix upstream retrieval that never surfaces the right chunks.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;llama-rope-gqa-and-long-context-bases&#34;&gt;LLaMA: RoPE, GQA, and “long context” bases&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;RoPE and “long context” marketing&lt;/strong&gt;&lt;br&gt;
A checkpoint may advertise a large context window, but pretraining data is still &lt;strong&gt;mostly shorter&lt;/strong&gt; sequences. The model learns long-context behavior partly through &lt;strong&gt;extrapolation&lt;/strong&gt; (math and scaling tricks). If your &lt;strong&gt;SFT&lt;/strong&gt; data jumps wildly between very short and very long examples without a clear pattern, you can get brittle behavior at the lengths you care about in production.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 4 — Train-time vs serve-time RoPE settings must match.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;rope_scaling&lt;/code&gt;, &lt;code&gt;max_position_embeddings&lt;/code&gt;, and related flags in &lt;code&gt;config.json&lt;/code&gt; are not cosmetic. If training uses scaling recipe &lt;strong&gt;A&lt;/strong&gt; and your inference server uses &lt;strong&gt;B&lt;/strong&gt; (or an older Transformers version interprets them differently), you often get &lt;strong&gt;no crash&lt;/strong&gt;—just &lt;strong&gt;worse answers&lt;/strong&gt;, especially on long prompts. Pin library versions and dump the effective config next to your adapter in the same manifest.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 5 — GQA layout is part of the architecture contract.&lt;/strong&gt;&lt;br&gt;
Community and vendor builds differ in how many heads group together. An adapter trained on one revision is &lt;strong&gt;not&lt;/strong&gt; a guaranteed drop-in on another revision if attention layout or hidden sizes changed. Treat “same model name, different revision” like a &lt;strong&gt;different car engine&lt;/strong&gt;—the body looks similar, parts may not interchange.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 6 — “Eval average loss” hides long-context bugs.&lt;/strong&gt;&lt;br&gt;
Short training runs and average metrics smooth over failures that only appear at &lt;strong&gt;max production context&lt;/strong&gt;. Smoke tests at 200 steps are fine for wiring; you still need at least one eval that stresses &lt;strong&gt;the longest context you plan to use&lt;/strong&gt; (or the longest you can afford to simulate).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;licenses-and-redistribution&#34;&gt;Licenses and redistribution&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 7 — LoRA does not reset licensing.&lt;/strong&gt;&lt;br&gt;
You still ship a story that includes the &lt;strong&gt;base weights&lt;/strong&gt; (loaded at runtime) &lt;strong&gt;plus&lt;/strong&gt; your adapter (or a &lt;strong&gt;merged&lt;/strong&gt; full checkpoint). Some variants restrict &lt;strong&gt;commercial&lt;/strong&gt; use, require &lt;strong&gt;attribution&lt;/strong&gt;, or impose &lt;strong&gt;compliance&lt;/strong&gt; steps. “We only fine-tuned a little” does not exempt you from the base model’s terms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 8 — Document Hub IDs like you document dependencies.&lt;/strong&gt;&lt;br&gt;
Put license notes beside &lt;strong&gt;exact&lt;/strong&gt; model IDs and &lt;strong&gt;revisions&lt;/strong&gt; (git-style hashes on the Hub) in the same ADR where you record dataset hashes and seeds. That is what legal and security reviews actually read.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;one-adapter-per-base-revision&#34;&gt;One adapter per base revision&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Gotcha 9 — Do not casually “port” adapters across bases.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;One LoRA per base revision&lt;/strong&gt; is the safe default. Research sometimes transfers adapters between related checkpoints; in production that is &lt;strong&gt;unforced risk&lt;/strong&gt;—small numerical differences in layernorms, embeddings, or RoPE implementation can accumulate into large behavior changes.&lt;/p&gt;
&lt;p&gt;If you must try transfer, treat it like a &lt;strong&gt;new model&lt;/strong&gt;: full eval gates, regression tests on golden prompts, and no “we eyeballed one benchmark screenshot.”&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;step-by-step-when-switching-families-qwen--llama&#34;&gt;Step-by-step when switching families (Qwen ↔ LLaMA)&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-lora-and-small-distilled-models-primer/&#34;&gt;LoRA: Fine-Tuning Small Distilled Models Without Full Weights&lt;/a&gt; and &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/&#34;&gt;Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face&lt;/a&gt;’s &lt;strong&gt;PEFT&lt;/strong&gt; recipe stays; below are &lt;strong&gt;extra&lt;/strong&gt; gates per family.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tokenizer probe&lt;/strong&gt; — For each locale you ship, tokenize a &lt;strong&gt;fixed&lt;/strong&gt; paragraph of representative text. Record &lt;strong&gt;token count&lt;/strong&gt; and verify &lt;strong&gt;round-trip&lt;/strong&gt;: for string inputs, &lt;code&gt;decode(encode(x))&lt;/code&gt; should equal &lt;code&gt;x&lt;/code&gt; (or document intentional normalization). Multilingual bases punish “character budget” planning that assumed English token density.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Template probe&lt;/strong&gt; — Dump &lt;code&gt;tokenizer.apply_chat_template&lt;/code&gt; on one golden multi-turn example to a &lt;code&gt;.txt&lt;/code&gt; file in git; &lt;strong&gt;diff&lt;/strong&gt; that file in PRs that touch chat formatting. Qwen and LLaMA instruct variants use different &lt;strong&gt;role markers&lt;/strong&gt; and spacing; mixing templates means you train the model to expect the wrong framing at inference.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RoPE / context flags&lt;/strong&gt; — Copy into your run manifest: &lt;code&gt;rope_scaling&lt;/code&gt;, &lt;code&gt;max_position_embeddings&lt;/code&gt;, and any other fields your &lt;strong&gt;training&lt;/strong&gt; and &lt;strong&gt;serving&lt;/strong&gt; stacks read from &lt;code&gt;config.json&lt;/code&gt;. Training with scaling &lt;strong&gt;A&lt;/strong&gt; and serving with scaling &lt;strong&gt;B&lt;/strong&gt; is a classic &lt;strong&gt;silent&lt;/strong&gt; quality regression.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Train short, eval long&lt;/strong&gt; — Run a short smoke job for plumbing; then evaluate at &lt;strong&gt;max context&lt;/strong&gt; you intend in production (or the nearest affordable subset). Extrapolation bugs show up here, not in smoothed training loss.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Merge policy&lt;/strong&gt; — Decide early whether production loads &lt;strong&gt;&lt;code&gt;PeftModel&lt;/code&gt;&lt;/strong&gt; (base + adapter at runtime) or a &lt;strong&gt;merged&lt;/strong&gt; &lt;code&gt;save_pretrained&lt;/code&gt; directory. &lt;strong&gt;Merge&lt;/strong&gt; simplifies some C++ or single-binary servers; it complicates “swap adapter only” deploys and can blur audit trails if you lose the separation between base and delta.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Optional merge&lt;/strong&gt; (when you explicitly want one directory of full weights):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; torch
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; transformers &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; AutoModelForCausalLM
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; peft &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; PeftModel
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;base &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; AutoModelForCausalLM&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(BASE_ID, revision&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;BASE_REV, torch_dtype&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;torch&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;bfloat16, device_map&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;cpu&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;base &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; PeftModel&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(base, &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./lora-out&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;merged &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; base&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;merge_and_unload()
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;merged&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;save_pretrained(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./merged-model&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;tokenizer&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;save_pretrained(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./merged-model&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Gotcha 10 — Re-evaluate after merge.&lt;/strong&gt;&lt;br&gt;
Run the &lt;strong&gt;same&lt;/strong&gt; eval suite on merged weights. Do not assume merge is bitwise-identical to “base + adapter” forward pass in every runtime without checking—numerical paths and optimizations can differ.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;hub--tooling-matrix-same-training-different-ergonomics&#34;&gt;Hub / tooling matrix (same training, different ergonomics)&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Workflow&lt;/th&gt;
          &lt;th&gt;Tooling&lt;/th&gt;
          &lt;th&gt;When to use it&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Notebook → script&lt;/td&gt;
          &lt;td&gt;Transformers + PEFT + &lt;strong&gt;TRL&lt;/strong&gt; (Transformer Reinforcement Learning library—helpers for SFT/RLHF-style training on HF)&lt;/td&gt;
          &lt;td&gt;Small teams, full control&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;YAML-driven SFT&lt;/td&gt;
          &lt;td&gt;LLaMA-Factory, Axolotl&lt;/td&gt;
          &lt;td&gt;Faster iteration when configs are shared&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Managed fine-tune&lt;/td&gt;
          &lt;td&gt;Cloud vendor jobs&lt;/td&gt;
          &lt;td&gt;When compliance wants a hosted trail; still export adapter + manifest&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The &lt;strong&gt;reproducibility&lt;/strong&gt; invariant is the same: pinned &lt;strong&gt;base&lt;/strong&gt; revision, frozen dataset hash, logged &lt;code&gt;lora_config&lt;/code&gt;, &lt;strong&gt;identical&lt;/strong&gt; chat template in train and serve.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In one sentence: &lt;strong&gt;family-specific tokenizer and RoPE (and attention layout) details&lt;/strong&gt; decide whether your fine-tune is stable—or fragile in ways that only show up under real prompts and real languages. &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt; moves from training to &lt;strong&gt;serving&lt;/strong&gt;: HTTP boundaries, OpenAI-compatible APIs, and runtime choices once the adapter exists.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Serving Distilled Models Behind an HTTP API</title>
      <link>https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/</link>
      <pubDate>Mon, 09 Feb 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Training a LoRA is pretty satisfying (once you see the results start leaning closer to the thing you&amp;rsquo;re striving towards); &lt;strong&gt;serving&lt;/strong&gt; it is where software engineering returns. Clients should see stable JSON, explicit errors, and predictable latency—not a Jupyter cell calling &lt;code&gt;model.generate&lt;/code&gt;. The HTTP layer is a &lt;strong&gt;contract&lt;/strong&gt;: versioning, observability, and back-pressure belong there, not buried inside CUDA kernels.&lt;/p&gt;
&lt;p&gt;This post frames &lt;strong&gt;OpenAI-compatible&lt;/strong&gt; surfaces and the runtimes behind them, in the same spirit as the interop and API-design posts from 2024.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-openai-compatible-routes-help&#34;&gt;Why OpenAI-compatible routes help&lt;/h2&gt;
&lt;p&gt;Exposing &lt;code&gt;/v1/chat/completions&lt;/code&gt; (or a close cousin) means existing SDKs, proxies, and internal tools often &lt;strong&gt;drop in&lt;/strong&gt; without a rewrite. You still document your own limits—max context, supported modalities, tool-calling quirks—but you avoid inventing yet another message schema for every new service.&lt;/p&gt;
&lt;p&gt;Compatibility is not perfection. Deltas should be documented explicitly (streaming format, stop sequences, tool JSON) so client teams do not rely on undocumented behavior.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;runtime-choices-vllm-tgi-llamacpp-and-friends&#34;&gt;Runtime choices: vLLM, TGI, llama.cpp, and friends&lt;/h2&gt;
&lt;p&gt;The right engine depends on constraints more than on hype:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-GPU, high throughput, Python ecosystem&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; and similar servers emphasize batching, PagedAttention-style memory use, and throughput. Good when you have NVIDIA hardware and want many concurrent sessions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hugging Face TGI&lt;/strong&gt;&lt;br&gt;
Another solid path when you want tight integration with Hub workflows and well-traveled Docker images.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CPU or edge, smaller models&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt;-family servers shine when GPUs are scarce or you need a portable binary. Latency per token differs; total cost of ownership may still win.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The choice rests on &lt;strong&gt;SLO&lt;/strong&gt;, &lt;strong&gt;hardware&lt;/strong&gt;, and &lt;strong&gt;team familiarity&lt;/strong&gt;—not benchmark leaderboard bragging rights alone.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;operations-health-versions-request-ids&#34;&gt;Operations: health, versions, request IDs&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Health checks&lt;/strong&gt; should validate more than “process is up”: load the tokenizer, warm the model, and optionally run a tiny forward pass in staging. &lt;strong&gt;Model version&lt;/strong&gt; belongs in response headers or metadata so clients can log which revision answered.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Request IDs&lt;/strong&gt; propagate through auth, rate limiting, inference, and logging—the same discipline as wiring gRPC and HTTP across Python and Java services. When something regresses, one string traces the path.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;cost-control-restricting-token-usage&#34;&gt;Cost control: restricting token usage&lt;/h2&gt;
&lt;p&gt;Cost tracks &lt;strong&gt;tokens processed&lt;/strong&gt; (prompt + completion) and &lt;strong&gt;how often&lt;/strong&gt; you run inference. To manage spend you need &lt;strong&gt;hard limits&lt;/strong&gt; clients cannot bypass, &lt;strong&gt;visibility&lt;/strong&gt; into who burned what, and &lt;strong&gt;fairness&lt;/strong&gt; so one integration cannot exhaust the pool.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Cap generation length server-side.&lt;/strong&gt; OpenAI-shaped APIs accept &lt;code&gt;max_tokens&lt;/code&gt; (or &lt;code&gt;max_completion_tokens&lt;/code&gt;) on the request; that only helps if you &lt;strong&gt;enforce a ceiling&lt;/strong&gt;: ignore values above your tier’s cap, or clamp them downward before the request reaches the engine. Otherwise a buggy or malicious client can ask for tens of thousands of new tokens per call. The engine should still return a clear &lt;strong&gt;finish reason&lt;/strong&gt; when generation stops on length.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Bound the context window at the runtime.&lt;/strong&gt; Engines expose a maximum sequence length (e.g. vLLM’s &lt;code&gt;--max-model-len&lt;/code&gt;, llama.cpp’s &lt;code&gt;-c&lt;/code&gt;). That limits &lt;strong&gt;worst-case&lt;/strong&gt; memory and per-request work. It does not replace &lt;strong&gt;policy&lt;/strong&gt;: you still want pre-flight checks (same tokenizer as the model) so oversize prompts fail fast with a 4xx instead of tying up a slot—see &lt;a href=&#34;https://blog.veydh.com/2026/2026-03-token-limits-budgets-and-truncation/&#34;&gt;Token Budgets: Enforcing Limits at the API Layer&lt;/a&gt; for counting, truncation, and streaming semantics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Rate-limit and quota by tenant or key.&lt;/strong&gt; Tokens per minute (TPM) and requests per minute (RPM), optionally combined with daily or monthly &lt;strong&gt;budgets&lt;/strong&gt;, belong in the &lt;strong&gt;gateway or auth layer&lt;/strong&gt; (API gateway, Envoy, Kong, or a thin BFF) keyed off API keys or JWT claims. That caps &lt;strong&gt;aggregate&lt;/strong&gt; burn even when each individual request is small.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. Meter and alert.&lt;/strong&gt; Log or export &lt;code&gt;prompt_tokens&lt;/code&gt;, &lt;code&gt;completion_tokens&lt;/code&gt;, and &lt;strong&gt;total&lt;/strong&gt; per request ID and tenant; chart burn rate and set alerts before the bill surprises anyone. Chargeback by team or product line turns “model cost” from a black box into an engineering lever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. Operational knobs that are not a substitute for policy.&lt;/strong&gt; Continuous batching and high concurrency improve &lt;strong&gt;throughput&lt;/strong&gt;; they do not stop runaway prompts. Default &lt;strong&gt;low&lt;/strong&gt; &lt;code&gt;max_tokens&lt;/code&gt; for anonymous or trial tiers; require review or higher entitlements for large completions.&lt;/p&gt;
&lt;p&gt;In short: &lt;strong&gt;runtime&lt;/strong&gt; caps define physics (max length, VRAM); &lt;strong&gt;API/gateway&lt;/strong&gt; caps define &lt;strong&gt;product and finance&lt;/strong&gt; (per-call generation limit, per-tenant TPM, monthly budget). You want both.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;concrete-runtimes-commands-and-trade-offs&#34;&gt;Concrete runtimes: commands and trade-offs&lt;/h2&gt;
&lt;p&gt;These are &lt;strong&gt;illustrative&lt;/strong&gt;; pin container tags and model revisions in your own manifests.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;vLLM (multi-session GPU throughput)&lt;/strong&gt; — OpenAI-compatible server, good when you already have CUDA and want continuous batching:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# Example only: replace model path and image tag with pinned versions&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# --max-model-len caps context (and thus per-request upper bound on work)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;docker run --gpus all -p 8000:8000 &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  vllm/vllm-openai:latest &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  --model mistralai/Mistral-7B-Instruct-v0.3 &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  --dtype bfloat16 &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  --max-model-len &lt;span style=&#34;color:#ae81ff&#34;&gt;8192&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;TGI (Text Generation Inference)&lt;/strong&gt; — Strong Hub/Docker story; similar niche to vLLM for transformer serving behind HTTP.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;llama.cpp server&lt;/strong&gt; — CPU or smaller GPUs; compile with the same quant you evaluated (&lt;code&gt;Q4_K_M&lt;/code&gt;, etc.):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;./server -m ./models/model.gguf --port &lt;span style=&#34;color:#ae81ff&#34;&gt;8080&lt;/span&gt; -c &lt;span style=&#34;color:#ae81ff&#34;&gt;4096&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Runtime choice:&lt;/strong&gt; &lt;strong&gt;vLLM/TGI&lt;/strong&gt; when concurrent GPU sessions and batching dominate SLOs; &lt;strong&gt;llama.cpp&lt;/strong&gt; when footprint, CPU-only, or edge packaging matters more than peak tokens/sec. &lt;strong&gt;Always&lt;/strong&gt; re-benchmark &lt;strong&gt;your&lt;/strong&gt; adapter + prompt template on the chosen runtime—kernel and quant interact with LoRA paths.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;minimal-openai-shaped-curl-contract-test&#34;&gt;Minimal OpenAI-shaped &lt;code&gt;curl&lt;/code&gt; contract test&lt;/h2&gt;
&lt;p&gt;After deploy, hit &lt;strong&gt;non-streaming&lt;/strong&gt; chat with a frozen prompt and compare &lt;strong&gt;SHA&lt;/strong&gt; of normalized JSON (strip volatile fields like &lt;code&gt;id&lt;/code&gt; timestamps if needed):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;curl -s http://localhost:8000/v1/chat/completions &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  -H &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  -d &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;{&amp;#34;model&amp;#34;:&amp;#34;mistralai/Mistral-7B-Instruct-v0.3&amp;#34;,&amp;#34;messages&amp;#34;:[{&amp;#34;role&amp;#34;:&amp;#34;user&amp;#34;,&amp;#34;content&amp;#34;:&amp;#34;Say OK.&amp;#34;}],&amp;#34;max_tokens&amp;#34;:8,&amp;#34;temperature&amp;#34;:0}&amp;#39;&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;  | jq .
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Check &lt;code&gt;choices[0].finish_reason&lt;/code&gt;, token usage fields, and error schema for oversize prompts—those are the surfaces &lt;a href=&#34;https://blog.veydh.com/2026/2026-03-token-limits-budgets-and-truncation/&#34;&gt;Token Budgets: Enforcing Limits at the API Layer&lt;/a&gt; tightens.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This post closes the loop from “weights on disk” to “service behind a load balancer,” including &lt;strong&gt;where&lt;/strong&gt; to enforce token and rate limits so costs stay predictable: ceilings on completion length and context at the engine, plus TPM/RPM and budgets at the gateway. &lt;a href=&#34;https://blog.veydh.com/2026/2026-03-token-limits-budgets-and-truncation/&#34;&gt;Token Budgets: Enforcing Limits at the API Layer&lt;/a&gt; goes deeper on &lt;strong&gt;pre-flight counting&lt;/strong&gt;, truncation strategies, and client contracts—stack it with the serving layer you run here.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>RAG Foundations: Embeddings, Chunking, and the Retrieval Loop</title>
      <link>https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/</link>
      <pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt; is the pattern everyone names when they want the model to cite &lt;em&gt;your&lt;/em&gt; PDFs, tickets, or runbooks instead of improvising from weights alone. It is less like a single model feature and more like a small &lt;strong&gt;data product&lt;/strong&gt;: you ingest documents, turn them into searchable units, pull the best candidates at query time, and only then hand text to the generator.&lt;/p&gt;
&lt;p&gt;This month locks in vocabulary and failure modes before the later posts go deeper on re-ranking, evaluation, and the distilled-model stack paired with RAG through the rest of 2026.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;the-core-loop-end-to-end&#34;&gt;The core loop, end to end&lt;/h2&gt;
&lt;p&gt;The happy path looks boring on purpose—that is a good sign.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ingest&lt;/strong&gt;&lt;br&gt;
You load sources (HTML, Markdown, PDF text, Slack exports, whatever). Normalization matters: strip boilerplate headers, fix encoding, and keep enough metadata (source URL, section, &lt;code&gt;updated_at&lt;/code&gt;) that you can filter later.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Chunk&lt;/strong&gt;&lt;br&gt;
Models do not read whole repositories in one gulp. You split text into overlapping segments—classically hundreds to low thousands of &lt;strong&gt;tokens&lt;/strong&gt; per chunk, with &lt;strong&gt;overlap&lt;/strong&gt; so sentences at boundaries are not orphaned. Tables and lists punish naive “split every N characters” strategies; Structure-aware splitting is preferable when the format allows it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Embed&lt;/strong&gt;&lt;br&gt;
An embedding model maps each chunk to a dense vector. At query time you embed the user question (sometimes the query and the document live better with &lt;strong&gt;asymmetric&lt;/strong&gt; encoders—one flavor for passages, one for questions—but many stacks use a single model for both).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retrieve&lt;/strong&gt;&lt;br&gt;
You score candidates with cosine similarity, dot product (when vectors are normalized differently), or a hybrid of &lt;strong&gt;BM25 + vectors&lt;/strong&gt;. You return &lt;strong&gt;top-k&lt;/strong&gt; chunks, optionally after metadata filters (tenant, product line, language).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Generate&lt;/strong&gt;&lt;br&gt;
The LLM receives a system or developer instruction, the retrieved passages in a clearly delimited block, and the user message. The model’s job is to &lt;strong&gt;ground&lt;/strong&gt; the answer in those passages while refusing when nothing supports a claim.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That loop is simple to diagram and surprisingly easy to get subtly wrong in production.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;chunking-where-the-most-silent-damage-shows-up&#34;&gt;Chunking: where the most silent damage shows up&lt;/h2&gt;
&lt;p&gt;If chunks are too large, you dilute relevance—the vector averages over too many topics and retrieval becomes fuzzy. If chunks are too small, you lose definitions and antecedents (“it”, “the policy”) that only make sense in a wider window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Overlap&lt;/strong&gt; is not waste; it is insurance that a sentence sliced in half still appears whole in at least one neighbor chunk. Overlap percentage is a tunable knob alongside chunk size, not a fixed constant from a tutorial.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tables&lt;/strong&gt; deserve a callout. Row-wise splitting can destroy joins across columns; sometimes keeping a whole small table in one chunk is better than perfect token parity. When tables are huge, consider extracting them to a structured store and retrieving them by key while still RAG-ing the narrative docs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;embeddings-and-the-vector-store&#34;&gt;Embeddings and the vector store&lt;/h2&gt;
&lt;p&gt;The embedding model defines the geometry of your search. A model trained for &lt;strong&gt;semantic&lt;/strong&gt; similarity helps with paraphrases; a weaker or badly matched embedder sends the generator irrelevant paragraphs that &lt;em&gt;look&lt;/em&gt; related in keyword space only.&lt;/p&gt;
&lt;p&gt;Vector databases and libraries differ in filtering, hybrid search, quantization, and ops maturity. For early work, &lt;strong&gt;reproducibility&lt;/strong&gt; (fixed model revision, recorded chunking parameters) matters more than shaving microseconds off single-query latency. Once traffic grows, batching, caching query embeddings, and separating hot tenants matter.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;pitfalls-to-plan-for-explicitly&#34;&gt;Pitfalls to plan for explicitly&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Stale corpora&lt;/strong&gt;&lt;br&gt;
If the retrieval index lags the truth, the model confidently cites outdated policy. Version your index builds and expose “as of” metadata when users need it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lost in the middle&lt;/strong&gt;&lt;br&gt;
Stuffing ten long chunks into the context window does not mean the model weighs them evenly. Keeping k modest, summarizing long threads, or using re-ranking (&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-operations-re-ranking-and-evaluation/&#34;&gt;RAG in Production: Re-ranking, HyDE, and Simple Evals&lt;/a&gt;) presents fewer, sharper passages.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Evaluation theater&lt;/strong&gt;&lt;br&gt;
Eyeballing answers in a notebook is not a test harness. Hold out question–answer pairs, measure whether the &lt;em&gt;right&lt;/em&gt; chunk would be retrieved, and track regressions when you change embedder or chunking.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;tooling-map-how-teams-actually-wire-the-loop&#34;&gt;Tooling map: how teams actually wire the loop&lt;/h2&gt;
&lt;p&gt;The loop above is runtime-agnostic. In practice you glue &lt;strong&gt;ingest → chunk → embed → index → query → generate&lt;/strong&gt; with one of these families (often mixed):&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Layer&lt;/th&gt;
          &lt;th&gt;Representative tools&lt;/th&gt;
          &lt;th&gt;What you trade&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Orchestration&lt;/td&gt;
          &lt;td&gt;&lt;a href=&#34;https://python.langchain.com/&#34;&gt;LangChain&lt;/a&gt;, &lt;a href=&#34;https://www.llamaindex.ai/&#34;&gt;LlamaIndex&lt;/a&gt;, &lt;a href=&#34;https://haystack.deepset.ai/&#34;&gt;Haystack&lt;/a&gt;, custom Python&lt;/td&gt;
          &lt;td&gt;LangChain/LlamaIndex ship fast demos and integrations; custom code minimizes magic and keeps dependencies thin.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Chunking&lt;/td&gt;
          &lt;td&gt;Framework splitters, &lt;a href=&#34;https://unstructured.io/&#34;&gt;Unstructured&lt;/a&gt; for messy PDFs, &lt;code&gt;langchain_text_splitters&lt;/code&gt;, hand-rolled Markdown headers&lt;/td&gt;
          &lt;td&gt;PDFs/HTML reward structure-aware parsers; naive &lt;code&gt;CharacterTextSplitter&lt;/code&gt; is fine only for clean text.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Embeddings&lt;/td&gt;
          &lt;td&gt;&lt;a href=&#34;https://www.sbert.net/&#34;&gt;sentence-transformers&lt;/a&gt;, OpenAI/Cohere/Voyage APIs, models on the Hub&lt;/td&gt;
          &lt;td&gt;APIs = ops simplicity + cost; self-hosted = reproducibility + no egress; pin &lt;strong&gt;model revision&lt;/strong&gt;.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Vector index&lt;/td&gt;
          &lt;td&gt;&lt;a href=&#34;https://github.com/facebookresearch/faiss&#34;&gt;FAISS&lt;/a&gt; (in-process), &lt;a href=&#34;https://www.trychroma.com/&#34;&gt;Chroma&lt;/a&gt;, &lt;a href=&#34;https://qdrant.tech/&#34;&gt;Qdrant&lt;/a&gt;, &lt;a href=&#34;https://github.com/pgvector/pgvector&#34;&gt;pgvector&lt;/a&gt;, managed cloud DBs&lt;/td&gt;
          &lt;td&gt;FAISS/Chroma for prototypes; &lt;strong&gt;pgvector&lt;/strong&gt; when you already own Postgres and want SQL + metadata filters; Qdrant/Milvus when you outgrow a single node.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Lexical / hybrid&lt;/td&gt;
          &lt;td&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Okapi_BM25&#34;&gt;BM25&lt;/a&gt; via Elasticsearch/OpenSearch, built-in hybrid in some vector DBs&lt;/td&gt;
          &lt;td&gt;Pure vectors miss exact SKUs and codes; hybrid fixes “semantic near-miss” failures.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Generation&lt;/td&gt;
          &lt;td&gt;Any chat API or vLLM/TGI self-host (see &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt;)&lt;/td&gt;
          &lt;td&gt;Same RAG stack; swap generator without re-indexing if prompts stay stable.&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Pros / cons in one sentence each:&lt;/strong&gt; &lt;em&gt;Framework-first&lt;/em&gt; (LangChain/LlamaIndex) accelerates time-to-demo but can hide tokenizer/chunk bugs until production. &lt;em&gt;Vector-in-Postgres&lt;/em&gt; reduces moving parts for teams that already run Postgres and need row-level security. &lt;em&gt;Separate vector DB&lt;/em&gt; scales ANN search and hybrid features but adds another backup and monitoring target.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;minimal-reproducible-path-self-hosted-embed--local-index&#34;&gt;Minimal reproducible path (self-hosted embed + local index)&lt;/h2&gt;
&lt;p&gt;Goal: one machine, fixed dependency versions, same chunks every run.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Environment&lt;/strong&gt; — Python 3.11+, &lt;code&gt;pip install chromadb sentence-transformers&lt;/code&gt; (pin versions in &lt;code&gt;requirements.txt&lt;/code&gt;; record &lt;code&gt;pip freeze&lt;/code&gt; hash in your ADR). For header-aware splitting later, add &lt;code&gt;langchain-text-splitters&lt;/code&gt; or Unstructured and &lt;strong&gt;record the extra dependency&lt;/strong&gt; as part of the index version.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Corpus&lt;/strong&gt; — Start with a folder of &lt;code&gt;.md&lt;/code&gt; or &lt;code&gt;.txt&lt;/code&gt; files; later swap in Unstructured for PDFs. Record file list + checksums.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chunk&lt;/strong&gt; — Fixed &lt;code&gt;chunk_size&lt;/code&gt; (tokens or characters), fixed &lt;code&gt;chunk_overlap&lt;/code&gt;, single splitter implementation. Log the parameters beside the index name.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embed&lt;/strong&gt; — Load one Hub model id with &lt;code&gt;revision=&lt;/code&gt;; embed documents in batches; persist the Chroma collection name + embed model id.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieve&lt;/strong&gt; — For each eval question, &lt;code&gt;collection.query(query_texts=[q], n_results=k)&lt;/code&gt;; log returned chunk ids + scores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate&lt;/strong&gt; — Build one prompt template: system instruction + &lt;code&gt;---&lt;/code&gt; delimited passages + user question. Pin chat model &lt;strong&gt;revision&lt;/strong&gt; the same way as the embedder.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A compact indexing sketch (adapt names and paths):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# index_build.py — illustrative; pin package versions in production&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; chromadb
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; chromadb.utils &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; embedding_functions
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;CHUNK_SIZE &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;800&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;CHUNK_OVERLAP &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;120&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;chunk_text&lt;/span&gt;(text: str, size: int &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; CHUNK_SIZE, overlap: int &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; CHUNK_OVERLAP) &lt;span style=&#34;color:#f92672&#34;&gt;-&amp;gt;&lt;/span&gt; list[str]:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&amp;#34;&amp;#34;Character-based chunks; swap for structure-aware splitters when format allows.&amp;#34;&amp;#34;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    chunks: list[str] &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; []
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    start &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    n &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; len(text)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;while&lt;/span&gt; start &lt;span style=&#34;color:#f92672&#34;&gt;&amp;lt;&lt;/span&gt; n:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        end &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; min(start &lt;span style=&#34;color:#f92672&#34;&gt;+&lt;/span&gt; size, n)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        chunks&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;append(text[start:end])
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; end &lt;span style=&#34;color:#f92672&#34;&gt;==&lt;/span&gt; n:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#66d9ef&#34;&gt;break&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        start &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; max(end &lt;span style=&#34;color:#f92672&#34;&gt;-&lt;/span&gt; overlap, start &lt;span style=&#34;color:#f92672&#34;&gt;+&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;return&lt;/span&gt; chunks
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;load_docs&lt;/span&gt;(paths: list[str]) &lt;span style=&#34;color:#f92672&#34;&gt;-&amp;gt;&lt;/span&gt; list[str]:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    chunks: list[str] &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; []
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; p &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; paths:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        text &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; open(p, encoding&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;utf-8&amp;#34;&lt;/span&gt;)&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;read()
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        chunks&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;extend(chunk_text(text))
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;return&lt;/span&gt; chunks
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;main&lt;/span&gt;():
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    ef &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; embedding_functions&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;SentenceTransformerEmbeddingFunction(
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        model_name&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;sentence-transformers/all-MiniLM-L6-v2&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    )
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    client &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; chromadb&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;PersistentClient(path&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./chroma_db&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    col &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; client&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;get_or_create_collection(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;docs&amp;#34;&lt;/span&gt;, embedding_function&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;ef)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    chunks &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; load_docs([&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./corpus/a.md&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./corpus/b.md&amp;#34;&lt;/span&gt;])
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    ids &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; [&lt;span style=&#34;color:#e6db74&#34;&gt;f&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;chunk-&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;{&lt;/span&gt;i&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt; &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; i &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; range(len(chunks))]
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    col&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;add(ids&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;ids, documents&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;chunks)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    print(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;indexed&amp;#34;&lt;/span&gt;, len(chunks), &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;chunks&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; __name__ &lt;span style=&#34;color:#f92672&#34;&gt;==&lt;/span&gt; &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;__main__&amp;#34;&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    main()
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;What “reproducible” means here:&lt;/strong&gt; anyone with the same corpus bytes, splitter parameters, embedder revision, and index path gets the same retrieval ids for the same queries. Change any of those without a version bump and you should expect eval drift.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;when-to-add-hybrid-search-or-metadata-first&#34;&gt;When to add hybrid search or metadata first&lt;/h2&gt;
&lt;p&gt;If users query &lt;strong&gt;IDs, error codes, or proper nouns&lt;/strong&gt;, add &lt;strong&gt;BM25 or hybrid&lt;/strong&gt; before chasing a larger embedder. If you have &lt;strong&gt;tenant_id&lt;/strong&gt; or &lt;strong&gt;product_line&lt;/strong&gt;, add &lt;strong&gt;metadata filters&lt;/strong&gt; on the vector store before raising &lt;code&gt;k&lt;/code&gt;. Both are cheaper than swapping a 7B generator to “fix” bad retrieval.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;RAG is &lt;strong&gt;systems engineering&lt;/strong&gt;: ingestion quality, chunk boundaries, and retrieval metrics matter as much as the chat model’s charisma. The distilled LLMs fine-tuned later in this series only shine when the passages they read are the right passages.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-operations-re-ranking-and-evaluation/&#34;&gt;RAG in Production: Re-ranking, HyDE, and Simple Evals&lt;/a&gt; picks up &lt;strong&gt;re-ranking, query reformulation, and lightweight evaluation&lt;/strong&gt; so naive top-k does not become the ceiling.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>RAG in Production: Re-ranking, HyDE, and Simple Evals</title>
      <link>https://blog.veydh.com/2026/2026-01-rag-operations-re-ranking-and-evaluation/</link>
      <pubDate>Sun, 11 Jan 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-01-rag-operations-re-ranking-and-evaluation/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Naive &lt;strong&gt;top-k&lt;/strong&gt; vector search is a decent demo and a fragile production strategy. Users paraphrase; documents use different jargon; multi-hop questions need facts that never co-occur in a single chunk. When the backend is a &lt;strong&gt;smaller distilled&lt;/strong&gt; model (Mistral-class, Qwen, LLaMA derivatives), the generator is less able to “paper over” bad retrieval with parametric knowledge—so the retrieval layer has to pull its weight.&lt;/p&gt;
&lt;p&gt;This post collects upgrades worth reaching for &lt;strong&gt;before&lt;/strong&gt; touching fancier graph RAG or agent loops. They pair well with the foundations in &lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt; and with the LoRA/serving posts that follow (&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-lora-and-small-distilled-models-primer/&#34;&gt;LoRA: Fine-Tuning Small Distilled Models Without Full Weights&lt;/a&gt;, &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-llm-serving-apis-openai-compatible/&#34;&gt;Serving Distilled Models Behind an HTTP API&lt;/a&gt;).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;re-ranking-a-second-opinion-on-query-passage-pairs&#34;&gt;Re-ranking: a second opinion on (query, passage) pairs&lt;/h2&gt;
&lt;p&gt;Bi-encoders (your embedding model) score candidates cheaply because query and document vectors are precomputed. &lt;strong&gt;Cross-encoders&lt;/strong&gt; and learned &lt;strong&gt;re-rankers&lt;/strong&gt; jointly attend to query and text; they are heavier per pair but far sharper at deciding whether a chunk truly answers the question.&lt;/p&gt;
&lt;p&gt;In production the usual pattern is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Retrieve a &lt;strong&gt;wider&lt;/strong&gt; candidate set (e.g. 50–200) with fast vectors.&lt;/li&gt;
&lt;li&gt;Re-rank down to &lt;strong&gt;3–10&lt;/strong&gt; with a cross-encoder or a small reranker model.&lt;/li&gt;
&lt;li&gt;Pass only that short list to the LLM.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The latency trade-off is real. &lt;strong&gt;p95&lt;/strong&gt; end-to-end warrants watching, and reranking is sometimes limited to “high value” intents or gated behind a cheap ambiguity classifier.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;query-expansion-and-hyde&#34;&gt;Query expansion and HyDE&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Query expansion&lt;/strong&gt; adds terms or sub-questions so lexical and embedding gaps shrink. Simple forms include synonyms from a domain glossary or LLM-generated bullet points that rephrase the user ask.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HyDE&lt;/strong&gt; (Hypothetical Document Embeddings) flips the idea: ask the LLM to write a &lt;em&gt;fake&lt;/em&gt; answer as if it knew the corpus, embed that synthetic paragraph, and search with it. It can bridge vocabulary mismatch; it can also amplify hallucinated specifics if you are not careful. HyDE fits better as an optional path behind a feature flag, not the default for regulated domains.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;metadata-filters-and-tenancy&#34;&gt;Metadata filters and tenancy&lt;/h2&gt;
&lt;p&gt;Vectors without filters become a soup. Retrieval should be scoped by &lt;strong&gt;tenant&lt;/strong&gt;, &lt;strong&gt;product&lt;/strong&gt;, &lt;strong&gt;language&lt;/strong&gt;, &lt;strong&gt;document type&lt;/strong&gt;, or &lt;strong&gt;time range&lt;/strong&gt; whenever those axes exist in the source system. That mirrors the “bounded context” thinking from earlier architecture writing: fewer collisions, easier debugging, and less chance of leaking one customer’s chunk into another’s session.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;evaluation-that-catches-regressions&#34;&gt;Evaluation that catches regressions&lt;/h2&gt;
&lt;p&gt;Three layers of signal beat one pretty demo:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retrieval quality&lt;/strong&gt;&lt;br&gt;
Labeled (query, relevant_chunk_id) pairs—or weaker proxies like “any gold paragraph in top-k”—give you recall@k and MRR. When embedders or chunking change, this layer screams first.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Answer groundedness&lt;/strong&gt;&lt;br&gt;
Does the final answer stick to retrieved text? LLM-as-judge helps &lt;strong&gt;if&lt;/strong&gt; the judge is validated against human ratings and prompts stay versioned. A single automated score should not pass as compliance evidence alone.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Latency and cost&lt;/strong&gt;&lt;br&gt;
Track histograms for embed, search, rerank, and generate. A “smarter” stack that misses SLOs is a failed deploy.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Automate a regression job when the corpus or models change. Same discipline as pinning dependencies: &lt;strong&gt;record model revisions, index build IDs, and dataset hashes&lt;/strong&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;step-by-step-widen--re-rank--generate&#34;&gt;Step-by-step: widen → re-rank → generate&lt;/h2&gt;
&lt;p&gt;This is the default production upgrade path once naive &lt;code&gt;top_k&lt;/code&gt; plateaus.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Vector retrieve&lt;/strong&gt; — Pull &lt;code&gt;k_wide&lt;/code&gt; (50–200) ids + texts with scores from your index (same embedder revision as in &lt;a href=&#34;https://blog.veydh.com/2026/2026-01-rag-foundations-embeddings-and-retrieval/&#34;&gt;RAG Foundations: Embeddings, Chunking, and the Retrieval Loop&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-rank&lt;/strong&gt; — Score each &lt;code&gt;(query, passage)&lt;/code&gt; pair with a cross-encoder or hosted rerank API; sort descending; keep &lt;code&gt;k_tight&lt;/code&gt; (3–10).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt&lt;/strong&gt; — Inject only &lt;code&gt;k_tight&lt;/code&gt; passages into the chat template; log their ids on the response object for debugging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measure&lt;/strong&gt; — Offline: recall@k on labeled data &lt;em&gt;before&lt;/em&gt; and &lt;em&gt;after&lt;/em&gt; rerank (rerank can hurt if the bi-encoder already ranked gold in position 80 and the cross-encoder mis-scores length-biased chunks).&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;tooling-comparison-for-re-ranking&#34;&gt;Tooling comparison for re-ranking&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Approach&lt;/th&gt;
          &lt;th&gt;Examples&lt;/th&gt;
          &lt;th&gt;Pros&lt;/th&gt;
          &lt;th&gt;Cons&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Cross-encoder (self-hosted)&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;sentence-transformers&lt;/code&gt; &lt;code&gt;CrossEncoder&lt;/code&gt;, small Hub models&lt;/td&gt;
          &lt;td&gt;Strong accuracy; no per-query API bill; full control&lt;/td&gt;
          &lt;td&gt;GPU latency; batching complexity at high QPS&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Hosted rerank&lt;/td&gt;
          &lt;td&gt;Cohere rerank, Voyage, Jina rerank APIs&lt;/td&gt;
          &lt;td&gt;Fast integration; scales with vendor&lt;/td&gt;
          &lt;td&gt;Cost + egress; must pin API version behavior&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;LLM pointwise scoring&lt;/td&gt;
          &lt;td&gt;“Rate 0–5 how well this passage answers…”&lt;/td&gt;
          &lt;td&gt;Flexible rubric&lt;/td&gt;
          &lt;td&gt;Slow, expensive, judge drift; validate against humans&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;ColBERT-style late interaction&lt;/td&gt;
          &lt;td&gt;RAGatouille, ColBERT indexes&lt;/td&gt;
          &lt;td&gt;Excellent retrieval quality for some corpora&lt;/td&gt;
          &lt;td&gt;Heavier index + ops than flat vectors&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Cross-encoder or hosted rerank&lt;/strong&gt; is the usual workhorse; LLM-as-judge for reranking is a last resort after simpler baselines fail.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Illustrative cross-encoder pass&lt;/strong&gt; (after you already have candidate strings &lt;code&gt;passages&lt;/code&gt;):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; sentence_transformers &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; CrossEncoder
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;model &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; CrossEncoder(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;cross-encoder/ms-marco-MiniLM-L-6-v2&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;pairs &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; [[query, p] &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; p &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; passages]
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;scores &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; model&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;predict(pairs)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ranked &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; [p &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; _, p &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; sorted(zip(scores, passages), reverse&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;True&lt;/span&gt;)]
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;tight &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; ranked[:&lt;span style=&#34;color:#ae81ff&#34;&gt;8&lt;/span&gt;]
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Pin the &lt;code&gt;CrossEncoder&lt;/code&gt; weights the same way you pin Hub LLM revisions.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;hyde-and-query-expansion-controlled-experiments&#34;&gt;HyDE and query expansion: controlled experiments&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Query expansion&lt;/strong&gt; — Maintain a domain glossary JSON; at query time append synonyms, or ask a small LLM for 3 paraphrases and embed the centroid / run multi-query fusion (retrieve per variant, merge with max score). Log which variant hit gold.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HyDE&lt;/strong&gt; — (1) Prompt LLM: “Write a short factual paragraph that would answer: …” (2) Embed that paragraph. (3) Search. &lt;strong&gt;A/B&lt;/strong&gt; behind a flag: measure precision@k on a holdout set; watch for hallucinated proper nouns in the synthetic paragraph polluting retrieval.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;lightweight-eval-harness-copy-paste-structure&#34;&gt;Lightweight eval harness (copy-paste structure)&lt;/h2&gt;
&lt;p&gt;Keep a &lt;code&gt;questions.jsonl&lt;/code&gt; with &lt;code&gt;question&lt;/code&gt;, &lt;code&gt;gold_chunk_ids[]&lt;/code&gt;, optional &lt;code&gt;metadata_filter&lt;/code&gt;. For each line:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run retrieval (and optional rerank).&lt;/li&gt;
&lt;li&gt;Compute &lt;strong&gt;recall@k&lt;/strong&gt;: did any gold id appear in the top k returned ids?&lt;/li&gt;
&lt;li&gt;Emit a CSV: &lt;code&gt;question_id, gold_hit@20, gold_rank, latency_ms&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Gate merges on &lt;strong&gt;recall@k&lt;/strong&gt; not moving backward when you change chunking, embedder, or reranker. The “answer groundedness” layer in this post stacks on top once retrieval is no longer random.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Treat RAG as a &lt;strong&gt;product surface&lt;/strong&gt;. Log which chunk IDs supported each answer; wire request IDs through search and generation; and assume distilled backends will not hide bad retrieval.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://blog.veydh.com/2026/2026-01-lora-and-small-distilled-models-primer/&#34;&gt;LoRA: Fine-Tuning Small Distilled Models Without Full Weights&lt;/a&gt; picks up &lt;strong&gt;LoRA&lt;/strong&gt; on those same backends so domain tone and tools improve without full fine-tunes.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>LoRA: Fine-Tuning Small Distilled Models Without Full Weights</title>
      <link>https://blog.veydh.com/2026/2026-01-lora-and-small-distilled-models-primer/</link>
      <pubDate>Sat, 10 Jan 2026 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2026/2026-01-lora-and-small-distilled-models-primer/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Low-Rank Adaptation (LoRA)&lt;/strong&gt; is the compromise when a model should sound like &lt;em&gt;our&lt;/em&gt; support desk, respect &lt;em&gt;our&lt;/em&gt; JSON tool schema, or stop doing &lt;em&gt;that one thing&lt;/em&gt;—without paying the full price of updating every weight in a multi-billion-parameter checkpoint.&lt;/p&gt;
&lt;p&gt;The idea is elementary linear algebra dressed as a training trick: instead of learning a full update matrix for a layer, you learn a &lt;strong&gt;low-rank factorization&lt;/strong&gt; added to frozen weights. VRAM drops, iteration speed rises, and you can often ship &lt;strong&gt;adapter-only&lt;/strong&gt; artifacts that are easy to version beside your data.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;why-pair-distillation-with-lora&#34;&gt;Why pair distillation with LoRA&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Distilled&lt;/strong&gt; students (from Mistral, Qwen, LLaMA families on the Hub) trade some ceiling capability for &lt;strong&gt;throughput&lt;/strong&gt; and &lt;strong&gt;cost&lt;/strong&gt;. They are attractive bases for assistants because you can serve them on fewer GPUs or smaller batches.&lt;/p&gt;
&lt;p&gt;LoRA sits on top of that trade: you keep the frozen student as a stable anchor, then nudge behavior with a thin adapter trained on curated SFT data. You are not trying to re-teach the entire internet—just &lt;strong&gt;steer&lt;/strong&gt; tone, format, and domain vocabulary.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;what-actually-gets-trained&#34;&gt;What actually gets trained&lt;/h2&gt;
&lt;p&gt;Implementations differ, but the mental model is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Frozen&lt;/strong&gt;: most of the transformer—embedding tables and the bulk of attention/MLP weights stay put.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trainable&lt;/strong&gt;: small matrices injected into chosen projections (&lt;code&gt;q_proj&lt;/code&gt;, &lt;code&gt;k_proj&lt;/code&gt;, &lt;code&gt;v_proj&lt;/code&gt;, &lt;code&gt;o_proj&lt;/code&gt;, sometimes MLP gates). The library merges or applies these at forward time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Rank &lt;code&gt;r&lt;/code&gt;&lt;/strong&gt; controls adapter capacity. Tiny &lt;code&gt;r&lt;/code&gt; underfits quirky datasets; large &lt;code&gt;r&lt;/code&gt; inches toward full fine-tune behavior and cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alpha scaling&lt;/strong&gt; (often paired with &lt;code&gt;r&lt;/code&gt;) scales the effective strength of the adapter. Both are tuned together rather than leaving defaults unquestioned.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dropout&lt;/strong&gt; on adapter paths helps when the SFT set is small; it fights memorization of noisy examples.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;data-hygiene-beats-clever-hyperparameters&#34;&gt;Data hygiene beats clever hyperparameters&lt;/h2&gt;
&lt;p&gt;Datasets are versioned like code: &lt;strong&gt;git&lt;/strong&gt;, manifests, and hashes. Deduplication matters—paraphrased duplicates inflate metrics and encourage verbatim regurgitation. &lt;strong&gt;System&lt;/strong&gt; vs &lt;strong&gt;user&lt;/strong&gt; turns should be labeled or separated consistently so chat templates do not scramble during training.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;step-by-step-lora-with-transformers--peft-mental-checklist&#34;&gt;Step-by-step: LoRA with Transformers + PEFT (mental checklist)&lt;/h2&gt;
&lt;p&gt;This is the “one canonical recipe” kept stable; &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/&#34;&gt;Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face&lt;/a&gt; and &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-qwen-and-llama-distills-lora-on-hf/&#34;&gt;Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas&lt;/a&gt; only swap tokenizer/template details.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Pin base&lt;/strong&gt; — &lt;code&gt;model_id&lt;/code&gt; + &lt;code&gt;revision&lt;/code&gt; (git commit on the Hub) for weights &lt;em&gt;and&lt;/em&gt; tokenizer JSON.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Choose targets&lt;/strong&gt; — Default: attention projections (&lt;code&gt;q_proj&lt;/code&gt;, &lt;code&gt;k_proj&lt;/code&gt;, &lt;code&gt;v_proj&lt;/code&gt;, &lt;code&gt;o_proj&lt;/code&gt;). Add MLP only if evals plateau and VRAM allows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set PEFT&lt;/strong&gt; — &lt;code&gt;LoraConfig(r=8 or 16, lora_alpha=16 or 32, lora_dropout=0.05, target_modules=[...], bias=&amp;quot;none&amp;quot;, task_type=&amp;quot;CAUSAL_LM&amp;quot;)&lt;/code&gt;. Treat &lt;code&gt;r&lt;/code&gt; and &lt;code&gt;alpha&lt;/code&gt; as &lt;strong&gt;coupled&lt;/strong&gt;: effective scale is roughly &lt;code&gt;alpha/r&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load in bf16&lt;/strong&gt; — &lt;code&gt;model = AutoModelForCausalLM.from_pretrained(..., torch_dtype=torch.bfloat16)&lt;/code&gt; when the GPU supports it; enable &lt;strong&gt;gradient checkpointing&lt;/strong&gt; if sequence length × batch threatens OOM.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tokenize with chat template&lt;/strong&gt; — Build &lt;code&gt;messages&lt;/code&gt; list → &lt;code&gt;tokenizer.apply_chat_template(..., tokenize=True, return_tensors=&amp;quot;pt&amp;quot;, add_generation_prompt=False)&lt;/code&gt; for SFT; mask &lt;strong&gt;non-assistant&lt;/strong&gt; token positions with &lt;code&gt;-100&lt;/code&gt; in &lt;code&gt;labels&lt;/code&gt; so loss applies only where you want imitation (&lt;a href=&#34;https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/&#34;&gt;Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face&lt;/a&gt; expands Mistral-specific layout).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Train&lt;/strong&gt; — &lt;code&gt;Trainer&lt;/code&gt; or &lt;code&gt;SFTTrainer&lt;/code&gt; with small &lt;strong&gt;learning rate&lt;/strong&gt; (e.g. &lt;code&gt;1e-4&lt;/code&gt; to &lt;code&gt;2e-4&lt;/code&gt; for adapters, lower if unstable), &lt;strong&gt;cosine&lt;/strong&gt; or linear decay, &lt;strong&gt;weight decay&lt;/strong&gt; 0–0.1, &lt;strong&gt;max_steps&lt;/strong&gt; or &lt;strong&gt;epochs&lt;/strong&gt; chosen from a tiny dry run that completes without NaN.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Save&lt;/strong&gt; — &lt;code&gt;trainer.model.save_pretrained(&amp;quot;./lora-out&amp;quot;)&lt;/code&gt; saves &lt;strong&gt;adapter-only&lt;/strong&gt; shards; record dataset hash, seed, and CLI args beside the folder.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load for inference&lt;/strong&gt; — &lt;code&gt;PeftModel.from_pretrained(base, &amp;quot;./lora-out&amp;quot;)&lt;/code&gt; or merge for deployment paths that require a single weight file.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Minimal wiring&lt;/strong&gt; (abbreviated; align with your model family’s &lt;code&gt;target_modules&lt;/code&gt; names):&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; torch
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; transformers &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; peft &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; LoraConfig, get_peft_model, TaskType
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;MODEL_ID &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;mistralai/Mistral-7B-Instruct-v0.3&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;REV &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&amp;lt;pin-a-commit-sha&amp;gt;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;tokenizer &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; AutoTokenizer&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(MODEL_ID, revision&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;REV)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;model &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; AutoModelForCausalLM&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;from_pretrained(
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    MODEL_ID,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    revision&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;REV,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    torch_dtype&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;torch&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;bfloat16,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    device_map&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;auto&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;lora &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; LoraConfig(
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    r&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;16&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    lora_alpha&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;32&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    lora_dropout&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;0.05&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    bias&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;none&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    task_type&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;TaskType&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;CAUSAL_LM,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    target_modules&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;[&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;q_proj&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;k_proj&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;v_proj&amp;#34;&lt;/span&gt;, &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;o_proj&amp;#34;&lt;/span&gt;],
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;model &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; get_peft_model(model, lora)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;# ... build tokenized dataset with masked labels, then:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;args &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; TrainingArguments(
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    output_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./lora-run&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    per_device_train_batch_size&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    gradient_accumulation_steps&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;8&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    learning_rate&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;2e-4&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    num_train_epochs&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    logging_steps&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;10&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    save_steps&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;200&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    bf16&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;True&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    gradient_checkpointing&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;True&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;trainer &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; Trainer(model&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;model, args&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;args, train_dataset&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;tokenized_train)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;trainer&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;train()
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;trainer&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;model&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;save_pretrained(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;./lora-out&amp;#34;&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Pros of this path:&lt;/strong&gt; battle-tested libraries, easy diff between runs, adapter artifacts stay small. &lt;strong&gt;Cons:&lt;/strong&gt; you still own data hygiene, template correctness, and eval gates—PEFT does not fix mis-tokenized chats.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Pick &lt;strong&gt;one canonical training recipe&lt;/strong&gt;—Transformers + PEFT, or a higher-level runner—and change only data and hyperparameters between experiments. &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-mistral-huggingface-lora-training-notes/&#34;&gt;Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face&lt;/a&gt; and &lt;a href=&#34;https://blog.veydh.com/2026/2026-02-qwen-and-llama-distills-lora-on-hf/&#34;&gt;Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas&lt;/a&gt; walk &lt;strong&gt;family-specific&lt;/strong&gt; quirks (Mistral templates, Qwen tokenization, LLaMA RoPE) so the same LoRA discipline does not stumble on tokenizer details.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Algorithms Retrospective: DFS, BFS, Dijkstra, and Backtracking</title>
      <link>https://blog.veydh.com/2025/2025-11-algorithms-retrospective-dfs-bfs-dijkstra/</link>
      <pubDate>Wed, 12 Nov 2025 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2025/2025-11-algorithms-retrospective-dfs-bfs-dijkstra/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;The 2020–2022 graph posts on this blog—&lt;strong&gt;DFS&lt;/strong&gt;, &lt;strong&gt;BFS&lt;/strong&gt;, &lt;strong&gt;Dijkstra&lt;/strong&gt;, grid islands, word-ladder style searches, &lt;strong&gt;N-Queens&lt;/strong&gt; backtracking—were exercises in &lt;strong&gt;state exploration&lt;/strong&gt; with clear &lt;strong&gt;invariants&lt;/strong&gt;. November pulls them into one narrative: what each technique buys you, when it misleads you, and how that thinking echoes &lt;strong&gt;systems&lt;/strong&gt; work like scheduling and search over configuration spaces.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;dfs-depth-first-exploration&#34;&gt;DFS: depth-first exploration&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;DFS&lt;/strong&gt; walks until it hits a wall, then backtracks. It is natural for &lt;strong&gt;exhaustive&lt;/strong&gt; search, connectivity checks, and puzzles with deep decision chains. It is not automatically the shortest path unless the problem structure guarantees it—knowing &lt;strong&gt;why&lt;/strong&gt; saves interview time and production bugs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;bfs-layers-and-shortest-path-in-unweighted-graphs&#34;&gt;BFS: layers and shortest path in unweighted graphs&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;BFS&lt;/strong&gt; expands &lt;strong&gt;layer by layer&lt;/strong&gt;, yielding shortest hop counts when edges have &lt;strong&gt;uniform&lt;/strong&gt; cost. Grid problems and social graphs love BFS when “minimum steps” is the metric.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;dijkstra-non-negative-weighted-shortest-paths&#34;&gt;Dijkstra: non-negative weighted shortest paths&lt;/h2&gt;
&lt;p&gt;When weights appear, &lt;strong&gt;Dijkstra&lt;/strong&gt; (with a priority queue) is the conservative default for &lt;strong&gt;non-negative&lt;/strong&gt; edges. I still narrate &lt;strong&gt;relaxation&lt;/strong&gt; aloud while coding—if I cannot explain the invariant, I should not trust the heap updates.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;backtracking-constraint-propagation-with-undo&#34;&gt;Backtracking: constraint propagation with undo&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;N-Queens&lt;/strong&gt; and similar puzzles are &lt;strong&gt;backtracking&lt;/strong&gt; with &lt;strong&gt;pruning&lt;/strong&gt;. The pattern is choose, recurse, &lt;strong&gt;undo&lt;/strong&gt;—the same shape as exploratory configuration changes with rollback in operational tooling (metaphorically, not literally identical code).&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;study-loop-that-stuck-for-me&#34;&gt;Study loop that stuck for me&lt;/h2&gt;
&lt;p&gt;Implement each pattern &lt;strong&gt;twice&lt;/strong&gt;—I used &lt;strong&gt;Python&lt;/strong&gt; and &lt;strong&gt;Java&lt;/strong&gt; in the original series—timed against &lt;strong&gt;randomized&lt;/strong&gt; inputs. Explaining &lt;strong&gt;invariants&lt;/strong&gt; out loud catches fence-post errors faster than staring at IDE highlights.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;mini-implementations-side-by-side-same-graph-api&#34;&gt;Mini implementations side by side (same graph API)&lt;/h2&gt;
&lt;p&gt;Assume &lt;code&gt;adj: Dict[int, List[Tuple[int, int]]]&lt;/code&gt; maps &lt;strong&gt;node → [(neighbor, weight), …]&lt;/strong&gt;; for BFS/DFS, ignore weights or set them to &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BFS (shortest hops, unweighted):&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;from&lt;/span&gt; collections &lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; deque
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;bfs&lt;/span&gt;(adj, start, target):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    q &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; deque([start])
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    prev &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; {start: &lt;span style=&#34;color:#66d9ef&#34;&gt;None&lt;/span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;while&lt;/span&gt; q:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        u &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; q&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;popleft()
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; u &lt;span style=&#34;color:#f92672&#34;&gt;==&lt;/span&gt; target:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#66d9ef&#34;&gt;break&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; v, _w &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; adj&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;get(u, []):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; v &lt;span style=&#34;color:#f92672&#34;&gt;not&lt;/span&gt; &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; prev:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;                prev[v] &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; u
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;                q&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;append(v)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;return&lt;/span&gt; prev  &lt;span style=&#34;color:#75715e&#34;&gt;# reconstruct path from prev&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Dijkstra (non-negative weights):&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#f92672&#34;&gt;import&lt;/span&gt; heapq
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;dijkstra&lt;/span&gt;(adj, start):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    dist &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; {start: &lt;span style=&#34;color:#ae81ff&#34;&gt;0&lt;/span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    pq &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; [(&lt;span style=&#34;color:#ae81ff&#34;&gt;0&lt;/span&gt;, start)]
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;while&lt;/span&gt; pq:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        d, u &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; heapq&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;heappop(pq)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; d &lt;span style=&#34;color:#f92672&#34;&gt;!=&lt;/span&gt; dist&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;get(u, float(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;inf&amp;#34;&lt;/span&gt;)):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#66d9ef&#34;&gt;continue&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; v, w &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; adj&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;get(u, []):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            nd &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; d &lt;span style=&#34;color:#f92672&#34;&gt;+&lt;/span&gt; w
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; nd &lt;span style=&#34;color:#f92672&#34;&gt;&amp;lt;&lt;/span&gt; dist&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;get(v, float(&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;inf&amp;#34;&lt;/span&gt;)):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;                dist[v] &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; nd
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;                heapq&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;heappush(pq, (nd, v))
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;return&lt;/span&gt; dist
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;DFS (connectivity / exhaustive search):&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;dfs&lt;/span&gt;(adj, start, seen&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;None&lt;/span&gt;):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; seen &lt;span style=&#34;color:#f92672&#34;&gt;is&lt;/span&gt; &lt;span style=&#34;color:#66d9ef&#34;&gt;None&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        seen &lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt; set()
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    seen&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;add(start)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; v, _w &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; adj&lt;span style=&#34;color:#f92672&#34;&gt;.&lt;/span&gt;get(start, []):
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;        &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; v &lt;span style=&#34;color:#f92672&#34;&gt;not&lt;/span&gt; &lt;span style=&#34;color:#f92672&#34;&gt;in&lt;/span&gt; seen:
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;            dfs(adj, v, seen)
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;    &lt;span style=&#34;color:#66d9ef&#34;&gt;return&lt;/span&gt; seen
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;When to use which (decision table):&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Goal&lt;/th&gt;
          &lt;th&gt;Algorithm&lt;/th&gt;
          &lt;th&gt;Invariant you should be able to say aloud&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Min hops, unweighted&lt;/td&gt;
          &lt;td&gt;BFS&lt;/td&gt;
          &lt;td&gt;Nodes dequeued in nondecreasing distance from start&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Min cost, edges ≥ 0&lt;/td&gt;
          &lt;td&gt;Dijkstra&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;dist[u]&lt;/code&gt; is final when &lt;code&gt;u&lt;/code&gt; is popped&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Explore all / detect cycle&lt;/td&gt;
          &lt;td&gt;DFS&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;seen&lt;/code&gt; partitions visited vs not&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Port the same three functions to &lt;strong&gt;Java&lt;/strong&gt; (&lt;code&gt;ArrayDeque&lt;/code&gt;, &lt;code&gt;PriorityQueue&lt;/code&gt;) to mirror the 2020–2022 posts—identical inputs should yield identical shortest-path distances.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Algorithm fluency and &lt;strong&gt;systems&lt;/strong&gt; fluency both reward &lt;strong&gt;careful state management&lt;/strong&gt;. &lt;a href=&#34;https://blog.veydh.com/2025/2025-09-debugging-concurrent-systems-books-and-practices/&#34;&gt;Debugging Concurrent Systems: Books and Practices&lt;/a&gt; ties &lt;strong&gt;debugging&lt;/strong&gt; and &lt;strong&gt;concurrency&lt;/strong&gt; reading to how you hold graph and systems work in your head.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Architecture for Evolvable Services After Polyglot Expansion</title>
      <link>https://blog.veydh.com/2025/2025-10-architecture-for-evolvable-services/</link>
      <pubDate>Tue, 14 Oct 2025 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2025/2025-10-architecture-for-evolvable-services/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Adding &lt;strong&gt;Gleam&lt;/strong&gt; or &lt;strong&gt;Rust&lt;/strong&gt; workers does not rewrite your &lt;strong&gt;domain rules&lt;/strong&gt;—it changes &lt;strong&gt;deployment topology&lt;/strong&gt;, &lt;strong&gt;build graphs&lt;/strong&gt;, and &lt;strong&gt;operational surface area&lt;/strong&gt;. If core logic leaks across process boundaries, every new language multiplies rewrite cost. This post is the checklist after &lt;a href=&#34;https://blog.veydh.com/2024/2024-11-polyglot-interop-http-grpc-python-java/&#34;&gt;Polyglot Interop: HTTP and gRPC Between Python and Java&lt;/a&gt;: keep &lt;strong&gt;contracts&lt;/strong&gt; honest and &lt;strong&gt;observability&lt;/strong&gt; uniform.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;apis-as-the-real-boundary&#34;&gt;APIs as the real boundary&lt;/h2&gt;
&lt;p&gt;HTTP or gRPC schemas are the &lt;strong&gt;stability layer&lt;/strong&gt;. Internal refactors swap; protobuf/OpenAPI versions &lt;strong&gt;negotiate&lt;/strong&gt;. &lt;strong&gt;Contract tests&lt;/strong&gt; that fail CI when consumers and producers diverge silently are worth the investment.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;feature-flags-and-cutover&#34;&gt;Feature flags and cutover&lt;/h2&gt;
&lt;p&gt;Rolling out a new runtime behind a &lt;strong&gt;flag&lt;/strong&gt; de-risks partial failures. Metrics should compare &lt;strong&gt;error budgets&lt;/strong&gt; and &lt;strong&gt;latency&lt;/strong&gt; between old and new paths on identical traffic slices—not on cherry-picked demos.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;observability-parity&#34;&gt;Observability parity&lt;/h2&gt;
&lt;p&gt;The same &lt;strong&gt;trace ID&lt;/strong&gt; must traverse Python, Java, Gleam, and Rust services. Span names and attributes should be boringly consistent. If one stack logs strings and another logs structured JSON only, on-call pays the tax forever.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;clean-architecture-echoes&#34;&gt;Clean architecture echoes&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;Clean Architecture&lt;/em&gt; and &lt;strong&gt;PoEAA&lt;/strong&gt; posts from earlier years still apply: frameworks at the &lt;strong&gt;edge&lt;/strong&gt;, domain in the &lt;strong&gt;center&lt;/strong&gt;. Languages become &lt;strong&gt;plugins&lt;/strong&gt; to boundaries you already named.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;contract-tests-make-polyglot-changes-reproducible&#34;&gt;Contract tests: make polyglot changes reproducible&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Consumer-driven contract&lt;/strong&gt; sketch (concept applies to Pact, Spring Cloud Contract, or hand-rolled):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Commit an &lt;strong&gt;OpenAPI&lt;/strong&gt; or &lt;strong&gt;protobuf&lt;/strong&gt; schema as the source of truth.&lt;/li&gt;
&lt;li&gt;CI runs &lt;strong&gt;schemathesis&lt;/strong&gt; / Dredd / buf breaking-change checks on each PR.&lt;/li&gt;
&lt;li&gt;For each language worker (Python, Java, Gleam, Rust), a &lt;strong&gt;thin adapter&lt;/strong&gt; maps DTOs ↔ domain types; tests assert the adapter round-trips &lt;strong&gt;golden JSON&lt;/strong&gt; fixtures.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; regressions surface before deploy; &lt;strong&gt;Cons:&lt;/strong&gt; contract suites need curation or they become flaky—pin API &lt;strong&gt;version&lt;/strong&gt; headers and time fields in fixtures.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;observability-one-trace-through-four-runtimes&#34;&gt;Observability: one trace through four runtimes&lt;/h2&gt;
&lt;p&gt;Propagate &lt;code&gt;traceparent&lt;/code&gt; (W3C) or your vendor’s equivalent. Minimum viable attributes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;service.name&lt;/code&gt;, &lt;code&gt;service.version&lt;/code&gt; (git SHA)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;http.route&lt;/code&gt; or gRPC method&lt;/li&gt;
&lt;li&gt;&lt;code&gt;messaging.destination&lt;/code&gt; for queues&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Reproducible debug session:&lt;/strong&gt; given &lt;code&gt;request_id&lt;/code&gt;, you should fetch &lt;strong&gt;one&lt;/strong&gt; trace spanning all hops without grepping raw logs first.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Good architecture &lt;strong&gt;minimizes&lt;/strong&gt; how many places must change when languages do. &lt;a href=&#34;https://blog.veydh.com/2025/2025-11-algorithms-retrospective-dfs-bfs-dijkstra/&#34;&gt;Algorithms Retrospective: DFS, BFS, Dijkstra, and Backtracking&lt;/a&gt; loops back to &lt;strong&gt;algorithms&lt;/strong&gt;—graph search patterns that underpin both interviews and scheduling intuition.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Debugging Concurrent Systems: Books and Practices</title>
      <link>https://blog.veydh.com/2025/2025-09-debugging-concurrent-systems-books-and-practices/</link>
      <pubDate>Thu, 11 Sep 2025 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2025/2025-09-debugging-concurrent-systems-books-and-practices/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Concurrent bugs &lt;strong&gt;laugh at printf&lt;/strong&gt;. Ordering issues hide until load, timing, and deployment topology align just wrong. September is the month to stop pretending heroics scale: &lt;strong&gt;systematic methods&lt;/strong&gt; and &lt;strong&gt;canonical books&lt;/strong&gt; belong in the same toolkit as traces and benchmarks.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;agans-debugging&#34;&gt;Agans, &lt;em&gt;Debugging&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;Agans’ habits—&lt;strong&gt;reproduce&lt;/strong&gt;, &lt;strong&gt;simplify&lt;/strong&gt;, &lt;strong&gt;read backwards&lt;/strong&gt;, &lt;strong&gt;divide and conquer&lt;/strong&gt;—apply whether the symptom is a segfault or a stuck BEAM process. For concurrency, “reproduce” often means &lt;strong&gt;stress&lt;/strong&gt; plus &lt;strong&gt;fault injection&lt;/strong&gt;, not a single lucky curl.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;herlihy--shavit-the-art-of-multiprocessor-programming&#34;&gt;Herlihy &amp;amp; Shavit, &lt;em&gt;The Art of Multiprocessor Programming&lt;/em&gt;&lt;/h2&gt;
&lt;p&gt;When &lt;strong&gt;locks&lt;/strong&gt;, &lt;strong&gt;lock-free&lt;/strong&gt; structures, or &lt;strong&gt;memory ordering&lt;/strong&gt; are in play, &lt;strong&gt;invariants&lt;/strong&gt; should be stated in plain language: what must be true before and after each step? The book is dense; even selective chapters upgrade intuition about &lt;strong&gt;why&lt;/strong&gt; a “tiny” race takes weeks to find.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;practices-that-paid-off&#34;&gt;Practices that paid off&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Record-replay&lt;/strong&gt; and &lt;strong&gt;time-travel&lt;/strong&gt; debuggers where available—expensive to set up, cheap at incident o’clock.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stress tests&lt;/strong&gt; with injected crashes (the 2025 workload theme).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sequence numbers&lt;/strong&gt; in logs so “event A before B” survives async reordering.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed traces&lt;/strong&gt; (&lt;a href=&#34;https://blog.veydh.com/2024/2024-07-observability-tracing-opentelemetry/&#34;&gt;OpenTelemetry Traces Across Python and Java&lt;/a&gt;) tying HTTP/gRPC spans to internal task lifetimes.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;cross-training-runtimes&#34;&gt;Cross-training runtimes&lt;/h2&gt;
&lt;p&gt;BEAM &lt;strong&gt;observer&lt;/strong&gt; and &lt;strong&gt;tracing&lt;/strong&gt; plus Rust &lt;strong&gt;backtraces&lt;/strong&gt; plus &lt;strong&gt;OpenTelemetry&lt;/strong&gt; spans shorten unknown-unknowns. No single tool wins; &lt;strong&gt;correlation IDs&lt;/strong&gt; are the glue.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;reproducible-incident-checklist-printable&#34;&gt;Reproducible incident checklist (printable)&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Freeze the world&lt;/strong&gt; — Git SHA, container digest, OTP/Rust toolchain versions, feature flags, traffic slice.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;One graph&lt;/strong&gt; — Pick &lt;em&gt;one&lt;/em&gt; metric that hurts (p99, error rate, mailbox depth); screenshot with &lt;strong&gt;absolute time&lt;/strong&gt; axis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Minimal replay&lt;/strong&gt; — Script that replays &lt;strong&gt;one&lt;/strong&gt; failing request against staging with the same seeds as &lt;a href=&#34;https://blog.veydh.com/2025/2025-01-concurrent-workload-specification/&#34;&gt;A Language-Agnostic Concurrent Workload for 2025 Comparisons&lt;/a&gt; when possible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Divide&lt;/strong&gt; — Binary-search components: disable rerank, disable RAG, shrink concurrency—same tactic as Agans’ divide-and-conquer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write the invariant&lt;/strong&gt; — One sentence: “We assumed X; under load Y became false.” File it beside the fix.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2 id=&#34;tooling-matrix-what-to-reach-for-first&#34;&gt;Tooling matrix (what to reach for first)&lt;/h2&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Symptom stack&lt;/th&gt;
          &lt;th&gt;First tool&lt;/th&gt;
          &lt;th&gt;Second tool&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;BEAM tail latency&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;observer&lt;/code&gt;, &lt;code&gt;process_info&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;recon&lt;/code&gt; / &lt;code&gt;etop&lt;/code&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tokio stalls&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;tokio-console&lt;/code&gt;, tracing spans&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;perf&lt;/code&gt; / flamegraph&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Cross-service&lt;/td&gt;
          &lt;td&gt;OpenTelemetry trace by &lt;code&gt;request_id&lt;/code&gt;&lt;/td&gt;
          &lt;td&gt;Log join on span id&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Tools and languages rotate; &lt;strong&gt;methodical debugging&lt;/strong&gt; does not. &lt;a href=&#34;https://blog.veydh.com/2025/2025-10-architecture-for-evolvable-services/&#34;&gt;Architecture for Evolvable Services After Polyglot Expansion&lt;/a&gt; returns to &lt;strong&gt;architecture&lt;/strong&gt; when Gleam/Rust workers join Python/Java systems—boundaries matter more than logos.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Send, Sync, and Fearless Concurrency in Rust</title>
      <link>https://blog.veydh.com/2025/2025-08-rust-send-sync-and-fearless-concurrency/</link>
      <pubDate>Wed, 13 Aug 2025 00:00:00 +0000</pubDate>
      <author>Veydh Gooljar</author>
      
      <guid>https://blog.veydh.com/2025/2025-08-rust-send-sync-and-fearless-concurrency/</guid>
      <description>&lt;h2 id=&#34;preamble&#34;&gt;Preamble&lt;/h2&gt;
&lt;p&gt;Rust’s concurrency story is not only &lt;strong&gt;ownership&lt;/strong&gt;—it is also &lt;strong&gt;&lt;code&gt;Send&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;Sync&lt;/code&gt;&lt;/strong&gt;, marker traits the compiler uses to prove whether values may cross &lt;strong&gt;thread&lt;/strong&gt; or &lt;strong&gt;task&lt;/strong&gt; boundaries safely. When &lt;a href=&#34;https://blog.veydh.com/2025/2025-04-rust-tokio-same-concurrent-workload/&#34;&gt;Rust and Tokio: The Same Concurrent Workload in Type-Safe Threads&lt;/a&gt;’s benchmark introduced &lt;strong&gt;shared aggregators&lt;/strong&gt;, these traits stopped being trivia and became &lt;strong&gt;API design&lt;/strong&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;send-moving-ownership-across-threads&#34;&gt;&lt;code&gt;Send&lt;/code&gt;: moving ownership across threads&lt;/h2&gt;
&lt;p&gt;A type is &lt;strong&gt;&lt;code&gt;Send&lt;/code&gt;&lt;/strong&gt; when ownership can transfer to another thread without breaking aliasing rules. Most owned data is &lt;code&gt;Send&lt;/code&gt;; &lt;strong&gt;&lt;code&gt;Rc&lt;/code&gt;&lt;/strong&gt; is a classic counterexample for naive sharing. If a spawned task closure captures something non-&lt;code&gt;Send&lt;/code&gt;, the compiler refuses—&lt;strong&gt;good&lt;/strong&gt;—because you were about to ship a data race.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;sync-shared-references-across-threads&#34;&gt;&lt;code&gt;Sync&lt;/code&gt;: shared references across threads&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;Sync&lt;/code&gt;&lt;/strong&gt; means &lt;code&gt;&amp;amp;T&lt;/code&gt; is &lt;code&gt;Send&lt;/code&gt;: many threads may hold &lt;strong&gt;immutable&lt;/strong&gt; references concurrently when the type’s interior rules allow. &lt;strong&gt;&lt;code&gt;Mutex&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/strong&gt; is &lt;code&gt;Sync&lt;/code&gt; when &lt;code&gt;T&lt;/code&gt; is &lt;code&gt;Send&lt;/code&gt;—the mutex serializes mutation so references stay disciplined.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;aggregators-in-the-benchmark&#34;&gt;Aggregators in the benchmark&lt;/h2&gt;
&lt;p&gt;Global counters tempt &lt;strong&gt;&lt;code&gt;Arc&amp;lt;Mutex&amp;lt;Stats&amp;gt;&amp;gt;&lt;/code&gt;&lt;/strong&gt;. That works until &lt;strong&gt;contention&lt;/strong&gt; dominates; then &lt;strong&gt;message passing&lt;/strong&gt; to a single owner task often simplifies invariants and reduces lock thrashing. The “right” choice is &lt;strong&gt;measured&lt;/strong&gt;, not ideological.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;compiler-errors-as-design-feedback&#34;&gt;Compiler errors as design feedback&lt;/h2&gt;
&lt;p&gt;When Rust rejects a spawn site, the error can be read as &lt;strong&gt;a design review&lt;/strong&gt; from the typechecker. Languages without these checks still have the races—they just discover them &lt;strong&gt;after&lt;/strong&gt; deploy, with fewer breadcrumbs.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;code-arcmutext-versus-a-single-owner-stats-task&#34;&gt;Code: &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt; versus a single-owner stats task&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Shared mutex (simple, contends under load):&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-rust&#34; data-lang=&#34;rust&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;use&lt;/span&gt; std::sync::{Arc, Mutex};
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;// use tokio::sync::mpsc; // in full binary
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#[derive(Default)]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;struct&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;Stats&lt;/span&gt; { done: &lt;span style=&#34;color:#66d9ef&#34;&gt;u64&lt;/span&gt;, errors: &lt;span style=&#34;color:#66d9ef&#34;&gt;u64&lt;/span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;// async fn worker(..., stats: Arc&amp;lt;Mutex&amp;lt;Stats&amp;gt;&amp;gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//     let mut g = stats.lock().unwrap();
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//     g.done += 1;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;// }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Message passing (often less thrash; mirrors a dedicated collector process on the BEAM):&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-rust&#34; data-lang=&#34;rust&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;pub&lt;/span&gt; &lt;span style=&#34;color:#66d9ef&#34;&gt;struct&lt;/span&gt; &lt;span style=&#34;color:#a6e22e&#34;&gt;StatsDelta&lt;/span&gt; { &lt;span style=&#34;color:#66d9ef&#34;&gt;pub&lt;/span&gt; done: &lt;span style=&#34;color:#66d9ef&#34;&gt;u64&lt;/span&gt;, &lt;span style=&#34;color:#66d9ef&#34;&gt;pub&lt;/span&gt; errors: &lt;span style=&#34;color:#66d9ef&#34;&gt;u64&lt;/span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;// async fn stats_task(mut srx: mpsc::Receiver&amp;lt;StatsDelta&amp;gt;) {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//     let mut acc = StatsDelta { done: 0, errors: 0 };
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//     while let Some(d) = srx.recv().await {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//         acc.done += d.done;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//         acc.errors += d.errors;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;//     }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;// }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Uncomment and wire &lt;code&gt;mpsc&lt;/code&gt; channels when you port the JSONL metrics from &lt;a href=&#34;https://blog.veydh.com/2025/2025-01-concurrent-workload-specification/&#34;&gt;A Language-Agnostic Concurrent Workload for 2025 Comparisons&lt;/a&gt;: one &lt;strong&gt;owner&lt;/strong&gt; task serializes updates without a hot mutex. If profiling shows &lt;strong&gt;lock wait&lt;/strong&gt; is negligible, &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;Stats&amp;gt;&amp;gt;&lt;/code&gt; stays the smaller program.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Prefer &lt;strong&gt;message passing&lt;/strong&gt; when it clarifies ownership; use &lt;strong&gt;locks&lt;/strong&gt; when metrics structures are genuinely shared and hot—then profile. &lt;a href=&#34;https://blog.veydh.com/2025/2025-09-debugging-concurrent-systems-books-and-practices/&#34;&gt;Debugging Concurrent Systems: Books and Practices&lt;/a&gt; folds in &lt;strong&gt;debugging practices&lt;/strong&gt; and classic texts so incidents close faster than printf loops.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>
