Preamble (plain English)
Hugging Face (HF) hosts thousands of model checkpoints. Among them, Qwen and LLaMA families are everywhere: raw bases, chat-tuned “instruct” versions, and distilled smaller models trained to imitate bigger ones for speed.
LoRA (Low-Rank Adaptation) means you do not retrain the whole model—you add a thin trainable “side path” and leave most weights frozen. That is cheap and fast, but the adapter is not a free-floating brain: it is tied to one specific base checkpoint.
The training recipe from Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face still applies. This post is the “read me before you merge adapters” note: tokenizer, position encoding (RoPE), attention layout (GQA), and licenses differ by family. Getting any of those wrong often fails silently (the model runs, but quality drifts or breaks in edge cases).
Tiny glossary (terms that show up in configs and papers)
- Tokenizer — The piece that turns raw text into integer tokens the model sees. Different families use different tokenizers; swap the tokenizer and you have effectively changed the “alphabet” the model was trained on.
- RoPE — Rotary Positional Embedding. A standard way to tell the model how far apart tokens are in the sequence (position information) without paying the full cost of older absolute-position schemes. “RoPE scaling” tweaks that math so the model can pretend it was trained on longer sequences than it mostly saw—useful for long context, but easy to get wrong between train and serve.
- GQA — Grouped-Query Attention. A memory-saving trick: several query heads share the same key/value heads instead of every query head getting its own K/V. Layout differs across checkpoints; mixing configs can break loading or silently change behavior if something maps layers wrong.
- SFT — Supervised Fine-Tuning. You show (prompt, desired answer) pairs and train the model to imitate them—typical “instruction tuning.”
- PEFT — Parameter-Efficient Fine-Tuning (e.g. LoRA via the
peftlibrary). You train few parameters; the rest stay frozen. - ADR — Architecture Decision Record — a short doc in your repo that says why you picked a base model and how you pin versions.
Qwen: multilingual tokenization changes the effective budget
Think of the tokenizer as a ruler that measures text in tokens, not characters. Qwen’s tokenizer is built for many languages. That is great for products that are not English-only; it also changes planning in ways people miss.
Gotcha 1 — Same “length” in characters ≠ same token budget.
A paragraph in English might become 800 tokens; the same idea in another script might become 1,400 tokens. Your “4k context” window is 4k tokens, not 4k letters—so non-English locales can feel like they have less room for instructions, tools, and retrieved documents.
Gotcha 2 — English-only metrics lie.
If you only watch loss or accuracy on English dev data, you can ship something that looks fine in dashboards but degrades in other languages. Check each language you actually ship, including formatting quirks (dates, numbers, mixed scripts).
Gotcha 3 — RAG stacks compound the bias.
If your chunking, embedders, and eval sets were built English-first, retrieval quality drops where the tokenizer and embeddings were not aligned with real user text. The LoRA cannot fix upstream retrieval that never surfaces the right chunks.
LLaMA: RoPE, GQA, and “long context” bases
RoPE and “long context” marketing
A checkpoint may advertise a large context window, but pretraining data is still mostly shorter sequences. The model learns long-context behavior partly through extrapolation (math and scaling tricks). If your SFT data jumps wildly between very short and very long examples without a clear pattern, you can get brittle behavior at the lengths you care about in production.
Gotcha 4 — Train-time vs serve-time RoPE settings must match.
rope_scaling, max_position_embeddings, and related flags in config.json are not cosmetic. If training uses scaling recipe A and your inference server uses B (or an older Transformers version interprets them differently), you often get no crash—just worse answers, especially on long prompts. Pin library versions and dump the effective config next to your adapter in the same manifest.
Gotcha 5 — GQA layout is part of the architecture contract.
Community and vendor builds differ in how many heads group together. An adapter trained on one revision is not a guaranteed drop-in on another revision if attention layout or hidden sizes changed. Treat “same model name, different revision” like a different car engine—the body looks similar, parts may not interchange.
Gotcha 6 — “Eval average loss” hides long-context bugs.
Short training runs and average metrics smooth over failures that only appear at max production context. Smoke tests at 200 steps are fine for wiring; you still need at least one eval that stresses the longest context you plan to use (or the longest you can afford to simulate).
Licenses and redistribution
Gotcha 7 — LoRA does not reset licensing.
You still ship a story that includes the base weights (loaded at runtime) plus your adapter (or a merged full checkpoint). Some variants restrict commercial use, require attribution, or impose compliance steps. “We only fine-tuned a little” does not exempt you from the base model’s terms.
Gotcha 8 — Document Hub IDs like you document dependencies.
Put license notes beside exact model IDs and revisions (git-style hashes on the Hub) in the same ADR where you record dataset hashes and seeds. That is what legal and security reviews actually read.
One adapter per base revision
Gotcha 9 — Do not casually “port” adapters across bases.
One LoRA per base revision is the safe default. Research sometimes transfers adapters between related checkpoints; in production that is unforced risk—small numerical differences in layernorms, embeddings, or RoPE implementation can accumulate into large behavior changes.
If you must try transfer, treat it like a new model: full eval gates, regression tests on golden prompts, and no “we eyeballed one benchmark screenshot.”
Step-by-step when switching families (Qwen ↔ LLaMA)
LoRA: Fine-Tuning Small Distilled Models Without Full Weights and Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face’s PEFT recipe stays; below are extra gates per family.
-
Tokenizer probe — For each locale you ship, tokenize a fixed paragraph of representative text. Record token count and verify round-trip: for string inputs,
decode(encode(x))should equalx(or document intentional normalization). Multilingual bases punish “character budget” planning that assumed English token density. -
Template probe — Dump
tokenizer.apply_chat_templateon one golden multi-turn example to a.txtfile in git; diff that file in PRs that touch chat formatting. Qwen and LLaMA instruct variants use different role markers and spacing; mixing templates means you train the model to expect the wrong framing at inference. -
RoPE / context flags — Copy into your run manifest:
rope_scaling,max_position_embeddings, and any other fields your training and serving stacks read fromconfig.json. Training with scaling A and serving with scaling B is a classic silent quality regression. -
Train short, eval long — Run a short smoke job for plumbing; then evaluate at max context you intend in production (or the nearest affordable subset). Extrapolation bugs show up here, not in smoothed training loss.
-
Merge policy — Decide early whether production loads
PeftModel(base + adapter at runtime) or a mergedsave_pretraineddirectory. Merge simplifies some C++ or single-binary servers; it complicates “swap adapter only” deploys and can blur audit trails if you lose the separation between base and delta.
Optional merge (when you explicitly want one directory of full weights):
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(BASE_ID, revision=BASE_REV, torch_dtype=torch.bfloat16, device_map="cpu")
base = PeftModel.from_pretrained(base, "./lora-out")
merged = base.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
Gotcha 10 — Re-evaluate after merge.
Run the same eval suite on merged weights. Do not assume merge is bitwise-identical to “base + adapter” forward pass in every runtime without checking—numerical paths and optimizations can differ.
Hub / tooling matrix (same training, different ergonomics)
| Workflow | Tooling | When to use it |
|---|---|---|
| Notebook → script | Transformers + PEFT + TRL (Transformer Reinforcement Learning library—helpers for SFT/RLHF-style training on HF) | Small teams, full control |
| YAML-driven SFT | LLaMA-Factory, Axolotl | Faster iteration when configs are shared |
| Managed fine-tune | Cloud vendor jobs | When compliance wants a hosted trail; still export adapter + manifest |
The reproducibility invariant is the same: pinned base revision, frozen dataset hash, logged lora_config, identical chat template in train and serve.
Conclusion
In one sentence: family-specific tokenizer and RoPE (and attention layout) details decide whether your fine-tune is stable—or fragile in ways that only show up under real prompts and real languages. Serving Distilled Models Behind an HTTP API moves from training to serving: HTTP boundaries, OpenAI-compatible APIs, and runtime choices once the adapter exists.