Preamble
Mistral-family checkpoints are common bases for assistants: strong quality per dollar, good tooling, and a large community of derivative models. Pulling weights from the Hub is only the first step. Training without the model’s official chat template—or mixing tokenizer specials inconsistently—is how you get sparkling loss curves and broken inference.
The snippets below assume PyTorch + Transformers + PEFT; translate the same invariants to your runner of choice.
This post is a checklist to run before any serious LoRA SFT job on Mistral-class bases. It mirrors the dependency-pinning mindset from earlier DevOps writing: reproducibility beats heroics.
Pin the checkpoint, not just the name
Hub models move. revision should be pinned to a commit hash for training and for inference. The same applies to tokenizer files bundled with that revision. “Latest” is fine for a weekend experiment; it is a liability when you need to diff a regression three weeks later.
Chat templates and special tokens
Instruction-tuned models expect BOS/EOS and role markers in a specific layout. Training on raw concatenated strings—user and assistant blobs jammed together without structure—teaches the adapter the wrong conditional distribution.
Always:
- Load
tokenizerfrom the same revision asmodel. - Apply the model card’s recommended chat template (or
tokenizer.apply_chat_templatewhen available) to build training examples. - Mask labels so loss applies only to assistant tokens when that is the objective; masking prevents the model from learning to imitate user prefixes.
If any of those steps feel optional, revisit RAG Foundations: Embeddings, Chunking, and the Retrieval Loop: garbage structure upstream becomes garbage behavior downstream.
Memory, precision, and sequence length
Gradient checkpointing trades compute for activation memory; on long contexts it is often mandatory. bf16 is used when hardware supports it well; fp16 can work but needs watching for overflow. OOM during the first real batch is a signal to shorten sequences, reduce micro-batch, or enable ZeRO-style sharding—not to immediately give up on LoRA.
Saving and merging adapters
During iteration, adapter weights only are saved—small, easy to diff and A/B swap. Merging adapters into a full weight file is useful for some deployment paths (single artifact, certain runtimes); it complicates updates because every merge is a new binary. Which path production uses should be documented so nobody confuses “adapter v3” with “merged full model v3”.
Documentation alongside the weights
An ADR-style note (even a short Markdown file in the repo) should record Hub IDs, dataset hashes, seeds, learning rate schedule, and eval commands. That is the same habit as recording Docker base images and lockfiles: future you is also a teammate.
End-to-end recipe (SFT on JSONL chat rows)
Assume each training row is {"messages": [{"role":"system","content":...}, {"role":"user",...}, {"role":"assistant",...}]} — one assistant turn per example for simplicity.
Step 1 — Load pinned revision
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
REV = "abc123deadbeef..." # Hub commit sha
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, revision=REV)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, revision=REV, torch_dtype=torch.bfloat16, device_map="auto"
)
Step 2 — Apply chat template and mask user/system tokens
Use tokenizer.apply_chat_template with return_dict=True and token masks, or build labels manually: set labels[i] = -100 for tokens that belong to system + user prefixes so only assistant tokens contribute loss. Skipping this is how you get a model that parrots the user channel.
Step 3 — Attach LoRA and train
Match target_modules to Mistral’s linear names (q_proj, k_proj, v_proj, o_proj are the usual set). Watch train loss and a small held-out generation set weekly—loss can fall while JSON/tool format falls apart.
Step 4 — Save adapter + manifest
lora-out/
adapter_config.json
adapter_model.safetensors
manifest.json # { "base": MODEL_ID, "revision": REV, "dataset_sha256": "...", "seed": 42 }
Step 5 — Inference smoke test
Load base + adapter, run the same apply_chat_template path used in training, and compare against a golden transcript file in CI if possible.
CLI alternative: trl / Axolotl / LLaMA-Factory
Higher-level runners (trl SFT, Axolotl, LLaMA-Factory YAMLs) encode the same steps: dataset → template → masked labels → PEFT. Pros: less boilerplate, proven configs on Discord. Cons: another layer to debug when templates change between library versions—still pin the library version and export the rendered tokenization for one row to disk when incidents happen.
Conclusion
Mistral-class fine-tunes are won or lost on tokenizer fidelity and pinned revisions. Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas extends the same discipline to Qwen and LLaMA derivatives where multilingual vocabs and RoPE extensions add their own footguns.