Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face

Preamble

Mistral-family checkpoints are common bases for assistants: strong quality per dollar, good tooling, and a large community of derivative models. Pulling weights from the Hub is only the first step. Training without the model’s official chat template—or mixing tokenizer specials inconsistently—is how you get sparkling loss curves and broken inference.

The snippets below assume PyTorch + Transformers + PEFT; translate the same invariants to your runner of choice.

This post is a checklist to run before any serious LoRA SFT job on Mistral-class bases. It mirrors the dependency-pinning mindset from earlier DevOps writing: reproducibility beats heroics.

Pin the checkpoint, not just the name

Hub models move. revision should be pinned to a commit hash for training and for inference. The same applies to tokenizer files bundled with that revision. “Latest” is fine for a weekend experiment; it is a liability when you need to diff a regression three weeks later.

Chat templates and special tokens

Instruction-tuned models expect BOS/EOS and role markers in a specific layout. Training on raw concatenated strings—user and assistant blobs jammed together without structure—teaches the adapter the wrong conditional distribution.

Always:

Load tokenizer from the same revision as model.
Apply the model card’s recommended chat template (or tokenizer.apply_chat_template when available) to build training examples.
Mask labels so loss applies only to assistant tokens when that is the objective; masking prevents the model from learning to imitate user prefixes.

If any of those steps feel optional, revisit RAG Foundations: Embeddings, Chunking, and the Retrieval Loop: garbage structure upstream becomes garbage behavior downstream.

Memory, precision, and sequence length

Gradient checkpointing trades compute for activation memory; on long contexts it is often mandatory. bf16 is used when hardware supports it well; fp16 can work but needs watching for overflow. OOM during the first real batch is a signal to shorten sequences, reduce micro-batch, or enable ZeRO-style sharding—not to immediately give up on LoRA.

Saving and merging adapters

During iteration, adapter weights only are saved—small, easy to diff and A/B swap. Merging adapters into a full weight file is useful for some deployment paths (single artifact, certain runtimes); it complicates updates because every merge is a new binary. Which path production uses should be documented so nobody confuses “adapter v3” with “merged full model v3”.

Documentation alongside the weights

An ADR-style note (even a short Markdown file in the repo) should record Hub IDs, dataset hashes, seeds, learning rate schedule, and eval commands. That is the same habit as recording Docker base images and lockfiles: future you is also a teammate.

End-to-end recipe (SFT on JSONL chat rows)

Assume each training row is {"messages": [{"role":"system","content":...}, {"role":"user",...}, {"role":"assistant",...}]} — one assistant turn per example for simplicity.

Step 1 — Load pinned revision

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
REV = "abc123deadbeef..."  # Hub commit sha

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, revision=REV)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, revision=REV, torch_dtype=torch.bfloat16, device_map="auto"
)

Step 2 — Apply chat template and mask user/system tokens

Use tokenizer.apply_chat_template with return_dict=True and token masks, or build labels manually: set labels[i] = -100 for tokens that belong to system + user prefixes so only assistant tokens contribute loss. Skipping this is how you get a model that parrots the user channel.

Step 3 — Attach LoRA and train

Match target_modules to Mistral’s linear names (q_proj, k_proj, v_proj, o_proj are the usual set). Watch train loss and a small held-out generation set weekly—loss can fall while JSON/tool format falls apart.

Step 4 — Save adapter + manifest

lora-out/
  adapter_config.json
  adapter_model.safetensors
manifest.json   # { "base": MODEL_ID, "revision": REV, "dataset_sha256": "...", "seed": 42 }

Step 5 — Inference smoke test

Load base + adapter, run the same apply_chat_template path used in training, and compare against a golden transcript file in CI if possible.

CLI alternative: `trl` / Axolotl / LLaMA-Factory

Higher-level runners (trl SFT, Axolotl, LLaMA-Factory YAMLs) encode the same steps: dataset → template → masked labels → PEFT. Pros: less boilerplate, proven configs on Discord. Cons: another layer to debug when templates change between library versions—still pin the library version and export the rendered tokenization for one row to disk when incidents happen.

Conclusion

Mistral-class fine-tunes are won or lost on tokenizer fidelity and pinned revisions. Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas extends the same discipline to Qwen and LLaMA derivatives where multilingual vocabs and RoPE extensions add their own footguns.