LoRA: Fine-Tuning Small Distilled Models Without Full Weights

Preamble

Low-Rank Adaptation (LoRA) is the compromise when a model should sound like our support desk, respect our JSON tool schema, or stop doing that one thing—without paying the full price of updating every weight in a multi-billion-parameter checkpoint.

The idea is elementary linear algebra dressed as a training trick: instead of learning a full update matrix for a layer, you learn a low-rank factorization added to frozen weights. VRAM drops, iteration speed rises, and you can often ship adapter-only artifacts that are easy to version beside your data.

Why pair distillation with LoRA

Distilled students (from Mistral, Qwen, LLaMA families on the Hub) trade some ceiling capability for throughput and cost. They are attractive bases for assistants because you can serve them on fewer GPUs or smaller batches.

LoRA sits on top of that trade: you keep the frozen student as a stable anchor, then nudge behavior with a thin adapter trained on curated SFT data. You are not trying to re-teach the entire internet—just steer tone, format, and domain vocabulary.

What actually gets trained

Implementations differ, but the mental model is:

Frozen: most of the transformer—embedding tables and the bulk of attention/MLP weights stay put.
Trainable: small matrices injected into chosen projections (q_proj, k_proj, v_proj, o_proj, sometimes MLP gates). The library merges or applies these at forward time.

Rank r controls adapter capacity. Tiny r underfits quirky datasets; large r inches toward full fine-tune behavior and cost.

Alpha scaling (often paired with r) scales the effective strength of the adapter. Both are tuned together rather than leaving defaults unquestioned.

Dropout on adapter paths helps when the SFT set is small; it fights memorization of noisy examples.

Data hygiene beats clever hyperparameters

Datasets are versioned like code: git, manifests, and hashes. Deduplication matters—paraphrased duplicates inflate metrics and encourage verbatim regurgitation. System vs user turns should be labeled or separated consistently so chat templates do not scramble during training.

Step-by-step: LoRA with Transformers + PEFT (mental checklist)

This is the “one canonical recipe” kept stable; Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face and Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas only swap tokenizer/template details.

Pin base — model_id + revision (git commit on the Hub) for weights and tokenizer JSON.
Choose targets — Default: attention projections (q_proj, k_proj, v_proj, o_proj). Add MLP only if evals plateau and VRAM allows.
Set PEFT — LoraConfig(r=8 or 16, lora_alpha=16 or 32, lora_dropout=0.05, target_modules=[...], bias="none", task_type="CAUSAL_LM"). Treat r and alpha as coupled: effective scale is roughly alpha/r.
Load in bf16 — model = AutoModelForCausalLM.from_pretrained(..., torch_dtype=torch.bfloat16) when the GPU supports it; enable gradient checkpointing if sequence length × batch threatens OOM.
Tokenize with chat template — Build messages list → tokenizer.apply_chat_template(..., tokenize=True, return_tensors="pt", add_generation_prompt=False) for SFT; mask non-assistant token positions with -100 in labels so loss applies only where you want imitation (Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face expands Mistral-specific layout).
Train — Trainer or SFTTrainer with small learning rate (e.g. 1e-4 to 2e-4 for adapters, lower if unstable), cosine or linear decay, weight decay 0–0.1, max_steps or epochs chosen from a tiny dry run that completes without NaN.
Save — trainer.model.save_pretrained("./lora-out") saves adapter-only shards; record dataset hash, seed, and CLI args beside the folder.
Load for inference — PeftModel.from_pretrained(base, "./lora-out") or merge for deployment paths that require a single weight file.

Minimal wiring (abbreviated; align with your model family’s target_modules names):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
REV = "<pin-a-commit-sha>"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, revision=REV)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    revision=REV,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

lora = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
model = get_peft_model(model, lora)

# ... build tokenized dataset with masked labels, then:
args = TrainingArguments(
    output_dir="./lora-run",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=200,
    bf16=True,
    gradient_checkpointing=True,
)
trainer = Trainer(model=model, args=args, train_dataset=tokenized_train)
trainer.train()
trainer.model.save_pretrained("./lora-out")

Pros of this path: battle-tested libraries, easy diff between runs, adapter artifacts stay small. Cons: you still own data hygiene, template correctness, and eval gates—PEFT does not fix mis-tokenized chats.

Conclusion

Pick one canonical training recipe—Transformers + PEFT, or a higher-level runner—and change only data and hyperparameters between experiments. Fine-Tuning a Mistral-Class Checkpoint with LoRA on Hugging Face and Qwen and LLaMA Distilled Checkpoints: LoRA Gotchas walk family-specific quirks (Mistral templates, Qwen tokenization, LLaMA RoPE) so the same LoRA discipline does not stumble on tokenizer details.