关键要点
- LoRA = add small trainable layers to a pre-trained model. Only 1–5% of model weights are trainable, dramatically reducing VRAM and time.
- Fine-tuning requirements: 500–1000 high-quality examples, 8–16 GB VRAM, 1–4 hours training time.
- Best tools: unsloth (fastest), Hugging Face TRL, Axolotl (most flexible).
- LoRA rank (r): Lower (r=8) is smaller, faster; higher (r=64) is more expressive. Default: r=16–32.
- As of April 2026, LoRA is production-ready and widely supported across inference engines.
How Does LoRA Work?
LoRA adds small "adapter" matrices alongside the original model weights. During training, only the adapters are updated. Original weights freeze.
Example: A 13B model has 13 billion weights. LoRA adds only 50 million trainable parameters (~0.4% of original). Training is 100× faster.
At inference, the adapter output is merged with the main model output via matrix multiplication. Minimal speed penalty (~5%).
Result: A domain-specific model that performs better on your tasks, using only 8 GB VRAM instead of 26 GB.
Should You Fine-Tune or Use RAG?
Decision matrix:
| Criteria | Fine-Tuning | RAG |
|---|---|---|
| Documents change frequency | Annual or less | Weekly or more |
| Knowledge requirements | Model needs deep understanding | Retrieval suffices |
| Training data available | Need 500+ high-quality examples | Any documents work |
| Cost (long-term) | One-time ($50–200) | Continuous embeddings |
| Latency | Faster (no retrieval) | Slower (retrieval + LLM) |
| Best for | Code, creative writing, domain style | Knowledge bases, Q&A |
How Do You Prepare Training Data?
Quality training data determines fine-tuning success. Poor data = poor model.
Minimum: 500 examples. Each example = input + expected output.
Optimal: 1000–5000 examples. More data = better accuracy.
Format: JSON or JSONL. Each line = one training example.
[
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"},
{"instruction": "Summarize", "input": "Long text...", "output": "Summary..."},
{"instruction": "Code review", "input": "Python code...", "output": "Review comments..."}
]
# OR instruction-only format:
[
{"text": "<|user|>Translate to French\nHello<|assistant|>Bonjour"},
{"text": "<|user|>Summarize\nText<|assistant|>Summary"}
]Fine-Tuning Setup With Unsloth
Unsloth is the fastest LoRA framework (4× speed vs standard training):
# Install unsloth
pip install unsloth[colab-new] xformers bitsandbytes
from unsloth import FastLanguageModel
from datasets import load_dataset
# Load base model with LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
lora_r=16, lora_alpha=32,
lora_dropout=0.05
)
# Load training data
dataset = load_dataset("json", data_files="training.jsonl")
# Configure trainer
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=4,
num_train_epochs=3,
learning_rate=2e-4,
output_dir="output"
)
)
# Train
trainer.train()Key Hyperparameters for LoRA Fine-Tuning
| Hyperparameter | Recommended Value | Typical Range | Effect |
|---|---|---|---|
| — | — | — | — |
| — | — | — | — |
| — | — | — | — |
| — | — | — | — |
| — | — | — | — |
| — | — | — | — |
How Do You Evaluate Fine-Tuned Models?
Training loss: Should decrease over epochs. If flat, learning rate may be too low.
Validation loss: Should decrease but stay above training loss (normal). If increases, overfitting.
Manual testing: Run the fine-tuned model on test examples and compare outputs to expected results.
Benchmark tasks: Use standard benchmarks (MMLU, HumanEval) to measure improvement.
Common Fine-Tuning Mistakes
- Too few training examples. <200 examples often leads to overfitting. Collect at least 500.
- Training for too many epochs. Model memorizes data instead of learning generalizable patterns. Stop at 3–5 epochs max.
- Not validating on unseen data. Always split data into train/validation (80/20). Validate frequently to catch overfitting.
- Using the same data for fine-tuning and evaluation. Reported accuracy is meaningless if evaluated on training data.
- Not saving checkpoints. Training can take hours. Save every epoch so you can recover from crashes.
Common Questions About LoRA Fine-Tuning
How much training data is needed?
Minimum 500 examples, optimal 1000–5000. Quality matters more than quantity. 100 high-quality examples > 1000 low-quality examples.
Can I fine-tune on a laptop?
Yes. Use 4-bit quantization and LoRA. A 7B model requires 8 GB VRAM, training takes 1–2 hours on CPU (slow) or 10–15 min on GPU.
How do I merge LoRA adapters into the base model?
Use unsloth or HF transformers: `model.merge_and_unload()`. Creates a single model file (~3–4 GB for 7B), ready for inference.
Can I combine multiple LoRA adapters?
Yes, with restrictions. Stack adapters for sequential application, or use adapter composition techniques (e.g., DoRA).
Is fine-tuned model quality better than RAG?
For most tasks, yes. Fine-tuned models understand domain concepts deeply. RAG is better when documents are large and change frequently.
Sources
- LoRA Paper (Hu et al.) — arxiv.org/abs/2106.09685
- Unsloth GitHub — github.com/unslothai/unsloth
- HuggingFace TRL — github.com/huggingface/trl
- Axolotl — github.com/OpenAccess-AI-Collective/axolotl