关键要点
- Fine-tuning (recommended): 8 GB VRAM, 500+ training examples, 1–4 hours. Cost: $100–500.
- Pre-training: 8+ GPUs, 100B+ tokens, weeks of training. Cost: $50k–500k.
- Most organizations should fine-tune, not pre-train. Diminishing returns for custom pre-training.
- Best approach: Start with fine-tuning on your domain data, then evaluate if pre-training is justified.
- As of April 2026, pre-training is rarely justified unless you need proprietary model.
Fine-Tuning vs Pre-Training
| Aspect | Fine-Tuning | Pre-Training |
|---|---|---|
| Training time | 1–4 hours | Weeks–months |
| VRAM required | 8 GB | 100+ GB (multi-GPU) |
| Data required | 500–5k examples | 100B+ tokens |
| Cost | $100–500 | $50k–500k |
| Customization | Domain knowledge | Proprietary model |
| When to use | 99% of cases | Rare, specialized needs |
Fine-Tuning Path (Recommended)
- 1Collect 500–5000 domain-specific examples (high quality matters).
- 2Choose base model (Llama 3.1 8B, Qwen 7B, etc.).
- 3Use LoRA for efficient training (4× faster, same quality).
- 4Train for 3–5 epochs on GPU.
- 5Evaluate on test set (precision, recall, custom metrics).
- 6Merge LoRA adapter into base model.
- 7Deploy as production model.
Pre-Training: When and Why
Pre-training means learning from raw data (books, documents, code). Only justified if:
1. You have >10 billion tokens of unique, valuable data.
2. Pre-trained models consistently fail on your domain.
3. Budget is >$50k (realistic cost).
4. You need proprietary model (competitive advantage).
Example: A genomics company with 500GB of private research data might justify custom pre-training.
Domain Adaptation Strategies
Without full pre-training, improve model performance on your domain:
- Continued pre-training: Take base model, train on your domain data (10B+ tokens). Cheaper than full pre-training.
- LoRA fine-tuning: Most practical. Tune on 500+ examples.
- Prompt engineering: Craft good prompts. Free, but limited.
- RAG: Retrieve documents, provide context. Works without retraining.
- Ensemble: Combine multiple models.
Evaluation Metrics
Measure model quality:
- Task-specific metrics: Accuracy, F1 score, BLEU (for text generation).
- Benchmark tests: Run on standard benchmarks (MMLU, HumanEval).
- Human evaluation: Manual scoring (time-consuming but accurate).
- Business metrics: Does model improve actual business outcomes?
Common Mistakes
- Pre-training without sufficient data. <10B tokens is wasted compute. Fine-tune instead.
- Not evaluating properly. Only training loss is misleading. Test on unseen data.
- Expecting custom model to match GPT-4. Gap between open models and frontier models is large.
- Ignoring inference costs. Larger custom models = higher inference costs. Consider trade-off.
Sources
- Chinchilla Scaling Laws — arxiv.org/abs/2203.15556
- Instruction Tuning Survey — arxiv.org/abs/2308.10792