PromptQuorumPromptQuorum
Home/Local LLMs/LoRA Fine-Tuning for Local LLMs 2026: Unsloth Tutorial on 8 GB VRAM with Llama 3.1
Advanced Techniques

LoRA Fine-Tuning for Local LLMs 2026: Unsloth Tutorial on 8 GB VRAM with Llama 3.1

Β·13 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Fine-tuning adapts a pre-trained model to your domain using LoRA (Low-Rank Adaptation) β€” add small adapter layers (0.4% of total weights) instead of retraining the entire model. A Llama 3.1 8B fine-tune requires 8 GB VRAM and 1–2 hours on consumer hardware using Unsloth (4Γ— faster than standard training).

Fine-tuning adapts a pre-trained model to your domain using LoRA (Low-Rank Adaptation) β€” add small adapter layers (0.4% of total weights) instead of retraining the entire model. A Llama 3.1 8B fine-tune requires 8 GB VRAM and 1–2 hours on consumer hardware using Unsloth (4Γ— faster than standard training). As of April 2026, LoRA and QLoRA (4-bit quantized LoRA) are production-ready across Ollama, LM Studio, and vLLM.

Slide Deck: LoRA Fine-Tuning for Local LLMs 2026: Unsloth Tutorial on 8 GB VRAM with Llama 3.1

The slide deck below covers: how LoRA reduces trainable parameters to 0.4% of the full model, QLoRA 4-bit quantization enabling fine-tuning on 8 GB VRAM, a LoRA vs RAG decision matrix, 6-step Unsloth training setup, key hyperparameters (rank, alpha, dropout), and 5 common fine-tuning mistakes. Download the PDF as a LoRA fine-tuning reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • LoRA = add small trainable layers to a pre-trained model. Only 1-5% of model weights are trainable, dramatically reducing VRAM and time.
  • Fine-tuning requirements: 500-1000 high-quality examples, 8-16 GB VRAM, 1-4 hours training time.
  • Best tools: unsloth (fastest), Hugging Face TRL, Axolotl (most flexible).
  • LoRA rank (r): Lower (r=8) is smaller, faster; higher (r=64) is more expressive. Default: r=16-32.
  • As of April 2026, LoRA is production-ready and widely supported across inference engines.

How Does LoRA Work?

LoRA adds small "adapter" matrices alongside the original model weights. During training, only the adapters are updated. Original weights freeze.

Example: A 13B model has 13 billion weights. LoRA adds only 50 million trainable parameters (~0.4% of original). Training is 100Γ— faster.

At inference, the adapter output is merged with the main model output via matrix multiplication. Minimal speed penalty (~5%).

Result: A domain-specific model that performs better on your tasks, using only 8 GB VRAM instead of 26 GB.

LoRA adds small trainable adapter matrices alongside frozen base model weights. Only 0.4% of the 13B Llama model's parameters are updated during training, reducing VRAM and time by 100Γ—.
LoRA adds small trainable adapter matrices alongside frozen base model weights. Only 0.4% of the 13B Llama model's parameters are updated during training, reducing VRAM and time by 100Γ—.

What Is QLoRA (4-Bit Quantized LoRA)?

QLoRA combines LoRA with 4-bit quantization β€” the base model loads in 4-bit (QLoRA) while only the adapter trains in 16-bit. This halves VRAM requirements:

As of April 2026, QLoRA is the default for consumer hardware. Unsloth's `load_in_4bit=True` flag in the code example above enables QLoRA automatically. The 2% quality difference vs full LoRA is negligible for most domain-adaptation tasks.

When to use LoRA (16-bit) over QLoRA (4-bit):

β€’ Tasks requiring maximum precision (medical, legal contract analysis)

β€’ You have 16+ GB VRAM available

β€’ Fine-tuning 3B or smaller models (QLoRA savings are minimal at small sizes)

Method7B Model VRAM13B Model VRAMQuality vs Full
Full fine-tuning28 GB52 GB100% (baseline)
LoRA (16-bit base)16 GB30 GB~97%
QLoRA (4-bit base)8 GB14 GB~95%
VRAM requirements by fine-tuning method across 7B, 13B, and 70B models. Full fine-tuning requires 28+ GB for 7B; QLoRA reduces this to 8 GB. For enterprises, QLoRA enables fine-tuning 70B models on dual RTX 4090s (~40 GB total).
VRAM requirements by fine-tuning method across 7B, 13B, and 70B models. Full fine-tuning requires 28+ GB for 7B; QLoRA reduces this to 8 GB. For enterprises, QLoRA enables fine-tuning 70B models on dual RTX 4090s (~40 GB total).

Should You Fine-Tune or Use RAG?

Decision matrix:

Before investing in LoRA fine-tuning, verify that better prompting cannot solve the problem first β€” prompt engineering is faster, reversible, and model-agnostic. For the full decision framework, see prompt engineering vs fine-tuning: how to decide.

Fine-tuning is one path to keeping a coding workflow productive offline. For the broader offline setup β€” model, IDE, package cache, docs mirror β€” see Local Coding LLM Without Internet.

CriteriaFine-TuningRAG
Documents change frequencyAnnual or lessWeekly or more
Knowledge requirementsModel needs deep understandingRetrieval suffices
Training data availableNeed 500+ high-quality examplesAny documents work
Cost (long-term)One-time ($50-200)Continuous embeddings
LatencyFaster (no retrieval)Slower (retrieval + LLM)
Best forCode, creative writing, domain styleKnowledge bases, Q&A

How Do You Prepare Training Data?

Quality training data determines fine-tuning success. Poor data = poor model.

Minimum: 500 examples. Each example = input + expected output.

Optimal: 1000-5000 examples. More data = better accuracy.

Format: JSON or JSONL. Each line = one training example.

json
[
  {"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"},
  {"instruction": "Summarize", "input": "Long text...", "output": "Summary..."},
  {"instruction": "Code review", "input": "Python code...", "output": "Review comments..."}
]

# OR instruction-only format:
[
  {"text": "<|user|>Translate to French\nHello<|assistant|>Bonjour"},
  {"text": "<|user|>Summarize\nText<|assistant|>Summary"}
]
Training data preparation workflow: collect 500+ domain-specific instruction/output pairs, format as JSONL (one per line), and load into SFTTrainer. Quality matters more than quantityβ€”100 high-quality examples outperform 1000 low-quality ones.
Training data preparation workflow: collect 500+ domain-specific instruction/output pairs, format as JSONL (one per line), and load into SFTTrainer. Quality matters more than quantityβ€”100 high-quality examples outperform 1000 low-quality ones.

Fine-Tuning Setup With Unsloth

Unsloth is the fastest LoRA framework (4Γ— speed vs standard training):

python
# Install unsloth
pip install unsloth[colab-new] xformers bitsandbytes

from unsloth import FastLanguageModel
from datasets import load_dataset

# Load base model with LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
  model_name="unsloth/llama-3.1-8b-bnb-4bit",
  max_seq_length=2048,
  load_in_4bit=True,
  lora_r=16, lora_alpha=32,
  lora_dropout=0.05
)

# Load training data
dataset = load_dataset("json", data_files="training.jsonl")

# Configure trainer
from trl import SFTTrainer
trainer = SFTTrainer(
  model=model,
  tokenizer=tokenizer,
  train_dataset=dataset["train"],
  dataset_text_field="text",
  max_seq_length=2048,
  args=TrainingArguments(
    per_device_train_batch_size=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    output_dir="output"
  )
)

# Train
trainer.train()

Key Hyperparameters for LoRA Fine-Tuning

HyperparameterRecommended ValueTypical RangeEffect
learning_rate2e-41e-5 to 1e-3Lower = stable, slower convergence
lora_r (rank)164 to 64Higher = more expressive, slower
lora_alpha328 to 256Higher = stronger LoRA effect
num_train_epochs31 to 10More epochs = risk of overfitting
batch_size41 to 32Larger = faster training, more VRAM
warmup_steps1000 to 1000Gradual LR increase, stabilizes training

How Do You Evaluate Fine-Tuned Models?

Training loss: Should decrease over epochs. If flat, learning rate may be too low.

Validation loss: Should decrease but stay above training loss (normal). If increases, overfitting.

Manual testing: Run the fine-tuned model on test examples and compare outputs to expected results.

Benchmark tasks: Use standard benchmarks (MMLU, HumanEval) to measure improvement.

What Are the Most Common Fine-Tuning Mistakes?

  • Too few training examples. <200 examples often leads to overfitting. Collect at least 500.
  • Training for too many epochs. Model memorizes data instead of learning generalizable patterns. Stop at 3-5 epochs max.
  • Not validating on unseen data. Always split data into train/validation (80/20). Validate frequently to catch overfitting.
  • Using the same data for fine-tuning and evaluation. Reported accuracy is meaningless if evaluated on training data.
  • Not saving checkpoints. Training can take hours. Save every epoch so you can recover from crashes.

Common Questions About LoRA Fine-Tuning

How much training data is needed?

Minimum 500 examples, optimal 1000-5000. Quality matters more than quantity. 100 high-quality examples > 1000 low-quality examples.

Can I fine-tune on a laptop?

Yes. Use 4-bit quantization and LoRA. A 7B model requires 8 GB VRAM, training takes 1-2 hours on CPU (slow) or 10-15 min on GPU.

How do I merge LoRA adapters into the base model?

Use unsloth or HF transformers: `model.merge_and_unload()`. Creates a single model file (~3-4 GB for 7B), ready for inference.

Can I combine multiple LoRA adapters?

Yes, with restrictions. Stack adapters for sequential application, or use adapter composition techniques (e.g., DoRA).

Is fine-tuned model quality better than RAG?

For most tasks, yes. Fine-tuned models understand domain concepts deeply. RAG is better when documents are large and change frequently.

What is the difference between LoRA and QLoRA?

LoRA loads the base model in 16-bit and trains small adapter layers. QLoRA loads the base model in 4-bit and trains adapters in 16-bit. QLoRA uses roughly half the VRAM: 8 GB for 7B vs 16 GB for LoRA. Quality difference is ~2% β€” negligible for most tasks. Unsloth enables QLoRA with `load_in_4bit=True`.

How do I use a LoRA fine-tuned model in Ollama?

After training, merge the adapter into the base model: `model.merge_and_unload()`. Convert to GGUF using llama.cpp's `convert.py`. Create an Ollama Modelfile pointing to the GGUF file: `FROM ./my-finetuned-model.gguf` Then: `ollama create my-model -f Modelfile` and `ollama run my-model`. The fine-tuned model runs identically to any Ollama model.

Can I fine-tune Llama 3.3 70B with LoRA on consumer hardware?

Yes, with QLoRA. Llama 3.3 70B at 4-bit requires ~40 GB VRAM β€” fits on dual RTX 4090 (2Γ—24 GB) or a single A100 80GB. Training time: 4–8 hours on 1,000 examples. For most users, fine-tuning 7B or 13B models is more practical and yields 90%+ of the 70B quality gain for domain tasks.

Sources

  • Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." https://arxiv.org/abs/2106.09685 β€” Original LoRA paper demonstrating 0.4% trainable parameters matching full fine-tuning quality.
  • Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." https://arxiv.org/abs/2305.14314 β€” QLoRA paper: 4-bit quantized base model + 16-bit LoRA adapters halving VRAM requirements.
  • Unsloth. (2026). "Unsloth: 4Γ— Faster LoRA Training." https://github.com/unslothai/unsloth β€” Fastest LoRA framework, supports Llama 3.x, Qwen2.5, Mistral with 4Γ— training speedup.
  • Hugging Face. (2025). "TRL: Transformer Reinforcement Learning." https://github.com/huggingface/trl β€” SFTTrainer for supervised fine-tuning with LoRA adapter support.
  • Test PE link content
  • Fine-tuning works best when the foundation is strong. Before investing time in LoRA, ensure your base prompts are optimized: prompt engineering guide covers 80 techniques that improve output quality on untuned models.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider's official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Join the PromptQuorum Waitlist β†’

← Back to Local LLMs

LoRA Fine-Tuning Local LLMs 2026: Unsloth on 8 GB VRAM