Best Practices

Fine-tuning and LoRA/PEFT

Adapting pre-trained LLMs to specific domains — full fine-tuning, LoRA, QLoRA, and parameter-efficient methods with practical examples

Published: 2026-04-06 · Last updated: 2026-04-13

Fine-tuning and LoRA/PEFT

Pre-trained LLMs are remarkably capable out of the box, but domain-specific applications often require adaptation. Fine-tuning adjusts a model's weights on your data to improve performance on specific tasks or domains.

When to Fine-tune vs. Prompt

Criterion	Prompt Engineering	Fine-tuning
Data available	No training data needed	100–10,000+ examples
Cost	Pay per API call	Upfront training cost
Latency	Same as base model	Same or faster (smaller model possible)
Quality ceiling	Limited by base model	Can exceed base model on target task
Maintenance	Easy to update prompts	Requires retraining pipeline
Data privacy	Data sent to API	Train on-prem, data stays local

Rule of thumb: Start with prompt engineering. Fine-tune when:

Prompts can't achieve target quality consistently
You need to reduce inference costs (fine-tune a smaller model)
You have domain-specific language/style requirements
Data privacy prevents API usage

Full Fine-tuning

Training all model weights on domain-specific data:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

# Prepare dataset
train_data = [
    {"input": "What is our company's refund policy?", 
     "output": "Customers can request a refund within 30 days..."},
    # ... more examples
]

# Full fine-tuning trains ALL parameters
# For a 3B model: 3 billion parameters → needs significant GPU memory

Hardware requirements:

Model Size	GPUs Needed (A100 80GB)	Training Time (1K examples)
3B	1	~2 hours
8B	2-4	~4 hours
70B	8-16	~24 hours

Parameter-Efficient Fine-tuning (PEFT)

Instead of updating all parameters, PEFT methods train a small number of additional parameters while freezing the base model.

LoRA (Low-Rank Adaptation)

LoRA injects trainable low-rank matrices into attention layers:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

# LoRA configuration
lora_config = LoraConfig(
    r=16,                    # Rank of update matrices
    lora_alpha=32,           # Scaling factor
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Where to inject
    task_type="CAUSAL_LM"
)

# The model now has only ~0.1% trainable parameters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# "trainable params: 4,194,304 || all params: 3,212,756,992 || trainable%: 0.13%"

How LoRA works:

Original:    W @ x
With LoRA:   W @ x + (B @ A) @ x

Where:
  W: frozen weight matrix (d_model × d_model)
  A: trainable down-projection (r × d_model), r << d_model
  B: trainable up-projection (d_model × r)
  
For r=16, d_model=4096:
  Original: 4096 × 4096 = 16.7M params
  LoRA: 2 × (16 × 4096) = 131K params per matrix (127× fewer!)

QLoRA (Quantized LoRA)

Combines LoRA with 4-bit quantization for even lower memory:

from transformers import BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # Normal Float 4-bit
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,  # Double quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_config,
    device_map="auto"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

Memory comparison for 70B model:

Method	GPU Memory	Quality
Full fine-tuning (FP16)	~280 GB	Best
LoRA (FP16)	~140 GB	Near-full
QLoRA (4-bit)	~48 GB	~95-99% of full

Training Data Format

Instruction Format (Alpaca-style)

[
  {
    "instruction": "Explain the law of supply and demand",
    "input": "",
    "output": "The law of supply and demand states that..."
  },
  {
    "instruction": "Translate to Spanish",
    "input": "Good morning, how can I help you?",
    "output": "Buenos días, ¿en qué puedo ayudarle?"
  }
]

Chat Format (Llama 3)

[
  {
    "messages": [
      {"role": "system", "content": "You are a customer support agent."},
      {"role": "user", "content": "I need to return an item"},
      {"role": "assistant", "content": "I'd be happy to help with your return..."}
    ]
  }
]

Training with Hugging Face Transformers

from transformers import TrainingArguments, Trainer
from datasets import load_dataset

dataset = load_dataset("json", data_files="training_data.json")

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,            # Higher LR for LoRA than full fine-tuning
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    optim="paged_adamw_8bit",      # Memory-efficient optimizer
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
)

trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset.get("validation"),
    args=training_args,
)

trainer.train()
model.save_pretrained("./fine-tuned-model")

Merging LoRA Adapters

After training, you can merge the adapter with the base model for deployment:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
lora_model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")

# Merge adapter weights into base model
merged = lora_model.merge_and_unload()
merged.save_pretrained("./merged-model")

Common Fine-tuning Tasks

Task	Format	Example
Instruction following	Instruction → Response	Q&A, summarization
Style transfer	Input → Styled output	Formal writing, brand voice
Domain adaptation	Domain text → Domain text	Legal, medical, code
Tool use	Query → Tool call + result	API calling, code execution
Classification	Text → Label	Sentiment, intent, topic

Best Practices

Data quality > Data quantity: 500 clean examples often beat 10,000 noisy ones
Start small: Try LoRA on a 3-8B model before scaling up
Validate thoroughly: Test on held-out data AND real-world inputs
Avoid catastrophic forgetting: Mix some general data with domain data
Monitor for overfitting: Training loss should decrease; eval loss should too
Use QLoRA for large models: 4-bit LoRA makes 70B models trainable on consumer hardware

Key Takeaways

LoRA reduces trainable parameters by 100-1000× with near-full fine-tuning quality
QLoRA enables fine-tuning 70B models on a single 48GB GPU
Fine-tune when prompting alone can't achieve your quality targets
Data quality is the most important factor in fine-tuning success
Always validate on real-world inputs, not just held-out test sets

Model Training — How pre-training works
Evaluation Metrics — Measuring fine-tuning success
RAG Systems — Alternative to fine-tuning

Related docs

Code LLM Specialization

Code-specific LLM techniques — code tokenization, repository-level context, code fine-tuning, program synthesis evaluation, and code-specific RAG

LLM Fine-Tuning Data Preparation

How to prepare high-quality fine-tuning datasets — data collection, formatting, cleaning, augmentation, and quality validation pipelines

Fine-Tuning Architecture Patterns

Reference patterns, tradeoffs, and building blocks for designing fine-tuning systems.