Architecture & Training
Supervised Fine-tuning and Alignment
Transforming pre-trained models into helpful assistants — SFT, RLHF, DPO, and constitutional AI techniques
Published: 2026-04-09 · Last updated: 2026-04-13
Supervised Fine-tuning and Alignment
A pre-trained LLM can generate fluent text but doesn't know how to follow instructions, be helpful, or avoid harmful outputs. The alignment pipeline transforms a raw pre-trained model into a useful, safe assistant.
The Alignment Pipeline
Pre-trained Model (base LM)
│
▼
Supervised Fine-Tuning (SFT)
│ Trained on instruction-response pairs
▼
Instruction-Following Model
│
▼
Preference Alignment (RLHF / DPO)
│ Trained on human/AI preference data
▼
Aligned Assistant
│
▼
Safety Filters (optional additional layer)
Step 1: Supervised Fine-Tuning (SFT)
SFT teaches the model to follow instructions by training on high-quality examples:
# SFT training data format
sft_examples = [
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain photosynthesis in simple terms."},
{"role": "assistant", "content": "Photosynthesis is the process where plants..."}
]
},
# ... thousands more examples
]
Data sources:
| Source | Quality | Scale | Cost |
|---|---|---|---|
| Human-written instructions | Highest | 1K-10K | Very high |
| AI-generated + human-reviewed | High | 10K-100K | High |
| AI-generated + AI-filtered | Good | 100K-1M+ | Moderate |
| Distilled from frontier models | Variable | 100K-1M | API costs |
SFT training:
from transformers import SFTTrainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./sft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5, # Much lower than pre-training
lr_scheduler_type="cosine",
warmup_ratio=0.05,
fp16=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=sft_dataset,
args=training_args,
)
trainer.train()
Step 2: Preference Alignment
After SFT, the model follows instructions but may not align with human preferences. This step teaches the model what humans consider "better" responses.
Reward Model Training
Train a separate model to score response quality:
# Preference data format
preference_pairs = [
{
"prompt": "How do I make pasta?",
"chosen": "Here's a step-by-step guide:\n1. Boil water...", # Preferred
"rejected": "Pasta is Italian. Italy is in Europe.", # Not preferred
}
]
# Reward model training
from transformers import AutoModelForSequenceClassification
reward_model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=1 # Single score output
)
# Train to predict human preferences
for pair in preference_pairs:
chosen_score = reward_model(pair["prompt"], pair["chosen"])
rejected_score = reward_model(pair["prompt"], pair["rejected"])
# Loss: chosen should score higher
loss = -torch.log(torch.sigmoid(chosen_score - rejected_score))
loss.backward()
RLHF (Reinforcement Learning from Human Feedback)
Policy Model (SFT) → generates response → Reward Model → score
↓
PPO optimization ← advantage ← value function
# Conceptual RLHF with PPO
from trl import PPOTrainer, PPOConfig
config = PPOConfig(
model_name="llama-3-8b-sft",
learning_rate=1e-6, # Very small LR
ppo_epochs=4,
)
ppo_trainer = PPOTrainer(
config=config,
model=policy_model,
ref_model=sft_model, # Reference model (frozen SFT)
reward_model=reward_model,
)
for batch in preference_data:
# Generate responses
responses = ppo_trainer.generate(batch["prompt"])
# Score with reward model
rewards = reward_model(batch["prompt"], responses)
# PPO update: maximize reward while staying close to reference
ppo_trainer.step(batch["prompt"], responses, rewards)
Challenges with RLHF:
- Computationally expensive (needs 4 model copies: policy, reference, reward, value)
- Unstable training dynamics
- Reward hacking (model learns to game the reward model)
- Requires careful hyperparameter tuning
DPO: Direct Preference Optimization
DPO simplifies preference alignment by eliminating the separate reward model and RL loop:
# DPO directly optimizes the policy model on preference pairs
from trl import DPOTrainer
dpo_config = {
"beta": 0.1, # Temperature for preference distribution
"learning_rate": 5e-7,
}
dpo_trainer = DPOTrainer(
model=model,
ref_model=sft_model,
args=dpo_config,
train_dataset=preference_pairs,
)
dpo_trainer.train()
DPO loss function:
def dpo_loss(chosen_log_probs, rejected_log_probs, beta=0.1):
"""
Direct Preference Optimization loss.
Intuitively: increase probability of chosen, decrease probability of rejected.
"""
log_ratio = chosen_log_probs - rejected_log_probs
loss = -torch.log(torch.sigmoid(beta * log_ratio))
return loss.mean()
| Method | Components | Stability | Quality | Cost |
|---|---|---|---|---|
| RLHF | Policy + Ref + Reward + Value | Unstable | Best | Very high |
| DPO | Policy + Ref | Stable | Near-RLHF | Moderate |
| ORPO | Policy only | Most stable | Good | Low |
| KTO | Policy + preference labels | Stable | Good | Low |
Constitutional AI (Claude's Approach)
Instead of human feedback, Constitutional AI uses a set of principles ("constitution") to guide alignment:
# Phase 1: Self-critique
constitution = [
"Choose the response that is most helpful and honest.",
"Choose the response that avoids harmful or misleading content.",
"Choose the response that best respects the user's autonomy.",
]
# Generate critiques using the model itself
critique_prompt = """
Given the following response, identify any issues based on the principle:
"{principle}"
Response: {response}
What should be improved?
"""
# Phase 2: Revise based on critiques
revised_response = model.generate(
f"Original: {response}\nCritique: {critique}\nRevised:"
)
# Phase 3: Train on revised data (same as SFT)
The "Alignment Tax"
Alignment can reduce raw model capability:
| Metric | Base Model | After SFT | After RLHF/DPO |
|---|---|---|---|
| MMLU (knowledge) | 75% | 73% | 70% |
| Human preference | 40% | 70% | 85% |
| Code generation | 65% | 68% | 63% |
| Safety compliance | 30% | 60% | 90% |
The "tax" occurs because alignment restricts the model's output distribution. Research into alignment without capability loss is ongoing.
Best Practices
- High-quality SFT data: 10K carefully curated examples often beat 1M noisy ones
- Diverse prompts: Cover many topics, styles, and difficulty levels
- Preference data quality: Annotator agreement should be >70%
- DPO over RLHF: For most teams, DPO gives 90% of RLHF quality at 30% of the cost
- Evaluate alignment separately: Measure both capability AND safety metrics
Key Takeaways
- SFT teaches instruction-following; preference alignment teaches human values
- RLHF is powerful but complex; DPO is the pragmatic alternative for most teams
- Constitutional AI reduces reliance on human annotation
- Alignment inevitably trades off some raw capability for safety and helpfulness
- Data quality is the dominant factor at every alignment stage
Related Documentation
- Fine-tuning — Parameter-efficient adaptation methods
- Safety and Red-teaming — Testing aligned models
- Evaluation Metrics — Measuring alignment quality
Related docs
Reinforcement Learning for LLMs
Using RL to improve LLM behavior — PPO, GRPO, reward modeling, process vs outcome supervision, and scaling RL for alignment
LLM Bias Mitigation
Understanding and mitigating bias in LLM outputs — demographic bias, cultural bias, measurement techniques, debiasing strategies, and continuous monitoring
Prompt Security Testing
Systematic prompt security testing methodology — injection testing, jailbreak detection, output validation, and continuous security monitoring