TRL (Transformers Reinforcement Learning) trains aligned LLMs. pip install trl. SFTTrainer for instruction fine-tuning: from trl import SFTTrainer, SFTConfig. trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig(output_dir="sft-model", max_seq_length=2048, packing=True, num_train_epochs=3)). trainer.train(). Dataset format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} — SFTTrainer applies the model’s chat template automatically. DPO: from trl import DPOTrainer, DPOConfig. Dataset: {"chosen": "...", "rejected": "..."} or {"prompt": "...", "chosen": "...", "rejected": "..."}. trainer = DPOTrainer(model=model, ref_model=ref_model, tokenizer=tokenizer, train_dataset=dataset, args=DPOConfig(beta=0.1, output_dir="dpo-model")). beta=0.1 controls KL divergence penalty. ORPO: from trl import ORPOTrainer, ORPOConfig — no reference model needed. ORPOConfig(lambda=0.1). RewardTrainer: from trl import RewardTrainer, RewardConfig. Dataset: {"input_ids_chosen": ..., "input_ids_rejected": ...}. PPO: from trl import PPOTrainer, PPOConfig. ppo_trainer = PPOTrainer(config=PPOConfig(...), model=ppo_model, ref_model=ref_model, tokenizer=tokenizer). for batch in dataloader: query_tensors = [...]; response_tensors = ppo_trainer.generate(query_tensors); rewards = [reward_model(q, r) for q, r in zip(queries, responses)]; stats = ppo_trainer.step(query_tensors, response_tensors, rewards). PEFT integration: pass peft_config=LoraConfig(...) directly to SFTTrainer or DPOTrainer. DataCollatorForCompletionOnlyLM masks the prompt tokens so losses only backprop through completions. Claude Code generates TRL training scripts, DPO/ORPO preference datasets, reward model training, PPO RL loops, and PEFT-integrated configs.
CLAUDE.md for TRL
## TRL Stack
- Version: trl >= 0.8, transformers >= 4.40
- SFT: SFTTrainer(model, tokenizer, train_dataset, args=SFTConfig(packing=True))
Dataset: {"messages": [{"role": "user"|"system"|"assistant", "content": str}]}
- DPO: DPOTrainer(model, ref_model, tokenizer, dataset, args=DPOConfig(beta=0.1))
Dataset: {"prompt": str, "chosen": str, "rejected": str}
- ORPO: ORPOTrainer (no ref_model needed) — DPOConfig(use_ipo=True) or ORPOConfig
- Reward: RewardTrainer(model, tokenizer, dataset, args=RewardConfig)
- PEFT: pass peft_config=LoraConfig(...) to any TRL trainer
- Chat template: SFTTrainer auto-applies tokenizer.chat_template to "messages" format
SFT Training
# finetune/sft_trainer.py — SFTTrainer for instruction fine-tuning
from __future__ import annotations
import os
import torch
from datasets import Dataset, load_dataset
from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import DataCollatorForCompletionOnlyLM, SFTConfig, SFTTrainer
MODEL_ID = os.environ.get("BASE_MODEL", "meta-llama/Llama-3.2-3B-Instruct")
OUTPUT_DIR = "outputs/sft-llama3"
def load_model_and_tokenizer():
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
return model, tokenizer
def build_sft_dataset(n: int = 3000) -> Dataset:
"""Build an instruction dataset in messages format."""
raw = load_dataset("iamtarun/python_code_instructions_18k_alpaca", split=f"train[:{n}]")
def to_messages(example):
messages = [
{"role": "system", "content": "You are an expert Python programmer."},
{"role": "user", "content": example.get("instruction", "")},
{"role": "assistant", "content": example.get("output", "")},
]
return {"messages": messages}
return raw.map(to_messages, remove_columns=raw.column_names)
def run_sft():
model, tokenizer = load_model_and_tokenizer()
dataset = build_sft_dataset()
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
sft_config = SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
max_seq_length=2048,
packing=True, # Packs multiple short examples into one sequence
logging_steps=10,
save_steps=100,
save_total_limit=2,
report_to=["tensorboard"],
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
args=sft_config,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"SFT model saved: {OUTPUT_DIR}")
DPO Training
# finetune/dpo_trainer.py — DPO for preference alignment
from __future__ import annotations
import os
from copy import deepcopy
import torch
from datasets import Dataset
from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import DPOConfig, DPOTrainer
MODEL_ID = "outputs/sft-llama3" # Start from SFT model
OUTPUT_DIR = "outputs/dpo-llama3"
def build_preference_dataset() -> Dataset:
"""
Build DPO dataset with chosen/rejected pairs.
Format: {"prompt": str, "chosen": str, "rejected": str}
Where chosen/rejected are complete messages in chat format OR just the assistant turn.
"""
examples = [
{
"prompt": "Explain recursion in Python",
"chosen": (
"Recursion is a technique where a function calls itself. "
"Here's a classic example:\n\n```python\ndef factorial(n):\n"
" if n <= 1:\n return 1\n return n * factorial(n - 1)\n```\n"
"The base case (n <= 1) prevents infinite recursion."
),
"rejected": (
"Recursion is when a function calls itself. "
"Example: def f(n): return f(n-1)"
),
},
# Add more real preference pairs from human annotations or model outputs
]
return Dataset.from_list(examples)
def run_dpo():
"""Run DPO training on a preference dataset."""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load policy model (the model we're aligning)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, quantization_config=bnb_config, device_map="auto", use_cache=False
)
# Reference model (frozen copy of the SFT model)
ref_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = build_preference_dataset()
peft_config = LoraConfig(
r=8, lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none",
task_type=TaskType.CAUSAL_LM,
)
dpo_config = DPOConfig(
output_dir=OUTPUT_DIR,
beta=0.1, # KL divergence penalty
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-5,
bf16=True,
max_length=1024,
max_prompt_length=512,
logging_steps=10,
save_steps=50,
report_to=["tensorboard"],
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
args=dpo_config,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"DPO model saved: {OUTPUT_DIR}")
# ── ORPO (no reference model needed) ─────────────────────────────────────────
def run_orpo():
"""ORPO training — no reference model, fewer hyperparameters than DPO."""
from trl import ORPOConfig, ORPOTrainer
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, quantization_config=bnb_config, device_map="auto", use_cache=False
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = build_preference_dataset()
orpo_config = ORPOConfig(
output_dir="outputs/orpo-llama3",
beta=0.1, # Odds ratio weight
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=8e-6,
bf16=True,
max_length=1024,
max_prompt_length=512,
logging_steps=10,
save_steps=50,
)
trainer = ORPOTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=orpo_config,
)
trainer.train()
trainer.save_model(orpo_config.output_dir)
print("ORPO training complete")
if __name__ == "__main__":
run_sft()
run_dpo()
For the PEFT-only LoRA alternative when only doing supervised instruction fine-tuning without preference alignment — use LoRA directly with the HuggingFace Trainer while TRL’s SFTTrainer adds packing, completion-only loss masking with DataCollatorForCompletionOnlyLM, and the DPO/ORPO/PPO trainers when you want to align model outputs to human preferences beyond just following instructions. For the Axolotl alternative when needing a configuration-driven fine-tuning framework that orchestrates PEFT, DeepSpeed, Flash Attention, and multiple dataset formats through a single YAML config file without writing Python training code — Axolotl wraps TRL and other libraries into configuration files while TRL provides the low-level Python API for when you need custom training loops, reward functions, or non-standard preference datasets. The Claude Skills 360 bundle includes TRL skill sets covering SFTTrainer instruction fine-tuning, DPOTrainer preference alignment, ORPO training, RewardTrainer, and PPO loop configurations. Start with the free tier to try LLM alignment training generation.