Axolotl fine-tunes LLMs through YAML configuration. pip install axolotl. Train: accelerate launch -m axolotl.cli.train config.yml. Preprocess: python -m axolotl.cli.preprocess config.yml. Merge LoRA: python -m axolotl.cli.merge_lora config.yml. Core config: base_model: meta-llama/Llama-3.2-3B-Instruct, model_type: LlamaForCausalLM, tokenizer_type: AutoTokenizer. LoRA: adapter: lora, lora_r: 16, lora_alpha: 32, lora_target_modules: ["q_proj","v_proj"]. QLoRA: load_in_4bit: true, adapter: qlora. Dataset: datasets: [{path: iamtarun/python_code_instructions_18k_alpaca, type: alpaca, split: train[:5000]}]. Types: alpaca (instruction/input/output), sharegpt (conversations list), completion (raw text), chat_template (messages format). Multi-dataset: add multiple entries with optional ds_weight: 0.5. Training: num_epochs: 3, micro_batch_size: 2, gradient_accumulation_steps: 4, learning_rate: 2e-4, lr_scheduler: cosine, warmup_steps: 100. Flash Attention: flash_attention: true. Sample packing: sample_packing: true, eval_sample_packing: false. DeepSpeed: deepspeed: configs/zero2.json. FSDP: fsdp: [full_shard, auto_wrap], fsdp_config: {fsdp_offload_params: false}. Output: output_dir: outputs/axolotl-llama3, saves_per_epoch: 1. Sequence length: sequence_len: 2048. Wandb: wandb_project: axolotl-runs, wandb_name: llama3-qlora. DPO: rl: dpo, datasets: [{type: chatml.intel, path: ...}]. hub_model_id: user/model-name pushes to Hub after training. Claude Code generates Axolotl YAML configs, multi-dataset recipes, DeepSpeed integration, DPO configs, and CLI training scripts.
CLAUDE.md for Axolotl
## Axolotl Stack
- Version: axolotl >= 0.4, transformers >= 4.40, deepspeed >= 0.14
- Train: accelerate launch -m axolotl.cli.train config.yml
- Config: base_model, model_type, adapter (lora|qlora), datasets[{path, type}]
- Formats: alpaca, sharegpt, completion, chat_template, chatml.intel (for DPO)
- Flash Attn: flash_attention: true (requires flash-attn>=2.0)
- Sample packing: sample_packing: true (efficient for short sequences)
- Merge: python -m axolotl.cli.merge_lora config.yml → merged_model/
Training Configs
# configs/qlora_llama3.yml — QLoRA fine-tuning on a single GPU
base_model: meta-llama/Llama-3.2-3B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
# QLoRA settings
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true # Target ALL linear layers automatically
# or explicit: lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
# Dataset — can mix multiple sources
datasets:
- path: iamtarun/python_code_instructions_18k_alpaca
type: alpaca # instruction + input + output fields
split: train[:5000]
shards: 1
- path: HuggingFaceH4/ultrachat_200k
type: sharegpt # conversations: [{from: human|gpt, value: str}]
conversation: llama-3 # Chat template to apply
split: train_sft[:1000]
ds_weight: 0.3 # 30% sampling weight when mixing datasets
# Sequence
sequence_len: 2048
sample_packing: true # Pack short examples into max-length chunks
eval_sample_packing: false # Disable for accurate eval loss
# Training hyperparameters
num_epochs: 3
micro_batch_size: 2 # batch per GPU
gradient_accumulation_steps: 4 # effective global batch = 2 * 4 * num_gpus = 8
optimizer: adamw_bnb_8bit # 8-bit AdamW from bitsandbytes
lr_scheduler: cosine
learning_rate: 2e-4
warmup_steps: 50
weight_decay: 0.0
# Memory-efficiency
gradient_checkpointing: true
flash_attention: true # 3x faster attention, requires flash-attn package
bf16: auto
# Output
output_dir: outputs/qlora-llama3
saves_per_epoch: 1
save_safetensors: true
# Logging
logging_steps: 10
eval_steps: 100
wandb_project: axolotl-runs
wandb_name: qlora-llama3-inst
# Hub upload (optional)
# hub_model_id: your-username/llama3-qlora
# hub_strategy: every_save
# configs/dpo_llama3.yml — Direct Preference Optimization
base_model: outputs/qlora-llama3/merged # Start from SFT-merged model
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
adapter: qlora
lora_r: 8
lora_alpha: 16
lora_target_linear: true
# DPO-specific
rl: dpo
dpo_beta: 0.1
datasets:
- path: Intel/orca_dpo_pairs
type: chatml.intel # DPO format: system + prompt + chosen + rejected
split: train[:2000]
sequence_len: 1024
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 5e-5
warmup_ratio: 0.05
bf16: auto
flash_attention: true
output_dir: outputs/dpo-llama3
# configs/zero2_multinode.yml — ZeRO-2 multi-GPU config
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
# Full fine-tuning (no adapter) with DeepSpeed ZeRO-2
adapter: ~
load_in_4bit: false
bf16: true
datasets:
- path: HuggingFaceH4/ultrachat_200k
type: sharegpt
conversation: llama-3
split: train_sft[:50000]
sequence_len: 4096
sample_packing: true
deepspeed: configs/zero2.json # Path to DeepSpeed config
num_epochs: 1
micro_batch_size: 1
gradient_accumulation_steps: 8
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 1e-5
warmup_ratio: 0.03
gradient_checkpointing: true
flash_attention: true
output_dir: outputs/full-ft-llama3-8b
saves_per_epoch: 2
Python Preprocessing and Launch
# scripts/axolotl_launch.py — programmatic config generation and training launch
from __future__ import annotations
import subprocess
import sys
from pathlib import Path
import yaml
def build_qlora_config(
base_model: str = "meta-llama/Llama-3.2-3B-Instruct",
dataset_path: str = "iamtarun/python_code_instructions_18k_alpaca",
output_dir: str = "outputs/axolotl-run",
lora_r: int = 16,
epochs: int = 3,
seq_len: int = 2048,
) -> dict:
"""Generate a QLoRA Axolotl config programmatically."""
return {
"base_model": base_model,
"model_type": "AutoModelForCausalLM",
"tokenizer_type": "AutoTokenizer",
"load_in_4bit": True,
"adapter": "qlora",
"lora_r": lora_r,
"lora_alpha": lora_r * 2,
"lora_dropout": 0.05,
"lora_target_linear": True,
"datasets": [
{
"path": dataset_path,
"type": "alpaca",
"split": "train[:5000]",
}
],
"sequence_len": seq_len,
"sample_packing": True,
"eval_sample_packing": False,
"num_epochs": epochs,
"micro_batch_size": 2,
"gradient_accumulation_steps": 4,
"optimizer": "adamw_bnb_8bit",
"lr_scheduler": "cosine",
"learning_rate": 2e-4,
"warmup_steps": 50,
"gradient_checkpointing": True,
"flash_attention": True,
"bf16": "auto",
"output_dir": output_dir,
"saves_per_epoch": 1,
"logging_steps": 10,
}
def save_config(config: dict, path: str = "config.yml") -> Path:
out = Path(path)
out.parent.mkdir(parents=True, exist_ok=True)
with open(out, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
print(f"Config saved: {out}")
return out
def preprocess(config_path: str) -> None:
"""Preprocess and cache dataset before training."""
subprocess.run(
[sys.executable, "-m", "axolotl.cli.preprocess", config_path],
check=True,
)
def train(config_path: str, num_gpus: int = 1) -> None:
"""Launch Axolotl training via accelerate."""
cmd = [
"accelerate", "launch",
f"--num_processes={num_gpus}",
"-m", "axolotl.cli.train",
config_path,
]
subprocess.run(cmd, check=True)
def merge_lora(config_path: str) -> None:
"""Merge LoRA adapter weights into base model."""
subprocess.run(
[sys.executable, "-m", "axolotl.cli.merge_lora", config_path],
check=True,
)
def run_inference(config_path: str, prompt: str) -> None:
"""Interactive inference with the trained model."""
subprocess.run(
[sys.executable, "-m", "axolotl.cli.inference", config_path,
"--prompter", "None", "--message", prompt],
check=True,
)
if __name__ == "__main__":
config = build_qlora_config(
base_model="meta-llama/Llama-3.2-3B-Instruct",
dataset_path="iamtarun/python_code_instructions_18k_alpaca",
output_dir="outputs/axolotl-qlora",
epochs=1,
)
config_path = str(save_config(config, "configs/generated_qlora.yml"))
print("Preprocessing dataset...")
preprocess(config_path)
print("Starting training...")
train(config_path, num_gpus=1)
print("Merging LoRA weights...")
merge_lora(config_path)
print("Testing inference...")
run_inference(config_path, "Write a Python function to binary search a sorted list.")
For the Unsloth alternative when training on a single consumer GPU and needing maximum memory efficiency through custom Triton kernels — Unsloth’s kernel-level optimizations give 2x speedup and 60% less VRAM versus a standard setup while Axolotl’s YAML-driven approach handles large multi-GPU clusters, multi-dataset mixing pipelines, and DPO/ORPO training without writing Python, making it the better choice for teams that need reproducible, version-controlled training recipes. For the TRL SFTTrainer alternative when needing a pure-Python programmatic API for custom data collators, reward functions, or non-standard training loops — TRL gives direct API control while Axolotl wraps TRL and DeepSpeed into a configuration layer that prevents boilerplate mistakes and ensures consistent hyperparameter management across experiments. The Claude Skills 360 bundle includes Axolotl skill sets covering QLoRA YAML configs, multi-dataset recipes, DPO alignment configs, DeepSpeed integration, and training launch scripts. Start with the free tier to try config-driven LLM fine-tuning generation.