timm (PyTorch Image Models) provides 1000+ pretrained vision models. pip install timm. import timm. List models: timm.list_models("efficientnet*") — filter by wildcard. Load: model = timm.create_model("efficientnet_b3", pretrained=True). Custom classes: model = timm.create_model("vit_base_patch16_224", pretrained=True, num_classes=10). Feature extractor: model = timm.create_model("resnet50", pretrained=True, features_only=True, out_indices=(1,2,3,4)). Forward: features = model(x) — list of feature maps at selected stages. Custom head: model = timm.create_model("convnext_base", pretrained=True, num_classes=0) — num_classes=0 removes classifier head, returns pooled features. Transforms: data_config = timm.data.resolve_model_data_config(model), transform = timm.data.create_transform(**data_config, is_training=False). Training transforms: timm.data.create_transform(**data_config, is_training=True, auto_augment="rand-m9-mstd0.5"). Mixup: from timm.data import Mixup, mixup_fn = Mixup(mixup_alpha=0.8, cutmix_alpha=1.0, num_classes=100). Apply: samples, targets = mixup_fn(samples, targets). Model info: from timm.models import get_pretrained_cfg, cfg = get_pretrained_cfg("resnet50"), cfg.input_size. Multi-scale: timm.create_model("swin_base_patch4_window7_224", img_size=384). In-channels: timm.create_model("resnet50", in_chans=1) for grayscale. Global pool: model = timm.create_model("efficientnet_b0", global_pool="avgmax"). Export: torch.onnx.export(model, dummy, "model.onnx"). Claude Code generates timm fine-tuning scripts, feature extractors, classification heads, augmentation pipelines, and model comparison scripts.
CLAUDE.md for timm
## timm Stack
- Version: timm >= 1.0
- Load: timm.create_model(name, pretrained=True, num_classes=N)
- Discover: timm.list_models("efficientnet*") | timm.list_pretrained()
- Features: create_model(..., features_only=True, out_indices=(1,2,3,4))
- No head: create_model(..., num_classes=0) → pooled feature vector
- Transforms: timm.data.resolve_model_data_config(model) → create_transform(**cfg)
- Mixup/CutMix: Mixup(mixup_alpha, cutmix_alpha, num_classes) → callable
- Auto-aug: create_transform(..., auto_augment="rand-m9-mstd0.5-inc1")
- Multi-channel: create_model(..., in_chans=1) for grayscale/medical imaging
timm Fine-tuning and Feature Extraction
# vision/timm_finetuning.py — transfer learning and feature extraction with timm
from __future__ import annotations
import os
from pathlib import Path
from typing import Optional
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import timm
from timm.data import Mixup, create_transform, resolve_model_data_config
from timm.loss import SoftTargetCrossEntropy, LabelSmoothingCrossEntropy
# ── 1. Model creation ─────────────────────────────────────────────────────────
def create_classifier(
model_name: str = "efficientnet_b3",
num_classes: int = 10,
pretrained: bool = True,
drop_rate: float = 0.2,
drop_path: float = 0.1,
in_chans: int = 3,
) -> nn.Module:
"""
Load a pretrained timm model with custom classifier head.
drop_path is stochastic depth — key for ViT and ConvNeXt training.
"""
model = timm.create_model(
model_name,
pretrained=pretrained,
num_classes=num_classes,
drop_rate=drop_rate,
drop_path_rate=drop_path,
in_chans=in_chans,
)
n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Model: {model_name} | classes: {num_classes} | params: {n_params:,}")
return model
def create_feature_extractor(
model_name: str = "resnet50",
pretrained: bool = True,
out_indices: tuple = (1, 2, 3, 4),
) -> nn.Module:
"""
Create a feature pyramid extractor for dense prediction tasks.
Returns multi-scale feature maps for detection/segmentation heads.
"""
model = timm.create_model(
model_name,
pretrained=pretrained,
features_only=True,
out_indices=out_indices,
)
# Print output shapes
dummy = torch.randn(1, 3, 224, 224)
with torch.no_grad():
features = model(dummy)
for i, f in enumerate(features):
print(f" Stage {out_indices[i]}: {tuple(f.shape)}")
return model
def create_embedding_model(
model_name: str = "vit_base_patch16_224",
pretrained: bool = True,
embed_dim: int = 512,
) -> nn.Module:
"""
Create an embedding model for similarity search / retrieval.
Uses global average pooling (num_classes=0) + optional projection.
"""
backbone = timm.create_model(model_name, pretrained=pretrained, num_classes=0)
backbone_dim = backbone.num_features
if embed_dim and embed_dim != backbone_dim:
model = nn.Sequential(
backbone,
nn.Linear(backbone_dim, embed_dim),
nn.LayerNorm(embed_dim),
)
else:
model = backbone
return model
# ── 2. Transform configuration ────────────────────────────────────────────────
def get_transforms(model, augment_level: str = "medium"):
"""
Get model-appropriate transforms using timm's auto-configuration.
augment_level: "none" | "light" | "medium" | "heavy"
"""
data_config = resolve_model_data_config(model)
aug_configs = {
"none": {},
"light": {"auto_augment": "rand-m5-mstd0.5"},
"medium": {"auto_augment": "rand-m9-mstd0.5-inc1", "re_prob": 0.25},
"heavy": {"auto_augment": "3augment", "re_prob": 0.3, "re_count": 2},
}
train_transform = create_transform(
**data_config,
is_training=True,
**aug_configs.get(augment_level, {}),
)
val_transform = create_transform(**data_config, is_training=False)
print(f"Input: {data_config['input_size']} | Norm: mean={data_config['mean']}")
return train_transform, val_transform
# ── 3. Fine-tuning setup ──────────────────────────────────────────────────────
def configure_fine_tuning(
model: nn.Module,
strategy: str = "full", # "full" | "head_only" | "last_n_blocks"
n_blocks: int = 4,
lr_head: float = 1e-3,
lr_backbone: float = 1e-4,
) -> list[dict]:
"""
Configure layer-wise learning rates for fine-tuning.
Returns param_groups for optimizer.
"""
if strategy == "head_only":
# Freeze backbone, train head only
for name, param in model.named_parameters():
if "head" not in name and "classifier" not in name and "fc" not in name:
param.requires_grad = False
params = [{"params": [p for p in model.parameters() if p.requires_grad], "lr": lr_head}]
elif strategy == "last_n_blocks":
# Freeze all but last N transformer blocks / stages
all_layers = list(model.named_parameters())
n_total = len(all_layers)
freeze_until = max(0, n_total - n_blocks * 20) # ~20 params per block
for i, (_, param) in enumerate(all_layers):
param.requires_grad = i >= freeze_until
params = [
{"params": [p for p in model.parameters() if p.requires_grad], "lr": lr_head}
]
else: # "full" fine-tuning with layer-wise LR decay
# Head
head_params = []
backbone_params = []
for name, param in model.named_parameters():
if any(k in name for k in ["head", "classifier", "fc"]):
head_params.append(param)
else:
backbone_params.append(param)
params = [
{"params": head_params, "lr": lr_head},
{"params": backbone_params, "lr": lr_backbone},
]
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Fine-tuning strategy: {strategy} | trainable: {trainable:,}")
return params
# ── 4. Training loop ──────────────────────────────────────────────────────────
class ImageClassificationTrainer:
"""Complete fine-tuning trainer with Mixup and label smoothing."""
def __init__(
self,
model: nn.Module,
num_classes: int,
device: str = "cuda",
mixup_alpha: float = 0.8,
cutmix_alpha: float = 1.0,
smooth: float = 0.1,
use_mixup: bool = True,
):
self.model = model.to(device)
self.device = device
self.use_mixup = use_mixup
if use_mixup:
self.mixup_fn = Mixup(
mixup_alpha=mixup_alpha,
cutmix_alpha=cutmix_alpha,
label_smoothing=smooth,
num_classes=num_classes,
)
self.criterion = SoftTargetCrossEntropy()
else:
self.criterion = LabelSmoothingCrossEntropy(smoothing=smooth)
self.mixup_fn = None
def train_epoch(
self,
loader: DataLoader,
optimizer: optim.Optimizer,
scheduler = None,
) -> dict:
self.model.train()
total_loss = 0.0
correct = 0
total = 0
for images, labels in loader:
images = images.to(self.device)
labels = labels.to(self.device)
if self.use_mixup and self.mixup_fn is not None:
images, labels = self.mixup_fn(images, labels)
optimizer.zero_grad()
logits = self.model(images)
loss = self.criterion(logits, labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
optimizer.step()
if scheduler:
scheduler.step()
total_loss += loss.item()
if not self.use_mixup:
pred = logits.argmax(dim=1)
correct += (pred == labels).sum().item()
total += labels.size(0)
return {
"loss": total_loss / len(loader),
"acc": correct / total if total else 0.0,
}
@torch.no_grad()
def evaluate(self, loader: DataLoader) -> dict:
self.model.eval()
total_loss = 0.0
correct = 0
total = 0
criterion = nn.CrossEntropyLoss()
for images, labels in loader:
images, labels = images.to(self.device), labels.to(self.device)
logits = self.model(images)
loss = criterion(logits, labels.long())
pred = logits.argmax(dim=1)
correct += (pred == labels).sum().item()
total += labels.size(0)
total_loss += loss.item()
return {
"loss": total_loss / len(loader),
"acc": correct / total,
}
# ── 5. Feature extraction for downstream tasks ───────────────────────────────
@torch.no_grad()
def extract_embeddings(
model: nn.Module,
dataloader: DataLoader,
device: str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Extract feature embeddings for all samples.
Returns (embeddings, labels) as tensors.
"""
model.eval().to(device)
all_feats = []
all_labels = []
for images, labels in dataloader:
images = images.to(device)
feats = model(images)
if feats.dim() > 2:
feats = feats.mean(dim=[2, 3]) # GAP for feature maps
all_feats.append(feats.cpu())
all_labels.append(labels)
return torch.cat(all_feats), torch.cat(all_labels)
def find_similar_images(
query_embedding: torch.Tensor, # (D,)
gallery_embeddings: torch.Tensor, # (N, D)
top_k: int = 10,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Cosine similarity nearest-neighbor search."""
query = query_embedding / query_embedding.norm()
gallery = gallery_embeddings / (gallery_embeddings.norm(dim=1, keepdim=True) + 1e-8)
scores = gallery @ query
values, indices = torch.topk(scores, k=top_k)
return values, indices
# ── 6. Model comparison ───────────────────────────────────────────────────────
def compare_models(
model_names: list[str],
input_size: tuple = (1, 3, 224, 224),
device: str = "cpu",
) -> list[dict]:
"""Compare models by parameter count, throughput, and memory."""
import time
results = []
dummy = torch.randn(*input_size).to(device)
for name in model_names:
try:
model = timm.create_model(name, pretrained=False, num_classes=1000).to(device)
model.eval()
n_params = sum(p.numel() for p in model.parameters())
n_macs = 0 # Would need ptflops package for FLOPs
# Throughput benchmark
with torch.no_grad():
for _ in range(3): # Warmup
model(dummy)
t0 = time.perf_counter()
for _ in range(20):
model(dummy)
elapsed = time.perf_counter() - t0
throughput = 20 / elapsed * input_size[0] # samples/sec
results.append({
"model": name,
"params_M": n_params / 1e6,
"throughput": throughput,
"input_size": input_size[-1],
})
print(f" {name:<40} {n_params/1e6:>6.1f}M params | {throughput:>6.0f} imgs/s")
except Exception as e:
print(f" {name}: Error — {e}")
return results
if __name__ == "__main__":
# Discover models
efficient_models = timm.list_models("efficientnet_b*")
print(f"EfficientNet variants: {efficient_models}")
vit_models = timm.list_models("vit_base*", pretrained=True)
print(f"Pretrained ViT-Base variants: {len(vit_models)}")
# Create model and get transforms
model = create_classifier("efficientnet_b3", num_classes=100)
train_tf, val_tf = get_transforms(model, "medium")
print(f"Train transform: {train_tf}")
# Feature extractor
feat_model = create_feature_extractor("resnet50", pretrained=True, out_indices=(2, 3, 4))
# Embedding model
embed_model = create_embedding_model("vit_base_patch16_224", embed_dim=256)
# Compare models
print("\nModel comparison:")
compare_models(["resnet18", "efficientnet_b0", "mobilenetv3_small_100", "vit_tiny_patch16_224"])
For the torchvision models alternative when using built-in PyTorch vision models (ResNet, EfficientNet, ViT) with official PyTorch support and tight DataLoader integration — torchvision covers standard architectures while timm’s 1000+ models include the latest architectural innovations (ConvNeXt v2, EVA, SwinV2, MaxViT, DaViT) with pretrained checkpoints updated weekly and a unified create_model API that makes multi-model benchmarking trivial without changing training code. For the HuggingFace transformers vision alternative when needing ViT, CLIP, BLIP, and other transformer vision models with HuggingFace Hub integration and pipeline API — the HF ecosystem handles multimodal and vision-language models while timm’s features_only=True mode for dense feature pyramids, Mixup/CutMix augmentation, and layer-wise learning rate support make it the stronger choice for training discriminative vision models from scratch or fine-tuning for classification and retrieval. The Claude Skills 360 bundle includes timm skill sets covering model creation, feature extraction, embedding models, Mixup/CutMix augmentation, layer-wise fine-tuning, training loops, throughput benchmarking, and model comparison. Start with the free tier to try vision model transfer learning code generation.