Weights & Biases tracks ML experiments with one line. pip install wandb. wandb.login() authenticates with API key. wandb.init(project="churn-model", name="gbm-v1", config={"lr": 0.05, "n_estimators": 200}, tags=["production", "sklearn"]) starts a run. wandb.log({"loss": 0.23, "auc": 0.87, "epoch": 5}) logs metrics — call in a loop for time series. wandb.log({"roc_curve": wandb.plot.roc_curve(y_true, y_proba, labels=["no_churn", "churn"])}). wandb.log({"conf_matrix": wandb.plot.confusion_matrix(probs=y_proba, y_true=y_true, class_names=["no", "yes"])}). wandb.log({"predictions": wandb.Table(dataframe=df)}) logs tabular data. Images: wandb.log({"chart": wandb.Image(fig)}). Artifacts: artifact = wandb.Artifact("churn-dataset", type="dataset"), artifact.add_dir("data/"), run.log_artifact(artifact). Model artifact: artifact = wandb.Artifact("churn-model", type="model"), artifact.add_file("model.pkl"), run.log_artifact(artifact). Download: artifact = run.use_artifact("churn-model:latest"), artifact.download(). wandb.summary["best_auc"] = 0.91 sets run summary. Sweeps: sweep_config = {"method": "bayes", "metric": {"goal": "maximize", "name": "auc"}, "parameters": {"lr": {"distribution": "log_uniform_values", "min": 0.001, "max": 0.1}}}, sweep_id = wandb.sweep(sweep_config, project="churn-model"), wandb.agent(sweep_id, function=train, count=50). HuggingFace: TrainingArguments(report_to="wandb") auto-integrates. run.finish(). Claude Code generates W&B training loops, sweep configs, artifact management, and TypeScript API clients.
CLAUDE.md for W&B
## W&B Stack
- Version: wandb >= 0.16
- Init: wandb.init(project, name, config=hyperparams_dict, tags=[...])
- Log: wandb.log({"metric": value}) — call per step/epoch for time series
- Summary: wandb.summary["key"] = value — per-run summary metric
- Artifacts: wandb.Artifact(name, type) → artifact.add_file/add_dir → run.log_artifact()
- Sweeps: wandb.sweep(config) → wandb.agent(sweep_id, fn, count)
- Finish: wandb.finish() — or use as context manager with wandb.init() as run:
Training with W&B Logging
# train.py — complete training script with W&B experiment tracking
from __future__ import annotations
import os
import pickle
import time
import numpy as np
import pandas as pd
import wandb
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
average_precision_score,
classification_report,
confusion_matrix,
roc_auc_score,
)
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
FEATURE_COLS = ["age", "tenure_days", "monthly_spend", "support_tickets", "last_login_days"]
TARGET_COL = "churned"
PROJECT = "churn-prediction"
def train_model(config: dict | None = None) -> float:
"""Training function — works standalone or as a W&B sweep agent."""
with wandb.init(config=config) as run:
cfg = run.config
# ── Load data ────────────────────────────────────────────────────
train_df = pd.read_csv("data/train.csv")
X = train_df[FEATURE_COLS].values
y = train_df[TARGET_COL].values
# ── Log dataset artifact ──────────────────────────────────────────
dataset_artifact = wandb.Artifact(
name="churn-dataset",
type="dataset",
description="Churn training dataset",
metadata={"rows": len(train_df), "features": FEATURE_COLS},
)
dataset_artifact.add_file("data/train.csv")
run.log_artifact(dataset_artifact)
# ── Cross-validation with per-fold logging ────────────────────────
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_aucs: list[float] = []
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr, y_val = y[train_idx], y[val_idx]
pipeline = Pipeline([
("scaler", StandardScaler()),
("clf", GradientBoostingClassifier(
n_estimators=cfg.n_estimators,
learning_rate=cfg.learning_rate,
max_depth=cfg.max_depth,
min_samples_leaf=cfg.min_samples_leaf,
subsample=cfg.subsample,
random_state=42,
)),
])
pipeline.fit(X_tr, y_tr)
y_proba = pipeline.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_proba)
ap = average_precision_score(y_val, y_proba)
fold_aucs.append(auc)
run.log({
"fold": fold,
"fold_auc": auc,
"fold_ap": ap,
})
mean_auc = float(np.mean(fold_aucs))
# ── Final model on full training data ─────────────────────────────
final_pipeline = Pipeline([
("scaler", StandardScaler()),
("clf", GradientBoostingClassifier(
n_estimators=cfg.n_estimators,
learning_rate=cfg.learning_rate,
max_depth=cfg.max_depth,
min_samples_leaf=cfg.min_samples_leaf,
subsample=cfg.subsample,
random_state=42,
)),
])
final_pipeline.fit(X, y)
y_proba_all = final_pipeline.predict_proba(X)[:, 1]
y_pred_all = final_pipeline.predict(X)
# ── Log rich metrics ──────────────────────────────────────────────
run.log({
"cv_auc_mean": mean_auc,
"cv_auc_std": float(np.std(fold_aucs)),
"train_auc": roc_auc_score(y, y_proba_all),
"roc_curve": wandb.plot.roc_curve(y, y_proba_all[:, np.newaxis], labels=["churn"]),
"conf_matrix": wandb.plot.confusion_matrix(
probs=np.column_stack([1 - y_proba_all, y_proba_all]),
y_true=y,
class_names=["no_churn", "churn"],
),
})
# Log feature importance
importances = final_pipeline.named_steps["clf"].feature_importances_
fi_table = wandb.Table(
columns=["feature", "importance"],
data=sorted(zip(FEATURE_COLS, importances), key=lambda x: -x[1]),
)
run.log({"feature_importance": wandb.plot.bar(fi_table, "feature", "importance", title="Feature Importances")})
# ── Model artifact ────────────────────────────────────────────────
model_path = "model.pkl"
with open(model_path, "wb") as f:
pickle.dump(final_pipeline, f)
model_artifact = wandb.Artifact(
name="churn-model",
type="model",
description=f"GBM churn model - AUC {mean_auc:.4f}",
metadata={
"cv_auc": mean_auc,
"n_estimators": cfg.n_estimators,
"learning_rate": cfg.learning_rate,
"max_depth": cfg.max_depth,
},
)
model_artifact.add_file(model_path)
run.log_artifact(model_artifact)
wandb.summary["cv_auc"] = mean_auc
wandb.summary["best_fold"] = int(np.argmax(fold_aucs))
return mean_auc
# ── Standalone training ──────────────────────────────────────────────────────
DEFAULT_CONFIG = {
"n_estimators": 200,
"learning_rate": 0.05,
"max_depth": 4,
"min_samples_leaf": 10,
"subsample": 0.8,
}
if __name__ == "__main__":
wandb.login()
train_model(config=DEFAULT_CONFIG)
Hyperparameter Sweeps
# sweeps.py — W&B Bayesian hyperparameter sweep
import wandb
from train import train_model, PROJECT
SWEEP_CONFIG = {
"method": "bayes", # bayes | grid | random
"metric": {
"goal": "maximize",
"name": "cv_auc_mean",
},
"parameters": {
"n_estimators": {"values": [100, 200, 400, 600]},
"learning_rate": {"distribution": "log_uniform_values", "min": 0.005, "max": 0.2},
"max_depth": {"values": [2, 3, 4, 5, 6]},
"min_samples_leaf": {"values": [5, 10, 20, 50]},
"subsample": {"distribution": "uniform", "min": 0.6, "max": 1.0},
},
"early_terminate": {
"type": "hyperband",
"min_iter": 3,
},
}
def run_sweep(count: int = 50) -> None:
"""Launch Bayesian sweep with parallel agents."""
sweep_id = wandb.sweep(SWEEP_CONFIG, project=PROJECT)
print(f"Sweep ID: {sweep_id}")
# Run agent — set count > 1 for multiple sequential runs per agent
wandb.agent(sweep_id, function=train_model, count=count)
if __name__ == "__main__":
wandb.login()
run_sweep(count=50)
TypeScript API Client
// lib/wandb/client.ts — query W&B runs via REST API
const WANDB_API = "https://api.wandb.ai"
const WANDB_API_KEY = process.env.WANDB_API_KEY ?? ""
export type WandbRun = {
id: string
name: string
state: string
summary: Record<string, number>
config: Record<string, unknown>
created_at: string
}
async function wandbFetch<T>(path: string): Promise<T> {
const res = await fetch(`${WANDB_API}${path}`, {
headers: { Authorization: `Basic ${btoa(`api:${WANDB_API_KEY}`)}` },
})
if (!res.ok) throw new Error(`W&B API ${res.status}: ${await res.text()}`)
return res.json()
}
/** List runs for a project, sorted by best AUC */
export async function listRuns(entity: string, project: string): Promise<WandbRun[]> {
const data = await wandbFetch<{ runs: WandbRun[] }>(
`/api/v1/runs/${entity}/${project}?order=-summary_metrics.cv_auc_mean&per_page=50`
)
return data.runs
}
/** Get best run by a summary metric */
export async function getBestRun(
entity: string,
project: string,
metric: string = "cv_auc_mean",
): Promise<WandbRun | null> {
const runs = await listRuns(entity, project)
return runs.sort((a, b) => (b.summary[metric] ?? 0) - (a.summary[metric] ?? 0))[0] ?? null
}
/** Download artifact URL */
export async function getArtifactUrl(
entity: string,
project: string,
artifact: string,
version: string = "latest",
): Promise<string> {
const data = await wandbFetch<{ artifact: { id: string; currentVersion: { id: string } } }>(
`/api/v1/artifacts/${entity}/${project}/${artifact}/${version}`
)
return `${WANDB_API}/artifacts/${entity}/${project}/${artifact}/${version}`
}
For the MLflow alternative when needing a self-hosted, open-source experiment tracking server that runs inside your own infrastructure without a SaaS dependency — MLflow’s tracking server and model registry work identically on-prem or in any cloud while W&B is a managed SaaS service with richer built-in visualizations, collaboration features, interactive dashboards, and the Sweeps distributed hyperparameter search system. For the Neptune.ai alternative when needing unlimited storage for large media artifacts, custom metadata namespaces, advanced query language for run filtering, and team collaboration without per-seat pricing — Neptune offers more flexible artifact retention while W&B’s Artifacts system with lineage tracking and the W&B Table comparison tool provide industry-standard visualization for model debugging and dataset versioning. The Claude Skills 360 bundle includes W&B skill sets covering experiment logging, artifact management, Bayesian sweeps, and TypeScript API clients. Start with the free tier to try ML experiment tracking generation.