DVC versions data and models alongside Git. pip install dvc dvc-s3. dvc init adds DVC to a Git repo. dvc add data/train.csv tracks large files — creates data/train.csv.dvc (committed to Git) and adds the actual file to .gitignore. dvc remote add -d myremote s3://my-bucket/dvc-store sets default remote. dvc push uploads to S3; dvc pull downloads. Remote types: s3, gs, azure, ssh, local. DVC Pipelines: dvc.yaml defines stages — stages: prepare: cmd: python prepare.py, deps: [data/raw/, prepare.py], params: [prepare.yaml:split_ratio], outs: [data/prepared/], metrics: [metrics/prepare_metrics.json]. dvc repro re-runs only changed stages. dvc dag shows pipeline graph. Parameters: params.yaml stores hyperparams — dvc params diff compares. Metrics: dvc metrics show and dvc metrics diff between commits. Experiments: dvc exp run --set-param train.lr=0.01 runs with different params without a new commit. dvc exp show tabular comparison. dvc exp push shares experiments. DVCLive: from dvclive import Live, live.log_metric("auc", 0.87), live.log_params(config), live.next_step() — auto-generates dvclive/metric_history.tsv and integrates with DVC. dvc plots show dvclive/metric_history.tsv renders charts. dvc checkout restores data to any version. dvc status shows what’s changed. dvc gc --workspace cleans unused cache. Claude Code generates dvc.yaml pipelines, params.yaml, DVCLive training scripts, remote configs, and GitHub Actions workflows.
CLAUDE.md for DVC
## DVC Stack
- Version: dvc >= 3.x + dvc-s3 / dvc-gs / dvc-azure (remote-specific)
- Init: dvc init in git repo root
- Track: dvc add data/file.csv → commits .dvc file, data goes into .gitignore
- Remote: dvc remote add -d myremote s3://bucket/path
- Pipeline: dvc.yaml → stages with cmd/deps/params/outs/metrics
- Repro: dvc repro — re-runs only changed stages
- Experiments: dvc exp run --set-param --name; dvc exp show for comparison table
- DVCLive: from dvclive import Live → live.log_metric / log_params / next_step
DVC Pipeline (dvc.yaml)
# dvc.yaml — complete ML pipeline definition
stages:
# ── 1. Data preparation ───────────────────────────────────────────────────
prepare:
cmd: python src/prepare.py
deps:
- data/raw/customers.csv
- src/prepare.py
params:
- params.yaml:
- prepare.test_size
- prepare.random_seed
- prepare.feature_cols
outs:
- data/prepared/train.csv
- data/prepared/test.csv
metrics:
- metrics/prepare.json:
cache: false
# ── 2. Feature engineering ────────────────────────────────────────────────
featurize:
cmd: python src/featurize.py
deps:
- data/prepared/train.csv
- data/prepared/test.csv
- src/featurize.py
params:
- params.yaml:
- featurize.max_categories
- featurize.scale_features
outs:
- data/features/train_features.pkl
- data/features/test_features.pkl
- data/features/scaler.pkl
# ── 3. Model training ─────────────────────────────────────────────────────
train:
cmd: python src/train.py
deps:
- data/features/train_features.pkl
- data/features/scaler.pkl
- src/train.py
params:
- params.yaml:
- train.n_estimators
- train.learning_rate
- train.max_depth
- train.min_samples_leaf
outs:
- models/churn_model.pkl
metrics:
- metrics/train.json:
cache: false
plots:
- dvclive/metric_history.tsv:
cache: false
# ── 4. Model evaluation ───────────────────────────────────────────────────
evaluate:
cmd: python src/evaluate.py
deps:
- data/features/test_features.pkl
- models/churn_model.pkl
- src/evaluate.py
metrics:
- metrics/evaluate.json:
cache: false
plots:
- metrics/roc_curve.json:
x: fpr
y: tpr
cache: false
- metrics/confusion_matrix.json:
template: confusion
cache: false
# params.yaml — hyperparameters file
prepare:
test_size: 0.2
random_seed: 42
feature_cols:
- age
- tenure_days
- monthly_spend
- support_tickets
- last_login_days
featurize:
max_categories: 20
scale_features: true
train:
n_estimators: 200
learning_rate: 0.05
max_depth: 4
min_samples_leaf: 10
Training Script with DVCLive
# src/train.py — DVC pipeline training stage with DVCLive logging
from __future__ import annotations
import json
import os
import pickle
from pathlib import Path
import numpy as np
import yaml
from dvclive import Live
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.model_selection import cross_val_score
def load_params(params_file: str = "params.yaml") -> dict:
with open(params_file) as f:
return yaml.safe_load(f)["train"]
def main():
params = load_params()
# Load features (created by featurize stage)
with open("data/features/train_features.pkl", "rb") as f:
X_train, y_train = pickle.load(f)
# DVCLive for real-time metric tracking
with Live(dir="dvclive", report="auto") as live:
live.log_params(params)
clf = GradientBoostingClassifier(
n_estimators=params["n_estimators"],
learning_rate=params["learning_rate"],
max_depth=params["max_depth"],
min_samples_leaf=params["min_samples_leaf"],
warm_start=True, # Enable incremental fitting for per-step logging
random_state=42,
)
# Train incrementally to log per-iteration metrics
step_size = max(1, params["n_estimators"] // 20)
for n in range(step_size, params["n_estimators"] + 1, step_size):
clf.n_estimators = n
clf.fit(X_train, y_train)
auc = roc_auc_score(y_train, clf.predict_proba(X_train)[:, 1])
live.log_metric("train_auc", auc)
live.next_step()
# Final cross-validation score
cv_aucs = cross_val_score(clf, X_train, y_train, cv=5, scoring="roc_auc", n_jobs=-1)
live.log_metric("cv_auc_mean", float(np.mean(cv_aucs)))
live.log_metric("cv_auc_std", float(np.std(cv_aucs)))
live.log_metric("final_train_auc", roc_auc_score(y_train, clf.predict_proba(X_train)[:, 1]))
# Save model artifact
Path("models").mkdir(exist_ok=True)
with open("models/churn_model.pkl", "wb") as f:
pickle.dump(clf, f)
# Write DVC metrics JSON
Path("metrics").mkdir(exist_ok=True)
with open("metrics/train.json", "w") as f:
json.dump({
"cv_auc_mean": round(float(np.mean(cv_aucs)), 4),
"cv_auc_std": round(float(np.std(cv_aucs)), 4),
"n_estimators": params["n_estimators"],
}, f, indent=2)
print(f"Training complete. CV AUC: {np.mean(cv_aucs):.4f} ± {np.std(cv_aucs):.4f}")
if __name__ == "__main__":
main()
GitHub Actions CI
# .github/workflows/dvc-ci.yml — DVC pipeline in CI/CD
name: ML Pipeline CI
on:
pull_request:
paths:
- "src/**"
- "params.yaml"
- "dvc.yaml"
- "data/**/*.dvc"
jobs:
dvc-repro:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- name: Install dependencies
run: pip install dvc[s3] dvclive scikit-learn pandas pyyaml
- name: Configure DVC remote credentials
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc remote modify myremote region us-east-1
- name: Pull data from remote
run: dvc pull --run-cache
- name: Reproduce pipeline
run: dvc repro
- name: Show metrics
run: |
dvc metrics show
dvc params diff HEAD~1
- name: Push updated cache
run: dvc push
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Report experiment results
uses: iterative/setup-cml@v2
- name: CML report
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "## DVC Pipeline Results" > report.md
dvc metrics show --md >> report.md
dvc params diff HEAD~1 --md >> report.md
cml comment create report.md
For the MLflow alternative when needing an experiment tracking server, model registry with approval workflows, and deployment integrations that work across all ML frameworks including TensorFlow, PyTorch, and scikit-learn — MLflow focuses on experiment tracking and model deployment while DVC focuses on data and pipeline versioning, and both can be used together (DVCLive logs to DVC while MLflow tracks runs). For the Weights & Biases alternative when needing a managed SaaS platform for experiment tracking with rich visualization, team collaboration, and Sweeps for hyperparameter search without setting up infrastructure — W&B handles experiment tracking beautifully while DVC handles the data versioning problem that Git alone can’t solve for large files, and W&B + DVC complement each other rather than compete. The Claude Skills 360 bundle includes DVC skill sets covering pipeline YAML definitions, DVCLive training loops, S3 remote configuration, experiment comparison, and GitHub Actions CI. Start with the free tier to try data version control generation.