Blog / AI / Claude Code for TorchMetrics: ML Evaluation Metrics

Claude Code for TorchMetrics: ML Evaluation Metrics

Published: December 6, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

TorchMetrics provides modular, distributed-safe ML evaluation metrics for PyTorch. pip install torchmetrics. from torchmetrics import Accuracy, F1Score, AUROC, MeanSquaredError. Classification: acc = Accuracy(task="binary"), acc.update(preds, target), result = acc.compute(), acc.reset(). Multiclass: Accuracy(task="multiclass", num_classes=10). Multilabel: Accuracy(task="multilabel", num_labels=5, average="macro"). Precision/Recall: Precision(task="binary"), Recall(task="multiclass", num_classes=10, average="macro"). F1: F1Score(task="binary"), F1Score(task="multiclass", num_classes=10, average="weighted"). AUROC: AUROC(task="binary") — expects probabilities not logits. ConfusionMatrix(task="multiclass", num_classes=5). Regression: MeanSquaredError(), MeanAbsoluteError(), R2Score(). MeanSquaredError(squared=False) — RMSE. Collection: metrics = MetricCollection([Accuracy(task="binary"), F1Score(task="binary"), AUROC(task="binary")]). Lightning: self.train_metrics = MetricCollection({...}). self.log("val/acc", acc) — automatically resets after logging. MAP: from torchmetrics.detection import MeanAveragePrecision, map_metric = MeanAveragePrecision(), map_metric.update(preds, targets). IoU: IntersectionOverUnion(). BLEU: BLEUScore(n_gram=4). ROUGE: ROUGEScore(). Functional: from torchmetrics.functional import accuracy, acc = accuracy(preds, target, task="binary"). Device: acc = acc.to(device) — metrics live on same device as tensors. Distributed: updates are automatically sync’d across GPUs via dist.all_reduce. Claude Code generates TorchMetrics training loops, MetricCollection validation reporters, and detection mAP evaluators.

CLAUDE.md for TorchMetrics

## TorchMetrics Stack
- Version: torchmetrics >= 1.4
- Pattern: metric.update(preds, target) → metric.compute() → metric.reset()
- Classification: task="binary"|"multiclass"|"multilabel", num_classes=N
- Average: "micro" | "macro" | "weighted" | "none" (per-class)
- Collection: MetricCollection([...]) — call update/compute/reset once
- Device: metrics must be on same device as predictions (metric.to(device))
- Functional: torchmetrics.functional.* — stateless single-batch ops

TorchMetrics Evaluation Pipeline

# ml/torchmetrics_pipeline.py — ML evaluation metrics with TorchMetrics
from __future__ import annotations
from typing import Any

import torch
import torch.nn as nn
from torchmetrics import (
    Accuracy, Precision, Recall, F1Score, AUROC,
    ConfusionMatrix, CohenKappa, MatthewsCorrCoef,
    MeanSquaredError, MeanAbsoluteError, R2Score,
    MeanAbsolutePercentageError,
    MetricCollection,
)
from torchmetrics.classification import (
    BinaryPrecisionRecallCurve, MulticlassROC,
    BinaryCalibrationError, BinaryStatScores,
    MultilabelAccuracy,
)


# ── 0. Classification metric suites ──────────────────────────────────────────

def binary_metric_suite(device: torch.device = None) -> MetricCollection:
    """
    Complete binary classification metric collection.
    Covers accuracy, F1, AUROC, precision, recall, and calibration.
    """
    metrics = MetricCollection({
        "accuracy":  Accuracy(task="binary"),
        "precision": Precision(task="binary"),
        "recall":    Recall(task="binary"),
        "f1":        F1Score(task="binary"),
        "auroc":     AUROC(task="binary"),
        "kappa":     CohenKappa(task="binary"),
        "mcc":       MatthewsCorrCoef(task="binary"),
        "ece":       BinaryCalibrationError(n_bins=15),
    })
    if device:
        metrics = metrics.to(device)
    return metrics


def multiclass_metric_suite(
    num_classes: int,
    device: torch.device = None,
) -> MetricCollection:
    """
    Multi-class classification metrics with macro and weighted averaging.
    Pass integer class labels as targets; pass softmax probabilities as preds.
    """
    metrics = MetricCollection({
        "accuracy_macro":    Accuracy(task="multiclass", num_classes=num_classes, average="macro"),
        "accuracy_weighted": Accuracy(task="multiclass", num_classes=num_classes, average="weighted"),
        "f1_macro":          F1Score(task="multiclass",  num_classes=num_classes, average="macro"),
        "f1_weighted":       F1Score(task="multiclass",  num_classes=num_classes, average="weighted"),
        "precision_macro":   Precision(task="multiclass", num_classes=num_classes, average="macro"),
        "recall_macro":      Recall(task="multiclass",   num_classes=num_classes, average="macro"),
        "auroc_macro":       AUROC(task="multiclass",    num_classes=num_classes, average="macro"),
        "kappa":             CohenKappa(task="multiclass", num_classes=num_classes),
    })
    if device:
        metrics = metrics.to(device)
    return metrics


def per_class_metrics(
    num_classes: int,
    device: torch.device = None,
) -> MetricCollection:
    """Per-class precision, recall, F1 (average='none')."""
    metrics = MetricCollection({
        "precision_per_class": Precision(task="multiclass", num_classes=num_classes, average="none"),
        "recall_per_class":    Recall(task="multiclass",    num_classes=num_classes, average="none"),
        "f1_per_class":        F1Score(task="multiclass",   num_classes=num_classes, average="none"),
    })
    if device:
        metrics = metrics.to(device)
    return metrics


# ── 1. Regression metric suite ────────────────────────────────────────────────

def regression_metric_suite(device: torch.device = None) -> MetricCollection:
    """
    Standard regression metrics: MSE, RMSE, MAE, MAPE, R².
    """
    metrics = MetricCollection({
        "mse":  MeanSquaredError(squared=True),
        "rmse": MeanSquaredError(squared=False),
        "mae":  MeanAbsoluteError(),
        "mape": MeanAbsolutePercentageError(),
        "r2":   R2Score(),
    })
    if device:
        metrics = metrics.to(device)
    return metrics


# ── 2. Training loop integration ──────────────────────────────────────────────

class MetricTracker:
    """
    Wraps MetricCollection for use in a training/validation loop.
    Handles update, compute, reset, and per-epoch logging in one place.
    """

    def __init__(
        self,
        metrics: MetricCollection,
        prefix:  str = "val",
    ):
        self.metrics = metrics
        self.prefix  = prefix

    def update(self, preds: torch.Tensor, targets: torch.Tensor) -> None:
        """Accumulate batch predictions."""
        self.metrics.update(preds, targets)

    def compute(self) -> dict[str, float]:
        """Compute epoch-level metrics and reset state."""
        results = self.metrics.compute()
        self.metrics.reset()
        # Flatten nested tensors to scalars
        flat = {}
        for k, v in results.items():
            if v.ndim == 0:
                flat[f"{self.prefix}/{k}"] = v.item()
            else:
                # per-class vector: log each class separately
                for i, vi in enumerate(v.tolist()):
                    flat[f"{self.prefix}/{k}_class{i}"] = vi
        return flat

    def reset(self) -> None:
        self.metrics.reset()


def run_validation_epoch(
    model:      nn.Module,
    loader,                      # DataLoader
    metrics:    MetricTracker,
    device:     torch.device,
    task:       str = "binary",  # "binary" | "multiclass"
) -> dict[str, float]:
    """
    Standard validation loop with TorchMetrics accumulation.
    Returns dict of metric_name → float for logging.
    """
    model.eval()
    with torch.no_grad():
        for batch in loader:
            x, y = batch
            x, y = x.to(device), y.to(device)
            logits = model(x)

            if task == "binary":
                preds = torch.sigmoid(logits).squeeze(-1)
            else:
                preds = torch.softmax(logits, dim=-1)

            metrics.update(preds, y)

    return metrics.compute()


# ── 3. Confusion matrix utilities ────────────────────────────────────────────

def compute_confusion_matrix(
    preds:       torch.Tensor,
    targets:     torch.Tensor,
    num_classes: int,
    normalize:   str | None = None,   # None | "true" | "pred" | "all"
) -> torch.Tensor:
    """
    Compute confusion matrix.
    normalize="true": row-normalized (recall per class)
    normalize="pred": col-normalized (precision per class)
    normalize="all": divide all cells by total count
    """
    cm_metric = ConfusionMatrix(
        task="multiclass",
        num_classes=num_classes,
        normalize=normalize,
    )
    cm_metric.update(preds, targets)
    return cm_metric.compute()


def per_class_report(
    preds:        torch.Tensor,
    targets:      torch.Tensor,
    num_classes:  int,
    class_names:  list[str] = None,
) -> list[dict]:
    """
    Per-class precision, recall, F1 report as a list of dicts.
    Analogous to sklearn's classification_report.
    """
    metrics = per_class_metrics(num_classes)
    metrics.update(preds, targets)
    results = metrics.compute()
    metrics.reset()

    rows = []
    for i in range(num_classes):
        name = class_names[i] if class_names else f"class_{i}"
        rows.append({
            "class":     name,
            "precision": round(results["precision_per_class"][i].item(), 4),
            "recall":    round(results["recall_per_class"][i].item(), 4),
            "f1":        round(results["f1_per_class"][i].item(), 4),
        })
    return rows


# ── 4. Object detection metrics ───────────────────────────────────────────────

def compute_map(
    preds:   list[dict],    # [{boxes:Tensor, scores:Tensor, labels:Tensor}, ...]
    targets: list[dict],    # [{boxes:Tensor, labels:Tensor}, ...]
    iou_thresholds: list[float] = None,
) -> dict[str, float]:
    """
    Compute COCO-style mAP using TorchMetrics MeanAveragePrecision.
    preds / targets follow the TorchMetrics dict format.
    """
    from torchmetrics.detection import MeanAveragePrecision
    metric = MeanAveragePrecision(
        iou_thresholds=iou_thresholds,   # None = COCO [0.5:0.05:0.95]
        box_format="xyxy",
    )
    metric.update(preds, targets)
    result = metric.compute()
    return {k: round(float(v), 4) for k, v in result.items() if v.ndim == 0}


# ── 5. NLP metrics ────────────────────────────────────────────────────────────

def compute_bleu(
    predictions: list[str],
    references:  list[list[str]],
    n_gram:      int = 4,
) -> float:
    """Compute corpus-level BLEU score."""
    from torchmetrics.text import BLEUScore
    metric = BLEUScore(n_gram=n_gram)
    metric.update(predictions, references)
    return round(float(metric.compute()), 4)


def compute_rouge(
    predictions: list[str],
    references:  list[str],
) -> dict[str, float]:
    """Compute ROUGE-1, ROUGE-2, ROUGE-L F1 scores."""
    from torchmetrics.text import ROUGEScore
    metric = ROUGEScore()
    metric.update(predictions, references)
    result = metric.compute()
    return {k: round(float(v), 4) for k, v in result.items()}


# ── Demo ──────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("TorchMetrics Evaluation Demo")
    print("=" * 50)

    torch.manual_seed(42)
    n = 1000

    # Binary classification
    print("\n── Binary Classification ──")
    probs   = torch.sigmoid(torch.randn(n))
    targets = torch.randint(0, 2, (n,))
    suite   = binary_metric_suite()
    suite.update(probs, targets)
    results = suite.compute()
    suite.reset()
    for name, val in results.items():
        print(f"  {name:<12} {float(val):.4f}")

    # Multi-class
    print("\n── Multi-class (5 classes) ──")
    num_classes  = 5
    mc_probs     = torch.softmax(torch.randn(n, num_classes), dim=1)
    mc_targets   = torch.randint(0, num_classes, (n,))
    mc_suite     = multiclass_metric_suite(num_classes)
    mc_suite.update(mc_probs, mc_targets)
    mc_results = mc_suite.compute()
    mc_suite.reset()
    for name, val in mc_results.items():
        print(f"  {name:<22} {float(val):.4f}")

    # Regression
    print("\n── Regression ──")
    y_pred   = torch.randn(n)
    y_true   = y_pred + 0.3 * torch.randn(n)
    reg_suite = regression_metric_suite()
    reg_suite.update(y_pred, y_true)
    reg_results = reg_suite.compute()
    reg_suite.reset()
    for name, val in reg_results.items():
        print(f"  {name:<6} {float(val):.4f}")

    # Per-class report
    print("\n── Per-class Report (3 classes) ──")
    report = per_class_report(
        mc_probs[:200, :3].softmax(dim=1),
        mc_targets[:200] % 3,
        num_classes=3,
        class_names=["cat", "dog", "bird"],
    )
    for row in report:
        print(f"  {row['class']:<6}  P={row['precision']:.3f}  R={row['recall']:.3f}  F1={row['f1']:.3f}")

    # BLEU
    print("\n── BLEU Score ──")
    preds = ["the quick brown fox", "hello world"]
    refs  = [["the quick brown fox jumps"], ["hello world today"]]
    bleu  = compute_bleu(preds, refs, n_gram=2)
    print(f"  BLEU-2: {bleu}")

For the sklearn.metrics alternative — sklearn metrics are one-shot functions that require loading all predictions into memory at once while TorchMetrics’ stateful metric.update(batch_preds, batch_targets) accumulates intermediate statistics across batches, enabling correct epoch-level metrics without concatenating thousands of prediction arrays, and MetricCollection.update/compute/reset tracks 10 metrics in one call with automatic GPU tensor support. For the manual Pandas/NumPy computation alternative — manual precision/recall loops break on edge cases like all-negative batches (zero division), multi-GPU training (metrics must be all-reduced across ranks), and half-precision tensors (uint8 accumulation overflow) while TorchMetrics handles all three cases internally, AUROC(task="multiclass", average="macro") integrates the PR curve correctly for unbalanced classes, and MatthewsCorrCoef gives a single balanced metric without choosing a threshold. The Claude Skills 360 bundle includes TorchMetrics skill sets covering binary/multiclass/multilabel accuracy, precision/recall/F1/AUROC, MetricCollection, MetricTracker training loop wrapper, confusion matrix normalization, per-class report, MeanAveragePrecision for detection, BLEU/ROUGE NLP metrics, regression MSE/RMSE/MAE/MAPE/R², and BinaryCalibrationError. Start with the free tier to try ML evaluation code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39