Pandera validates pandas and Polars DataFrames with schemas. pip install pandera. Schema: schema = pa.DataFrameSchema({"price": pa.Column(float, pa.Check.gt(0)), "quantity": pa.Column(int, pa.Check.ge(1))}). Validate: schema.validate(df). validate(df, lazy=True) — collect all errors before raising. ClassModel: class OrderSchema(pa.DataFrameModel): price: float = pa.Field(gt=0); quantity: int = pa.Field(ge=1). Validate model: OrderSchema.validate(df). @pa.check_types def process(df: pa.typing.DataFrame[OrderSchema]) -> pd.DataFrame. Check types: pa.Column(float) — type coercion disabled by default. pa.Column(float, coerce=True) — cast to float. Nullable: pa.Column(str, nullable=True). Required: pa.Column(str, required=False) — column may be absent. Built-in checks: pa.Check.isin(["a","b","c"]), pa.Check.between(0,100), pa.Check.str_contains("@"), pa.Check.str_matches(r"^\d{5}$"), pa.Check.str_length(min_value=1,max_value=50). Custom check: pa.Check(lambda s: s.str.contains("@").all(), error="invalid email"). Element-wise: pa.Check(lambda x: x > 0, element_wise=True). Index: pa.Index(int, pa.Check.ge(0)). MultiIndex: pa.MultiIndex([pa.Index(str, name="region"), pa.Index(str, name="product")]). Transformations: schema.validate(df, head=5) — validate only first 5 rows. lazy=True returns SchemaErrors with all failures. DataFrameModel inheritance: class SalesSchema(OrderSchema): revenue: float = pa.Field(ge=0). Polars: pa.DataFrameSchema(...).validate(polars_df). SeriesSchema: pa.SeriesSchema(float, pa.Check.between(0,1)).validate(series). OrderSchema.to_schema_dataframe() — export schema as DataFrame. Claude Code generates Pandera schemas, ETL validation gates, and check_types decorated pipeline functions.
CLAUDE.md for Pandera
## Pandera Stack
- Version: pandera >= 0.19 | pip install "pandera[pandas]"
- Schema: pa.DataFrameSchema({"col": pa.Column(type, pa.Check.gt(0))})
- Model: class S(pa.DataFrameModel): col: float = pa.Field(gt=0, nullable=False)
- Validate: schema.validate(df) | schema.validate(df, lazy=True) for all errors
- Decorator: @pa.check_types on functions with pa.typing.DataFrame[Schema] hints
- Built-in: Check.isin | between | gt/ge/lt/le | str_contains | str_matches
- Custom: pa.Check(lambda s: ..., error="msg") | element_wise=True per value
Pandera DataFrame Validation Pipeline
# app/schemas/dataframe_schemas.py — Pandera validation schemas
from __future__ import annotations
import re
from datetime import date, datetime
from typing import Optional
import pandas as pd
import pandera as pa
from pandera import DataFrameModel, Field
from pandera.typing import DataFrame, Index, Series
# ─────────────────────────────────────────────────────────────────────────────
# Custom check functions — reusable across schemas
# ─────────────────────────────────────────────────────────────────────────────
def check_valid_email(series: pd.Series) -> bool:
"""Vectorised email format check."""
pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
return series.str.match(pattern, na=False).all()
def check_sku_format(series: pd.Series) -> bool:
"""SKU must be uppercase alphanumeric with hyphens: PROD-12345."""
return series.str.match(r"^[A-Z0-9]+-\d{4,}$", na=False).all()
def check_no_future_date(series: pd.Series) -> bool:
return (series <= pd.Timestamp.now(tz=None)).all()
# ─────────────────────────────────────────────────────────────────────────────
# 1. Functional API — pa.DataFrameSchema
# ─────────────────────────────────────────────────────────────────────────────
USER_SCHEMA = pa.DataFrameSchema(
columns={
"user_id": pa.Column(
int,
checks=[pa.Check.gt(0)],
nullable=False,
description="Positive integer user identifier",
),
"email": pa.Column(
str,
checks=[
pa.Check(check_valid_email, error="email must be valid"),
pa.Check.str_length(min_value=5, max_value=254),
],
nullable=False,
),
"first_name": pa.Column(
str,
checks=pa.Check.str_length(min_value=1, max_value=100),
nullable=False,
),
"last_name": pa.Column(
str,
checks=pa.Check.str_length(min_value=1, max_value=100),
nullable=False,
),
"role": pa.Column(
str,
checks=pa.Check.isin(["user", "moderator", "admin"]),
nullable=False,
),
"age": pa.Column(
int,
checks=pa.Check.between(0, 130),
nullable=True,
required=False, # column may be absent from the DataFrame
),
"created_at": pa.Column(
"datetime64[ns]",
checks=pa.Check(check_no_future_date, error="created_at cannot be in the future"),
nullable=False,
coerce=True, # convert strings/dates to datetime automatically
),
},
index=pa.Index(int, pa.Check.ge(0), name="idx"),
strict=False, # allow extra columns (use strict=True to forbid them)
coerce=False,
ordered=False,
name="UserSchema",
)
ORDER_SCHEMA = pa.DataFrameSchema(
columns={
"order_id": pa.Column(int, pa.Check.gt(0)),
"user_id": pa.Column(int, pa.Check.gt(0)),
"status": pa.Column(str, pa.Check.isin(["pending","paid","shipped","delivered","cancelled"])),
"total": pa.Column(float, [pa.Check.ge(0), pa.Check.le(1_000_000)]),
"line_count": pa.Column(int, pa.Check.ge(1)),
"created_at": pa.Column("datetime64[ns]", coerce=True),
},
coerce=True,
)
# ─────────────────────────────────────────────────────────────────────────────
# 2. Class-based API — DataFrameModel (recommended for larger schemas)
# ─────────────────────────────────────────────────────────────────────────────
class ProductSchema(DataFrameModel):
"""Schema for the products table — used in ETL pipeline validation."""
product_id: int = Field(gt=0, alias="id", check_name=True)
sku: str = Field(str_matches=r"^[A-Z0-9]+-\d{4,}$")
name: str = Field(str_length={"min_value": 1, "max_value": 200})
price: float = Field(gt=0, le=100_000)
stock: int = Field(ge=0)
is_active: bool
category: str = Field(isin=["Electronics","Clothing","Books","Home","Sports"])
weight_kg: Optional[float] = Field(None, ge=0, nullable=True)
class Config:
name = "ProductSchema"
strict = False # allow extra columns
coerce = True # auto-cast compatible types
@pa.check("sku", name="sku_format")
@classmethod
def validate_sku(cls, series: pd.Series) -> bool:
return check_sku_format(series)
@pa.dataframe_check
@classmethod
def price_vs_stock(cls, df: pd.DataFrame) -> pd.Series:
"""Business rule: out-of-stock products must not have zero price."""
return ~((df["stock"] == 0) & (df["price"] == 0))
class SalesLineSchema(DataFrameModel):
"""Line-item schema for order analytics — inherits common fields."""
line_id: int = Field(gt=0)
order_id: int = Field(gt=0)
product_id: int = Field(gt=0)
quantity: int = Field(ge=1, le=10_000)
unit_price: float = Field(gt=0)
discount: float = Field(ge=0, le=1) # 0.0–1.0 fraction
@pa.dataframe_check
@classmethod
def subtotal_positive(cls, df: pd.DataFrame) -> pd.Series:
subtotal = df["quantity"] * df["unit_price"] * (1 - df["discount"])
return subtotal > 0
# ─────────────────────────────────────────────────────────────────────────────
# 3. @pa.check_types — validate at function boundaries
# ─────────────────────────────────────────────────────────────────────────────
@pa.check_types
def compute_order_totals(
lines: DataFrame[SalesLineSchema],
) -> pd.DataFrame:
"""
Pandera validates `lines` against SalesLineSchema before the body runs.
Invalid DataFrames raise SchemaError before any computation.
"""
lines = lines.copy()
lines["subtotal"] = lines["quantity"] * lines["unit_price"] * (1 - lines["discount"])
return (
lines
.groupby("order_id", as_index=False)
.agg(
total=("subtotal", "sum"),
line_count=("line_id", "count"),
)
)
@pa.check_types
def filter_active_products(
df: DataFrame[ProductSchema],
min_stock: int = 0,
) -> pd.DataFrame:
"""Filters are safe to apply because the schema was already validated."""
return df[(df["is_active"] == True) & (df["stock_qty"] >= min_stock)]
# ─────────────────────────────────────────────────────────────────────────────
# 4. Lazy validation — collect all errors before raising
# ─────────────────────────────────────────────────────────────────────────────
def validate_raw_data(df: pd.DataFrame, schema: pa.DataFrameSchema) -> list[str]:
"""
Returns a list of error messages rather than raising on first failure.
Useful for reporting all data quality issues to the user at once.
"""
errors: list[str] = []
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
for failure in exc.failure_cases.itertuples():
errors.append(
f"Column '{failure.column}' check '{failure.check}' failed "
f"at index {failure.index}: value={failure.failure_case!r}"
)
return errors
# ─────────────────────────────────────────────────────────────────────────────
# 5. SeriesSchema — validate individual series
# ─────────────────────────────────────────────────────────────────────────────
PRICE_SERIES_SCHEMA = pa.SeriesSchema(
float,
checks=[
pa.Check.gt(0),
pa.Check.le(10_000),
],
nullable=False,
name="price",
)
def validate_price_column(prices: pd.Series) -> pd.Series:
return PRICE_SERIES_SCHEMA.validate(prices)
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Valid product data
products = pd.DataFrame({
"id": [1, 2, 3],
"sku": ["PROD-1001", "BOOK-2005", "ELEC-3010"],
"name": ["Widget", "Python Book", "Keyboard"],
"price": [9.99, 39.99, 89.99],
"stock": [100, 50, 25],
"is_active": [True, True, False],
"category": ["Home", "Books", "Electronics"],
"weight_kg": [0.2, 0.5, None],
})
try:
validated = ProductSchema.validate(products)
print("Products valid:", len(validated), "rows")
except pa.errors.SchemaError as e:
print("Validation failed:", e)
# Lazy validation with multiple errors
bad_df = pd.DataFrame({
"user_id": [1, -2, 3], # -2 fails gt(0)
"email": ["[email protected]", "not-an-email", "[email protected]"],
"first_name": ["Alice", "", "Carol"], # empty string fails min_length
"last_name": ["Smith", "Jones", "White"],
"role": ["user", "superuser", "admin"], # superuser not in isin
"created_at": ["2024-01-01", "2099-01-01", "2024-06-01"], # 2099 in future
})
errors = validate_raw_data(bad_df, USER_SCHEMA)
print(f"\nFound {len(errors)} validation errors:")
for err in errors[:3]:
print(" ", err)
For the great_expectations alternative — Great Expectations requires defining Expectation Suites, Data Sources, Checkpoints, and running a validation run pipeline, while Pandera’s DataFrameModel defines the entire schema as a Python class with pa.Field(gt=0, isin=[...]) annotations and @pa.check_types enforces it at the function signature — no YAML or checkpoint files. For the assert df.dtypes["price"] == float alternative — manual asserts report one failure at a time and don’t check value ranges, business rules, or string patterns, while schema.validate(df, lazy=True) collects every failing row and column into SchemaErrors.failure_cases — a DataFrame with columns column, check, failure_case, index — that can be logged, emailed, or stored for dashboards in one pass over the data. The Claude Skills 360 bundle includes Pandera skill sets covering DataFrameSchema column definitions, DataFrameModel class-based schemas, Check built-ins isin/between/str_matches/str_length, custom check functions, @pa.check_types decorator enforcement, lazy validation for bulk error collection, SeriesSchema for individual columns, @pa.dataframe_check for multi-column business rules, coerce for type casting, and strict mode for preventing extra columns. Start with the free tier to try DataFrame validation code generation.