Soda validates data quality with SodaCL — human-readable YAML check language. soda scan -d my_datasource -c checks/orders.yml runs checks. SodaCL checks: checks for orders: then - row_count > 0, - missing_count(email) = 0, - duplicate_count(order_id) = 0, - min(amount) >= 0, - max(amount) < 999999, - freshness(created_at) < 2h. Schema check: - schema: fail when required column missing: [id, email, created_at]. SQL check: - failed rows: { name: flagged_orders, fail query: SELECT * FROM orders WHERE amount < 0 }. Anomaly detection: - anomaly score for row_count < default — uses ML to detect unexpected changes. Valid values: - invalid_count(status) = 0: valid values: [pending, processing, completed, refunded]. Numeric distribution: - percentile(amount, 0.99) < 1000. soda test-connection -d my_datasource -c configuration.yml verifies connectivity. Soda Core Python: from soda.scan import Scan, scan = Scan(), scan.set_data_source_name("my_source"), scan.add_configuration_yaml_file("configuration.yml"), scan.add_sodacl_yaml_file("checks/orders.yml"), scan.execute(), scan.has_check_failures(). Configuration: my_datasource: type: postgres / connection: host/database/schema/username/password. Soda Cloud: soda_cloud: host: cloud.soda.io / api_key_id / api_key_secret in configuration — pushes results to cloud for history and alerts. Actions: Slack webhook - send message to slack on check failure. Claude Code generates Soda SodaCL check files, scan configurations, Python scan runners, and CI/CD integration scripts.
CLAUDE.md for Soda
## Soda Stack
- Version: soda-core >= 3.3 + connector: soda-core-postgres / soda-core-bigquery / etc.
- Config: configuration.yml — datasource connections + optional soda_cloud block
- Checks: checks/*.yml — SodaCL YAML check definitions per table
- Scan CLI: soda scan -d datasource_name -c configuration.yml -s checks/table.yml
- Python: from soda.scan import Scan — programmatic execution for pipelines
- CI gate: sys.exit(1 if scan.has_check_failures() else 0)
- Cloud: soda_cloud block in configuration.yml → pushes results to Soda Cloud
SodaCL Check Files
# checks/orders.yml — data quality checks for the orders table
checks for orders:
# ── Volume ──────────────────────────────────────────────────────────────
- row_count > 0:
name: Orders table not empty
- row_count > 100:
name: Minimum row count
fail: when < 100
warn: when < 1000
# ── Freshness ────────────────────────────────────────────────────────────
- freshness(created_at) < 2h:
name: Recent orders exist (< 2 hours old)
- freshness(updated_at) < 4h:
name: Orders updated recently
# ── Completeness ─────────────────────────────────────────────────────────
- missing_count(order_id) = 0:
name: No missing order IDs
- missing_count(user_id) = 0:
name: No missing user IDs
- missing_percent(email) < 0.5%:
name: Email mostly populated
# ── Uniqueness ────────────────────────────────────────────────────────────
- duplicate_count(order_id) = 0:
name: Order IDs are unique
# ── Validity ──────────────────────────────────────────────────────────────
- invalid_count(status) = 0:
name: Valid order statuses only
valid values: [pending, processing, completed, refunded, cancelled]
- invalid_count(currency) = 0:
valid values: [USD, EUR, GBP, JPY, CAD, AUD]
# ── Numeric ranges ────────────────────────────────────────────────────────
- min(amount_usd) >= 0:
name: No negative order amounts
- max(amount_usd) < 1000000:
name: No unreasonably large orders
- avg(amount_usd) between 20 and 500:
name: Average order value in expected range
warn: when not between 15 and 600
fail: when not between 5 and 2000
- percentile(amount_usd, 0.99) < 5000:
name: 99th percentile amount reasonable
# ── Schema ────────────────────────────────────────────────────────────────
- schema:
name: Required columns present
fail:
when required column missing: [order_id, user_id, amount_usd, status, created_at]
warn:
when forbidden column present: [password, ssn, credit_card]
when wrong column type:
amount_usd: double precision
# ── Custom SQL ────────────────────────────────────────────────────────────
- failed rows:
name: No negative refunds
fail query: |
SELECT * FROM orders
WHERE status = 'refunded'
AND amount_usd > 0 -- refunded orders should have negative or zero amounts
- failed rows:
name: Completed orders have valid amounts
warn query: |
SELECT * FROM orders
WHERE status = 'completed'
AND amount_usd <= 0
# ── Anomaly detection (Soda Cloud) ────────────────────────────────────────
- anomaly score for row_count < default:
name: Row count anomaly detection
- anomaly score for avg(amount_usd) < default:
name: Avg order value anomaly
# checks/users.yml — user table checks
checks for users:
- row_count > 0
- duplicate_count(id) = 0:
name: User IDs unique
- duplicate_count(email) = 0:
name: Emails unique
- missing_count(email) = 0
- invalid_count(email) = 0:
valid format: email
- invalid_count(plan) = 0:
valid values: [free, pro, enterprise]
- freshness(created_at) < 48h:
name: User data fresh (new signups within 48h)
Configuration
# configuration.yml — datasource connections
data_sources:
production_postgres:
type: postgres
connection:
host: ${POSTGRES_HOST}
port: 5432
database: ${POSTGRES_DB}
schema: public
username: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
dev_duckdb:
type: duckdb
path: data/warehouse.duckdb
bigquery_warehouse:
type: bigquery
connection:
project_id: ${GCP_PROJECT}
dataset: analytics
service_account_json_key: ${GOOGLE_APPLICATION_CREDENTIALS}
soda_cloud:
host: cloud.soda.io
api_key_id: ${SODA_API_KEY_ID}
api_key_secret: ${SODA_API_KEY_SECRET}
Python Scan Runner
# scripts/run_data_quality.py — programmatic Soda scan for pipeline integration
import sys
import os
from soda.scan import Scan
def run_checks(
datasource: str = "production_postgres",
tables: list[str] = None,
config: str = "configuration.yml",
) -> bool:
"""Run Soda checks and return True if all pass."""
tables = tables or ["orders", "users", "events"]
scan = Scan()
scan.set_verbose(True)
scan.set_data_source_name(datasource)
scan.add_configuration_yaml_file(config)
for table in tables:
check_file = f"checks/{table}.yml"
if os.path.exists(check_file):
scan.add_sodacl_yaml_file(check_file)
else:
print(f"[Soda] No check file found for {table}, skipping.")
scan.execute()
scan.assert_no_error_logs()
if scan.has_check_failures():
print("\n[Soda] FAILED CHECKS:")
for check in scan.get_checks_not_passed():
print(f" ✗ {check.name}: {check.outcome}")
return False
if scan.has_check_warnings():
print("\n[Soda] WARNINGS:")
for check in scan.get_checks_with_warnings():
print(f" ⚠ {check.name}: {check.outcome}")
print(f"\n[Soda] All {len(scan.get_all_checks())} checks passed for {tables}")
return True
if __name__ == "__main__":
datasource = os.environ.get("SODA_DATASOURCE", "production_postgres")
tables = sys.argv[1:] if len(sys.argv) > 1 else None
passed = run_checks(datasource=datasource, tables=tables)
sys.exit(0 if passed else 1)
For the Great Expectations alternative when needing a Python-native framework with rich assertion libraries, detailed HTML profiling reports, custom Expectation classes, and tight integration with Spark/Pandas DataFrames — Great Expectations is more powerful for complex statistical validations while Soda’s SodaCL is a simpler, more readable YAML-first syntax that analysts can write without Python knowledge. For the dbt Tests alternative when already in the dbt ecosystem and wanting data quality checks co-located with dbt model SQL files — dbt generic tests (not_null, unique, accepted_values) run at model build time while Soda is better for continuous freshness monitoring and post-load validation outside the dbt pipeline. The Claude Skills 360 bundle includes Soda skill sets covering SodaCL checks, multi-datasource configuration, Python scan integration, and CI/CD gates. Start with the free tier to try data quality generation.