Blog / AI / Claude Code for Soda: Data Quality Checks as Code

Claude Code for Soda: Data Quality Checks as Code

Published: August 26, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

Soda validates data quality with SodaCL — human-readable YAML check language. soda scan -d my_datasource -c checks/orders.yml runs checks. SodaCL checks: checks for orders: then - row_count > 0, - missing_count(email) = 0, - duplicate_count(order_id) = 0, - min(amount) >= 0, - max(amount) < 999999, - freshness(created_at) < 2h. Schema check: - schema: fail when required column missing: [id, email, created_at]. SQL check: - failed rows: { name: flagged_orders, fail query: SELECT * FROM orders WHERE amount < 0 }. Anomaly detection: - anomaly score for row_count < default — uses ML to detect unexpected changes. Valid values: - invalid_count(status) = 0: valid values: [pending, processing, completed, refunded]. Numeric distribution: - percentile(amount, 0.99) < 1000. soda test-connection -d my_datasource -c configuration.yml verifies connectivity. Soda Core Python: from soda.scan import Scan, scan = Scan(), scan.set_data_source_name("my_source"), scan.add_configuration_yaml_file("configuration.yml"), scan.add_sodacl_yaml_file("checks/orders.yml"), scan.execute(), scan.has_check_failures(). Configuration: my_datasource: type: postgres / connection: host/database/schema/username/password. Soda Cloud: soda_cloud: host: cloud.soda.io / api_key_id / api_key_secret in configuration — pushes results to cloud for history and alerts. Actions: Slack webhook - send message to slack on check failure. Claude Code generates Soda SodaCL check files, scan configurations, Python scan runners, and CI/CD integration scripts.

CLAUDE.md for Soda

## Soda Stack
- Version: soda-core >= 3.3 + connector: soda-core-postgres / soda-core-bigquery / etc.
- Config: configuration.yml — datasource connections + optional soda_cloud block
- Checks: checks/*.yml — SodaCL YAML check definitions per table
- Scan CLI: soda scan -d datasource_name -c configuration.yml -s checks/table.yml
- Python: from soda.scan import Scan — programmatic execution for pipelines
- CI gate: sys.exit(1 if scan.has_check_failures() else 0)
- Cloud: soda_cloud block in configuration.yml → pushes results to Soda Cloud

SodaCL Check Files

# checks/orders.yml — data quality checks for the orders table
checks for orders:

  # ── Volume ──────────────────────────────────────────────────────────────
  - row_count > 0:
      name: Orders table not empty

  - row_count > 100:
      name: Minimum row count
      fail: when < 100
      warn: when < 1000

  # ── Freshness ────────────────────────────────────────────────────────────
  - freshness(created_at) < 2h:
      name: Recent orders exist (< 2 hours old)

  - freshness(updated_at) < 4h:
      name: Orders updated recently

  # ── Completeness ─────────────────────────────────────────────────────────
  - missing_count(order_id) = 0:
      name: No missing order IDs

  - missing_count(user_id) = 0:
      name: No missing user IDs

  - missing_percent(email) < 0.5%:
      name: Email mostly populated

  # ── Uniqueness ────────────────────────────────────────────────────────────
  - duplicate_count(order_id) = 0:
      name: Order IDs are unique

  # ── Validity ──────────────────────────────────────────────────────────────
  - invalid_count(status) = 0:
      name: Valid order statuses only
      valid values: [pending, processing, completed, refunded, cancelled]

  - invalid_count(currency) = 0:
      valid values: [USD, EUR, GBP, JPY, CAD, AUD]

  # ── Numeric ranges ────────────────────────────────────────────────────────
  - min(amount_usd) >= 0:
      name: No negative order amounts

  - max(amount_usd) < 1000000:
      name: No unreasonably large orders

  - avg(amount_usd) between 20 and 500:
      name: Average order value in expected range
      warn: when not between 15 and 600
      fail: when not between 5 and 2000

  - percentile(amount_usd, 0.99) < 5000:
      name: 99th percentile amount reasonable

  # ── Schema ────────────────────────────────────────────────────────────────
  - schema:
      name: Required columns present
      fail:
        when required column missing: [order_id, user_id, amount_usd, status, created_at]
      warn:
        when forbidden column present: [password, ssn, credit_card]
        when wrong column type:
          amount_usd: double precision

  # ── Custom SQL ────────────────────────────────────────────────────────────
  - failed rows:
      name: No negative refunds
      fail query: |
        SELECT * FROM orders
        WHERE status = 'refunded'
          AND amount_usd > 0  -- refunded orders should have negative or zero amounts

  - failed rows:
      name: Completed orders have valid amounts
      warn query: |
        SELECT * FROM orders
        WHERE status = 'completed'
          AND amount_usd <= 0

  # ── Anomaly detection (Soda Cloud) ────────────────────────────────────────
  - anomaly score for row_count < default:
      name: Row count anomaly detection

  - anomaly score for avg(amount_usd) < default:
      name: Avg order value anomaly

# checks/users.yml — user table checks
checks for users:
  - row_count > 0
  - duplicate_count(id) = 0:
      name: User IDs unique
  - duplicate_count(email) = 0:
      name: Emails unique
  - missing_count(email) = 0
  - invalid_count(email) = 0:
      valid format: email
  - invalid_count(plan) = 0:
      valid values: [free, pro, enterprise]
  - freshness(created_at) < 48h:
      name: User data fresh (new signups within 48h)

Configuration

# configuration.yml — datasource connections
data_sources:
  production_postgres:
    type: postgres
    connection:
      host:     ${POSTGRES_HOST}
      port:     5432
      database: ${POSTGRES_DB}
      schema:   public
    username: ${POSTGRES_USER}
    password: ${POSTGRES_PASSWORD}

  dev_duckdb:
    type: duckdb
    path: data/warehouse.duckdb

  bigquery_warehouse:
    type: bigquery
    connection:
      project_id: ${GCP_PROJECT}
      dataset: analytics
      service_account_json_key: ${GOOGLE_APPLICATION_CREDENTIALS}

soda_cloud:
  host:           cloud.soda.io
  api_key_id:     ${SODA_API_KEY_ID}
  api_key_secret: ${SODA_API_KEY_SECRET}

Python Scan Runner

# scripts/run_data_quality.py — programmatic Soda scan for pipeline integration
import sys
import os
from soda.scan import Scan


def run_checks(
    datasource: str = "production_postgres",
    tables:     list[str] = None,
    config:     str = "configuration.yml",
) -> bool:
    """Run Soda checks and return True if all pass."""
    tables = tables or ["orders", "users", "events"]
    scan   = Scan()

    scan.set_verbose(True)
    scan.set_data_source_name(datasource)
    scan.add_configuration_yaml_file(config)

    for table in tables:
        check_file = f"checks/{table}.yml"
        if os.path.exists(check_file):
            scan.add_sodacl_yaml_file(check_file)
        else:
            print(f"[Soda] No check file found for {table}, skipping.")

    scan.execute()
    scan.assert_no_error_logs()

    if scan.has_check_failures():
        print("\n[Soda] FAILED CHECKS:")
        for check in scan.get_checks_not_passed():
            print(f"  ✗ {check.name}: {check.outcome}")
        return False

    if scan.has_check_warnings():
        print("\n[Soda] WARNINGS:")
        for check in scan.get_checks_with_warnings():
            print(f"  ⚠ {check.name}: {check.outcome}")

    print(f"\n[Soda] All {len(scan.get_all_checks())} checks passed for {tables}")
    return True


if __name__ == "__main__":
    datasource = os.environ.get("SODA_DATASOURCE", "production_postgres")
    tables     = sys.argv[1:] if len(sys.argv) > 1 else None

    passed = run_checks(datasource=datasource, tables=tables)
    sys.exit(0 if passed else 1)

For the Great Expectations alternative when needing a Python-native framework with rich assertion libraries, detailed HTML profiling reports, custom Expectation classes, and tight integration with Spark/Pandas DataFrames — Great Expectations is more powerful for complex statistical validations while Soda’s SodaCL is a simpler, more readable YAML-first syntax that analysts can write without Python knowledge. For the dbt Tests alternative when already in the dbt ecosystem and wanting data quality checks co-located with dbt model SQL files — dbt generic tests (not_null, unique, accepted_values) run at model build time while Soda is better for continuous freshness monitoring and post-load validation outside the dbt pipeline. The Claude Skills 360 bundle includes Soda skill sets covering SodaCL checks, multi-datasource configuration, Python scan integration, and CI/CD gates. Start with the free tier to try data quality generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39