Blog / AI / Claude Code for xml.sax: Python SAX XML Parser

Claude Code for xml.sax: Python SAX XML Parser

Published: January 4, 2029

•

Read time: 5 min read

•

By: Claude Skills 360

Python’s xml.sax module provides a streaming event-driven XML parser — ideal for large files where loading the entire tree into memory is impractical. import xml.sax. Define handler: subclass xml.sax.handler.ContentHandler and override startElement(name, attrs), endElement(name), characters(content), startDocument(), endDocument(). Parse: xml.sax.parse("data.xml", MyHandler()) or xml.sax.parseString(b"<r/>", MyHandler()). Attributes: attrs.getValue("id"), attrs.getNames() → list[str], attrs.getLength(). Namespace mode: parser = xml.sax.make_parser(); parser.setFeature(xml.sax.handler.feature_namespaces, True) — handler gets startElementNS((ns, local), qname, attrs). Error handler: parser.setErrorHandler(xml.sax.handler.ErrorHandler()). Incremental feed: parser.setContentHandler(h); parser.feed(chunk_bytes); parser.close(). Note: xml.sax is vulnerable to billion-laughs and quadratic attacks on untrusted XML — use defusedxml.sax (PyPI) for external data. Claude Code generates streaming large-file parsers, element collectors, path-based extractors, namespace-aware processors, and SAX-to-dict converters.

CLAUDE.md for xml.sax

## xml.sax Stack
- Stdlib: import xml.sax, xml.sax.handler
- Handler: class MyHandler(xml.sax.handler.ContentHandler):
-              def startElement(self, name, attrs): ...
-              def endElement(self, name): ...
-              def characters(self, content): ...
- Parse:  xml.sax.parse("file.xml", MyHandler())
-         xml.sax.parseString(xml_bytes, MyHandler())
- NS:     parser.setFeature(xml.sax.handler.feature_namespaces, True)
- Note:   use defusedxml.sax for untrusted XML input

xml.sax Streaming Parser Pipeline

# app/saxutil.py — element collector, path extractor, counter, dict builder, streamer
from __future__ import annotations

import io
import xml.sax
import xml.sax.handler
import xml.sax.xmlreader
from dataclasses import dataclass, field
from typing import Any, Callable


# ─────────────────────────────────────────────────────────────────────────────
# 1. Base handler helpers
# ─────────────────────────────────────────────────────────────────────────────

def attrs_to_dict(attrs: xml.sax.xmlreader.AttributesImpl) -> dict[str, str]:
    """
    Convert a SAX Attributes object to a plain dict.

    Example:
        def startElement(self, name, attrs):
            d = attrs_to_dict(attrs)
            print(d)   # {"id": "42", "lang": "en"}
    """
    return {attrs.getQNameByName(n): attrs.getValueByQName(attrs.getQNameByName(n))
            for n in attrs.getNames()} if attrs.getLength() else {}


def _safe_attrs_dict(attrs: Any) -> dict[str, str]:
    """Convert SAX Attributes to dict regardless of implementation."""
    try:
        return {name: attrs.getValue(name) for name in attrs.getNames()}
    except Exception:
        return {}


# ─────────────────────────────────────────────────────────────────────────────
# 2. Element collector — collect all instances of one tag
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class SAXRecord:
    """One collected element record."""
    tag:   str
    attrs: dict[str, str]
    text:  str


class ElementCollector(xml.sax.handler.ContentHandler):
    """
    SAX handler that collects all elements with a given tag name.
    Efficiently handles large XML files without loading the whole tree.

    Example:
        collector = ElementCollector("book")
        xml.sax.parse("library.xml", collector)
        for record in collector.records:
            print(record.attrs.get("id"), record.text.strip())
    """

    def __init__(self, target_tag: str) -> None:
        super().__init__()
        self.target_tag = target_tag
        self.records: list[SAXRecord] = []
        self._inside = False
        self._attrs: dict[str, str] = {}
        self._buf: list[str] = []

    def startElement(self, name: str, attrs: Any) -> None:
        if name == self.target_tag:
            self._inside = True
            self._attrs = _safe_attrs_dict(attrs)
            self._buf = []

    def endElement(self, name: str) -> None:
        if name == self.target_tag and self._inside:
            self.records.append(SAXRecord(
                tag=name,
                attrs=self._attrs,
                text="".join(self._buf),
            ))
            self._inside = False

    def characters(self, content: str) -> None:
        if self._inside:
            self._buf.append(content)


def collect_elements(source: "str | bytes | io.IOBase", tag: str) -> list[SAXRecord]:
    """
    Parse source and return all SAXRecords for the given tag.

    Example:
        records = collect_elements("library.xml", "book")
        records = collect_elements(xml_bytes, "item")
    """
    handler = ElementCollector(tag)
    if isinstance(source, (str, bytes)):
        b = source.encode() if isinstance(source, str) else source
        xml.sax.parseString(b, handler)
    else:
        xml.sax.parse(source, handler)
    return handler.records


# ─────────────────────────────────────────────────────────────────────────────
# 3. Path extractor — extract text at a specific element path
# ─────────────────────────────────────────────────────────────────────────────

class PathExtractor(xml.sax.handler.ContentHandler):
    """
    SAX handler that collects text content at a specific element path
    (e.g. "catalog/book/title").

    Example:
        extractor = PathExtractor("catalog/book/title")
        xml.sax.parseString(xml_bytes, extractor)
        print(extractor.values)   # all title texts
    """

    def __init__(self, path: str) -> None:
        super().__init__()
        self._path_parts = path.split("/")
        self._stack: list[str] = []
        self._collecting = False
        self._buf: list[str] = []
        self.values: list[str] = []

    def startElement(self, name: str, attrs: Any) -> None:
        self._stack.append(name)
        if self._stack == self._path_parts:
            self._collecting = True
            self._buf = []

    def endElement(self, name: str) -> None:
        if self._collecting and self._stack == self._path_parts:
            self.values.append("".join(self._buf).strip())
            self._collecting = False
        if self._stack:
            self._stack.pop()

    def characters(self, content: str) -> None:
        if self._collecting:
            self._buf.append(content)


def extract_path(source: "str | bytes", path: str) -> list[str]:
    """
    Parse XML and return all text values at the given element path.

    Example:
        titles = extract_path(xml_bytes, "catalog/book/title")
    """
    handler = PathExtractor(path)
    b = source.encode() if isinstance(source, str) else source
    xml.sax.parseString(b, handler)
    return handler.values


# ─────────────────────────────────────────────────────────────────────────────
# 4. Tag counter
# ─────────────────────────────────────────────────────────────────────────────

class TagCounter(xml.sax.handler.ContentHandler):
    """
    SAX handler that counts how many times each tag appears.

    Example:
        counter = TagCounter()
        xml.sax.parseString(xml_bytes, counter)
        print(counter.counts)   # {"catalog": 1, "book": 5, "title": 5, ...}
    """

    def __init__(self) -> None:
        super().__init__()
        self.counts: dict[str, int] = {}

    def startElement(self, name: str, attrs: Any) -> None:
        self.counts[name] = self.counts.get(name, 0) + 1


# ─────────────────────────────────────────────────────────────────────────────
# 5. SAX-to-dict builder
# ─────────────────────────────────────────────────────────────────────────────

class DictBuilder(xml.sax.handler.ContentHandler):
    """
    SAX handler that builds a nested dict/list structure from XML.
    Lists are created when the same tag appears multiple times.

    Example:
        builder = DictBuilder()
        xml.sax.parseString(xml_bytes, builder)
        print(builder.result)
    """

    def __init__(self) -> None:
        super().__init__()
        self._stack: list[dict[str, Any]] = []
        self._keys: list[str] = []
        self._text_buf: list[str] = []
        self.result: "dict[str, Any] | None" = None

    def startElement(self, name: str, attrs: Any) -> None:
        node: dict[str, Any] = {}
        a = _safe_attrs_dict(attrs)
        if a:
            node["@attrs"] = a
        self._stack.append(node)
        self._keys.append(name)
        self._text_buf = []

    def endElement(self, name: str) -> None:
        node = self._stack.pop()
        text = "".join(self._text_buf).strip()
        if text and len(node) == 0:
            node = text  # type: ignore[assignment]   # pure text node → string
        elif text:
            node["#text"] = text

        if self._stack:
            parent = self._stack[-1]
            if name in parent:
                if not isinstance(parent[name], list):
                    parent[name] = [parent[name]]
                parent[name].append(node)
            else:
                parent[name] = node
        else:
            self.result = {name: node}
        self._text_buf = []

    def characters(self, content: str) -> None:
        self._text_buf.append(content)


def xml_to_dict(source: "str | bytes") -> "dict[str, Any] | None":
    """
    Parse XML and return as a nested dict/list structure.

    Example:
        d = xml_to_dict(xml_bytes)
        print(d["catalog"]["book"][0]["title"])
    """
    builder = DictBuilder()
    b = source.encode() if isinstance(source, str) else source
    xml.sax.parseString(b, builder)
    return builder.result


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    import json

    print("=== xml.sax demo ===")

    xml_src = b"""<?xml version="1.0"?>
<catalog>
  <book id="1" lang="en">
    <title>Python Cookbook</title>
    <author>Beazley</author>
    <year>2013</year>
  </book>
  <book id="2" lang="en">
    <title>Fluent Python</title>
    <author>Ramalho</author>
    <year>2022</year>
  </book>
  <book id="3" lang="de">
    <title>Python lernen</title>
    <author>Ziadé</author>
    <year>2021</year>
  </book>
</catalog>"""

    # ── collect_elements ──────────────────────────────────────────────────────
    print("\n--- collect_elements('book') ---")
    books = collect_elements(xml_src, "book")
    print(f"  found {len(books)} books")
    for b in books:
        print(f"  id={b.attrs.get('id')}  lang={b.attrs.get('lang')}")

    # ── extract_path ──────────────────────────────────────────────────────────
    print("\n--- extract_path ---")
    titles = extract_path(xml_src, "catalog/book/title")
    authors = extract_path(xml_src, "catalog/book/author")
    for t, a in zip(titles, authors):
        print(f"  {t!r:30s}  by {a!r}")

    # ── TagCounter ────────────────────────────────────────────────────────────
    print("\n--- TagCounter ---")
    counter = TagCounter()
    xml.sax.parseString(xml_src, counter)
    for tag, count in sorted(counter.counts.items()):
        print(f"  {tag:15s}: {count}")

    # ── xml_to_dict ───────────────────────────────────────────────────────────
    print("\n--- xml_to_dict ---")
    d = xml_to_dict(xml_src)
    print(json.dumps(d, indent=2)[:400])

    print("\n=== done ===")

For the xml.etree.ElementTree alternative — ET.parse()/ET.iterparse(source) provides both tree-mode and event-mode iteration; iterparse yields (event, element) tuples and is suitable for streaming large files without the boilerplate of writing a ContentHandler subclass — prefer ET.iterparse() over xml.sax when you want streaming without subclassing; use xml.sax when you need full SAX/LotusXML compliance or are integrating with systems that expect ContentHandler callbacks. For the lxml.etree.iterparse alternative — lxml.etree.iterparse(source, events=("start","end")) is 2–5× faster than stdlib SAX on large files and supports XPath and schema validation in the same pass — use lxml.iterparse for performance-critical large-XML pipelines; use xml.sax for zero-dependency stdlib-only streaming. The Claude Skills 360 bundle includes xml.sax skill sets covering attrs_to_dict() / _safe_attrs_dict() helpers, SAXRecord dataclass + ElementCollector / collect_elements(), PathExtractor / extract_path(), TagCounter, and DictBuilder / xml_to_dict() SAX-to-nested-dict converter. Start with the free tier to try streaming XML patterns and xml.sax pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39