Blog / AI / Claude Code for xml.parsers.expat: Python Expat XML Parser Bindings

Claude Code for xml.parsers.expat: Python Expat XML Parser Bindings

Published: February 7, 2029

•

Read time: 5 min read

•

By: Claude Skills 360

Python’s xml.parsers.expat module provides direct bindings to the Expat C library — the fastest XML parser in the stdlib. import xml.parsers.expat as expat. Create a parser: p = expat.ParserCreate(encoding="utf-8", namespace_separator="|"). Register handlers: p.StartElementHandler = fn(name, attrs), p.EndElementHandler = fn(name), p.CharacterDataHandler = fn(data), p.ProcessingInstructionHandler = fn(target, data), p.CommentHandler = fn(data), p.StartNamespaceDeclHandler = fn(prefix, uri). Feed data: p.Parse(chunk, is_final=False) (bool); p.ParseFile(fp). Position: p.CurrentLineNumber, p.CurrentColumnNumber, p.CurrentByteIndex. Errors: expat.ExpatError (subclass of Exception) has .lineno, .offset, .code; use expat.ErrorString(code) to get the message. Security: always set p.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_NEVER) to prevent XXE attacks; never parse untrusted XML with default settings. Namespace mode: ParserCreate(namespace_separator="|") makes tag names "{uri}|localname"; set separator to something unlikely to appear in URIs. Expat is approximately 2–5× faster than xml.sax and 3–10× faster than xml.dom.minidom for streaming workloads. Claude Code generates ultra-fast streaming XML processors, element counters, tag-frequency analyzers, namespace extractors, and large-file XML scanners.

CLAUDE.md for xml.parsers.expat

## xml.parsers.expat Stack
- Stdlib: import xml.parsers.expat as expat
- Create: p = expat.ParserCreate("utf-8")
-         p.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_NEVER)  # XXE guard
- Handlers:
-   p.StartElementHandler    = fn(name, attrs)  # attrs is dict
-   p.EndElementHandler      = fn(name)
-   p.CharacterDataHandler   = fn(data)
-   p.ProcessingInstructionHandler = fn(target, data)
-   p.CommentHandler         = fn(data)
-   p.StartNamespaceDeclHandler = fn(prefix, uri)
- Feed:   p.Parse(chunk, False) / p.Parse(b"", True)  # is_final
-         p.ParseFile(fp)
- Pos:    p.CurrentLineNumber / p.CurrentColumnNumber
- Error:  expat.ExpatError  .lineno .offset .code
-         expat.ErrorString(code)

xml.parsers.expat Streaming Parser Pipeline

# app/xmlexpatutil.py — count, collect, namespaces, error, stream-large
from __future__ import annotations

import io
import xml.parsers.expat as _expat
from dataclasses import dataclass, field
from typing import Any, Callable


# ─────────────────────────────────────────────────────────────────────────────
# 1. Tag-frequency counter (fastest path — no text buffering)
# ─────────────────────────────────────────────────────────────────────────────

def count_elements(xml_source: "bytes | str",
                   tag: str | None = None) -> dict[str, int]:
    """
    Count element occurrences using Expat (no DOM construction).
    If tag is given, count only that tag; otherwise count all.

    Example:
        counts = count_elements(xml_bytes)
        counts = count_elements(xml_bytes, "item")
    """
    counts: dict[str, int] = {}
    if isinstance(xml_source, str):
        xml_source = xml_source.encode("utf-8")

    p = _expat.ParserCreate("utf-8")
    p.SetParamEntityParsing(_expat.XML_PARAM_ENTITY_PARSING_NEVER)

    def start(name: str, attrs: dict) -> None:
        if tag is None or name == tag:
            counts[name] = counts.get(name, 0) + 1

    p.StartElementHandler = start
    p.Parse(xml_source, True)
    return counts


# ─────────────────────────────────────────────────────────────────────────────
# 2. Element text extractor
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class ExtractedElement:
    tag:        str
    attrs:      dict[str, str]
    text:       str


def extract_elements(xml_source: "bytes | str",
                     target_tag: str,
                     max_items: int = 1000) -> list[ExtractedElement]:
    """
    Extract all elements with the given tag, including text and attributes.

    Example:
        items = extract_elements(rss_bytes, "item", max_items=20)
        for item in items:
            print(item.attrs, item.text[:60])
    """
    if isinstance(xml_source, str):
        xml_source = xml_source.encode("utf-8")

    results: list[ExtractedElement] = []
    _stack: list[tuple[str, dict[str, str], list[str]]] = []
    _active = [False]

    p = _expat.ParserCreate("utf-8")
    p.SetParamEntityParsing(_expat.XML_PARAM_ENTITY_PARSING_NEVER)

    def start(name: str, attrs: dict) -> None:
        if name == target_tag and len(results) < max_items:
            _active[0] = True
            _stack.append((name, dict(attrs), []))
        elif _active[0]:
            _stack.append((name, {}, []))

    def characters(data: str) -> None:
        if _active[0] and _stack:
            _stack[-1][2].append(data)

    def end(name: str) -> None:
        if not _stack:
            return
        tag_n, tag_attrs, text_parts = _stack[-1]
        _stack.pop()
        if tag_n == target_tag:
            _active[0] = bool(_stack)
            results.append(ExtractedElement(
                tag=tag_n,
                attrs=tag_attrs,
                text="".join(text_parts).strip(),
            ))

    p.StartElementHandler = start
    p.CharacterDataHandler = characters
    p.EndElementHandler = end
    p.Parse(xml_source, True)
    return results


# ─────────────────────────────────────────────────────────────────────────────
# 3. Namespace extractor
# ─────────────────────────────────────────────────────────────────────────────

def extract_namespaces(xml_source: "bytes | str") -> dict[str, str]:
    """
    Return all namespace prefix→URI declarations in an XML document.

    Example:
        ns = extract_namespaces(xml_bytes)
        print(ns)   # {"xsi": "http://www.w3.org/2001/XMLSchema-instance", ...}
    """
    if isinstance(xml_source, str):
        xml_source = xml_source.encode("utf-8")

    ns_map: dict[str, str] = {}
    p = _expat.ParserCreate("utf-8", "|")
    p.SetParamEntityParsing(_expat.XML_PARAM_ENTITY_PARSING_NEVER)

    def ns_start(prefix: str, uri: str) -> None:
        ns_map[prefix or "(default)"] = uri

    p.StartNamespaceDeclHandler = ns_start
    try:
        p.Parse(xml_source, True)
    except _expat.ExpatError:
        pass
    return ns_map


# ─────────────────────────────────────────────────────────────────────────────
# 4. Error collector
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class ParseError:
    message:  str
    line:     int
    column:   int
    offset:   int
    code:     int


def validate_xml(xml_source: "bytes | str") -> "ParseError | None":
    """
    Validate an XML document.  Returns None if well-formed, ParseError if not.

    Example:
        err = validate_xml(xml_bytes)
        if err:
            print(f"Line {err.line}: {err.message}")
    """
    if isinstance(xml_source, str):
        xml_source = xml_source.encode("utf-8")

    p = _expat.ParserCreate("utf-8")
    p.SetParamEntityParsing(_expat.XML_PARAM_ENTITY_PARSING_NEVER)
    try:
        p.Parse(xml_source, True)
        return None
    except _expat.ExpatError as e:
        return ParseError(
            message=_expat.ErrorString(e.code),
            line=e.lineno,
            column=e.offset,
            offset=getattr(e, "byteoffset", -1),
            code=e.code,
        )


# ─────────────────────────────────────────────────────────────────────────────
# 5. Streaming large-file parser
# ─────────────────────────────────────────────────────────────────────────────

def stream_parse(stream: "io.RawIOBase | io.BufferedIOBase",
                 start_handler: "Callable[[str, dict], None] | None" = None,
                 end_handler: "Callable[[str], None] | None" = None,
                 char_handler: "Callable[[str], None] | None" = None,
                 chunk_size: int = 65536) -> "ParseError | None":
    """
    Parse a large XML stream in chunks.  Register handlers for events.
    Returns None on success, ParseError on failure.

    Example:
        counts = {}
        def start(name, attrs): counts[name] = counts.get(name, 0) + 1
        with open("large.xml", "rb") as f:
            err = stream_parse(f, start_handler=start)
        print(counts)
    """
    p = _expat.ParserCreate("utf-8")
    p.SetParamEntityParsing(_expat.XML_PARAM_ENTITY_PARSING_NEVER)
    if start_handler:
        p.StartElementHandler = start_handler
    if end_handler:
        p.EndElementHandler = end_handler
    if char_handler:
        p.CharacterDataHandler = char_handler

    try:
        while True:
            chunk = stream.read(chunk_size)
            if not chunk:
                p.Parse(b"", True)
                break
            p.Parse(chunk, False)
    except _expat.ExpatError as e:
        return ParseError(
            message=_expat.ErrorString(e.code),
            line=e.lineno,
            column=e.offset,
            offset=-1,
            code=e.code,
        )
    return None


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("=== xml.parsers.expat demo ===")

    sample = b"""<?xml version="1.0"?>
<catalog xmlns:dc="http://purl.org/dc/elements/1.1/">
  <book id="b1" lang="en">
    <dc:title>Python Cookbook</dc:title>
    <author>David Beazley</author>
    <price>39.99</price>
  </book>
  <book id="b2" lang="fr">
    <dc:title>Apprendre Python</dc:title>
    <author>Mark Lutz</author>
    <price>29.99</price>
  </book>
  <magazine id="m1">
    <dc:title>Python Weekly</dc:title>
  </magazine>
</catalog>"""

    # ── count_elements ────────────────────────────────────────────────────
    print("\n--- count_elements ---")
    counts = count_elements(sample)
    for name, n in sorted(counts.items()):
        print(f"  {name:20s}: {n}")

    # ── extract_elements ──────────────────────────────────────────────────
    print("\n--- extract_elements (book) ---")
    books = extract_elements(sample, "book")
    for b in books:
        print(f"  id={b.attrs.get('id')}  lang={b.attrs.get('lang')}")

    # ── extract_namespaces ────────────────────────────────────────────────
    print("\n--- extract_namespaces ---")
    ns = extract_namespaces(sample)
    for prefix, uri in ns.items():
        print(f"  {prefix:12s} → {uri}")

    # ── validate_xml ──────────────────────────────────────────────────────
    print("\n--- validate_xml ---")
    good_err = validate_xml(sample)
    bad_err = validate_xml(b"<root><unclosed></root>")
    print(f"  good: {good_err}")
    print(f"  bad : message={bad_err.message!r}  line={bad_err.line}")

    # ── stream_parse ──────────────────────────────────────────────────────
    print("\n--- stream_parse ---")
    tag_counts: dict[str, int] = {}

    def on_start(name: str, attrs: dict) -> None:
        tag_counts[name] = tag_counts.get(name, 0) + 1

    stream_err = stream_parse(io.BytesIO(sample),
                               start_handler=on_start, chunk_size=128)
    print(f"  error: {stream_err}")
    for name, n in sorted(tag_counts.items()):
        print(f"  {name:20s}: {n}")

    print("\n=== done ===")

For the xml.sax stdlib alternative — xml.sax.parseString(data, handler) provides the same event-driven parsing as Expat but through the standard SAX2 interface with ContentHandler.startElement(), endElement(), and characters() callbacks plus ErrorHandler and EntityResolver — use xml.sax when you want a standardised SAX2 API with swappable parser backends; use xml.parsers.expat directly when you need maximum performance or access to Expat-specific features like CurrentByteIndex, namespace-prefix mode, or incremental Parse() control. For the lxml.etree (PyPI) alternative — lxml.etree.iterparse(source, events=("start","end")) provides SAX-speed streaming with a cleaner API, DTD/schema validation, XPath, and XSLT — use lxml for all production large-file XML work; use xml.parsers.expat for zero-dependency, maximum-speed stdlib-only XML streaming. The Claude Skills 360 bundle includes xml.parsers.expat skill sets covering count_elements() tag-frequency counter, ExtractedElement/extract_elements() text extractor, extract_namespaces() namespace mapper, ParseError/validate_xml() well-formedness validator, and stream_parse() large-file streaming parser. Start with the free tier to try Expat streaming patterns and xml.parsers.expat pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39