Blog / AI / Claude Code for codecs: Python Codec and Encoding Pipeline

Claude Code for codecs: Python Codec and Encoding Pipeline

Published: September 18, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

Python’s codecs module provides an interface to all text and binary codecs available in the interpreter. import codecs. codecs.open: f = codecs.open("file.txt", "r", encoding="utf-8", errors="replace") — like open() but accepts any codec name. encode/decode: codecs.encode("hello", "rot_13") → "uryyb"; codecs.decode(b"\x89PNG...", "latin-1"). lookup: info = codecs.lookup("utf-8") → CodecInfo(name, encode, decode, streamreader, streamwriter, ...); info.name is the canonical name. Encodings: Python supports "utf-8", "utf-16", "utf-32", "latin-1", "ascii", "idna", "punycode", "base64", "hex_codec", "zlib_codec", "bz2_codec", "rot_13", "uu_codec". BOM: codecs.BOM_UTF8 = b"\xef\xbb\xbf"; codecs.BOM_UTF16 = b"\xff\xfe" or b"\xfe\xff". StreamReader/StreamWriter: wrap a file-like object — reader = codecs.getreader("utf-8")(raw_io). IncrementalDecoder: dec = codecs.getincrementaldecoder("utf-8")(); dec.decode(chunk, final=True) — stateful; handles multi-byte boundaries. errors: "strict", "ignore", "replace", "xmlcharrefreplace", "backslashreplace", "surrogateescape". register: codecs.register(search_fn) — add custom codec. codecs.encode(data, "zlib_codec") — zlib compress as one-liner. Claude Code generates encoding-aware file readers, BOM-stripping utilities, incremental streaming decoders, and multi-encoding document converters.

CLAUDE.md for codecs

## codecs Stack
- Stdlib: import codecs
- File:   f = codecs.open("file.txt", "r", encoding="utf-8-sig", errors="replace")
- Quick:  codecs.encode(text, "utf-8"); codecs.decode(data, "utf-8")
- Zlib:   compressed = codecs.encode(data, "zlib_codec")
- BOM:    data.lstrip(codecs.BOM_UTF8) if data.startswith(codecs.BOM_UTF8)
- Stream: reader = codecs.getreader("utf-8")(raw_binary_io)

codecs Encoding Pipeline

# app/codecutil.py — open, detect BOM, incremental decode, transform, custom
from __future__ import annotations

import codecs
import io
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Callable, Generator, Iterator


# ─────────────────────────────────────────────────────────────────────────────
# 1. File and BOM helpers
# ─────────────────────────────────────────────────────────────────────────────

_BOM_TABLE: list[tuple[bytes, str]] = [
    (codecs.BOM_UTF32_BE, "utf-32-be"),
    (codecs.BOM_UTF32_LE, "utf-32-le"),
    (codecs.BOM_UTF8,     "utf-8-sig"),
    (codecs.BOM_UTF16_BE, "utf-16-be"),
    (codecs.BOM_UTF16_LE, "utf-16-le"),
]


def detect_bom_encoding(data: bytes) -> str | None:
    """
    Detect encoding from a BOM prefix in raw bytes.
    Returns encoding name or None if no BOM found.

    Example:
        enc = detect_bom_encoding(Path("file.txt").read_bytes())
        if enc:
            text = data.decode(enc)
    """
    for bom, enc in _BOM_TABLE:
        if data.startswith(bom):
            return enc
    return None


def strip_bom(data: bytes) -> tuple[bytes, str | None]:
    """
    Strip BOM from bytes if present. Returns (stripped_bytes, encoding_or_None).

    Example:
        stripped, enc = strip_bom(raw_bytes)
        text = stripped.decode(enc or "utf-8")
    """
    for bom, enc in _BOM_TABLE:
        if data.startswith(bom):
            return data[len(bom):], enc
    return data, None


def read_text_auto(path: str | Path, fallback: str = "utf-8",
                   errors: str = "replace") -> str:
    """
    Read a text file, auto-detecting BOM encoding and falling back to `fallback`.

    Example:
        text = read_text_auto("notes.txt")
    """
    raw = Path(path).read_bytes()
    enc = detect_bom_encoding(raw) or fallback
    cleaned, _ = strip_bom(raw)
    return cleaned.decode(enc, errors=errors)


def open_text(
    path: str | Path,
    mode: str = "r",
    encoding: str = "utf-8",
    errors: str = "strict",
) -> io.TextIOWrapper:
    """
    Open a text file via codecs, returning a file-like object.
    Equivalent to open() but uses codecs machinery for any codec name.

    Example:
        with open_text("win.txt", encoding="cp1252") as f:
            for line in f:
                print(line)
    """
    return codecs.open(str(path), mode, encoding=encoding, errors=errors)  # type: ignore[return-value]


# ─────────────────────────────────────────────────────────────────────────────
# 2. Encoding / decoding helpers
# ─────────────────────────────────────────────────────────────────────────────

def encode_to(text: str, encoding: str, errors: str = "strict") -> bytes:
    """
    Encode a string to bytes using the named codec.

    Example:
        data = encode_to("hello", "utf-8")
        data = encode_to("naïve", "latin-1", errors="replace")
    """
    encoder, _ = codecs.lookup(encoding).incrementalencoder, None
    return text.encode(encoding, errors=errors)


def decode_from(data: bytes, encoding: str, errors: str = "strict") -> str:
    """
    Decode bytes to string using the named codec.

    Example:
        text = decode_from(b"\\xff\\xfeh\\x00i\\x00", "utf-16")
    """
    return data.decode(encoding, errors=errors)


def transcode(data: bytes, src_enc: str, dst_enc: str, errors: str = "replace") -> bytes:
    """
    Re-encode bytes from src_enc to dst_enc.

    Example:
        utf8_bytes = transcode(latin1_bytes, "latin-1", "utf-8")
    """
    return data.decode(src_enc, errors=errors).encode(dst_enc, errors=errors)


def try_decode(data: bytes, encodings: list[str],
               errors: str = "strict") -> tuple[str, str] | None:
    """
    Try decoding bytes with each encoding in order; return (text, encoding) for
    the first success, or None if all fail.

    Example:
        result = try_decode(raw, ["utf-8", "utf-16", "latin-1"])
        if result:
            text, enc = result
    """
    for enc in encodings:
        try:
            return data.decode(enc, errors=errors), enc
        except (UnicodeDecodeError, LookupError):
            continue
    return None


# ─────────────────────────────────────────────────────────────────────────────
# 3. Incremental decoder for streaming
# ─────────────────────────────────────────────────────────────────────────────

class StreamingDecoder:
    """
    Stateful incremental decoder for feeding bytes in chunks (e.g. from a socket).
    Handles multi-byte boundary splits correctly.

    Example:
        dec = StreamingDecoder("utf-8")
        for chunk in socket_chunks:
            text = dec.feed(chunk)
            process(text)
        text = dec.finish()
    """

    def __init__(self, encoding: str = "utf-8", errors: str = "replace") -> None:
        self._decoder = codecs.getincrementaldecoder(encoding)(errors=errors)
        self._encoding = encoding

    def feed(self, data: bytes) -> str:
        """Decode a chunk (may be partial; buffers incomplete sequences)."""
        return self._decoder.decode(data, final=False)

    def finish(self) -> str:
        """Flush any remaining buffered bytes and signal end-of-stream."""
        return self._decoder.decode(b"", final=True)

    def reset(self) -> None:
        self._decoder.reset()

    def decode_iter(self, source: Iterator[bytes]) -> Generator[str, None, None]:
        """
        Yield decoded strings for each chunk from an iterator of bytes.

        Example:
            for text in dec.decode_iter(response.iter_content()):
                print(text, end="")
        """
        for chunk in source:
            result = self.feed(chunk)
            if result:
                yield result
        tail = self.finish()
        if tail:
            yield tail


# ─────────────────────────────────────────────────────────────────────────────
# 4. Data transform codecs
# ─────────────────────────────────────────────────────────────────────────────

def zlib_compress(data: bytes) -> bytes:
    """
    Compress bytes using zlib via codecs.

    Example:
        compressed = zlib_compress(large_bytes)
        original   = zlib_decompress(compressed)
    """
    return codecs.encode(data, "zlib_codec")


def zlib_decompress(data: bytes) -> bytes:
    """Decompress zlib-compressed bytes via codecs."""
    return codecs.decode(data, "zlib_codec")


def bz2_compress(data: bytes) -> bytes:
    """Compress bytes using bzip2 via codecs."""
    return codecs.encode(data, "bz2_codec")


def bz2_decompress(data: bytes) -> bytes:
    """Decompress bz2-compressed bytes via codecs."""
    return codecs.decode(data, "bz2_codec")


def rot13(text: str) -> str:
    """
    Apply ROT-13 encoding/decoding (its own inverse) via codecs.

    Example:
        rot13("Hello World")   # "Uryyb Jbeyq"
        rot13("Uryyb Jbeyq")   # "Hello World"
    """
    return codecs.encode(text, "rot_13")


def hex_encode(data: bytes) -> str:
    """Hex-encode bytes via codecs (returns lowercase hex string)."""
    return codecs.encode(data, "hex_codec").decode("ascii")


def hex_decode(hex_str: str) -> bytes:
    """Decode a hex string to bytes via codecs."""
    return codecs.decode(hex_str.encode("ascii"), "hex_codec")


# ─────────────────────────────────────────────────────────────────────────────
# 5. Codec info and custom codec helpers
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class CodecSummary:
    name:    str
    aliases: list[str]

    def __str__(self) -> str:
        return f"{self.name}  aliases={self.aliases[:4]}"


def codec_info(name: str) -> CodecSummary:
    """
    Return summary info for a codec name.

    Example:
        print(codec_info("utf-8"))
        print(codec_info("latin-1"))
    """
    info = codecs.lookup(name)
    # Build alias list by trying common variants
    aliases: list[str] = []
    for variant in [name.replace("-", "_"), name.replace("_", "-"),
                    name.upper(), name.lower()]:
        try:
            if codecs.lookup(variant).name == info.name and variant != info.name:
                aliases.append(variant)
        except LookupError:
            pass
    return CodecSummary(name=info.name, aliases=aliases)


def supported_encodings() -> list[str]:
    """
    Return a list of encoding names that Python supports on this platform.
    Checks a common set of well-known codecs.
    """
    candidates = [
        "utf-8", "utf-16", "utf-32", "ascii", "latin-1", "cp1252",
        "cp1251", "iso-8859-1", "iso-8859-2", "gbk", "big5", "shift_jis",
        "euc-jp", "euc-kr", "utf-8-sig", "idna", "punycode",
        "base64", "hex_codec", "zlib_codec", "bz2_codec", "rot_13",
    ]
    supported: list[str] = []
    for enc in candidates:
        try:
            codecs.lookup(enc)
            supported.append(enc)
        except LookupError:
            pass
    return supported


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("=== codecs demo ===")

    # ── BOM detection ─────────────────────────────────────────────────────────
    print("\n--- BOM detection ---")
    samples = {
        "UTF-8 BOM":    codecs.BOM_UTF8 + "hello world".encode("utf-8"),
        "UTF-16 LE BOM": codecs.BOM_UTF16_LE + "hello".encode("utf-16-le"),
        "No BOM":        "plain text".encode("utf-8"),
    }
    for label, raw in samples.items():
        stripped, enc = strip_bom(raw)
        print(f"  {label:15s}: enc={enc!r:15s}  stripped starts={stripped[:8]!r}")

    # ── transcode ─────────────────────────────────────────────────────────────
    print("\n--- transcode latin-1 → utf-8 ---")
    latin1 = "café résumé".encode("latin-1")
    utf8 = transcode(latin1, "latin-1", "utf-8")
    print(f"  latin1: {latin1!r}")
    print(f"  utf-8:  {utf8!r}")

    # ── try_decode ────────────────────────────────────────────────────────────
    print("\n--- try_decode ---")
    snippets = [
        "hello 世界".encode("utf-8"),
        "café".encode("latin-1"),
    ]
    for raw in snippets:
        result = try_decode(raw, ["utf-8", "latin-1"])
        if result:
            text, enc = result
            print(f"  decoded as {enc!r}: {text!r}")

    # ── StreamingDecoder ──────────────────────────────────────────────────────
    print("\n--- StreamingDecoder ---")
    msg = "hello 世界".encode("utf-8")
    dec = StreamingDecoder("utf-8")
    parts = [msg[:5], msg[5:10], msg[10:]]
    reconstructed = ""
    for chunk in parts:
        reconstructed += dec.feed(chunk)
    reconstructed += dec.finish()
    print(f"  streamed in 3 chunks → {reconstructed!r}")

    # ── transform codecs ──────────────────────────────────────────────────────
    print("\n--- transform codecs ---")
    data = b"hello world " * 100
    compressed = zlib_compress(data)
    decompressed = zlib_decompress(compressed)
    print(f"  zlib: {len(data)} → {len(compressed)} bytes (ratio {len(data)/len(compressed):.1f}x)")
    print(f"  roundtrip ok: {decompressed == data}")

    print(f"\n  rot13('Hello World') = {rot13('Hello World')!r}")
    print(f"  hex_encode(b'\\xde\\xad') = {hex_encode(b'\\xde\\xad')!r}")
    print(f"  hex_decode('deadbeef') = {hex_decode('deadbeef')!r}")

    # ── supported encodings ───────────────────────────────────────────────────
    print("\n--- supported_encodings (sample) ---")
    for enc in supported_encodings()[:8]:
        print(f"  {enc}")

    print("\n=== done ===")

For the chardet / charset-normalizer alternative — chardet (PyPI) samples the byte distribution of an unknown-encoding document and returns a probability-weighted encoding guess; charset-normalizer does the same as a drop-in replacement used by requests — use these when you receive a file with no declared encoding and must infer it from the bytes; use codecs when you already know or can determine the encoding (from BOM, HTTP header, XML declaration, or user configuration) and simply need a streaming-capable encoder/decoder interface. For the io.TextIOWrapper alternative — io.TextIOWrapper wraps a binary io.RawIOBase with a codec and is what Python’s built-in open() returns for text mode; it offers newline, line_buffering, and write_through options; codecs.StreamReader/StreamWriter provide the same wrapping but accept any codec name including zlib_codec and base64 — use io.TextIOWrapper for standard text-file I/O; use codecs.open() or StreamReader/StreamWriter when you need a codec name that open() doesn’t recognise (e.g. "zlib_codec") or when building a custom codec pipeline. The Claude Skills 360 bundle includes codecs skill sets covering detect_bom_encoding()/strip_bom()/read_text_auto()/open_text() BOM-aware readers, encode_to()/decode_from()/transcode()/try_decode() encode/decode helpers, StreamingDecoder with feed()/finish()/decode_iter() incremental streaming, zlib_compress()/zlib_decompress()/bz2_compress()/rot13()/hex_encode()/hex_decode() transform codec wrappers, and codec_info()/supported_encodings() introspection. Start with the free tier to try encoding pipeline patterns and codecs pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39