Blog / AI / Claude Code for unicodedata: Python Unicode Character Properties

Claude Code for unicodedata: Python Unicode Character Properties

Published: September 17, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

Python’s unicodedata module exposes Unicode character properties from the Unicode Character Database (UCD). import unicodedata. name: unicodedata.name("A") → "LATIN CAPITAL LETTER A". lookup: unicodedata.lookup("SNOWMAN") → "☃". category: unicodedata.category("A") → "Lu" (Uppercase Letter); categories include "Ll" (lowercase), "Nd" (decimal digit), "Zs" (space), "Cc" (control), "Po" (other punctuation), "So" (other symbol). combining: unicodedata.combining("\u0301") → 230 (accent combining class; 0 = non-combining). bidirectional: unicodedata.bidirectional("A") → "L" (left-to-right); "R", "AL", "AN", "EN" etc. east_asian_width: unicodedata.east_asian_width("A") → "Na" (narrow); "W" (wide, CJK). mirrored: unicodedata.mirrored("(") → 1. decomposition: unicodedata.decomposition("é") → "0065 0301" (e + combining acute). normalize: unicodedata.normalize("NFC", s) — composed; "NFD" — decomposed; "NFKC" / "NFKD" — compatibility forms; is_normalized(form, s). digit/numeric/decimal: unicodedata.digit("5") → 5; unicodedata.numeric("½") → 0.5. unidata_version: unicodedata.unidata_version → e.g. "15.0.0". Claude Code generates slugifiers, accent removers, character-class validators, text normalizers, and Unicode security analyzers.

CLAUDE.md for unicodedata

## unicodedata Stack
- Stdlib: import unicodedata
- Name:   unicodedata.name(ch, default="")
- Cat:    unicodedata.category(ch)  # e.g. "Lu", "Nd", "Zs"
- Norm:   unicodedata.normalize("NFC", text)
- Strip:  ''.join(c for c in unicodedata.normalize("NFD", text) if unicodedata.category(c) != "Mn")
- Width:  unicodedata.east_asian_width(ch)  # "W" | "Na" | "N" | ...

unicodedata Character Analysis Pipeline

# app/unicodeutil.py — normalize, slug, strip accents, classify, width
from __future__ import annotations

import re
import unicodedata
from dataclasses import dataclass
from typing import Callable


# ─────────────────────────────────────────────────────────────────────────────
# 1. Normalization helpers
# ─────────────────────────────────────────────────────────────────────────────

def normalize(text: str, form: str = "NFC") -> str:
    """
    Apply Unicode normalization to text.
    form: "NFC" (composed, default), "NFD" (decomposed),
          "NFKC" (compatibility composed), "NFKD" (compatibility decomposed).

    Example:
        normalize("café")    # NFC: single composed character
        normalize("café", "NFD")  # NFD: c + combining ́ form
    """
    return unicodedata.normalize(form, text)


def is_normalized(text: str, form: str = "NFC") -> bool:
    """Return True if text is already in the given normalization form."""
    return unicodedata.is_normalized(form, text)


def strip_accents(text: str) -> str:
    """
    Remove combining diacritical marks from text (e.g. é → e, ñ → n).
    Applies NFD decomposition then removes Mn (Mark, Non-spacing) characters.

    Example:
        strip_accents("Ångström café naïve")   # "Angstrom cafe naive"
    """
    nfd = unicodedata.normalize("NFD", text)
    return "".join(c for c in nfd if unicodedata.category(c) != "Mn")


def to_ascii_slug(text: str, separator: str = "-") -> str:
    """
    Convert arbitrary Unicode text to an ASCII slug safe for URLs and filenames.
    Strips accents, lowercases, replaces non-alphanumeric runs with separator.

    Example:
        to_ascii_slug("Ångström Café 2025!")   # "angstrom-cafe-2025"
        to_ascii_slug("Привет мир")            # "privet-mir" (Latin fallback)
    """
    clean = strip_accents(text)
    # Keep only ASCII alphanumeric + spaces, collapse to separator
    clean = re.sub(r"[^a-zA-Z0-9\s]", "", clean)
    clean = clean.lower().strip()
    return re.sub(r"\s+", separator, re.sub(r"[^\w\s]", "", clean)).strip(separator)


def nfkc_casefold(text: str) -> str:
    """
    NFKC-normalize and casefold (aggressive case-insensitive comparison fold).
    Equivalent to str.casefold() but also applies compatibility normalization.

    Example:
        nfkc_casefold("ﬁ")    # "fi" (fi ligature → two chars)
        nfkc_casefold("Ω")    # "ω"
    """
    return unicodedata.normalize("NFKC", text).casefold()


# ─────────────────────────────────────────────────────────────────────────────
# 2. Character classification
# ─────────────────────────────────────────────────────────────────────────────

def char_category(ch: str) -> str:
    """
    Return the two-letter Unicode general category for a single character.

    Example:
        char_category("A")    # "Lu"
        char_category("5")    # "Nd"
        char_category(" ")    # "Zs"
        char_category("!")    # "Po"
    """
    return unicodedata.category(ch)


_ALPHA_CATS = frozenset({"Lu", "Ll", "Lt", "Lm", "Lo"})
_DIGIT_CATS = frozenset({"Nd"})
_SPACE_CATS = frozenset({"Zs", "Zl", "Zp"})
_PUNCT_CATS = frozenset({"Pc", "Pd", "Pe", "Pf", "Pi", "Po", "Ps"})
_CTRL_CATS  = frozenset({"Cc", "Cf"})


def is_alpha(ch: str) -> bool:
    return unicodedata.category(ch) in _ALPHA_CATS


def is_digit(ch: str) -> bool:
    return unicodedata.category(ch) in _DIGIT_CATS


def is_space(ch: str) -> bool:
    return unicodedata.category(ch) in _SPACE_CATS or ch in "\t\n\r\f\v"


def is_punct(ch: str) -> bool:
    return unicodedata.category(ch) in _PUNCT_CATS


def is_control(ch: str) -> bool:
    return unicodedata.category(ch) in _CTRL_CATS


def classify_string(text: str) -> dict[str, int]:
    """
    Count characters by Unicode category in text.

    Example:
        counts = classify_string("Hello, 世界! 42")
        # {"Lu": 1, "Ll": 4, "Po": 2, "Zs": 2, "Lo": 2, "Nd": 2}
    """
    counts: dict[str, int] = {}
    for ch in text:
        cat = unicodedata.category(ch)
        counts[cat] = counts.get(cat, 0) + 1
    return counts


# ─────────────────────────────────────────────────────────────────────────────
# 3. Display width (for terminal monospace rendering)
# ─────────────────────────────────────────────────────────────────────────────

_WIDE_WIDTH = frozenset({"W", "F"})   # Wide, Fullwidth — count as 2 columns


def char_width(ch: str) -> int:
    """
    Return the display column width of a single Unicode character (1 or 2).
    Wide/fullwidth CJK characters count as 2 columns.

    Example:
        char_width("A")   # 1
        char_width("中")   # 2
    """
    return 2 if unicodedata.east_asian_width(ch) in _WIDE_WIDTH else 1


def display_width(text: str) -> int:
    """
    Return the total display column width of a string (monospace terminal).

    Example:
        display_width("Hello")    # 5
        display_width("你好")      # 4
        display_width("A中B")      # 4
    """
    return sum(char_width(c) for c in text)


def ljust_unicode(text: str, width: int, fillchar: str = " ") -> str:
    """
    Left-justify text in a field of `width` display columns.
    Accounts for wide CJK characters.

    Example:
        ljust_unicode("你好", 10)   # "你好      " (6 spaces to fill 10 cols)
    """
    pad = width - display_width(text)
    return text + fillchar * max(0, pad)


# ─────────────────────────────────────────────────────────────────────────────
# 4. Numeric and name utilities
# ─────────────────────────────────────────────────────────────────────────────

def char_numeric(ch: str) -> float | None:
    """
    Return the numeric value of a Unicode character, or None.

    Example:
        char_numeric("5")   # 5.0
        char_numeric("½")   # 0.5
        char_numeric("Ⅷ")   # 8.0
        char_numeric("A")   # None
    """
    try:
        return unicodedata.numeric(ch)
    except ValueError:
        return None


def char_name(ch: str, default: str = "") -> str:
    """
    Return the Unicode name for a character.

    Example:
        char_name("☃")   # "SNOWMAN"
        char_name("\x00")  # "NULL"
    """
    return unicodedata.name(ch, default)


def find_char(name_fragment: str, limit: int = 10) -> list[tuple[str, str]]:
    """
    Find Unicode characters whose names contain name_fragment (case-insensitive).
    Scans all BMP code points — may be slow for broad queries.

    Example:
        find_char("SNOWMAN")   # [("☃", "SNOWMAN")]
        find_char("DIGIT ONE") # [("1", "DIGIT ONE"), ("①", "CIRCLED DIGIT ONE"), ...]
    """
    fragment = name_fragment.upper()
    results: list[tuple[str, str]] = []
    for cp in range(0x110000):
        try:
            ch = chr(cp)
            n = unicodedata.name(ch)
            if fragment in n:
                results.append((ch, n))
                if len(results) >= limit:
                    break
        except (ValueError, TypeError):
            continue
    return results


# ─────────────────────────────────────────────────────────────────────────────
# 5. Text sanitization helpers
# ─────────────────────────────────────────────────────────────────────────────

def remove_control_chars(text: str) -> str:
    """
    Remove Unicode control characters (category Cc and Cf) from text.
    Preserves newlines and tabs (which are Cc but printable).

    Example:
        remove_control_chars("hello\x00\x08world")   # "helloworld"
    """
    return "".join(
        c for c in text
        if unicodedata.category(c) not in _CTRL_CATS or c in "\t\n\r"
    )


def homoglyph_normalize(text: str) -> str:
    """
    Apply NFKC to collapse compatibility equivalents and homoglyphs.
    Useful for preventing lookalike-character attacks in usernames.

    Example:
        homoglyph_normalize("ℌello")   # "Hello" via NFKC
        homoglyph_normalize("１２３")   # "123" (fullwidth digits → ASCII)
    """
    return unicodedata.normalize("NFKC", text)


def ascii_only(text: str) -> str:
    """
    Keep only ASCII printable characters (ord 32–126).

    Example:
        ascii_only("café résumé")   # "caf rsum" (accents stripped by encode)
    """
    return text.encode("ascii", errors="ignore").decode("ascii")


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("=== unicodedata demo ===")
    print(f"  Unicode version: {unicodedata.unidata_version}")

    # ── normalization ─────────────────────────────────────────────────────────
    print("\n--- normalize + strip_accents ---")
    words = ["café", "résumé", "naïve", "Ångström", "Ñoño"]
    for w in words:
        print(f"  {w:15s} → stripped={strip_accents(w)!r}")

    # ── slugify ───────────────────────────────────────────────────────────────
    print("\n--- to_ascii_slug ---")
    for title in ["Hello World!", "Ångström Café 2025", "Python 3.12 — New Features"]:
        print(f"  {title!r:35s} → {to_ascii_slug(title)!r}")

    # ── char info ─────────────────────────────────────────────────────────────
    print("\n--- char properties ---")
    for ch in ["A", "5", " ", "☃", "é", "中", "½", "①"]:
        cat = unicodedata.category(ch)
        name = char_name(ch, "?")
        num = char_numeric(ch)
        w = char_width(ch)
        print(f"  {ch!r:4s} cat={cat:3s}  w={w}  num={num!s:5s}  name={name}")

    # ── display width ─────────────────────────────────────────────────────────
    print("\n--- display_width ---")
    for text in ["Hello", "你好世界", "A中B", "３ＡＢ"]:
        print(f"  {text!r:12s} → {display_width(text)} cols")

    # ── classify_string ───────────────────────────────────────────────────────
    print("\n--- classify_string ---")
    sample = "Hello, 世界! Answer: 42."
    cats = classify_string(sample)
    for cat, count in sorted(cats.items(), key=lambda x: -x[1]):
        print(f"  {cat}: {count}")

    # ── homoglyph_normalize ───────────────────────────────────────────────────
    print("\n--- homoglyph_normalize ---")
    for s in ["ＨＥＬＬＯworld", "１２３ABC", "ℌello"]:
        print(f"  {s!r:15s} → {homoglyph_normalize(s)!r}")

    print("\n=== done ===")

For the unidecode alternative — unidecode (PyPI) transliterates any Unicode character to its approximate ASCII representation using lookup tables (e.g. "Ångström" → "Angstrom", "中文" → "Zhong Wen"); it covers many scripts that strip_accents() cannot handle because they are not decomposable via NFD — use unidecode for broad transliteration of non-Latin scripts in user-facing slug generation; use unicodedata when you need the precise normalization algorithm (NFC/NFKC for database comparison, NFD for accent stripping) or when you are validating characters by Unicode property category. For the regex alternative — the regex PyPI package extends re to support Unicode property escapes like \p{Lu} (uppercase letter) and \p{Script=Latin} (Latin script), which are more expressive than unicodedata.category() lookups — use regex when your pattern-matching logic needs Unicode script or property classes inline in a regex; use unicodedata directly when you are inspecting individual character properties in Python code rather than patterns. The Claude Skills 360 bundle includes unicodedata skill sets covering normalize()/is_normalized()/strip_accents()/to_ascii_slug()/nfkc_casefold() normalization tools, char_category()/is_alpha()/is_digit()/is_space()/is_punct()/is_control()/classify_string() classifiers, char_width()/display_width()/ljust_unicode() terminal width helpers, char_numeric()/char_name()/find_char() property utilities, and remove_control_chars()/homoglyph_normalize()/ascii_only() sanitizers. Start with the free tier to try Unicode text processing patterns and unicodedata pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39