Python’s unicodedata module exposes Unicode character properties from the Unicode Character Database (UCD). import unicodedata. name: unicodedata.name("A") → "LATIN CAPITAL LETTER A". lookup: unicodedata.lookup("SNOWMAN") → "☃". category: unicodedata.category("A") → "Lu" (Uppercase Letter); categories include "Ll" (lowercase), "Nd" (decimal digit), "Zs" (space), "Cc" (control), "Po" (other punctuation), "So" (other symbol). combining: unicodedata.combining("\u0301") → 230 (accent combining class; 0 = non-combining). bidirectional: unicodedata.bidirectional("A") → "L" (left-to-right); "R", "AL", "AN", "EN" etc. east_asian_width: unicodedata.east_asian_width("A") → "Na" (narrow); "W" (wide, CJK). mirrored: unicodedata.mirrored("(") → 1. decomposition: unicodedata.decomposition("é") → "0065 0301" (e + combining acute). normalize: unicodedata.normalize("NFC", s) — composed; "NFD" — decomposed; "NFKC" / "NFKD" — compatibility forms; is_normalized(form, s). digit/numeric/decimal: unicodedata.digit("5") → 5; unicodedata.numeric("½") → 0.5. unidata_version: unicodedata.unidata_version → e.g. "15.0.0". Claude Code generates slugifiers, accent removers, character-class validators, text normalizers, and Unicode security analyzers.
CLAUDE.md for unicodedata
## unicodedata Stack
- Stdlib: import unicodedata
- Name: unicodedata.name(ch, default="")
- Cat: unicodedata.category(ch) # e.g. "Lu", "Nd", "Zs"
- Norm: unicodedata.normalize("NFC", text)
- Strip: ''.join(c for c in unicodedata.normalize("NFD", text) if unicodedata.category(c) != "Mn")
- Width: unicodedata.east_asian_width(ch) # "W" | "Na" | "N" | ...
unicodedata Character Analysis Pipeline
# app/unicodeutil.py — normalize, slug, strip accents, classify, width
from __future__ import annotations
import re
import unicodedata
from dataclasses import dataclass
from typing import Callable
# ─────────────────────────────────────────────────────────────────────────────
# 1. Normalization helpers
# ─────────────────────────────────────────────────────────────────────────────
def normalize(text: str, form: str = "NFC") -> str:
"""
Apply Unicode normalization to text.
form: "NFC" (composed, default), "NFD" (decomposed),
"NFKC" (compatibility composed), "NFKD" (compatibility decomposed).
Example:
normalize("café") # NFC: single composed character
normalize("café", "NFD") # NFD: c + combining ́ form
"""
return unicodedata.normalize(form, text)
def is_normalized(text: str, form: str = "NFC") -> bool:
"""Return True if text is already in the given normalization form."""
return unicodedata.is_normalized(form, text)
def strip_accents(text: str) -> str:
"""
Remove combining diacritical marks from text (e.g. é → e, ñ → n).
Applies NFD decomposition then removes Mn (Mark, Non-spacing) characters.
Example:
strip_accents("Ångström café naïve") # "Angstrom cafe naive"
"""
nfd = unicodedata.normalize("NFD", text)
return "".join(c for c in nfd if unicodedata.category(c) != "Mn")
def to_ascii_slug(text: str, separator: str = "-") -> str:
"""
Convert arbitrary Unicode text to an ASCII slug safe for URLs and filenames.
Strips accents, lowercases, replaces non-alphanumeric runs with separator.
Example:
to_ascii_slug("Ångström Café 2025!") # "angstrom-cafe-2025"
to_ascii_slug("Привет мир") # "privet-mir" (Latin fallback)
"""
clean = strip_accents(text)
# Keep only ASCII alphanumeric + spaces, collapse to separator
clean = re.sub(r"[^a-zA-Z0-9\s]", "", clean)
clean = clean.lower().strip()
return re.sub(r"\s+", separator, re.sub(r"[^\w\s]", "", clean)).strip(separator)
def nfkc_casefold(text: str) -> str:
"""
NFKC-normalize and casefold (aggressive case-insensitive comparison fold).
Equivalent to str.casefold() but also applies compatibility normalization.
Example:
nfkc_casefold("fi") # "fi" (fi ligature → two chars)
nfkc_casefold("Ω") # "ω"
"""
return unicodedata.normalize("NFKC", text).casefold()
# ─────────────────────────────────────────────────────────────────────────────
# 2. Character classification
# ─────────────────────────────────────────────────────────────────────────────
def char_category(ch: str) -> str:
"""
Return the two-letter Unicode general category for a single character.
Example:
char_category("A") # "Lu"
char_category("5") # "Nd"
char_category(" ") # "Zs"
char_category("!") # "Po"
"""
return unicodedata.category(ch)
_ALPHA_CATS = frozenset({"Lu", "Ll", "Lt", "Lm", "Lo"})
_DIGIT_CATS = frozenset({"Nd"})
_SPACE_CATS = frozenset({"Zs", "Zl", "Zp"})
_PUNCT_CATS = frozenset({"Pc", "Pd", "Pe", "Pf", "Pi", "Po", "Ps"})
_CTRL_CATS = frozenset({"Cc", "Cf"})
def is_alpha(ch: str) -> bool:
return unicodedata.category(ch) in _ALPHA_CATS
def is_digit(ch: str) -> bool:
return unicodedata.category(ch) in _DIGIT_CATS
def is_space(ch: str) -> bool:
return unicodedata.category(ch) in _SPACE_CATS or ch in "\t\n\r\f\v"
def is_punct(ch: str) -> bool:
return unicodedata.category(ch) in _PUNCT_CATS
def is_control(ch: str) -> bool:
return unicodedata.category(ch) in _CTRL_CATS
def classify_string(text: str) -> dict[str, int]:
"""
Count characters by Unicode category in text.
Example:
counts = classify_string("Hello, 世界! 42")
# {"Lu": 1, "Ll": 4, "Po": 2, "Zs": 2, "Lo": 2, "Nd": 2}
"""
counts: dict[str, int] = {}
for ch in text:
cat = unicodedata.category(ch)
counts[cat] = counts.get(cat, 0) + 1
return counts
# ─────────────────────────────────────────────────────────────────────────────
# 3. Display width (for terminal monospace rendering)
# ─────────────────────────────────────────────────────────────────────────────
_WIDE_WIDTH = frozenset({"W", "F"}) # Wide, Fullwidth — count as 2 columns
def char_width(ch: str) -> int:
"""
Return the display column width of a single Unicode character (1 or 2).
Wide/fullwidth CJK characters count as 2 columns.
Example:
char_width("A") # 1
char_width("中") # 2
"""
return 2 if unicodedata.east_asian_width(ch) in _WIDE_WIDTH else 1
def display_width(text: str) -> int:
"""
Return the total display column width of a string (monospace terminal).
Example:
display_width("Hello") # 5
display_width("你好") # 4
display_width("A中B") # 4
"""
return sum(char_width(c) for c in text)
def ljust_unicode(text: str, width: int, fillchar: str = " ") -> str:
"""
Left-justify text in a field of `width` display columns.
Accounts for wide CJK characters.
Example:
ljust_unicode("你好", 10) # "你好 " (6 spaces to fill 10 cols)
"""
pad = width - display_width(text)
return text + fillchar * max(0, pad)
# ─────────────────────────────────────────────────────────────────────────────
# 4. Numeric and name utilities
# ─────────────────────────────────────────────────────────────────────────────
def char_numeric(ch: str) -> float | None:
"""
Return the numeric value of a Unicode character, or None.
Example:
char_numeric("5") # 5.0
char_numeric("½") # 0.5
char_numeric("Ⅷ") # 8.0
char_numeric("A") # None
"""
try:
return unicodedata.numeric(ch)
except ValueError:
return None
def char_name(ch: str, default: str = "") -> str:
"""
Return the Unicode name for a character.
Example:
char_name("☃") # "SNOWMAN"
char_name("\x00") # "NULL"
"""
return unicodedata.name(ch, default)
def find_char(name_fragment: str, limit: int = 10) -> list[tuple[str, str]]:
"""
Find Unicode characters whose names contain name_fragment (case-insensitive).
Scans all BMP code points — may be slow for broad queries.
Example:
find_char("SNOWMAN") # [("☃", "SNOWMAN")]
find_char("DIGIT ONE") # [("1", "DIGIT ONE"), ("①", "CIRCLED DIGIT ONE"), ...]
"""
fragment = name_fragment.upper()
results: list[tuple[str, str]] = []
for cp in range(0x110000):
try:
ch = chr(cp)
n = unicodedata.name(ch)
if fragment in n:
results.append((ch, n))
if len(results) >= limit:
break
except (ValueError, TypeError):
continue
return results
# ─────────────────────────────────────────────────────────────────────────────
# 5. Text sanitization helpers
# ─────────────────────────────────────────────────────────────────────────────
def remove_control_chars(text: str) -> str:
"""
Remove Unicode control characters (category Cc and Cf) from text.
Preserves newlines and tabs (which are Cc but printable).
Example:
remove_control_chars("hello\x00\x08world") # "helloworld"
"""
return "".join(
c for c in text
if unicodedata.category(c) not in _CTRL_CATS or c in "\t\n\r"
)
def homoglyph_normalize(text: str) -> str:
"""
Apply NFKC to collapse compatibility equivalents and homoglyphs.
Useful for preventing lookalike-character attacks in usernames.
Example:
homoglyph_normalize("ℌello") # "Hello" via NFKC
homoglyph_normalize("123") # "123" (fullwidth digits → ASCII)
"""
return unicodedata.normalize("NFKC", text)
def ascii_only(text: str) -> str:
"""
Keep only ASCII printable characters (ord 32–126).
Example:
ascii_only("café résumé") # "caf rsum" (accents stripped by encode)
"""
return text.encode("ascii", errors="ignore").decode("ascii")
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
print("=== unicodedata demo ===")
print(f" Unicode version: {unicodedata.unidata_version}")
# ── normalization ─────────────────────────────────────────────────────────
print("\n--- normalize + strip_accents ---")
words = ["café", "résumé", "naïve", "Ångström", "Ñoño"]
for w in words:
print(f" {w:15s} → stripped={strip_accents(w)!r}")
# ── slugify ───────────────────────────────────────────────────────────────
print("\n--- to_ascii_slug ---")
for title in ["Hello World!", "Ångström Café 2025", "Python 3.12 — New Features"]:
print(f" {title!r:35s} → {to_ascii_slug(title)!r}")
# ── char info ─────────────────────────────────────────────────────────────
print("\n--- char properties ---")
for ch in ["A", "5", " ", "☃", "é", "中", "½", "①"]:
cat = unicodedata.category(ch)
name = char_name(ch, "?")
num = char_numeric(ch)
w = char_width(ch)
print(f" {ch!r:4s} cat={cat:3s} w={w} num={num!s:5s} name={name}")
# ── display width ─────────────────────────────────────────────────────────
print("\n--- display_width ---")
for text in ["Hello", "你好世界", "A中B", "3AB"]:
print(f" {text!r:12s} → {display_width(text)} cols")
# ── classify_string ───────────────────────────────────────────────────────
print("\n--- classify_string ---")
sample = "Hello, 世界! Answer: 42."
cats = classify_string(sample)
for cat, count in sorted(cats.items(), key=lambda x: -x[1]):
print(f" {cat}: {count}")
# ── homoglyph_normalize ───────────────────────────────────────────────────
print("\n--- homoglyph_normalize ---")
for s in ["HELLOworld", "123ABC", "ℌello"]:
print(f" {s!r:15s} → {homoglyph_normalize(s)!r}")
print("\n=== done ===")
For the unidecode alternative — unidecode (PyPI) transliterates any Unicode character to its approximate ASCII representation using lookup tables (e.g. "Ångström" → "Angstrom", "中文" → "Zhong Wen"); it covers many scripts that strip_accents() cannot handle because they are not decomposable via NFD — use unidecode for broad transliteration of non-Latin scripts in user-facing slug generation; use unicodedata when you need the precise normalization algorithm (NFC/NFKC for database comparison, NFD for accent stripping) or when you are validating characters by Unicode property category. For the regex alternative — the regex PyPI package extends re to support Unicode property escapes like \p{Lu} (uppercase letter) and \p{Script=Latin} (Latin script), which are more expressive than unicodedata.category() lookups — use regex when your pattern-matching logic needs Unicode script or property classes inline in a regex; use unicodedata directly when you are inspecting individual character properties in Python code rather than patterns. The Claude Skills 360 bundle includes unicodedata skill sets covering normalize()/is_normalized()/strip_accents()/to_ascii_slug()/nfkc_casefold() normalization tools, char_category()/is_alpha()/is_digit()/is_space()/is_punct()/is_control()/classify_string() classifiers, char_width()/display_width()/ljust_unicode() terminal width helpers, char_numeric()/char_name()/find_char() property utilities, and remove_control_chars()/homoglyph_normalize()/ascii_only() sanitizers. Start with the free tier to try Unicode text processing patterns and unicodedata pipeline code generation.