Blog / AI / Claude Code for Unidecode: Unicode to ASCII in Python

Claude Code for Unidecode: Unicode to ASCII in Python

Published: March 3, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

Unidecode transliterates Unicode text to ASCII using transliteration tables. pip install Unidecode. Basic: from unidecode import unidecode; unidecode("München") → “Munchen”. unidecode("Ångström") → “Angstrom”. unidecode("café") → “cafe”. Chinese: unidecode("北京") → “Bei Jing”. Japanese: unidecode("東京") → “Dong Jing”. Korean: unidecode("서울") → “Seoul”. Russian: unidecode("Москва") → “Moskva”. Arabic: unidecode("مرحبا") → “mrhba”. Greek: unidecode("Ελλάδα") → “Ellada”. Thai: unidecode("ไทย") → “thyy”. Hebrew: unidecode("שלום") → “shlwm”. German umlauts: unidecode("Ä") → “A” | unidecode("ö") → “o” | unidecode("ü") → “u”. Performance: from unidecode import unidecode_expect_ascii — faster when input is mostly ASCII. unidecode_expect_nonascii — optimized for non-ASCII heavy text. unidecode("", errors="ignore") — ignore errors. unidecode("", errors="replace", replace_str="?") — replace with custom string. Slug: combine with re.sub(r"[^a-z0-9]+", "-", unidecode(text).lower()).strip("-"). Codepoint: from unidecode import codepoint_string; codepoint_string(ord("ü")) → “u”. Claude Code generates Unidecode romanizers, name normalizers, and ASCII slug pipelines.

CLAUDE.md for Unidecode

## Unidecode Stack
- Version: Unidecode >= 1.3 | pip install Unidecode
- Basic: unidecode("München") → "Munchen" | unidecode("北京") → "Bei Jing"
- Performance: unidecode_expect_ascii(text) for mostly-ASCII | _expect_nonascii for mostly Unicode
- Errors: unidecode(text, errors="ignore") | errors="replace", replace_str="?"
- Slug: re.sub(r"[^a-z0-9]+", "-", unidecode(text).lower()).strip("-")
- Codepoint: codepoint_string(ord(ch)) → ASCII equivalent of one Unicode char
- Lossy: unidecode destroys characters — use ftfy if you want to preserve them

Unidecode Transliteration Pipeline

# app/romanize.py — Unidecode transliteration, romanization, and ASCII slug pipeline
from __future__ import annotations

import re
import unicodedata
from typing import Any

from unidecode import (
    codepoint_string,
    unidecode,
    unidecode_expect_ascii,
    unidecode_expect_nonascii,
)


# ─────────────────────────────────────────────────────────────────────────────
# 1. Core transliteration helpers
# ─────────────────────────────────────────────────────────────────────────────

def to_ascii(text: str) -> str:
    """
    Transliterate Unicode text to ASCII using Unidecode tables.
    Lossy — accents, CJK, Cyrillic etc. are converted to their closest ASCII.
    "München" → "Munchen"
    "café"    → "cafe"
    "北京"    → "Bei Jing"
    "Москва"  → "Moskva"
    """
    return unidecode(text)


def to_ascii_fast(text: str) -> str:
    """
    Faster variant when text is mostly ASCII with occasional Unicode characters.
    Uses unidecode_expect_ascii — skips pre-check overhead for ASCII-heavy strings.
    """
    return unidecode_expect_ascii(text)


def to_ascii_unicode(text: str) -> str:
    """
    Optimized variant for Unicode-heavy text (CJK, Arabic, Cyrillic, etc.).
    Uses unidecode_expect_nonascii — avoids the ASCII fast-path check.
    """
    return unidecode_expect_nonascii(text)


def char_to_ascii(ch: str) -> str:
    """
    Return the ASCII transliteration of a single Unicode character.
    Useful for per-character analysis or custom replacement logic.
    "ü" → "u"
    "北" → "Bei "
    """
    cp = ord(ch)
    if cp < 0x80:
        return ch
    return codepoint_string(cp)


# ─────────────────────────────────────────────────────────────────────────────
# 2. Slug generation
# ─────────────────────────────────────────────────────────────────────────────

def ascii_slug(text: str, separator: str = "-", max_length: int = 80) -> str:
    """
    Generate an ASCII URL slug from any Unicode text.
    1. Unidecode to ASCII
    2. Lowercase
    3. Replace non-alphanumeric with separator
    4. Strip leading/trailing separators
    5. Truncate at max_length (at word boundary)
    "München Biergarten 2024" → "munchen-biergarten-2024"
    "北京旅游指南"              → "bei-jing-lu-you-zhi-nan"
    """
    ascii_text = unidecode(text).lower()
    slug = re.sub(r"[^a-z0-9]+", separator, ascii_text).strip(separator)

    if max_length and len(slug) > max_length:
        slug = slug[:max_length]
        # Truncate at last separator to avoid cutting a word
        last_sep = slug.rfind(separator)
        if last_sep > max_length // 2:
            slug = slug[:last_sep]

    return slug


def ascii_filename(text: str, extension: str = "") -> str:
    """
    Generate a safe ASCII filename from Unicode text.
    Strips path separators and null bytes in addition to slug cleanup.
    "報告書 (最終版).pdf" → "bao-gao-shu-zui-zhong-ban.pdf"
    """
    base = ascii_slug(text, separator="-", max_length=120)
    if extension:
        ext = extension.lstrip(".")
        return f"{base}.{ext}"
    return base


# ─────────────────────────────────────────────────────────────────────────────
# 3. Name and address romanization
# ─────────────────────────────────────────────────────────────────────────────

def romanize_name(name: str) -> str:
    """
    Romanize a person's name to ASCII for systems that don't support Unicode.
    Preserves capitalization structure (first letter of each word uppercased).
    "Ångström" → "Angstrom"
    "Björk Guðmundsdóttir" → "Bjork Gudmundsdottir"
    "李小龙" → "Li Xiao Long"
    """
    ascii_name = unidecode(name).strip()
    # Re-capitalize each word (unidecode may lowercase some chars)
    return " ".join(w.capitalize() for w in ascii_name.split() if w)


def romanize_address(address: str) -> str:
    """
    Romanize a postal address to ASCII.
    "Schloßstraße 12, München" → "Schlossstrasse 12, Munchen"
    """
    return unidecode(address).strip()


def normalize_for_search(text: str) -> str:
    """
    Normalize text for ASCII-only search indexes.
    1. NFC normalize
    2. Transliterate to ASCII
    3. Lowercase
    4. Collapse whitespace
    Good for matching "München" against "munchen" queries.
    """
    nfc = unicodedata.normalize("NFC", text)
    ascii_text = unidecode(nfc).lower()
    return re.sub(r"\s+", " ", ascii_text).strip()


# ─────────────────────────────────────────────────────────────────────────────
# 4. Comparison and matching
# ─────────────────────────────────────────────────────────────────────────────

def ascii_equal(a: str, b: str) -> bool:
    """
    Return True if two strings are equal after ASCII transliteration.
    "café" == "cafe" → True
    "München" == "Munchen" → True
    """
    return unidecode(a).lower() == unidecode(b).lower()


def ascii_startswith(text: str, prefix: str) -> bool:
    """Case-insensitive ASCII prefix check across Unicode strings."""
    return unidecode(text).lower().startswith(unidecode(prefix).lower())


def ascii_contains(text: str, query: str) -> bool:
    """Case-insensitive ASCII substring search across Unicode strings."""
    return unidecode(query).lower() in unidecode(text).lower()


# ─────────────────────────────────────────────────────────────────────────────
# 5. Batch processing
# ─────────────────────────────────────────────────────────────────────────────

def romanize_batch(texts: list[str | None]) -> list[str | None]:
    """Transliterate a list, passing None values through unchanged."""
    return [unidecode(t) if t is not None else None for t in texts]


def slug_batch(texts: list[str | None]) -> list[str | None]:
    """Generate ASCII slugs for a list of strings."""
    return [ascii_slug(t) if t is not None else None for t in texts]


def detect_non_ascii(texts: list[str]) -> list[int]:
    """Return indices of strings that contain non-ASCII characters."""
    return [i for i, t in enumerate(texts) if any(ord(c) >= 128 for c in t)]


# ─────────────────────────────────────────────────────────────────────────────
# 6. Pandas integration
# ─────────────────────────────────────────────────────────────────────────────

def romanize_column(df, column: str, new_column: str | None = None):
    """
    Add a romanized (ASCII) version of a DataFrame column.
    df = romanize_column(df, "name", new_column="name_ascii")
    """
    out_col = new_column or f"{column}_ascii"
    df[out_col] = df[column].apply(
        lambda x: unidecode(str(x)) if x is not None else None
    )
    return df


def slug_column(df, column: str, new_column: str | None = None):
    """Add an ASCII slug column derived from a text column."""
    out_col = new_column or f"{column}_slug"
    df[out_col] = df[column].apply(
        lambda x: ascii_slug(str(x)) if x is not None else None
    )
    return df


# ─────────────────────────────────────────────────────────────────────────────
# 7. Jinja2 filter registration
# ─────────────────────────────────────────────────────────────────────────────

def register_unidecode_filters(env) -> None:
    """
    Register Unidecode filters for Jinja2.
    Usage:
      {{ author.name | romanize }}
      {{ post.title | ascii_slug }}
    """
    env.filters["romanize"]  = to_ascii
    env.filters["ascii_slug"] = ascii_slug
    env.filters["romanize_name"] = romanize_name


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    samples = [
        ("German",   "München Straße Ärger"),
        ("French",   "café naïve façade"),
        ("Spanish",  "señor año jalapeño"),
        ("Russian",  "Москва Санкт-Петербург"),
        ("Chinese",  "北京 上海 香港"),
        ("Japanese", "東京 大阪"),
        ("Korean",   "서울 부산"),
        ("Arabic",   "مرحبا بالعالم"),
        ("Greek",    "Αθήνα Ελλάδα"),
        ("Hindi",    "नमस्ते दुनिया"),
        ("Thai",     "กรุงเทพมหานคร"),
    ]

    print("=== Transliteration ===")
    for lang, text in samples:
        print(f"  {lang:10} {text!r:30} → {unidecode(text)!r}")

    print("\n=== Slug generation ===")
    slug_inputs = [
        "München Biergarten 2024",
        "北京旅游指南",
        "Ångström & Friends",
        "Björk Guðmundsdóttir",
        "señor jalapeño café",
        "100% Pure — Alpine Water",
    ]
    for text in slug_inputs:
        print(f"  {text!r:35} → {ascii_slug(text)!r}")

    print("\n=== Name romanization ===")
    names = [
        "Ångström",
        "Björk Guðmundsdóttir",
        "李小龙",
        "Михаил Булгаков",
        "Γεώργιος Παπανδρέου",
    ]
    for name in names:
        print(f"  {name!r:30} → {romanize_name(name)!r}")

    print("\n=== Search normalization ===")
    pairs = [
        ("München", "munchen"),
        ("café",    "cafe"),
        ("北京",    "bei jing"),
    ]
    for original, query in pairs:
        norm  = normalize_for_search(original)
        match = ascii_equal(original, query)
        print(f"  {original!r:10} normalize={norm!r:15} matches {query!r}={match}")

    print("\n=== Batch ===")
    texts = ["café", None, "Москва", "normal", "北京"]
    for orig, rom in zip(texts, romanize_batch(texts)):
        print(f"  {str(orig):15} → {str(rom)!r}")

For the python-slugify alternative — python-slugify uses Unidecode internally for transliteration when the input has non-ASCII characters, but wraps it with stopword removal, word-boundary truncation, and custom replacements API; use python-slugify when you’re building URL slugs with stop-word filtering, and use Unidecode directly when you need plain transliteration for search normalization, address romanization, or filename sanitization where you don’t want stop-words removed. For the unicodedata.normalize("NFKD") alternative — NFKD decomposition followed by filtering to ASCII strips accents (“é” → “e”) but only covers Latin characters; it produces empty strings for CJK, Arabic, Cyrillic, and Greek text; Unidecode covers 200+ scripts using transliteration tables, so “北京” → “Bei Jing” and “Москва” → “Moskva” instead of becoming empty strings. The Claude Skills 360 bundle includes Unidecode skill sets covering unidecode() transliteration, unidecode_expect_ascii() and _expect_nonascii() performance variants, codepoint_string() per-character lookup, ascii_slug() URL slug generation, ascii_filename() safe filename builder, romanize_name() name capitalization, normalize_for_search() search normalization, ascii_equal()/ascii_contains() fuzzy matching, romanize_batch() and slug_batch() list processing, romanize_column() and slug_column() pandas integration, and Jinja2 filter registration. Start with the free tier to try Unicode transliteration code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39