Unidecode transliterates Unicode text to ASCII using transliteration tables. pip install Unidecode. Basic: from unidecode import unidecode; unidecode("München") → “Munchen”. unidecode("Ångström") → “Angstrom”. unidecode("café") → “cafe”. Chinese: unidecode("北京") → “Bei Jing”. Japanese: unidecode("東京") → “Dong Jing”. Korean: unidecode("서울") → “Seoul”. Russian: unidecode("Москва") → “Moskva”. Arabic: unidecode("مرحبا") → “mrhba”. Greek: unidecode("Ελλάδα") → “Ellada”. Thai: unidecode("ไทย") → “thyy”. Hebrew: unidecode("שלום") → “shlwm”. German umlauts: unidecode("Ä") → “A” | unidecode("ö") → “o” | unidecode("ü") → “u”. Performance: from unidecode import unidecode_expect_ascii — faster when input is mostly ASCII. unidecode_expect_nonascii — optimized for non-ASCII heavy text. unidecode("", errors="ignore") — ignore errors. unidecode("", errors="replace", replace_str="?") — replace with custom string. Slug: combine with re.sub(r"[^a-z0-9]+", "-", unidecode(text).lower()).strip("-"). Codepoint: from unidecode import codepoint_string; codepoint_string(ord("ü")) → “u”. Claude Code generates Unidecode romanizers, name normalizers, and ASCII slug pipelines.
CLAUDE.md for Unidecode
## Unidecode Stack
- Version: Unidecode >= 1.3 | pip install Unidecode
- Basic: unidecode("München") → "Munchen" | unidecode("北京") → "Bei Jing"
- Performance: unidecode_expect_ascii(text) for mostly-ASCII | _expect_nonascii for mostly Unicode
- Errors: unidecode(text, errors="ignore") | errors="replace", replace_str="?"
- Slug: re.sub(r"[^a-z0-9]+", "-", unidecode(text).lower()).strip("-")
- Codepoint: codepoint_string(ord(ch)) → ASCII equivalent of one Unicode char
- Lossy: unidecode destroys characters — use ftfy if you want to preserve them
Unidecode Transliteration Pipeline
# app/romanize.py — Unidecode transliteration, romanization, and ASCII slug pipeline
from __future__ import annotations
import re
import unicodedata
from typing import Any
from unidecode import (
codepoint_string,
unidecode,
unidecode_expect_ascii,
unidecode_expect_nonascii,
)
# ─────────────────────────────────────────────────────────────────────────────
# 1. Core transliteration helpers
# ─────────────────────────────────────────────────────────────────────────────
def to_ascii(text: str) -> str:
"""
Transliterate Unicode text to ASCII using Unidecode tables.
Lossy — accents, CJK, Cyrillic etc. are converted to their closest ASCII.
"München" → "Munchen"
"café" → "cafe"
"北京" → "Bei Jing"
"Москва" → "Moskva"
"""
return unidecode(text)
def to_ascii_fast(text: str) -> str:
"""
Faster variant when text is mostly ASCII with occasional Unicode characters.
Uses unidecode_expect_ascii — skips pre-check overhead for ASCII-heavy strings.
"""
return unidecode_expect_ascii(text)
def to_ascii_unicode(text: str) -> str:
"""
Optimized variant for Unicode-heavy text (CJK, Arabic, Cyrillic, etc.).
Uses unidecode_expect_nonascii — avoids the ASCII fast-path check.
"""
return unidecode_expect_nonascii(text)
def char_to_ascii(ch: str) -> str:
"""
Return the ASCII transliteration of a single Unicode character.
Useful for per-character analysis or custom replacement logic.
"ü" → "u"
"北" → "Bei "
"""
cp = ord(ch)
if cp < 0x80:
return ch
return codepoint_string(cp)
# ─────────────────────────────────────────────────────────────────────────────
# 2. Slug generation
# ─────────────────────────────────────────────────────────────────────────────
def ascii_slug(text: str, separator: str = "-", max_length: int = 80) -> str:
"""
Generate an ASCII URL slug from any Unicode text.
1. Unidecode to ASCII
2. Lowercase
3. Replace non-alphanumeric with separator
4. Strip leading/trailing separators
5. Truncate at max_length (at word boundary)
"München Biergarten 2024" → "munchen-biergarten-2024"
"北京旅游指南" → "bei-jing-lu-you-zhi-nan"
"""
ascii_text = unidecode(text).lower()
slug = re.sub(r"[^a-z0-9]+", separator, ascii_text).strip(separator)
if max_length and len(slug) > max_length:
slug = slug[:max_length]
# Truncate at last separator to avoid cutting a word
last_sep = slug.rfind(separator)
if last_sep > max_length // 2:
slug = slug[:last_sep]
return slug
def ascii_filename(text: str, extension: str = "") -> str:
"""
Generate a safe ASCII filename from Unicode text.
Strips path separators and null bytes in addition to slug cleanup.
"報告書 (最終版).pdf" → "bao-gao-shu-zui-zhong-ban.pdf"
"""
base = ascii_slug(text, separator="-", max_length=120)
if extension:
ext = extension.lstrip(".")
return f"{base}.{ext}"
return base
# ─────────────────────────────────────────────────────────────────────────────
# 3. Name and address romanization
# ─────────────────────────────────────────────────────────────────────────────
def romanize_name(name: str) -> str:
"""
Romanize a person's name to ASCII for systems that don't support Unicode.
Preserves capitalization structure (first letter of each word uppercased).
"Ångström" → "Angstrom"
"Björk Guðmundsdóttir" → "Bjork Gudmundsdottir"
"李小龙" → "Li Xiao Long"
"""
ascii_name = unidecode(name).strip()
# Re-capitalize each word (unidecode may lowercase some chars)
return " ".join(w.capitalize() for w in ascii_name.split() if w)
def romanize_address(address: str) -> str:
"""
Romanize a postal address to ASCII.
"Schloßstraße 12, München" → "Schlossstrasse 12, Munchen"
"""
return unidecode(address).strip()
def normalize_for_search(text: str) -> str:
"""
Normalize text for ASCII-only search indexes.
1. NFC normalize
2. Transliterate to ASCII
3. Lowercase
4. Collapse whitespace
Good for matching "München" against "munchen" queries.
"""
nfc = unicodedata.normalize("NFC", text)
ascii_text = unidecode(nfc).lower()
return re.sub(r"\s+", " ", ascii_text).strip()
# ─────────────────────────────────────────────────────────────────────────────
# 4. Comparison and matching
# ─────────────────────────────────────────────────────────────────────────────
def ascii_equal(a: str, b: str) -> bool:
"""
Return True if two strings are equal after ASCII transliteration.
"café" == "cafe" → True
"München" == "Munchen" → True
"""
return unidecode(a).lower() == unidecode(b).lower()
def ascii_startswith(text: str, prefix: str) -> bool:
"""Case-insensitive ASCII prefix check across Unicode strings."""
return unidecode(text).lower().startswith(unidecode(prefix).lower())
def ascii_contains(text: str, query: str) -> bool:
"""Case-insensitive ASCII substring search across Unicode strings."""
return unidecode(query).lower() in unidecode(text).lower()
# ─────────────────────────────────────────────────────────────────────────────
# 5. Batch processing
# ─────────────────────────────────────────────────────────────────────────────
def romanize_batch(texts: list[str | None]) -> list[str | None]:
"""Transliterate a list, passing None values through unchanged."""
return [unidecode(t) if t is not None else None for t in texts]
def slug_batch(texts: list[str | None]) -> list[str | None]:
"""Generate ASCII slugs for a list of strings."""
return [ascii_slug(t) if t is not None else None for t in texts]
def detect_non_ascii(texts: list[str]) -> list[int]:
"""Return indices of strings that contain non-ASCII characters."""
return [i for i, t in enumerate(texts) if any(ord(c) >= 128 for c in t)]
# ─────────────────────────────────────────────────────────────────────────────
# 6. Pandas integration
# ─────────────────────────────────────────────────────────────────────────────
def romanize_column(df, column: str, new_column: str | None = None):
"""
Add a romanized (ASCII) version of a DataFrame column.
df = romanize_column(df, "name", new_column="name_ascii")
"""
out_col = new_column or f"{column}_ascii"
df[out_col] = df[column].apply(
lambda x: unidecode(str(x)) if x is not None else None
)
return df
def slug_column(df, column: str, new_column: str | None = None):
"""Add an ASCII slug column derived from a text column."""
out_col = new_column or f"{column}_slug"
df[out_col] = df[column].apply(
lambda x: ascii_slug(str(x)) if x is not None else None
)
return df
# ─────────────────────────────────────────────────────────────────────────────
# 7. Jinja2 filter registration
# ─────────────────────────────────────────────────────────────────────────────
def register_unidecode_filters(env) -> None:
"""
Register Unidecode filters for Jinja2.
Usage:
{{ author.name | romanize }}
{{ post.title | ascii_slug }}
"""
env.filters["romanize"] = to_ascii
env.filters["ascii_slug"] = ascii_slug
env.filters["romanize_name"] = romanize_name
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
samples = [
("German", "München Straße Ärger"),
("French", "café naïve façade"),
("Spanish", "señor año jalapeño"),
("Russian", "Москва Санкт-Петербург"),
("Chinese", "北京 上海 香港"),
("Japanese", "東京 大阪"),
("Korean", "서울 부산"),
("Arabic", "مرحبا بالعالم"),
("Greek", "Αθήνα Ελλάδα"),
("Hindi", "नमस्ते दुनिया"),
("Thai", "กรุงเทพมหานคร"),
]
print("=== Transliteration ===")
for lang, text in samples:
print(f" {lang:10} {text!r:30} → {unidecode(text)!r}")
print("\n=== Slug generation ===")
slug_inputs = [
"München Biergarten 2024",
"北京旅游指南",
"Ångström & Friends",
"Björk Guðmundsdóttir",
"señor jalapeño café",
"100% Pure — Alpine Water",
]
for text in slug_inputs:
print(f" {text!r:35} → {ascii_slug(text)!r}")
print("\n=== Name romanization ===")
names = [
"Ångström",
"Björk Guðmundsdóttir",
"李小龙",
"Михаил Булгаков",
"Γεώργιος Παπανδρέου",
]
for name in names:
print(f" {name!r:30} → {romanize_name(name)!r}")
print("\n=== Search normalization ===")
pairs = [
("München", "munchen"),
("café", "cafe"),
("北京", "bei jing"),
]
for original, query in pairs:
norm = normalize_for_search(original)
match = ascii_equal(original, query)
print(f" {original!r:10} normalize={norm!r:15} matches {query!r}={match}")
print("\n=== Batch ===")
texts = ["café", None, "Москва", "normal", "北京"]
for orig, rom in zip(texts, romanize_batch(texts)):
print(f" {str(orig):15} → {str(rom)!r}")
For the python-slugify alternative — python-slugify uses Unidecode internally for transliteration when the input has non-ASCII characters, but wraps it with stopword removal, word-boundary truncation, and custom replacements API; use python-slugify when you’re building URL slugs with stop-word filtering, and use Unidecode directly when you need plain transliteration for search normalization, address romanization, or filename sanitization where you don’t want stop-words removed. For the unicodedata.normalize("NFKD") alternative — NFKD decomposition followed by filtering to ASCII strips accents (“é” → “e”) but only covers Latin characters; it produces empty strings for CJK, Arabic, Cyrillic, and Greek text; Unidecode covers 200+ scripts using transliteration tables, so “北京” → “Bei Jing” and “Москва” → “Moskva” instead of becoming empty strings. The Claude Skills 360 bundle includes Unidecode skill sets covering unidecode() transliteration, unidecode_expect_ascii() and _expect_nonascii() performance variants, codepoint_string() per-character lookup, ascii_slug() URL slug generation, ascii_filename() safe filename builder, romanize_name() name capitalization, normalize_for_search() search normalization, ascii_equal()/ascii_contains() fuzzy matching, romanize_batch() and slug_batch() list processing, romanize_column() and slug_column() pandas integration, and Jinja2 filter registration. Start with the free tier to try Unicode transliteration code generation.