ftfy (fixes text for you) repairs broken Unicode and mojibake in Python strings. pip install ftfy. Basic: import ftfy; ftfy.fix_text("étude") → “étude”. Mojibake: ftfy.fix_text("“Hello â€") → "\"Hello\"". Encoding: ftfy.fix_encoding("étude") → “étude”. Line breaks: ftfy.fix_line_breaks("text\r\nmore") → normalized. Fix surrogates: ftfy.fix_surrogates("\ud83d\ude00") → ”😀”. Latin ligatures: ftfy.fix_latin_ligatures("file") → “file”. Normalize: ftfy.fix_text("café", normalization="NFC"). Explain: ftfy.explain_unicode("é") — prints codepoint analysis. Fix and explain: ftfy.fix_and_explain("étude") → (fixed, explanation). Config: from ftfy import TextFixerConfig; ftfy.fix_text(text, config=TextFixerConfig(fix_encoding=True, fix_surrogates=False)). File: ftfy.fix_file("broken.txt") — streaming fix. Selective: TextFixerConfig(unescape_html=True, remove_control_chars=True, fix_encoding=True). HTML: ftfy.fix_text("<b>Hello</b>", unescape_html=True) → “Hello”. Bad entities: ftfy.fix_text("Schrödinger's cat") → fixed. CLI: ftfy < broken.txt > fixed.txt. Batch: [ftfy.fix_text(t) for t in texts]. Claude Code generates ftfy text repair pipelines, batch normalizers, and Unicode diagnostic tools.
CLAUDE.md for ftfy
## ftfy Stack
- Version: ftfy >= 6.1 | pip install ftfy
- Basic: ftfy.fix_text(text) — auto-detects and repairs mojibake, bad encoding
- Encoding: ftfy.fix_encoding(text) — targets Latin-1 misread as UTF-8
- Config: TextFixerConfig(fix_encoding=True, unescape_html=True, remove_control_chars=True)
- Explain: ftfy.fix_and_explain(text) → (fixed_str, explanation_list)
- File: ftfy.fix_file("file.txt") — streaming large file repair
- Batch: [ftfy.fix_text(t) for t in series] | df["text"].apply(ftfy.fix_text)
ftfy Unicode Repair Pipeline
# app/text_repair.py — ftfy mojibake repair, normalization, and batch processing
from __future__ import annotations
import unicodedata
from typing import Any
import ftfy
from ftfy import TextFixerConfig
# ─────────────────────────────────────────────────────────────────────────────
# Shared configs — create once per application
# ─────────────────────────────────────────────────────────────────────────────
# Standard repair: fix encoding, surrogates, line breaks, HTML entities
_STANDARD_CONFIG = TextFixerConfig(
fix_encoding=True,
fix_surrogates=True,
fix_line_breaks=True,
unescape_html=False,
remove_control_chars=True,
fix_latin_ligatures=True,
fix_character_width=True,
uncurl_quotes=False,
normalization="NFC",
)
# Aggressive repair: also unescape HTML entities and uncurl quotes
_AGGRESSIVE_CONFIG = TextFixerConfig(
fix_encoding=True,
fix_surrogates=True,
fix_line_breaks=True,
unescape_html=True,
remove_control_chars=True,
fix_latin_ligatures=True,
fix_character_width=True,
uncurl_quotes=True,
normalization="NFC",
)
# Minimal: encoding fix only, preserve everything else
_ENCODING_ONLY_CONFIG = TextFixerConfig(
fix_encoding=True,
fix_surrogates=False,
fix_line_breaks=False,
unescape_html=False,
remove_control_chars=False,
fix_latin_ligatures=False,
fix_character_width=False,
uncurl_quotes=False,
)
# ─────────────────────────────────────────────────────────────────────────────
# 1. Core repair helpers
# ─────────────────────────────────────────────────────────────────────────────
def fix(text: str, aggressive: bool = False) -> str:
"""
Repair broken Unicode text.
aggressive=False (default): standard repair — encoding, surrogates, ligatures.
aggressive=True: also unescapes HTML entities and uncurls curly quotes.
"""
config = _AGGRESSIVE_CONFIG if aggressive else _STANDARD_CONFIG
return ftfy.fix_text(text, config=config)
def fix_encoding_only(text: str) -> str:
"""
Repair only mojibake (Latin-1 bytes read as UTF-8).
Leaves line breaks, HTML entities, and quotes untouched.
"étude" → "étude"
"“Helloâ€" → '"Hello"'
"""
return ftfy.fix_text(text, config=_ENCODING_ONLY_CONFIG)
def fix_html_text(text: str) -> str:
"""
Fix encoding AND unescape HTML entities.
Useful for text scraped from web pages with double-encoded content.
"Schrödinger's cat" → "Schrödinger's cat"
"""
return ftfy.fix_text(text, config=_AGGRESSIVE_CONFIG)
def normalize(text: str, form: str = "NFC") -> str:
"""
Apply Unicode normalization without other repairs.
NFC: canonical composition (standard for text storage).
NFD: canonical decomposition.
NFKC: compatibility composition (collapses ligatures, unicode fractions).
NFKD: compatibility decomposition.
"""
return unicodedata.normalize(form, text)
# ─────────────────────────────────────────────────────────────────────────────
# 2. Diagnostic helpers
# ─────────────────────────────────────────────────────────────────────────────
def explain(text: str) -> list[dict[str, Any]]:
"""
Return per-character Unicode information for diagnostic purposes.
Useful for understanding why a string looks broken.
Returns [{"char", "codepoint", "name", "category"}]
"""
result = []
for ch in text[:200]: # cap at 200 chars for safety
cp = ord(ch)
try:
name = unicodedata.name(ch)
except ValueError:
name = "(no name)"
result.append({
"char": ch,
"codepoint": f"U+{cp:04X}",
"name": name,
"category": unicodedata.category(ch),
})
return result
def fix_and_explain(text: str) -> dict[str, Any]:
"""
Repair text and return the fixed string plus explanation.
Returns {"original", "fixed", "changed", "explanation"}.
"""
fixed, explanation = ftfy.fix_and_explain(text)
return {
"original": text,
"fixed": fixed,
"changed": text != fixed,
"explanation": explanation,
}
def is_broken(text: str) -> bool:
"""
Return True if ftfy would make any changes to the text.
Can be used to filter or flag suspicious strings before bulk repair.
"""
return ftfy.fix_text(text) != text
# ─────────────────────────────────────────────────────────────────────────────
# 3. Batch repair
# ─────────────────────────────────────────────────────────────────────────────
def fix_batch(
texts: list[str | None],
aggressive: bool = False,
) -> list[str | None]:
"""
Repair a list of strings. None values pass through unchanged.
Returns a new list of the same length.
"""
return [fix(t, aggressive=aggressive) if t is not None else None for t in texts]
def fix_dict_values(
record: dict[str, Any],
keys: list[str] | None = None,
) -> dict[str, Any]:
"""
Repair string values in a dict in-place (returns new dict).
keys: if provided, only repair those keys; otherwise repair all str values.
"""
result = dict(record)
target_keys = keys if keys is not None else list(record.keys())
for k in target_keys:
if k in result and isinstance(result[k], str):
result[k] = fix(result[k])
return result
def count_broken(texts: list[str | None]) -> int:
"""Count how many strings in the list need repair."""
return sum(1 for t in texts if t is not None and is_broken(t))
# ─────────────────────────────────────────────────────────────────────────────
# 4. File repair
# ─────────────────────────────────────────────────────────────────────────────
def fix_file_to_string(path: str, encoding: str = "utf-8") -> str:
"""
Read a file, repair its text, and return the fixed content as a string.
Handles large files by streaming line by line.
"""
lines = []
with open(path, encoding=encoding, errors="replace") as fh:
for line in fh:
lines.append(fix(line))
return "".join(lines)
def fix_file_inplace(path: str, encoding: str = "utf-8") -> int:
"""
Repair a text file in place. Returns number of lines changed.
"""
with open(path, encoding=encoding, errors="replace") as fh:
original_lines = fh.readlines()
fixed_lines = [fix(line) for line in original_lines]
changed = sum(1 for o, f in zip(original_lines, fixed_lines) if o != f)
if changed > 0:
with open(path, "w", encoding=encoding) as fh:
fh.writelines(fixed_lines)
return changed
# ─────────────────────────────────────────────────────────────────────────────
# 5. Pandas integration
# ─────────────────────────────────────────────────────────────────────────────
def fix_dataframe_column(df, column: str, aggressive: bool = False):
"""
Repair all string values in a pandas DataFrame column.
df["title"] = fix_dataframe_column(df, "title")
"""
df[column] = df[column].apply(
lambda x: fix(x, aggressive=aggressive) if isinstance(x, str) else x
)
return df[column]
def broken_rows(df, column: str):
"""Return a boolean Series — True where the column value needs repair."""
import pandas as pd
return df[column].apply(lambda x: is_broken(str(x)) if pd.notna(x) else False)
# ─────────────────────────────────────────────────────────────────────────────
# 6. Jinja2 filter registration
# ─────────────────────────────────────────────────────────────────────────────
def register_ftfy_filters(env) -> None:
"""
Register ftfy text repair filters for Jinja2.
Usage:
{{ post.body | fix_text }}
{{ scraped_title | fix_encoding }}
"""
env.filters["fix_text"] = fix
env.filters["fix_encoding"] = fix_encoding_only
env.filters["fix_html"] = fix_html_text
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
broken_samples = [
"étude", # é encoded as Latin-1, decoded as UTF-8
"“Hello â€", # curly quotes mojibake
"Café", # café mojibake
"München", # München mojibake
"‘single quotes’", # single quote mojibake
"Ã"ßßÃ", # heavy mojibake
"ï¬le flow", # fi/fl ligatures misencoded
"<b>Hello</b>", # HTML entities
"Schrödinger's cat", # decimal HTML entity
"normal text — stays the same", # clean text
]
print("=== Basic repair ===")
for text in broken_samples:
fixed = fix(text)
marker = "✓" if text != fixed else "·"
print(f" {marker} {text!r:40} → {fixed!r}")
print("\n=== Aggressive repair (HTML unescape) ===")
html_samples = [
"<b>bold</b>",
"Schrödinger's cat",
"Café & Boulanger",
]
for text in html_samples:
print(f" {text!r:35} → {fix(text, aggressive=True)!r}")
print("\n=== is_broken detection ===")
for text in broken_samples:
print(f" broken={is_broken(text)} {text!r}")
print("\n=== fix_and_explain ===")
result = fix_and_explain("Café")
print(f" original: {result['original']!r}")
print(f" fixed: {result['fixed']!r}")
print(f" changed: {result['changed']}")
print(f" explanation: {result['explanation']}")
print("\n=== Batch repair ===")
batch = ["Café", None, "München", "normal", "“quotesâ€"]
fixed_batch = fix_batch(batch)
for orig, fixed in zip(batch, fixed_batch):
print(f" {str(orig):30} → {str(fixed)!r}")
print("\n=== Character explanation ===")
for char_info in explain("é")[:4]:
print(f" {char_info['char']!r:4} {char_info['codepoint']:8} {char_info['category']:3} {char_info['name']}")
print("\n=== count_broken ===")
n = count_broken(broken_samples)
print(f" {n} of {len(broken_samples)} strings need repair")
For the chardet + codecs.decode alternative — detecting an encoding with chardet then manually decoding can fix mojibake when you have the raw bytes, but ftfy operates on Python str objects that have already been decoded (often incorrectly) and applies heuristics to reverse the misinterpretation; this means ftfy fixes text that was read from a database, an API, or a CSV that has already been decoded into str — exactly the case where chardet can’t help. For the unidecode alternative — unidecode transliterates Unicode to ASCII (“München” → “Munchen”), which is intentional lossy conversion; ftfy preserves the original characters (“Café” → “café”) by reversing the encoding error rather than throwing away the accents. The Claude Skills 360 bundle includes ftfy skill sets covering ftfy.fix_text() with TextFixerConfig, fix_encoding() for Latin-1/UTF-8 mojibake, fix_surrogates() for lone surrogates, fix_line_breaks() normalization, fix_latin_ligatures() for fi/fl, fix_and_explain() annotated output, is_broken() detection predicate, fix_batch() list repair, fix_dict_values() record repair, fix_file_to_string() and fix_file_inplace() for files, fix_dataframe_column() pandas integration, broken_rows() boolean Series, and Jinja2 filter registration. Start with the free tier to try Unicode text repair code generation.