chardet detects character encodings from bytes using statistical analysis. pip install chardet. Basic: import chardet; chardet.detect(b"Hello") → {"encoding": "ascii", "confidence": 1.0, "language": ""}. UTF-8: chardet.detect("café".encode("utf-8")) → {"encoding": "utf-8", ...}. Latin-1: chardet.detect("café".encode("latin-1")) → {"encoding": "ISO-8859-1", ...}. All: chardet.detect_all(data) → sorted list of candidates. Streaming: from chardet import UniversalDetector; d = UniversalDetector(); d.feed(chunk); d.close(); d.result. File: with open(path, "rb") as f: raw = f.read(); enc = chardet.detect(raw)["encoding"]. Decode: raw.decode(enc or "utf-8", errors="replace"). Confidence: result["confidence"] — float 0–1; require > 0.9 for production. Language: result["language"] — e.g. “Russian”, “Chinese” for multi-byte encodings. Aliases: "GB2312" → normalize to "gbk" for Python. "windows-1252" → common for Western European Windows files. CSV: detect encoding before pd.read_csv(path, encoding=enc). Requests: resp.apparent_encoding uses chardet internally. CLI: python -m chardet file.txt. Claude Code generates chardet encoding detectors, safe file readers, and CSV import pipelines.
CLAUDE.md for chardet
## chardet Stack
- Version: chardet >= 5.2 | pip install chardet
- Detect: chardet.detect(bytes) → {"encoding": str, "confidence": float, "language": str}
- All: chardet.detect_all(bytes) → ranked list of candidates
- Stream: UniversalDetector().feed(chunk); detector.close(); detector.result
- Threshold: require confidence >= 0.85 before trusting detection
- Decode: raw.decode(detected_enc or "utf-8", errors="replace")
- CSV: detect encoding → pd.read_csv(path, encoding=enc)
chardet Encoding Detection Pipeline
# app/encoding.py — chardet detection, safe decoding, and file/CSV import pipeline
from __future__ import annotations
import codecs
import io
import os
from pathlib import Path
from typing import Any
import chardet
from chardet import UniversalDetector
# ─────────────────────────────────────────────────────────────────────────────
# Constants
# ─────────────────────────────────────────────────────────────────────────────
# Minimum confidence to trust chardet's detection
CONFIDENCE_THRESHOLD = 0.85
# Fallback encoding when detection fails or confidence is too low
FALLBACK_ENCODING = "utf-8"
# Known encoding aliases that Python's codec doesn't accept — normalize them
_ENCODING_ALIASES: dict[str, str] = {
"gb2312": "gbk",
"x-sjis": "shift_jis",
"iso-8859-8-i": "iso-8859-8",
"iso-8859-8-e": "iso-8859-8",
"tis-620": "cp874",
}
# ─────────────────────────────────────────────────────────────────────────────
# 1. Detection helpers
# ─────────────────────────────────────────────────────────────────────────────
def detect(data: bytes) -> dict[str, Any]:
"""
Detect the encoding of a byte string.
Returns {"encoding": str | None, "confidence": float, "language": str}.
encoding is None if chardet cannot detect.
"""
return chardet.detect(data)
def detect_all(data: bytes, ignore_confidence: bool = False) -> list[dict[str, Any]]:
"""
Return all candidate encodings ranked by confidence.
ignore_confidence=True: includes low-confidence candidates.
"""
return chardet.detect_all(data, ignore_threshold=ignore_confidence)
def best_encoding(
data: bytes,
threshold: float = CONFIDENCE_THRESHOLD,
fallback: str = FALLBACK_ENCODING,
) -> str:
"""
Return the most likely encoding, or fallback if confidence is below threshold.
Normalizes encoding aliases to Python codec names.
"""
result = chardet.detect(data)
enc = result.get("encoding") or ""
conf = result.get("confidence") or 0.0
if not enc or conf < threshold:
return fallback
return _normalize_encoding(enc)
def _normalize_encoding(enc: str) -> str:
"""Normalize encoding name to one Python's codecs module accepts."""
normalized = enc.lower().replace(" ", "-")
normalized = _ENCODING_ALIASES.get(normalized, normalized)
# Verify Python accepts this codec
try:
codecs.lookup(normalized)
return normalized
except LookupError:
return FALLBACK_ENCODING
# ─────────────────────────────────────────────────────────────────────────────
# 2. Safe decoding
# ─────────────────────────────────────────────────────────────────────────────
def decode(
data: bytes,
threshold: float = CONFIDENCE_THRESHOLD,
fallback: str = FALLBACK_ENCODING,
errors: str = "replace",
) -> str:
"""
Detect encoding and decode bytes to str.
Falls back to `fallback` encoding if confidence is below threshold.
errors="replace": replace undecodable bytes with U+FFFD.
errors="ignore": silently drop undecodable bytes.
"""
enc = best_encoding(data, threshold=threshold, fallback=fallback)
return data.decode(enc, errors=errors)
def decode_with_info(
data: bytes,
threshold: float = CONFIDENCE_THRESHOLD,
fallback: str = FALLBACK_ENCODING,
) -> dict[str, Any]:
"""
Detect, decode, and return metadata.
Returns {"text", "encoding", "confidence", "language", "used_fallback"}.
"""
result = chardet.detect(data)
raw_enc = result.get("encoding") or ""
conf = result.get("confidence") or 0.0
language = result.get("language") or ""
used_fallback = not raw_enc or conf < threshold
enc = _normalize_encoding(raw_enc) if not used_fallback else fallback
text = data.decode(enc, errors="replace")
return {
"text": text,
"encoding": enc,
"confidence": conf,
"language": language,
"used_fallback": used_fallback,
}
# ─────────────────────────────────────────────────────────────────────────────
# 3. File helpers
# ─────────────────────────────────────────────────────────────────────────────
def detect_file_encoding(path: str | Path, sample_bytes: int = 65536) -> dict[str, Any]:
"""
Detect the encoding of a file without reading all of it.
sample_bytes: read at most this many bytes for detection (default 64 KB).
For large files, use detect_file_encoding_stream() instead.
"""
path = Path(path)
with path.open("rb") as fh:
raw = fh.read(sample_bytes)
return chardet.detect(raw)
def detect_file_encoding_stream(path: str | Path) -> dict[str, Any]:
"""
Detect encoding of a large file by streaming through UniversalDetector.
Stops early when confidence reaches 1.0. Safe for files of any size.
"""
detector = UniversalDetector()
path = Path(path)
with path.open("rb") as fh:
for chunk in iter(lambda: fh.read(65536), b""):
detector.feed(chunk)
if detector.done:
break
detector.close()
return detector.result
def read_text_file(
path: str | Path,
threshold: float = CONFIDENCE_THRESHOLD,
fallback: str = FALLBACK_ENCODING,
) -> str:
"""
Read a text file with automatic encoding detection.
Returns the decoded string content.
"""
raw = Path(path).read_bytes()
return decode(raw, threshold=threshold, fallback=fallback)
def read_text_file_with_info(path: str | Path) -> dict[str, Any]:
"""
Read a text file and return content plus encoding metadata.
Returns {"text", "encoding", "confidence", "language", "path"}.
"""
raw = Path(path).read_bytes()
result = decode_with_info(raw)
result["path"] = str(path)
return result
# ─────────────────────────────────────────────────────────────────────────────
# 4. Batch file processing
# ─────────────────────────────────────────────────────────────────────────────
def detect_directory(
directory: str | Path,
pattern: str = "*.txt",
sample_bytes: int = 65536,
) -> list[dict[str, Any]]:
"""
Detect encodings for all files matching a glob pattern in a directory.
Returns [{"path", "encoding", "confidence", "language"}].
"""
results = []
for path in sorted(Path(directory).glob(pattern)):
info = detect_file_encoding(path, sample_bytes=sample_bytes)
info["path"] = str(path)
results.append(info)
return results
def find_non_utf8_files(
directory: str | Path,
pattern: str = "**/*.txt",
) -> list[dict[str, Any]]:
"""
Return files in a directory that are not UTF-8 encoded.
Useful for auditing a codebase or content directory.
"""
non_utf8 = []
for path in Path(directory).glob(pattern):
try:
path.read_text(encoding="utf-8")
except (UnicodeDecodeError, ValueError):
info = detect_file_encoding(path)
info["path"] = str(path)
non_utf8.append(info)
return non_utf8
# ─────────────────────────────────────────────────────────────────────────────
# 5. CSV / pandas integration
# ─────────────────────────────────────────────────────────────────────────────
def read_csv_auto(
path: str | Path,
sample_bytes: int = 65536,
threshold: float = CONFIDENCE_THRESHOLD,
**pandas_kwargs: Any,
):
"""
Read a CSV file with automatic encoding detection.
Passes extra kwargs to pd.read_csv (e.g. sep, header, dtype).
"""
import pandas as pd
raw_path = Path(path)
raw = raw_path.read_bytes()[:sample_bytes]
enc = best_encoding(raw, threshold=threshold)
return pd.read_csv(raw_path, encoding=enc, **pandas_kwargs)
def detect_csv_encodings(paths: list[str | Path]) -> list[dict[str, Any]]:
"""Detect encodings for a list of CSV file paths."""
return [
{**detect_file_encoding(p), "path": str(p)}
for p in paths
]
# ─────────────────────────────────────────────────────────────────────────────
# 6. HTTP response decoding
# ─────────────────────────────────────────────────────────────────────────────
def decode_response_bytes(
content: bytes,
content_type: str = "",
threshold: float = CONFIDENCE_THRESHOLD,
) -> str:
"""
Decode HTTP response bytes, using charset from Content-Type first,
then falling back to chardet detection.
content_type: e.g. "text/html; charset=windows-1252"
"""
# Try to extract charset from Content-Type header
if "charset=" in content_type:
declared = content_type.split("charset=")[-1].split(";")[0].strip()
try:
return content.decode(_normalize_encoding(declared), errors="replace")
except (LookupError, UnicodeDecodeError):
pass
return decode(content, threshold=threshold)
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
test_strings = {
"UTF-8 (ASCII)": "Hello World".encode("ascii"),
"UTF-8 (accents)": "café naïve façade".encode("utf-8"),
"Latin-1": "café naïve".encode("latin-1"),
"Windows-1252": "café\x93quote\x94".encode("windows-1252"),
"UTF-16 LE": "Hello".encode("utf-16-le"),
"GB2312 (Chinese)": "北京上海".encode("gb2312"),
"Shift-JIS (Japanese)": "東京大阪".encode("shift_jis"),
"KOI8-R (Russian)": "Москва".encode("koi8-r"),
}
print("=== Detection ===")
for label, data in test_strings.items():
result = chardet.detect(data)
enc = result.get("encoding", "?")
conf = result.get("confidence", 0)
lang = result.get("language", "")
lang_str = f" [{lang}]" if lang else ""
print(f" {label:30} enc={enc:18} conf={conf:.2f}{lang_str}")
print("\n=== Safe decode ===")
for label, data in test_strings.items():
text = decode(data)
print(f" {label:30} → {text!r:.50}")
print("\n=== decode_with_info ===")
result = decode_with_info("café naïve".encode("latin-1"))
for k, v in result.items():
if k != "text":
print(f" {k:15}: {v}")
print(f" {'text':15}: {result['text']!r}")
print("\n=== detect_all ===")
data = "café naïve".encode("latin-1")
all_results = detect_all(data)
for r in all_results[:4]:
print(f" enc={r.get('encoding','?'):20} conf={r.get('confidence',0):.2f}")
print("\n=== best_encoding fallback ===")
short_data = b"hi"
enc = best_encoding(short_data, threshold=0.99, fallback="utf-8")
print(f" 'hi' (low confidence) → {enc!r} (fallback)")
enc2 = best_encoding("café naïve".encode("utf-8"), threshold=0.85)
print(f" UTF-8 café → {enc2!r}")
For the charset-normalizer alternative — charset-normalizer (pip install charset-normalizer) is chardet’s spiritual successor with better detection accuracy for some encodings and no external C dependencies; it ships with requests as the default charset detector since requests 2.26. chardet is the older, more widely documented library — both provide similar APIs (detect() returning {"encoding", "confidence"}), so charset-normalizer is a drop-in replacement if you hit chardet accuracy issues. For the UnicodeDammit (BeautifulSoup) alternative — bs4.UnicodeDammit(raw).unicode_markup detects encoding and returns decoded text in one step, but it requires BeautifulSoup as a dependency and is tightly coupled to HTML/XML; chardet as a standalone library is lighter and appropriate for encoding detection in CSV, log files, and API responses outside of HTML parsing contexts. The Claude Skills 360 bundle includes chardet skill sets covering chardet.detect() with encoding/confidence/language, detect_all() for ranked candidates, UniversalDetector streaming for large files, best_encoding() with confidence threshold, decode() and decode_with_info() for safe bytes-to-str, detect_file_encoding() and detect_file_encoding_stream(), read_text_file() and read_text_file_with_info(), detect_directory() and find_non_utf8_files() batch audit, read_csv_auto() pandas integration, decode_response_bytes() HTTP helper, and encoding alias normalization. Start with the free tier to try encoding detection code generation.