Blog / AI / Claude Code for regex: Advanced Python Regular Expressions

Claude Code for regex: Advanced Python Regular Expressions

Published: March 7, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

The regex module extends Python’s re with Unicode properties, fuzzy matching, and more. pip install regex. Drop-in: import regex as re — all re functions work. Unicode properties: regex.findall(r"\p{Letter}+", text) — matches Unicode letters. \p{Script=Latin} \p{Script=Han} \p{Category=Lu} \p{Block=CJK_Unified_Ideographs}. \P{Digit} — negated Unicode property. Fuzzy: regex.search(r"(?:word){e<=1}", text) — match “word” with up to 1 error. {s<=1,i<=1,d<=1} — substitutions, insertions, deletions. {e<=2} — up to 2 errors. {i<=1,d<=1,e<=2} — combined constraints. match.fuzzy_counts → (subs, ins, dels). match.fuzzy_changes → positions of changes. Overlapping: regex.findall(r"(?=(\d+))", "123", overlapped=True). Variable lookbehind: regex.search(r"(?<=a+)b", "aaab") — works in regex, not re. Possessive: r"a++b" — no backtrack. Atomic group: r"(?>a+)b". Branch reset: r"(?|(\w+)|(\d+))" — same group numbers for both. \K — keep nothing to left; resets start of match. regex.sub(r"pattern", repl, text, count=1, flags=regex.MULTILINE). Capture: regex.fullmatch(). Timeout: none built-in but use signal or concurrent.futures. Claude Code generates regex Unicode extractors, fuzzy search pipelines, and text normalization patterns.

CLAUDE.md for regex

## regex Stack
- Version: regex >= 2023 | pip install regex | import regex as re (drop-in)
- Unicode: \p{Letter}, \p{Script=Latin}, \p{Category=Lu}, \P{Digit}
- Fuzzy: r"(?:word){e<=1}" — 1 error | fuzzy_counts → (subs, ins, dels)
- Overlapping: findall(pattern, text, overlapped=True)
- Variable lookbehind: (?<=a+)b — not supported in stdlib re
- Possessive: a++ | atomic group: (?>a+) — prevent catastrophic backtracking
- Branch reset: (?|(\w+)|(\d+)) — same group number in alternatives

regex Advanced Pattern Matching Pipeline

# app/advanced_re.py — regex Unicode properties, fuzzy matching, and extraction
from __future__ import annotations

from typing import Any

import regex


# ─────────────────────────────────────────────────────────────────────────────
# 1. Unicode property matching
# ─────────────────────────────────────────────────────────────────────────────

# Compiled patterns for common Unicode categories
_UNICODE_WORD   = regex.compile(r"\p{Letter}[\p{Letter}\p{Number}\p{Connector_Punctuation}]*")
_UNICODE_LETTER = regex.compile(r"\p{Letter}+")
_UNICODE_DIGIT  = regex.compile(r"\p{Decimal_Digit_Number}+")
_LATIN_WORD     = regex.compile(r"[\p{Script=Latin}\p{Mark}]+")
_CJK            = regex.compile(r"\p{Script=Han}+")
_ARABIC         = regex.compile(r"\p{Script=Arabic}+")
_CYRILLIC       = regex.compile(r"\p{Script=Cyrillic}+")


def extract_words(text: str) -> list[str]:
    """
    Extract Unicode words from any script.
    Uses \\p{Letter}+ so "München", "北京", "Москва" are all matched.
    """
    return _UNICODE_LETTER.findall(text)


def extract_numbers(text: str) -> list[str]:
    """
    Extract numeric sequences in any Unicode script.
    Matches Arabic numerals, Devanagari digits, etc.
    """
    return _UNICODE_DIGIT.findall(text)


def extract_cjk(text: str) -> list[str]:
    """Extract CJK (Han script) character sequences."""
    return _CJK.findall(text)


def split_by_script(text: str) -> list[tuple[str, str]]:
    """
    Split text into (script, segment) pairs: "Hello 北京 World" →
    [("Latin", "Hello"), ("Han", "北京"), ("Latin", "World")]
    """
    scripts = {
        "Han":      regex.compile(r"\p{Script=Han}+"),
        "Latin":    regex.compile(r"[\p{Script=Latin}\p{Mark}]+"),
        "Cyrillic": regex.compile(r"\p{Script=Cyrillic}+"),
        "Arabic":   regex.compile(r"\p{Script=Arabic}+"),
        "Greek":    regex.compile(r"\p{Script=Greek}+"),
        "Hiragana": regex.compile(r"\p{Script=Hiragana}+"),
        "Katakana": regex.compile(r"\p{Script=Katakana}+"),
        "Hangul":   regex.compile(r"\p{Script=Hangul}+"),
    }
    all_matches: list[tuple[int, int, str, str]] = []
    for script_name, pat in scripts.items():
        for m in pat.finditer(text):
            all_matches.append((m.start(), m.end(), script_name, m.group()))

    all_matches.sort(key=lambda x: x[0])
    return [(s, t) for _, _, s, t in all_matches]


def strip_diacritics(text: str) -> str:
    """
    Remove combining diacritical marks (e.g. accents, tildes).
    "café" → "cafe" | "naïve" → "naive"
    NFD decomposes the character, then \\p{Mark} removes the combining marks.
    """
    import unicodedata
    nfd = unicodedata.normalize("NFD", text)
    return regex.sub(r"\p{Mark}", "", nfd)


def normalize_whitespace(text: str) -> str:
    """
    Collapse all Unicode whitespace (including non-breaking, zero-width, etc.)
    to a single ASCII space.
    """
    return regex.sub(r"\p{Separator}+|\p{Control}", " ", text).strip()


# ─────────────────────────────────────────────────────────────────────────────
# 2. Fuzzy matching
# ─────────────────────────────────────────────────────────────────────────────

def fuzzy_search(
    pattern: str,
    text: str,
    max_errors: int = 1,
) -> list[dict[str, Any]]:
    """
    Find all fuzzy matches of `pattern` in `text` with at most `max_errors` errors.
    Errors = substitutions + insertions + deletions.
    Returns [{"match", "start", "end", "errors", "fuzzy_counts"}].
    """
    pat = regex.compile(rf"(?:{regex.escape(pattern)}){{e<={max_errors}}}")
    results = []
    for m in pat.finditer(text, overlapped=True):
        subs, ins, dels = m.fuzzy_counts
        results.append({
            "match":        m.group(),
            "start":        m.start(),
            "end":          m.end(),
            "errors":       subs + ins + dels,
            "substitutions": subs,
            "insertions":   ins,
            "deletions":    dels,
        })
    return results


def fuzzy_match(
    pattern: str,
    text: str,
    max_substitutions: int = 1,
    max_insertions: int = 1,
    max_deletions: int = 1,
) -> dict[str, Any] | None:
    """
    Find first fuzzy match with separate error-type budgets.
    (?:word){s<=1,i<=1,d<=1}: up to 1 substitution, 1 insertion, 1 deletion.
    """
    pat = regex.compile(
        rf"(?:{regex.escape(pattern)})"
        rf"{{s<={max_substitutions},i<={max_insertions},d<={max_deletions}}}"
    )
    m = pat.search(text)
    if m is None:
        return None
    subs, ins, dels = m.fuzzy_counts
    return {
        "match":        m.group(),
        "start":        m.start(),
        "end":          m.end(),
        "substitutions": subs,
        "insertions":   ins,
        "deletions":    dels,
    }


def find_typos(
    word: str,
    text: str,
    max_errors: int = 1,
) -> list[str]:
    """
    Find all typos/misspellings of `word` in `text`.
    Returns the variant spellings that were matched.
    """
    results = fuzzy_search(word, text, max_errors=max_errors)
    variants = {r["match"] for r in results if r["match"] != word}
    return sorted(variants)


# ─────────────────────────────────────────────────────────────────────────────
# 3. Overlapping matches
# ─────────────────────────────────────────────────────────────────────────────

def find_overlapping(pattern: str, text: str, flags: int = 0) -> list[str]:
    """
    Find all overlapping matches of a pattern.
    Standard re.findall only returns non-overlapping matches.
    find_overlapping(r"\d\d", "12345") → ["12", "23", "34", "45"]
    """
    return regex.findall(pattern, text, overlapped=True, flags=flags)


def find_overlapping_spans(pattern: str, text: str) -> list[tuple[int, int, str]]:
    """Return (start, end, match) for every overlapping match."""
    return [
        (m.start(), m.end(), m.group())
        for m in regex.finditer(pattern, text, overlapped=True)
    ]


# ─────────────────────────────────────────────────────────────────────────────
# 4. Variable-length lookbehind (not in stdlib re)
# ─────────────────────────────────────────────────────────────────────────────

def find_after_prefix(text: str, prefix_pattern: str, word_pattern: str) -> list[str]:
    """
    Find words that appear immediately after a variable-length prefix.
    e.g. find_after_prefix(text, r"https?://", r"\S+") — URLs after scheme.
    Variable-length lookbehind is only supported in `regex`, not `re`.
    """
    pat = regex.compile(rf"(?<={prefix_pattern}){word_pattern}")
    return pat.findall(text)


# ─────────────────────────────────────────────────────────────────────────────
# 5. Practical extraction patterns
# ─────────────────────────────────────────────────────────────────────────────

# Compiled for reuse
_EMAIL_RE  = regex.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+", regex.IGNORECASE)
_URL_RE    = regex.compile(r"https?://[\w\-._~:/?#\[\]@!$&'()*+,;=%]+", regex.IGNORECASE)
_HASHTAG   = regex.compile(r"#\p{Letter}[\p{Letter}\p{Number}_]*")
_MENTION   = regex.compile(r"@\p{Letter}[\p{Letter}\p{Number}_]*")
_PHONE_RE  = regex.compile(r"\+?[\d\s\-\(\)]{7,20}\d")


def extract_emails(text: str) -> list[str]:
    return _EMAIL_RE.findall(text)


def extract_urls(text: str) -> list[str]:
    return _URL_RE.findall(text)


def extract_hashtags(text: str) -> list[str]:
    """Extract hashtags from any Unicode script: #python #北京 #Москва."""
    return _HASHTAG.findall(text)


def extract_mentions(text: str) -> list[str]:
    return _MENTION.findall(text)


def remove_emoji(text: str) -> str:
    """Strip emoji (and other symbol/pictographic characters) from text."""
    return regex.sub(r"\p{So}|\p{Cs}|\p{Co}", "", text)


def keep_only_letters_digits(text: str) -> str:
    """Remove everything that is not a Unicode letter or decimal digit."""
    return regex.sub(r"[^\p{Letter}\p{Decimal_Digit_Number}\s]", "", text)


# ─────────────────────────────────────────────────────────────────────────────
# 6. Pandas integration
# ─────────────────────────────────────────────────────────────────────────────

def extract_column(df, column: str, pattern: str, new_column: str | None = None, group: int = 0):
    """
    Apply a regex extraction to a pandas DataFrame column.
    group=0: return whole match; group=1: first capture group.
    """
    out_col = new_column or f"{column}_extracted"
    pat     = regex.compile(pattern)
    df[out_col] = df[column].apply(
        lambda x: (m.group(group) if (m := pat.search(str(x))) else None)
    )
    return df


def filter_rows_matching(df, column: str, pattern: str, fuzzy_errors: int = 0):
    """Return rows where column matches pattern (optionally with fuzzy errors)."""
    if fuzzy_errors > 0:
        pat = regex.compile(rf"(?:{pattern}){{e<={fuzzy_errors}}}")
    else:
        pat = regex.compile(pattern)
    mask = df[column].apply(lambda x: bool(pat.search(str(x))))
    return df[mask]


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("=== Unicode property matching ===")
    mixed = "Hello 世界 Привет мир مرحبا café"
    print(f"  extract_words:   {extract_words(mixed)}")
    print(f"  extract_cjk:     {extract_cjk(mixed)}")
    print(f"  split_by_script: {split_by_script(mixed)}")

    print("\n=== Unicode properties ===")
    samples = [
        (r"\p{Letter}+",          "Unicode letters", "café München 北京"),
        (r"\p{Script=Latin}+",    "Latin script",    "café München"),
        (r"\p{Uppercase_Letter}", "Uppercase",       "Hello World ABC"),
        (r"\p{Decimal_Digit_Number}+", "Digits",    "abc 123 ۱۲۳"),
    ]
    for pat, label, text in samples:
        matches = regex.findall(pat, text)
        print(f"  {label:20}: {matches}")

    print("\n=== Fuzzy matching ===")
    typo_text = "The woord Python is programing langgauge."
    for word in ["word", "Python", "programming", "language"]:
        results = fuzzy_search(word, typo_text, max_errors=1)
        if results:
            r = results[0]
            print(f"  {word!r:15} → matched {r['match']!r:15} "
                  f"(s={r['substitutions']},i={r['insertions']},d={r['deletions']})")

    print("\n=== find_typos ===")
    words = ["colour", "behaviour", "analyse"]
    text  = "Use colors and behaviors in your analysis"
    for word in words:
        typos = find_typos(word, text, max_errors=2)
        print(f"  {word!r:15} → variants: {typos}")

    print("\n=== Overlapping ===")
    for pat, text in [
        (r"\d\d",   "12345"),
        (r"aa",     "aaaa"),
        (r"[aeiou]","beautiful"),
    ]:
        matches = find_overlapping(pat, text)
        print(f"  {pat!r:15} in {text!r:15} → {matches}")

    print("\n=== Extraction ===")
    sample_text = "Email me at [email protected] or [email protected]. See https://example.com."
    print(f"  emails:   {extract_emails(sample_text)}")
    print(f"  urls:     {extract_urls(sample_text)}")

    social = "Hey @Alice and #python fans — check #北京 updates from @Москва"
    print(f"  hashtags: {extract_hashtags(social)}")
    print(f"  mentions: {extract_mentions(social)}")

    print("\n=== Strip diacritics ===")
    for t in ["café", "naïve", "München", "Ångström", "señor"]:
        print(f"  {t!r:15} → {strip_diacritics(t)!r}")

    print("\n=== remove_emoji ===")
    emoji_text = "Hello 👋 World 🌍 Python 🐍"
    print(f"  {emoji_text!r} → {remove_emoji(emoji_text)!r}")

For the stdlib re alternative — Python’s re module doesn’t support Unicode property escapes (\p{Letter}), fuzzy matching, overlapping matches, or variable-length lookbehind; regex is exactly a superset — it passes all re tests and adds these features. The import regex as re idiom is a safe drop-in once you have the package installed. For the pyparsing alternative — pyparsing constructs grammars from composable Python objects and is better for structured parsing of domain-specific languages where whitespace and precedence rules matter; regex is better for pattern-based text extraction and transformation where a single expression can express what you need — they’re complementary, with pyparsing handling grammar-level parsing and regex handling extraction and search within text. The Claude Skills 360 bundle includes regex skill sets covering \p{Letter}/\p{Script=Latin}/\p{Script=Han} Unicode properties, extract_words()/extract_cjk()/split_by_script() multilingual extraction, strip_diacritics() with NFD + \p{Mark} removal, normalize_whitespace() Unicode whitespace collapse, fuzzy_search() and fuzzy_match() with error budgets, find_typos() variant finder, find_overlapping() and find_overlapping_spans(), variable-length lookbehind, extract_emails()/extract_urls()/extract_hashtags()/extract_mentions(), remove_emoji() \p{So} cleanup, keep_only_letters_digits() filter, and pandas extract_column()/filter_rows_matching(). Start with the free tier to try advanced Unicode regex code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39