Blog / AI / Claude Code for tldextract: Python URL Domain Parsing

Claude Code for tldextract: Python URL Domain Parsing

Published: February 29, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

tldextract parses URLs into subdomain, domain, and TLD using the Public Suffix List. pip install tldextract. Basic: import tldextract; r = tldextract.extract("https://www.google.com/path"). r.subdomain → “www”. r.domain → “google”. r.suffix → “com”. r.registered_domain → “google.com”. r.fqdn → “www.google.com”. Multi-part TLD: tldextract.extract("blog.example.co.uk") → subdomain=“blog” domain=“example” suffix=“co.uk”. No subdomain: tldextract.extract("github.com") → subdomain="" domain=“github”. Punycode: tldextract.extract("münchen.de") handles IDN. Config: extract = tldextract.TLDExtract(cache_dir=".tldextract_cache"). Private domains: tldextract.TLDExtract(include_psl_private_domains=True) — treats github.io, blogspot.com, etc. as TLDs so “mysite.github.io” → subdomain="" domain=“mysite” suffix=“github.io”. IPv4: tldextract.extract("http://192.168.1.1/") → domain=“192.168.1.1” subdomain="" suffix="". Localhost: tldextract.extract("http://localhost:8000") → domain=“localhost”. From URL string: strips scheme, port, path, query automatically. Extra TLDs: TLDExtract(extra_suffixes=["internal","local"]). Batch: [tldextract.extract(url) for url in urls]. No fetch: TLDExtract(suffix_list_urls=()) — uses bundled list only. Claude Code generates tldextract URL parsers, domain normalizers, and log analyzers.

CLAUDE.md for tldextract

## tldextract Stack
- Version: tldextract >= 5.1 | pip install tldextract
- Extract: r = tldextract.extract(url) → r.subdomain / r.domain / r.suffix
- Key: r.registered_domain → "example.co.uk" | r.fqdn → full qualified hostname
- Multi TLD: auto-handles co.uk, com.au, ac.jp from Public Suffix List
- Private: TLDExtract(include_psl_private_domains=True) → github.io as TLD
- Cache: TLDExtract(cache_dir="/tmp/.tld") for Docker | suffix_list_urls=() for offline
- Batch: [tldextract.extract(url) for url in urls] — extract is picklable

tldextract URL Domain Parsing Pipeline

# app/url_parse.py — tldextract domain extraction, normalization, and log analysis
from __future__ import annotations

import re
from collections import Counter
from typing import Any
from urllib.parse import urlparse

import tldextract

# ─────────────────────────────────────────────────────────────────────────────
# Shared extractor — configure once per application
# ─────────────────────────────────────────────────────────────────────────────

# Standard extractor: github.io treated as domain suffix (private domain)
_extractor = tldextract.TLDExtract(
    cache_dir=None,   # no disk cache — uses bundled list
    suffix_list_urls=(),  # offline mode: don't fetch updated PSL
    include_psl_private_domains=False,
)

# PSL-private extractor: github.io, blogspot.com treated as registrable TLDs
_private_extractor = tldextract.TLDExtract(
    cache_dir=None,
    suffix_list_urls=(),
    include_psl_private_domains=True,
)


# ─────────────────────────────────────────────────────────────────────────────
# 1. Core extraction
# ─────────────────────────────────────────────────────────────────────────────

def extract(url: str, private_domains: bool = False) -> dict[str, str]:
    """
    Extract subdomain, domain, suffix, and registered_domain from a URL.
    private_domains=True: treats github.io/blogspot.com/etc. as TLDs,
    so "mysite.github.io" → subdomain="", domain="mysite", suffix="github.io".
    """
    ex = _private_extractor if private_domains else _extractor
    r  = ex(url)
    return {
        "subdomain":         r.subdomain,
        "domain":            r.domain,
        "suffix":            r.suffix,
        "registered_domain": r.registered_domain,
        "fqdn":              r.fqdn,
    }


def registered_domain(url: str) -> str | None:
    """
    Return the effective registered domain (eTLD+1).
    This is the part you pay for at a domain registrar.
    "www.google.com" → "google.com"
    "blog.example.co.uk" → "example.co.uk"
    "mysite.github.io" → "github.io" (standard) or "mysite.github.io" (private)
    Returns None for IPs, localhost, and empty strings.
    """
    r = _extractor(url)
    return r.registered_domain or None


def apex_domain(url: str) -> str | None:
    """Alias for registered_domain — the apex/root domain without subdomain."""
    return registered_domain(url)


def subdomain(url: str) -> str:
    """Return the subdomain part: 'www', 'api', 'mail', or '' for none."""
    return _extractor(url).subdomain


def tld(url: str) -> str:
    """Return the TLD/suffix: 'com', 'co.uk', 'io', etc."""
    return _extractor(url).suffix


# ─────────────────────────────────────────────────────────────────────────────
# 2. URL normalization
# ─────────────────────────────────────────────────────────────────────────────

def normalize_url(url: str) -> str:
    """
    Normalize a URL to canonical form:
    - ensure https:// scheme
    - lowercase the hostname
    - strip trailing slash and query string
    """
    # Add scheme if missing
    if "://" not in url:
        url = "https://" + url
    parsed = urlparse(url.strip().lower())
    host   = parsed.netloc.split(":")[0]   # strip port
    return f"https://{host}"


def is_same_domain(url1: str, url2: str) -> bool:
    """
    Return True if two URLs belong to the same registered domain.
    "www.example.com" and "api.example.com" → same registered domain "example.com".
    """
    d1 = registered_domain(url1)
    d2 = registered_domain(url2)
    return bool(d1 and d2 and d1 == d2)


def is_subdomain_of(child_url: str, parent_url: str) -> bool:
    """
    Return True if child_url is a subdomain of parent_url's domain.
    "api.example.com" is_subdomain_of "example.com" → True.
    """
    r_child  = _extractor(child_url)
    r_parent = _extractor(parent_url)
    if r_child.registered_domain != r_parent.registered_domain:
        return False
    return r_child.subdomain != r_parent.subdomain


# ─────────────────────────────────────────────────────────────────────────────
# 3. Batch processing and log analysis
# ─────────────────────────────────────────────────────────────────────────────

def extract_batch(urls: list[str]) -> list[dict[str, str]]:
    """Extract domain components from a list of URLs."""
    return [extract(url) for url in urls]


def count_domains(urls: list[str]) -> dict[str, int]:
    """Count occurrences of registered domains in a URL list."""
    domains = [registered_domain(url) for url in urls]
    valid   = [d for d in domains if d]
    return dict(Counter(valid).most_common())


def count_tlds(urls: list[str]) -> dict[str, int]:
    """Count occurrences of TLD suffixes."""
    tlds = [tld(url) for url in urls]
    valid = [t for t in tlds if t]
    return dict(Counter(valid).most_common())


def group_by_domain(urls: list[str]) -> dict[str, list[str]]:
    """Group URLs by their registered domain."""
    groups: dict[str, list[str]] = {}
    for url in urls:
        dom = registered_domain(url) or "(unknown)"
        groups.setdefault(dom, []).append(url)
    return groups


def top_domains_in_logs(log_lines: list[str], n: int = 20) -> list[tuple[str, int]]:
    """
    Extract URLs from log lines and return the top-N domains by frequency.
    Uses a simple URL regex — suitable for access logs and chat logs.
    """
    _URL_RE = re.compile(r"https?://[^\s\"'>]+", re.IGNORECASE)
    urls    = [m.group() for line in log_lines for m in _URL_RE.finditer(line)]
    counts  = count_domains(urls)
    return list(counts.items())[:n]


# ─────────────────────────────────────────────────────────────────────────────
# 4. Domain classification
# ─────────────────────────────────────────────────────────────────────────────

_COMMON_TECH_SLD = {
    "github", "gitlab", "bitbucket", "heroku", "vercel", "netlify",
    "cloudflare", "aws", "azure", "gcp",
}

_COMMON_EMAIL_PROVIDERS = {
    "gmail", "yahoo", "outlook", "hotmail", "icloud", "protonmail",
}


def is_email_provider(url: str) -> bool:
    """Return True if the domain is a common email provider."""
    r = _extractor(url)
    return r.domain.lower() in _COMMON_EMAIL_PROVIDERS


def is_tech_platform(url: str) -> bool:
    """Return True if the domain is a known tech SaaS platform."""
    r = _extractor(url)
    return r.domain.lower() in _COMMON_TECH_SLD


def classify_url(url: str) -> str:
    """
    Classify a URL as: 'ip', 'localhost', 'private', 'public', or 'invalid'.
    """
    r = _extractor(url)
    if not r.domain:
        return "invalid"

    # IP address
    parsed = urlparse(url if "://" in url else "http://" + url)
    host = parsed.netloc.split(":")[0]
    if re.match(r"^\d{1,3}(\.\d{1,3}){3}$", host):
        return "ip"

    if r.domain.lower() in ("localhost", "127.0.0.1"):
        return "localhost"

    if not r.suffix:
        return "private"

    return "public"


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

SAMPLE_URLS = [
    "https://www.google.com/search?q=python",
    "https://api.github.com/repos/user/repo",
    "http://blog.example.co.uk/post/1",
    "https://mysite.github.io/docs",
    "ftp://subdomain.example.com",
    "https://192.168.1.1/admin",
    "http://localhost:8000/api",
    "example.com",   # no scheme
    "HTTPS://WWW.AMAZON.CO.UK/dp/B001",
]

if __name__ == "__main__":
    print("=== Domain extraction ===")
    for url in SAMPLE_URLS:
        r = extract(url)
        print(f"  {url[:45]:45} → reg={r['registered_domain']:20} sub={r['subdomain']:8} suf={r['suffix']}")

    print("\n=== Private PSL domains ===")
    private_urls = ["mysite.github.io", "blog.example.blogspot.com", "pages.dev.cloudflare.com"]
    for url in private_urls:
        standard = registered_domain(url)
        private  = _private_extractor(url).registered_domain
        print(f"  {url:40} standard={standard:25} private={private}")

    print("\n=== Normalization ===")
    for url in ["google.com", "HTTP://WWW.GOOGLE.COM/", "https://google.com/path?q=1"]:
        print(f"  {url:40} → {normalize_url(url)}")

    print("\n=== Domain counts (sample) ===")
    counts = count_domains(SAMPLE_URLS)
    for dom, n in list(counts.items())[:5]:
        print(f"  {dom:25} : {n}")

    print("\n=== TLD counts ===")
    tld_counts = count_tlds(SAMPLE_URLS)
    for t, n in tld_counts.items():
        print(f"  .{t:12} : {n}")

    print("\n=== Classification ===")
    for url in SAMPLE_URLS:
        cls = classify_url(url)
        print(f"  {url[:40]:40} → {cls}")

For the urllib.parse.urlparse alternative — urlparse("https://www.google.co.uk").netloc returns “www.google.co.uk” but doesn’t know that “co.uk” is a two-part suffix — you’d need to implement PSL (Public Suffix List) logic yourself to extract “google” as the eTLD+1; tldextract maintains the current Public Suffix List, handles multi-part TLDs (co.uk, com.au, gov.sg, etc.) and private domains (github.io, blogspot.com) automatically, and provides r.registered_domain — the “google.co.uk” part — in one call. For the tld package alternative — the tld package also uses the PSL but has a different API and is less commonly used; tldextract has more downloads, better maintained, handles more edge cases (IPv4, punycode, localhost), and tldextract.extract() strips the scheme/port/path before parsing so you can pass a full URL without pre-processing. The Claude Skills 360 bundle includes tldextract skill sets covering tldextract.extract() with subdomain/domain/suffix/registered_domain/fqdn, TLDExtract with cache_dir and suffix_list_urls=() for offline, include_psl_private_domains for github.io/blogspot.com handling, registered_domain() eTLD+1 helper, normalize_url() for canonical form, is_same_domain() and is_subdomain_of(), count_domains() and count_tlds() frequency analysis, group_by_domain() grouping, top_domains_in_logs() log scanner, and classify_url() for ip/localhost/private/public classification. Start with the free tier to try URL domain parsing code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39