tldextract parses URLs into subdomain, domain, and TLD using the Public Suffix List. pip install tldextract. Basic: import tldextract; r = tldextract.extract("https://www.google.com/path"). r.subdomain → “www”. r.domain → “google”. r.suffix → “com”. r.registered_domain → “google.com”. r.fqdn → “www.google.com”. Multi-part TLD: tldextract.extract("blog.example.co.uk") → subdomain=“blog” domain=“example” suffix=“co.uk”. No subdomain: tldextract.extract("github.com") → subdomain="" domain=“github”. Punycode: tldextract.extract("münchen.de") handles IDN. Config: extract = tldextract.TLDExtract(cache_dir=".tldextract_cache"). Private domains: tldextract.TLDExtract(include_psl_private_domains=True) — treats github.io, blogspot.com, etc. as TLDs so “mysite.github.io” → subdomain="" domain=“mysite” suffix=“github.io”. IPv4: tldextract.extract("http://192.168.1.1/") → domain=“192.168.1.1” subdomain="" suffix="". Localhost: tldextract.extract("http://localhost:8000") → domain=“localhost”. From URL string: strips scheme, port, path, query automatically. Extra TLDs: TLDExtract(extra_suffixes=["internal","local"]). Batch: [tldextract.extract(url) for url in urls]. No fetch: TLDExtract(suffix_list_urls=()) — uses bundled list only. Claude Code generates tldextract URL parsers, domain normalizers, and log analyzers.
CLAUDE.md for tldextract
## tldextract Stack
- Version: tldextract >= 5.1 | pip install tldextract
- Extract: r = tldextract.extract(url) → r.subdomain / r.domain / r.suffix
- Key: r.registered_domain → "example.co.uk" | r.fqdn → full qualified hostname
- Multi TLD: auto-handles co.uk, com.au, ac.jp from Public Suffix List
- Private: TLDExtract(include_psl_private_domains=True) → github.io as TLD
- Cache: TLDExtract(cache_dir="/tmp/.tld") for Docker | suffix_list_urls=() for offline
- Batch: [tldextract.extract(url) for url in urls] — extract is picklable
tldextract URL Domain Parsing Pipeline
# app/url_parse.py — tldextract domain extraction, normalization, and log analysis
from __future__ import annotations
import re
from collections import Counter
from typing import Any
from urllib.parse import urlparse
import tldextract
# ─────────────────────────────────────────────────────────────────────────────
# Shared extractor — configure once per application
# ─────────────────────────────────────────────────────────────────────────────
# Standard extractor: github.io treated as domain suffix (private domain)
_extractor = tldextract.TLDExtract(
cache_dir=None, # no disk cache — uses bundled list
suffix_list_urls=(), # offline mode: don't fetch updated PSL
include_psl_private_domains=False,
)
# PSL-private extractor: github.io, blogspot.com treated as registrable TLDs
_private_extractor = tldextract.TLDExtract(
cache_dir=None,
suffix_list_urls=(),
include_psl_private_domains=True,
)
# ─────────────────────────────────────────────────────────────────────────────
# 1. Core extraction
# ─────────────────────────────────────────────────────────────────────────────
def extract(url: str, private_domains: bool = False) -> dict[str, str]:
"""
Extract subdomain, domain, suffix, and registered_domain from a URL.
private_domains=True: treats github.io/blogspot.com/etc. as TLDs,
so "mysite.github.io" → subdomain="", domain="mysite", suffix="github.io".
"""
ex = _private_extractor if private_domains else _extractor
r = ex(url)
return {
"subdomain": r.subdomain,
"domain": r.domain,
"suffix": r.suffix,
"registered_domain": r.registered_domain,
"fqdn": r.fqdn,
}
def registered_domain(url: str) -> str | None:
"""
Return the effective registered domain (eTLD+1).
This is the part you pay for at a domain registrar.
"www.google.com" → "google.com"
"blog.example.co.uk" → "example.co.uk"
"mysite.github.io" → "github.io" (standard) or "mysite.github.io" (private)
Returns None for IPs, localhost, and empty strings.
"""
r = _extractor(url)
return r.registered_domain or None
def apex_domain(url: str) -> str | None:
"""Alias for registered_domain — the apex/root domain without subdomain."""
return registered_domain(url)
def subdomain(url: str) -> str:
"""Return the subdomain part: 'www', 'api', 'mail', or '' for none."""
return _extractor(url).subdomain
def tld(url: str) -> str:
"""Return the TLD/suffix: 'com', 'co.uk', 'io', etc."""
return _extractor(url).suffix
# ─────────────────────────────────────────────────────────────────────────────
# 2. URL normalization
# ─────────────────────────────────────────────────────────────────────────────
def normalize_url(url: str) -> str:
"""
Normalize a URL to canonical form:
- ensure https:// scheme
- lowercase the hostname
- strip trailing slash and query string
"""
# Add scheme if missing
if "://" not in url:
url = "https://" + url
parsed = urlparse(url.strip().lower())
host = parsed.netloc.split(":")[0] # strip port
return f"https://{host}"
def is_same_domain(url1: str, url2: str) -> bool:
"""
Return True if two URLs belong to the same registered domain.
"www.example.com" and "api.example.com" → same registered domain "example.com".
"""
d1 = registered_domain(url1)
d2 = registered_domain(url2)
return bool(d1 and d2 and d1 == d2)
def is_subdomain_of(child_url: str, parent_url: str) -> bool:
"""
Return True if child_url is a subdomain of parent_url's domain.
"api.example.com" is_subdomain_of "example.com" → True.
"""
r_child = _extractor(child_url)
r_parent = _extractor(parent_url)
if r_child.registered_domain != r_parent.registered_domain:
return False
return r_child.subdomain != r_parent.subdomain
# ─────────────────────────────────────────────────────────────────────────────
# 3. Batch processing and log analysis
# ─────────────────────────────────────────────────────────────────────────────
def extract_batch(urls: list[str]) -> list[dict[str, str]]:
"""Extract domain components from a list of URLs."""
return [extract(url) for url in urls]
def count_domains(urls: list[str]) -> dict[str, int]:
"""Count occurrences of registered domains in a URL list."""
domains = [registered_domain(url) for url in urls]
valid = [d for d in domains if d]
return dict(Counter(valid).most_common())
def count_tlds(urls: list[str]) -> dict[str, int]:
"""Count occurrences of TLD suffixes."""
tlds = [tld(url) for url in urls]
valid = [t for t in tlds if t]
return dict(Counter(valid).most_common())
def group_by_domain(urls: list[str]) -> dict[str, list[str]]:
"""Group URLs by their registered domain."""
groups: dict[str, list[str]] = {}
for url in urls:
dom = registered_domain(url) or "(unknown)"
groups.setdefault(dom, []).append(url)
return groups
def top_domains_in_logs(log_lines: list[str], n: int = 20) -> list[tuple[str, int]]:
"""
Extract URLs from log lines and return the top-N domains by frequency.
Uses a simple URL regex — suitable for access logs and chat logs.
"""
_URL_RE = re.compile(r"https?://[^\s\"'>]+", re.IGNORECASE)
urls = [m.group() for line in log_lines for m in _URL_RE.finditer(line)]
counts = count_domains(urls)
return list(counts.items())[:n]
# ─────────────────────────────────────────────────────────────────────────────
# 4. Domain classification
# ─────────────────────────────────────────────────────────────────────────────
_COMMON_TECH_SLD = {
"github", "gitlab", "bitbucket", "heroku", "vercel", "netlify",
"cloudflare", "aws", "azure", "gcp",
}
_COMMON_EMAIL_PROVIDERS = {
"gmail", "yahoo", "outlook", "hotmail", "icloud", "protonmail",
}
def is_email_provider(url: str) -> bool:
"""Return True if the domain is a common email provider."""
r = _extractor(url)
return r.domain.lower() in _COMMON_EMAIL_PROVIDERS
def is_tech_platform(url: str) -> bool:
"""Return True if the domain is a known tech SaaS platform."""
r = _extractor(url)
return r.domain.lower() in _COMMON_TECH_SLD
def classify_url(url: str) -> str:
"""
Classify a URL as: 'ip', 'localhost', 'private', 'public', or 'invalid'.
"""
r = _extractor(url)
if not r.domain:
return "invalid"
# IP address
parsed = urlparse(url if "://" in url else "http://" + url)
host = parsed.netloc.split(":")[0]
if re.match(r"^\d{1,3}(\.\d{1,3}){3}$", host):
return "ip"
if r.domain.lower() in ("localhost", "127.0.0.1"):
return "localhost"
if not r.suffix:
return "private"
return "public"
# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────
SAMPLE_URLS = [
"https://www.google.com/search?q=python",
"https://api.github.com/repos/user/repo",
"http://blog.example.co.uk/post/1",
"https://mysite.github.io/docs",
"ftp://subdomain.example.com",
"https://192.168.1.1/admin",
"http://localhost:8000/api",
"example.com", # no scheme
"HTTPS://WWW.AMAZON.CO.UK/dp/B001",
]
if __name__ == "__main__":
print("=== Domain extraction ===")
for url in SAMPLE_URLS:
r = extract(url)
print(f" {url[:45]:45} → reg={r['registered_domain']:20} sub={r['subdomain']:8} suf={r['suffix']}")
print("\n=== Private PSL domains ===")
private_urls = ["mysite.github.io", "blog.example.blogspot.com", "pages.dev.cloudflare.com"]
for url in private_urls:
standard = registered_domain(url)
private = _private_extractor(url).registered_domain
print(f" {url:40} standard={standard:25} private={private}")
print("\n=== Normalization ===")
for url in ["google.com", "HTTP://WWW.GOOGLE.COM/", "https://google.com/path?q=1"]:
print(f" {url:40} → {normalize_url(url)}")
print("\n=== Domain counts (sample) ===")
counts = count_domains(SAMPLE_URLS)
for dom, n in list(counts.items())[:5]:
print(f" {dom:25} : {n}")
print("\n=== TLD counts ===")
tld_counts = count_tlds(SAMPLE_URLS)
for t, n in tld_counts.items():
print(f" .{t:12} : {n}")
print("\n=== Classification ===")
for url in SAMPLE_URLS:
cls = classify_url(url)
print(f" {url[:40]:40} → {cls}")
For the urllib.parse.urlparse alternative — urlparse("https://www.google.co.uk").netloc returns “www.google.co.uk” but doesn’t know that “co.uk” is a two-part suffix — you’d need to implement PSL (Public Suffix List) logic yourself to extract “google” as the eTLD+1; tldextract maintains the current Public Suffix List, handles multi-part TLDs (co.uk, com.au, gov.sg, etc.) and private domains (github.io, blogspot.com) automatically, and provides r.registered_domain — the “google.co.uk” part — in one call. For the tld package alternative — the tld package also uses the PSL but has a different API and is less commonly used; tldextract has more downloads, better maintained, handles more edge cases (IPv4, punycode, localhost), and tldextract.extract() strips the scheme/port/path before parsing so you can pass a full URL without pre-processing. The Claude Skills 360 bundle includes tldextract skill sets covering tldextract.extract() with subdomain/domain/suffix/registered_domain/fqdn, TLDExtract with cache_dir and suffix_list_urls=() for offline, include_psl_private_domains for github.io/blogspot.com handling, registered_domain() eTLD+1 helper, normalize_url() for canonical form, is_same_domain() and is_subdomain_of(), count_domains() and count_tlds() frequency analysis, group_by_domain() grouping, top_domains_in_logs() log scanner, and classify_url() for ip/localhost/private/public classification. Start with the free tier to try URL domain parsing code generation.