Blog / AI / Claude Code for urllib.robotparser: Python robots.txt Parser

Claude Code for urllib.robotparser: Python robots.txt Parser

Published: November 20, 2028

•

Read time: 5 min read

•

By: Claude Skills 360

Python’s urllib.robotparser module parses robots.txt files and answers whether a given user-agent may fetch a given URL. from urllib.robotparser import RobotFileParser. Create and fetch: rp = RobotFileParser(); rp.set_url("https://example.com/robots.txt"); rp.read() — fetches and parses; or rp.parse(lines) for pre-fetched content. Query: rp.can_fetch("*", url) → bool — "*" tests the wildcard agent; rp.can_fetch("Googlebot", url) → bool — agent-specific rule. Rate hints: rp.crawl_delay("*") → int | None — seconds to wait between requests; rp.request_rate("*") → RequestRate(requests=N, seconds=M) | None — N requests per M seconds. Freshness: rp.mtime() → float (time.time() of last read); rp.modified() → sets mtime to now (for cache freshness tracking). Site-specific: agent names are case-insensitive; rules match the longest matching path prefix; Disallow: / blocks everything; Allow: overrides Disallow for a more specific path; * in paths is a wildcard (extended standard). Sitemaps: not parsed by stdlib — use rp.sitemaps attribute in third-party extensions. Claude Code generates polite crawlers, site scraper guards, fetch policy validators, and crawl rate limiters.

CLAUDE.md for urllib.robotparser

## urllib.robotparser Stack
- Stdlib: from urllib.robotparser import RobotFileParser
- Fetch:  rp = RobotFileParser()
-         rp.set_url("https://example.com/robots.txt")
-         rp.read()                          # HTTP GET
- Query:  rp.can_fetch("*", url)            # True = allowed
-         rp.can_fetch("MyCrawler", url)
- Rate:   delay = rp.crawl_delay("*")       # int seconds or None
-         rate  = rp.request_rate("*")      # RequestRate(N, M) or None
- Parse:  rp.parse(["User-agent: *", "Disallow: /admin"])

urllib.robotparser Robots.txt Pipeline

# app/robotsutil.py — fetch, cache, check, rate-limit, policy report
from __future__ import annotations

import time
import urllib.parse
import urllib.request
from dataclasses import dataclass, field
from pathlib import Path
from urllib.robotparser import RobotFileParser


# ─────────────────────────────────────────────────────────────────────────────
# 1. Fetch and cache helpers
# ─────────────────────────────────────────────────────────────────────────────

def robots_url_for(site_url: str) -> str:
    """
    Return the canonical robots.txt URL for a site base URL.

    Example:
        robots_url_for("https://example.com/path/page")
        # 'https://example.com/robots.txt'
    """
    parsed = urllib.parse.urlparse(site_url)
    return f"{parsed.scheme}://{parsed.netloc}/robots.txt"


def fetch_robots(
    robots_url: str,
    user_agent: str = "Mozilla/5.0 (compatible; PythonCrawler/1.0)",
    timeout: int = 10,
) -> RobotFileParser:
    """
    Fetch and parse a robots.txt file from a URL.
    Returns a RobotFileParser ready for can_fetch() queries.

    Example:
        rp = fetch_robots("https://example.com/robots.txt")
        print(rp.can_fetch("*", "https://example.com/private"))
    """
    rp = RobotFileParser()
    rp.set_url(robots_url)
    try:
        req = urllib.request.Request(
            robots_url,
            headers={"User-Agent": user_agent},
        )
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            content = resp.read().decode("utf-8", errors="replace")
        rp.parse(content.splitlines())
        rp.modified()
    except Exception:
        # If robots.txt not found (404) or unreachable → allow all
        rp.parse(["User-agent: *", "Allow: /"])
    return rp


def parse_robots_text(text: str) -> RobotFileParser:
    """
    Parse a robots.txt string directly (no HTTP fetch).

    Example:
        rp = parse_robots_text('''
        User-agent: *
        Disallow: /admin
        Crawl-delay: 2
        ''')
    """
    rp = RobotFileParser()
    rp.parse(text.splitlines())
    return rp


# ─────────────────────────────────────────────────────────────────────────────
# 2. Per-domain robot cache
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class RobotCache:
    """
    Caches parsed RobotFileParser instances per origin, with TTL.
    """
    ttl_seconds: int = 3600
    _cache: dict[str, tuple[RobotFileParser, float]] = field(
        default_factory=dict, repr=False
    )

    def _origin(self, url: str) -> str:
        parsed = urllib.parse.urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}"

    def get(self, url: str) -> RobotFileParser:
        """
        Return cached or freshly fetched RobotFileParser for the URL's origin.

        Example:
            cache = RobotCache(ttl_seconds=300)
            rp = cache.get("https://example.com/page")
            print(rp.can_fetch("*", "https://example.com/page"))
        """
        origin = self._origin(url)
        robots_url = f"{origin}/robots.txt"
        entry = self._cache.get(origin)
        if entry is not None:
            rp, fetched_at = entry
            if time.time() - fetched_at < self.ttl_seconds:
                return rp
        rp = fetch_robots(robots_url)
        self._cache[origin] = (rp, time.time())
        return rp

    def can_fetch(self, user_agent: str, url: str) -> bool:
        """Check if user_agent may fetch url, using cached robots.txt."""
        return self.get(url).can_fetch(user_agent, url)

    def crawl_delay(self, url: str, user_agent: str = "*") -> float:
        """Return crawl delay (seconds) for the URL's origin, or 0.0."""
        rp = self.get(url)
        delay = rp.crawl_delay(user_agent)
        if delay is None:
            rate = rp.request_rate(user_agent)
            if rate is not None:
                return rate.seconds / max(rate.requests, 1)
        return float(delay or 0.0)

    def clear(self) -> None:
        self._cache.clear()


# ─────────────────────────────────────────────────────────────────────────────
# 3. Policy report
# ─────────────────────────────────────────────────────────────────────────────

@dataclass
class RobotPolicy:
    origin:        str
    agent:         str
    crawl_delay:   float | None   # seconds
    request_rate:  str | None     # "N/M sec" string
    test_results:  list[tuple[str, bool]]   # (url, allowed)

    def __str__(self) -> str:
        lines = [f"Origin: {self.origin}  Agent: {self.agent}"]
        if self.crawl_delay is not None:
            lines.append(f"  Crawl-delay: {self.crawl_delay}s")
        if self.request_rate:
            lines.append(f"  Request-rate: {self.request_rate}")
        for url, allowed in self.test_results:
            mark = "✓" if allowed else "✗"
            lines.append(f"  {mark} {url}")
        return "\n".join(lines)


def inspect_policy(
    rp: RobotFileParser,
    origin: str,
    agent: str,
    test_paths: list[str],
) -> RobotPolicy:
    """
    Produce a human-readable policy report for a user-agent against a robots.txt.

    Example:
        rp = parse_robots_text("User-agent: *\\nDisallow: /admin")
        report = inspect_policy(rp, "https://example.com", "*",
                                 ["/", "/admin", "/about"])
        print(report)
    """
    delay = rp.crawl_delay(agent)
    rate = rp.request_rate(agent)
    rate_str: str | None = None
    if rate is not None:
        rate_str = f"{rate.requests}/{rate.seconds}s"

    results = []
    for path in test_paths:
        url = urllib.parse.urljoin(origin, path)
        results.append((url, rp.can_fetch(agent, url)))

    return RobotPolicy(
        origin=origin,
        agent=agent,
        crawl_delay=float(delay) if delay is not None else None,
        request_rate=rate_str,
        test_results=results,
    )


# ─────────────────────────────────────────────────────────────────────────────
# 4. Polite crawl rate limiter
# ─────────────────────────────────────────────────────────────────────────────

class PoliteCrawler:
    """
    URL iterator that respects robots.txt and enforces per-origin crawl delays.

    Example:
        crawler = PoliteCrawler("MyCrawler/1.0", default_delay=1.0)
        for url in urls:
            if crawler.may_fetch(url):
                crawler.wait(url)
                content = fetch(url)
                crawler.record_fetch(url)
    """

    def __init__(
        self,
        user_agent: str,
        default_delay: float = 1.0,
        robots_ttl: int = 3600,
    ):
        self.user_agent = user_agent
        self.default_delay = default_delay
        self._cache = RobotCache(ttl_seconds=robots_ttl)
        self._last_fetch: dict[str, float] = {}   # origin → timestamp

    def _origin(self, url: str) -> str:
        p = urllib.parse.urlparse(url)
        return f"{p.scheme}://{p.netloc}"

    def may_fetch(self, url: str) -> bool:
        """Return True if robots.txt allows this user-agent to fetch url."""
        return self._cache.can_fetch(self.user_agent, url)

    def seconds_to_wait(self, url: str) -> float:
        """Seconds to sleep before fetching url (respects Crawl-delay)."""
        origin = self._origin(url)
        delay = self._cache.crawl_delay(url, self.user_agent) or self.default_delay
        last = self._last_fetch.get(origin, 0.0)
        elapsed = time.time() - last
        return max(0.0, delay - elapsed)

    def wait(self, url: str) -> None:
        """Sleep the required delay before fetching url."""
        secs = self.seconds_to_wait(url)
        if secs > 0:
            time.sleep(secs)

    def record_fetch(self, url: str) -> None:
        """Record that url's origin was just fetched; resets the delay timer."""
        self._last_fetch[self._origin(url)] = time.time()


# ─────────────────────────────────────────────────────────────────────────────
# Demo
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("=== urllib.robotparser demo ===")

    # ── parse_robots_text ─────────────────────────────────────────────────────
    print("\n--- parse_robots_text ---")
    robots_txt = """\
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2
Allow: /public/

User-agent: Googlebot
Disallow:
"""
    rp = parse_robots_text(robots_txt)
    origin = "https://example.com"

    test_paths = ["/", "/about", "/admin/settings", "/private/data",
                  "/public/docs", "/admin/login"]
    for path in test_paths:
        url = f"{origin}{path}"
        allowed = rp.can_fetch("*", url)
        google_ok = rp.can_fetch("Googlebot", url)
        print(f"  {'✓' if allowed else '✗'} * | {'✓' if google_ok else '✗'} Googlebot  {path}")

    # ── inspect_policy ────────────────────────────────────────────────────────
    print("\n--- inspect_policy ---")
    report = inspect_policy(rp, origin, "*", test_paths[:5])
    print(report)

    # ── crawl_delay / request_rate ─────────────────────────────────────────────
    print("\n--- crawl_delay / request_rate ---")
    print(f"  crawl_delay(*): {rp.crawl_delay('*')}")

    # Robots with Request-rate
    rp2 = parse_robots_text("User-agent: *\nRequest-rate: 3/10\nDisallow: /secret")
    rate = rp2.request_rate("*")
    print(f"  request_rate(*): {rate}")

    # ── robots_url_for ─────────────────────────────────────────────────────────
    print("\n--- robots_url_for ---")
    for url in [
        "https://example.com/blog/post?id=1",
        "https://shop.example.com/products",
        "http://api.example.org/v2/users",
    ]:
        print(f"  {url}")
        print(f"  → {robots_url_for(url)}")

    # ── PoliteCrawler demo ────────────────────────────────────────────────────
    print("\n--- PoliteCrawler ---")
    crawler = PoliteCrawler("TestBot/1.0", default_delay=0.5)
    urls = [
        f"{origin}/",
        f"{origin}/about",
        f"{origin}/admin/secret",
        f"{origin}/public/page",
    ]
    for url in urls:
        # Use local rp instead of fetching
        allowed = rp.can_fetch("*", url)
        print(f"  {'fetch' if allowed else 'skip ':5s}  {url}")

    print("\n=== done ===")

For the scrapy (PyPI) alternative — Scrapy’s RobotsTxtMiddleware automatically fetches, caches, and enforces robots.txt rules for every spider request via ROBOTSTXT_OBEY = True, with per-domain queuing and rate limiting built in — use Scrapy when running a full-featured production crawler; use urllib.robotparser for lightweight scripts, one-off scrapers, or any situation where Scrapy’s process model and configuration overhead is too heavy. For the reppy / robotexclusionrulesparser (PyPI) alternatives — these third-party parsers handle the RFC 9309 extended syntax (wildcards *, end-of-URL $, Sitemap directives) more completely than the stdlib — use a third-party parser for production crawlers that need strict RFC 9309 compliance and Sitemap URL extraction; use urllib.robotparser for basic allow/disallow checking where the extended syntax is not required. The Claude Skills 360 bundle includes urllib.robotparser skill sets covering robots_url_for()/fetch_robots()/parse_robots_text() fetch helpers, RobotCache TTL-based per-origin cache, RobotPolicy with inspect_policy() report generator, and PoliteCrawler with may_fetch()/wait()/record_fetch() polite crawl rate limiter. Start with the free tier to try robots.txt patterns and urllib.robotparser pipeline code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39