Blog / AI / Claude Code for Scrapy: Web Scraping at Scale

Claude Code for Scrapy: Web Scraping at Scale

Published: November 20, 2027

•

Read time: 5 min read

•

By: Claude Skills 360

Scrapy is a production-grade web scraping framework. pip install scrapy. scrapy startproject myproject. scrapy genspider myspider example.com. Spider: inherit scrapy.Spider, set name, start_urls, implement parse(self, response). CSS selector: response.css("h1::text").get(), .css("a::attr(href)").getall(). XPath: response.xpath("//div[@class='item']/text()").getall(). Following links: yield response.follow(url, callback=self.parse_item). Item: class ProductItem(scrapy.Item): title = scrapy.Field(). ItemLoader: loader = ItemLoader(item=ProductItem(), response=response), loader.add_css("title", "h1::text"), loader.add_xpath("price", "//span[@class='price']/text()"). Processors: input_processor=MapCompose(str.strip), output_processor=TakeFirst(). Pipeline: class with process_item method — filter, validate, dedup, store. Settings: CONCURRENT_REQUESTS=16, DOWNLOAD_DELAY=0.5, RANDOMIZE_DOWNLOAD_DELAY=True, AUTOTHROTTLE_ENABLED=True. Middlewares: process_request for proxy/UA rotation. Spider middleware: process_spider_output. CrawlSpider: rules = [Rule(LinkExtractor(allow=r"/product/"), callback="parse_item", follow=True)]. Run: scrapy crawl myspider -o items.jsonl. Playwright: scrapy-playwright for JS pages. Feed: FEEDS={"data.jsonl": {"format":"jsonl"}}. Stats: self.crawler.stats.get_value("item_scraped_count"). Claude Code generates Scrapy spiders, pipelines, middleware stacks, and CrawlSpider site scanners.

CLAUDE.md for Scrapy

## Scrapy Stack
- Version: scrapy >= 2.11
- Spider: class MySpider(scrapy.Spider): name | start_urls | parse(response)
- Selectors: response.css("sel::text").get() | .getall() | .xpath("//el").get()
- Follow: yield response.follow(url, callback=self.parse_page)
- Items: scrapy.Item subclass | ItemLoader with add_css/add_xpath/add_value
- Pipeline: process_item(item, spider) → validate/dedup/store
- Settings: CONCURRENT_REQUESTS | DOWNLOAD_DELAY | AUTOTHROTTLE_ENABLED=True
- Run: scrapy crawl spidername -o output.jsonl -s CLOSESPIDER_ITEMCOUNT=1000

Scrapy Web Scraping Pipeline

# scraping/scrapy_pipeline.py — production web scraping with Scrapy
# NOTE: This module shows patterns for all spider/pipeline/middleware components.
# In a real project, split into: spiders/, items.py, pipelines.py, middlewares.py, settings.py

from __future__ import annotations
import hashlib
import logging
import re
from datetime import datetime
from typing import Generator, Any

import scrapy
from scrapy.http import Response, Request
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from scrapy.exceptions import DropItem
from scrapy import signals
from itemadapter import ItemAdapter


logger = logging.getLogger(__name__)


# ── 1. Items ──────────────────────────────────────────────────────────────────

class ProductItem(scrapy.Item):
    """Structured item for product data."""
    # Identity
    url          = scrapy.Field()
    source       = scrapy.Field()
    # Content
    title        = scrapy.Field()
    price        = scrapy.Field()          # Cleaned float
    price_raw    = scrapy.Field()          # Raw string before cleaning
    currency     = scrapy.Field()
    description  = scrapy.Field()
    images       = scrapy.Field()          # List of absolute URLs
    categories   = scrapy.Field()          # List of strings
    # Metadata
    scraped_at   = scrapy.Field()
    item_id      = scrapy.Field()          # Fingerprint for dedup


# ── 2. Basic Spider ───────────────────────────────────────────────────────────

class ProductSpider(scrapy.Spider):
    """
    Single-domain product scraper.
    Usage: scrapy crawl products -o products.jsonl -s CLOSESPIDER_ITEMCOUNT=500
    """
    name         = "products"
    allowed_domains = ["example-shop.com"]
    start_urls   = ["https://example-shop.com/products/"]

    custom_settings = {
        "CONCURRENT_REQUESTS":       8,
        "DOWNLOAD_DELAY":            1.0,
        "RANDOMIZE_DOWNLOAD_DELAY":  True,
        "AUTOTHROTTLE_ENABLED":      True,
        "AUTOTHROTTLE_TARGET_CONCURRENCY": 4,
        "ROBOTSTXT_OBEY":            True,
        "COOKIES_ENABLED":           False,
        "FEEDS": {"products.jsonl": {"format": "jsonlines", "overwrite": True}},
    }

    def parse(self, response: Response) -> Generator:
        """Parse product list page — follow links and paginate."""
        # Product card links
        for href in response.css("a.product-card::attr(href)").getall():
            yield response.follow(href, callback=self.parse_product,
                                  cb_kwargs={"source": response.url})

        # Pagination
        next_page = response.css("a.pagination__next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_product(self, response: Response, source: str = "") -> Generator:
        """Parse individual product page and yield ProductItem."""
        loader = ItemLoader(item=ProductItem(), response=response)

        loader.add_css   ("title",       "h1.product-title::text",
                          MapCompose(str.strip))
        loader.add_css   ("price_raw",   "span.price::text",
                          MapCompose(str.strip))
        loader.add_css   ("description", "div.product-description p::text",
                          MapCompose(str.strip), Join(" "))
        loader.add_css   ("categories",  "nav.breadcrumb a::text",
                          MapCompose(str.strip))
        loader.add_xpath ("images",      "//img[@class='product-image']/@src",
                          MapCompose(lambda u: response.urljoin(u)))
        loader.add_value ("url",         response.url)
        loader.add_value ("source",      source)
        loader.add_value ("scraped_at",  datetime.utcnow().isoformat())

        # Set output processors
        loader.default_output_processor = TakeFirst()
        loader.categories_out           = list      # Keep as list
        loader.images_out               = list

        item = loader.load_item()
        item["item_id"] = hashlib.md5(response.url.encode()).hexdigest()

        yield item


# ── 3. CrawlSpider ────────────────────────────────────────────────────────────

class SiteCrawlSpider(scrapy.spiders.CrawlSpider):
    """
    Full-site crawler using CrawlSpider rules.
    Follows links matching /product/ and /category/ patterns.
    """
    name            = "site_crawler"
    allowed_domains = ["example-shop.com"]
    start_urls      = ["https://example-shop.com/"]

    rules = (
        # Follow category pages, don't parse them as items
        scrapy.spiders.Rule(
            scrapy.linkextractors.LinkExtractor(allow=r"/category/"),
            follow=True,
        ),
        # Parse product pages as items
        scrapy.spiders.Rule(
            scrapy.linkextractors.LinkExtractor(allow=r"/product/\d+"),
            callback="parse_product",
            follow=False,
        ),
    )

    def parse_product(self, response: Response) -> Generator:
        yield {
            "url":   response.url,
            "title": response.css("h1::text").get("").strip(),
            "price": response.css("[class*=price]::text").get("").strip(),
        }


# ── 4. Pipelines ─────────────────────────────────────────────────────────────

class PriceCleanPipeline:
    """Clean raw price strings → float."""

    PRICE_RE = re.compile(r"[\d,]+\.?\d*")

    def process_item(self, item: dict, spider) -> dict:
        adapter  = ItemAdapter(item)
        price_raw = adapter.get("price_raw", "")
        if price_raw:
            match = self.PRICE_RE.search(price_raw.replace(",", ""))
            adapter["price"] = float(match.group()) if match else None
            # Detect currency symbol
            if "$" in price_raw:
                adapter["currency"] = "USD"
            elif "€" in price_raw:
                adapter["currency"] = "EUR"
            elif "£" in price_raw:
                adapter["currency"] = "GBP"
        return item


class DuplicateFilterPipeline:
    """Drop already-seen items by fingerprint (in-memory set)."""

    def open_spider(self, spider):
        self.seen_ids: set[str] = set()

    def process_item(self, item: dict, spider) -> dict:
        adapter = ItemAdapter(item)
        item_id = adapter.get("item_id")
        if item_id in self.seen_ids:
            raise DropItem(f"Duplicate item: {item_id}")
        if item_id:
            self.seen_ids.add(item_id)
        return item


class ValidationPipeline:
    """Drop items missing required fields."""

    REQUIRED = ["title", "url"]

    def process_item(self, item: dict, spider) -> dict:
        adapter = ItemAdapter(item)
        for field in self.REQUIRED:
            if not adapter.get(field):
                raise DropItem(f"Missing required field '{field}': {item.get('url','')}")
        return item


class JsonLinesExportPipeline:
    """Append items to a JSONL file (alternative to Scrapy FEEDS)."""

    def __init__(self, output_path: str = "output.jsonl"):
        self.output_path = output_path

    @classmethod
    def from_crawler(cls, crawler):
        return cls(output_path=crawler.settings.get("JSONL_OUTPUT", "output.jsonl"))

    def open_spider(self, spider):
        import json
        self.file     = open(self.output_path, "w", encoding="utf-8")
        self._json    = json
        spider.logger.info(f"Writing items to {self.output_path}")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item: dict, spider) -> dict:
        line = self._json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item


# ── 5. Middleware ─────────────────────────────────────────────────────────────

class RotatingUserAgentMiddleware:
    """
    Rotate User-Agent strings on every request.
    Enable in settings: DOWNLOADER_MIDDLEWARES = {"...RotatingUserAgent...": 400}
    """
    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
        "Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0",
    ]
    _idx = 0

    def process_request(self, request: Request, spider) -> None:
        request.headers["User-Agent"] = self.USER_AGENTS[self._idx % len(self.USER_AGENTS)]
        self.__class__._idx += 1


class ProxyMiddleware:
    """
    Route requests through a rotating proxy list.
    Enable in settings: DOWNLOADER_MIDDLEWARES = {"...ProxyMiddleware...": 410}
    """
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self._idx    = 0

    @classmethod
    def from_crawler(cls, crawler):
        proxies = crawler.settings.getlist("PROXY_LIST", [])
        return cls(proxies)

    def process_request(self, request: Request, spider) -> None:
        if self.proxies:
            proxy = self.proxies[self._idx % len(self.proxies)]
            request.meta["proxy"] = proxy
            self._idx += 1


class RetryOnStatusMiddleware:
    """Retry requests that return specific status codes."""

    RETRY_CODES = {429, 503, 520, 521, 522, 524}

    def process_response(self, request: Request, response: Response, spider) -> Response | Request:
        if response.status in self.RETRY_CODES:
            retries = request.meta.get("retry_count", 0)
            if retries < 3:
                new_request = request.copy()
                new_request.meta["retry_count"] = retries + 1
                spider.logger.warning(f"Retry {retries+1}/3 for {request.url} (HTTP {response.status})")
                return new_request
        return response


# ── 6. Settings template ──────────────────────────────────────────────────────

RECOMMENDED_SETTINGS = """
# scrapy_pipeline/settings.py

BOT_NAME = "mybot"
SPIDER_MODULES = ["mybot.spiders"]

# Politeness
ROBOTSTXT_OBEY        = True
DOWNLOAD_DELAY        = 1.0
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS   = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
AUTOTHROTTLE_ENABLED  = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0

# Output
FEEDS = {
    "%(name)s_%(time)s.jsonl": {"format": "jsonlines"},
}

# Pipelines (order 100–900, lower = runs first)
ITEM_PIPELINES = {
    "mybot.pipelines.PriceCleanPipeline":      100,
    "mybot.pipelines.ValidationPipeline":      200,
    "mybot.pipelines.DuplicateFilterPipeline": 300,
    "mybot.pipelines.JsonLinesExportPipeline": 800,
}

# Middlewares
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
    "mybot.middlewares.RotatingUserAgentMiddleware": 400,
    "mybot.middlewares.ProxyMiddleware":              410,
    "mybot.middlewares.RetryOnStatusMiddleware":      550,
}

# HTTP cache (dev/debugging only)
HTTPCACHE_ENABLED    = False
HTTPCACHE_EXPIRATION_SECS = 3600

# Logging
LOG_LEVEL = "INFO"
"""


# ── Demo (run-from-script pattern) ────────────────────────────────────────────

if __name__ == "__main__":
    print("Scrapy Project Structure Demo")
    print("=" * 50)
    print("\n1. Create project:   scrapy startproject myshop")
    print("2. Generate spider:  scrapy genspider products example-shop.com")
    print("3. Run spider:       scrapy crawl products -o products.jsonl")
    print("4. Limit items:      scrapy crawl products -s CLOSESPIDER_ITEMCOUNT=100")
    print("\nPipeline order:")
    print("  Request → Spider → Item → Pipelines → Export")
    print("\nKey settings:")
    for line in RECOMMENDED_SETTINGS.split("\n"):
        if line.strip() and not line.startswith("#"):
            print(f"  {line.strip()}")
    print("\nScrapy shell for testing selectors:")
    print("  scrapy shell 'https://example-shop.com/product/123'")
    print("  >>> response.css('h1::text').get()")
    print("  >>> response.xpath('//span[@class=\"price\"]/text()').get()")

For the requests + BeautifulSoup alternative for quick one-off scraping — requests/BS4 is simpler for single pages while Scrapy’s built-in AutoThrottle adapts download rate to server response time, the CrawlSpider + LinkExtractor combo crawls entire sites without writing a link-following loop, and the pipeline/middleware architecture enforces clean separation between extraction, validation, and storage that prevents the common pattern of one 500-line scraping script with mixed concerns. For the Playwright/Selenium alternative for all JavaScript sites — Playwright is better for browser automation while scrapy-playwright integrates Playwright as a Scrapy download handler so JavaScript pages are handled transparently without losing Scrapy’s scheduler, rate-limiter, item pipeline, or stats collection, making it the production choice for large JS-heavy crawls. The Claude Skills 360 bundle includes Scrapy skill sets covering Spider and CrawlSpider, CSS and XPath selectors, ItemLoader with processors, pricing cleaning pipelines, dedup and validation pipelines, user-agent and proxy rotation middlewares, retry-on-error middleware, recommended settings with AutoThrottle, and JSONL export. Start with the free tier to try web scraping code generation.

Keep Reading

Claude Code for email.contentmanager: Python Email Content Accessors

Read and write EmailMessage body content with Python's email.contentmanager module and Claude Code — email contentmanager ContentManager for the class that maps content types to get and set handler functions allowing EmailMessage to support get_content and set_content with type-specific behaviour, email contentmanager raw_data_manager for the ContentManager instance that handles raw bytes and str payloads without any conversion, email contentmanager content_manager for the standard ContentManager instance used by email.policy.default that intelligently handles text plain text html multipart and binary content types, email contentmanager get_content_text for the handler that returns the decoded text payload of a text-star message part as a str, email contentmanager get_content_binary for the handler that returns the raw decoded bytes payload of a non-text message part, email contentmanager get_data_manager for the get-handler lookup used by EmailMessage get_content to find the right reader function for the content type, email contentmanager set_content text for the handler that creates and sets a text part correctly choosing charset and transfer encoding, email contentmanager set_content bytes for the handler that creates and sets a binary part with base64 encoding and optional filename Content-Disposition, email contentmanager EmailMessage get_content for the method that reads the message body using the registered content manager handlers, email contentmanager EmailMessage set_content for the method that sets the message body and MIME headers in one call, email contentmanager EmailMessage make_alternative make_mixed make_related for the methods that convert a simple message into a multipart container, email contentmanager EmailMessage add_attachment for the method that attaches a file or bytes to a multipart message, and email contentmanager integration with email.message and email.policy and email.mime and io for building high-level email readers attachment extractors text body accessors HTML readers and policy-aware MIME construction pipelines.

5 min read Feb 12, 2029

Claude Code for email.charset: Python Email Charset Encoding

Control header and body encoding for international email with Python's email.charset module and Claude Code — email charset Charset for the class that wraps a character set name with the encoding rules for header encoding and body encoding describing how to encode text for that charset in email messages, email charset Charset header_encoding for the attribute specifying whether headers using this charset should use QP quoted-printable encoding BASE64 encoding or no encoding, email charset Charset body_encoding for the attribute specifying the Content-Transfer-Encoding to use for message bodies in this charset such as QP or BASE64, email charset Charset output_codec for the attribute giving the Python codec name used to encode the string to bytes for the wire format, email charset Charset input_codec for the attribute giving the Python codec name used to decode incoming bytes to str, email charset Charset get_output_charset for returning the output charset name, email charset Charset header_encode for encoding a header string using the charset's header_encoding method, email charset Charset body_encode for encoding body content using the charset's body_encoding, email charset Charset convert for converting a string from the input_codec to the output_codec, email charset add_charset for registering a new charset with custom encoding rules in the global charset registry, email charset add_alias for adding an alias name that maps to an existing registered charset, email charset add_codec for registering a codec name mapping for use by the charset machinery, and email charset integration with email.message and email.mime and email.policy and email.encoders for building international email senders non-ASCII header encoders Content-Transfer-Encoding selectors charset-aware message constructors and MIME encoding pipelines.

5 min read Feb 11, 2029

Claude Code for email.utils: Python Email Address and Header Utilities

Parse and format RFC 2822 email addresses and dates with Python's email.utils module and Claude Code — email utils parseaddr for splitting a display-name plus angle-bracket address string into a realname and email address tuple, email utils formataddr for combining a realname and address string into a properly quoted RFC 2822 address with angle brackets, email utils getaddresses for parsing a list of raw address header strings each potentially containing multiple comma-separated addresses into a list of realname address tuples, email utils parsedate for parsing an RFC 2822 date string into a nine-tuple compatible with time.mktime, email utils parsedate_tz for parsing an RFC 2822 date string into a ten-tuple that includes the UTC offset timezone in seconds, email utils parsedate_to_datetime for parsing an RFC 2822 date string into an aware datetime object with timezone, email utils formatdate for formatting a POSIX timestamp or the current time as an RFC 2822 date string with optional usegmt and localtime flags, email utils format_datetime for formatting a datetime object as an RFC 2822 date string, email utils make_msgid for generating a globally unique Message-ID string with optional idstring and domain components, email utils decode_rfc2231 for decoding an RFC 2231 encoded parameter value into a tuple of charset language and value, email utils encode_rfc2231 for encoding a string as an RFC 2231 encoded parameter value, email utils collapse_rfc2231_value for collapsing a decoded RFC 2231 tuple to a Unicode string, and email utils integration with email.message and email.headerregistry and datetime and time for building address parsers date formatters message-id generators header extractors and RFC-compliant email construction utilities.

5 min read Feb 10, 2029

Put these ideas into practice

Claude Skills 360 gives you production-ready skills for everything in this article — and 2,350+ more. Start free or go all-in.

Get 360 skills free

Free $39