Scrapy is a production-grade web scraping framework. pip install scrapy. scrapy startproject myproject. scrapy genspider myspider example.com. Spider: inherit scrapy.Spider, set name, start_urls, implement parse(self, response). CSS selector: response.css("h1::text").get(), .css("a::attr(href)").getall(). XPath: response.xpath("//div[@class='item']/text()").getall(). Following links: yield response.follow(url, callback=self.parse_item). Item: class ProductItem(scrapy.Item): title = scrapy.Field(). ItemLoader: loader = ItemLoader(item=ProductItem(), response=response), loader.add_css("title", "h1::text"), loader.add_xpath("price", "//span[@class='price']/text()"). Processors: input_processor=MapCompose(str.strip), output_processor=TakeFirst(). Pipeline: class with process_item method — filter, validate, dedup, store. Settings: CONCURRENT_REQUESTS=16, DOWNLOAD_DELAY=0.5, RANDOMIZE_DOWNLOAD_DELAY=True, AUTOTHROTTLE_ENABLED=True. Middlewares: process_request for proxy/UA rotation. Spider middleware: process_spider_output. CrawlSpider: rules = [Rule(LinkExtractor(allow=r"/product/"), callback="parse_item", follow=True)]. Run: scrapy crawl myspider -o items.jsonl. Playwright: scrapy-playwright for JS pages. Feed: FEEDS={"data.jsonl": {"format":"jsonl"}}. Stats: self.crawler.stats.get_value("item_scraped_count"). Claude Code generates Scrapy spiders, pipelines, middleware stacks, and CrawlSpider site scanners.
CLAUDE.md for Scrapy
## Scrapy Stack
- Version: scrapy >= 2.11
- Spider: class MySpider(scrapy.Spider): name | start_urls | parse(response)
- Selectors: response.css("sel::text").get() | .getall() | .xpath("//el").get()
- Follow: yield response.follow(url, callback=self.parse_page)
- Items: scrapy.Item subclass | ItemLoader with add_css/add_xpath/add_value
- Pipeline: process_item(item, spider) → validate/dedup/store
- Settings: CONCURRENT_REQUESTS | DOWNLOAD_DELAY | AUTOTHROTTLE_ENABLED=True
- Run: scrapy crawl spidername -o output.jsonl -s CLOSESPIDER_ITEMCOUNT=1000
Scrapy Web Scraping Pipeline
# scraping/scrapy_pipeline.py — production web scraping with Scrapy
# NOTE: This module shows patterns for all spider/pipeline/middleware components.
# In a real project, split into: spiders/, items.py, pipelines.py, middlewares.py, settings.py
from __future__ import annotations
import hashlib
import logging
import re
from datetime import datetime
from typing import Generator, Any
import scrapy
from scrapy.http import Response, Request
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from scrapy.exceptions import DropItem
from scrapy import signals
from itemadapter import ItemAdapter
logger = logging.getLogger(__name__)
# ── 1. Items ──────────────────────────────────────────────────────────────────
class ProductItem(scrapy.Item):
"""Structured item for product data."""
# Identity
url = scrapy.Field()
source = scrapy.Field()
# Content
title = scrapy.Field()
price = scrapy.Field() # Cleaned float
price_raw = scrapy.Field() # Raw string before cleaning
currency = scrapy.Field()
description = scrapy.Field()
images = scrapy.Field() # List of absolute URLs
categories = scrapy.Field() # List of strings
# Metadata
scraped_at = scrapy.Field()
item_id = scrapy.Field() # Fingerprint for dedup
# ── 2. Basic Spider ───────────────────────────────────────────────────────────
class ProductSpider(scrapy.Spider):
"""
Single-domain product scraper.
Usage: scrapy crawl products -o products.jsonl -s CLOSESPIDER_ITEMCOUNT=500
"""
name = "products"
allowed_domains = ["example-shop.com"]
start_urls = ["https://example-shop.com/products/"]
custom_settings = {
"CONCURRENT_REQUESTS": 8,
"DOWNLOAD_DELAY": 1.0,
"RANDOMIZE_DOWNLOAD_DELAY": True,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 4,
"ROBOTSTXT_OBEY": True,
"COOKIES_ENABLED": False,
"FEEDS": {"products.jsonl": {"format": "jsonlines", "overwrite": True}},
}
def parse(self, response: Response) -> Generator:
"""Parse product list page — follow links and paginate."""
# Product card links
for href in response.css("a.product-card::attr(href)").getall():
yield response.follow(href, callback=self.parse_product,
cb_kwargs={"source": response.url})
# Pagination
next_page = response.css("a.pagination__next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_product(self, response: Response, source: str = "") -> Generator:
"""Parse individual product page and yield ProductItem."""
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_css ("title", "h1.product-title::text",
MapCompose(str.strip))
loader.add_css ("price_raw", "span.price::text",
MapCompose(str.strip))
loader.add_css ("description", "div.product-description p::text",
MapCompose(str.strip), Join(" "))
loader.add_css ("categories", "nav.breadcrumb a::text",
MapCompose(str.strip))
loader.add_xpath ("images", "//img[@class='product-image']/@src",
MapCompose(lambda u: response.urljoin(u)))
loader.add_value ("url", response.url)
loader.add_value ("source", source)
loader.add_value ("scraped_at", datetime.utcnow().isoformat())
# Set output processors
loader.default_output_processor = TakeFirst()
loader.categories_out = list # Keep as list
loader.images_out = list
item = loader.load_item()
item["item_id"] = hashlib.md5(response.url.encode()).hexdigest()
yield item
# ── 3. CrawlSpider ────────────────────────────────────────────────────────────
class SiteCrawlSpider(scrapy.spiders.CrawlSpider):
"""
Full-site crawler using CrawlSpider rules.
Follows links matching /product/ and /category/ patterns.
"""
name = "site_crawler"
allowed_domains = ["example-shop.com"]
start_urls = ["https://example-shop.com/"]
rules = (
# Follow category pages, don't parse them as items
scrapy.spiders.Rule(
scrapy.linkextractors.LinkExtractor(allow=r"/category/"),
follow=True,
),
# Parse product pages as items
scrapy.spiders.Rule(
scrapy.linkextractors.LinkExtractor(allow=r"/product/\d+"),
callback="parse_product",
follow=False,
),
)
def parse_product(self, response: Response) -> Generator:
yield {
"url": response.url,
"title": response.css("h1::text").get("").strip(),
"price": response.css("[class*=price]::text").get("").strip(),
}
# ── 4. Pipelines ─────────────────────────────────────────────────────────────
class PriceCleanPipeline:
"""Clean raw price strings → float."""
PRICE_RE = re.compile(r"[\d,]+\.?\d*")
def process_item(self, item: dict, spider) -> dict:
adapter = ItemAdapter(item)
price_raw = adapter.get("price_raw", "")
if price_raw:
match = self.PRICE_RE.search(price_raw.replace(",", ""))
adapter["price"] = float(match.group()) if match else None
# Detect currency symbol
if "$" in price_raw:
adapter["currency"] = "USD"
elif "€" in price_raw:
adapter["currency"] = "EUR"
elif "£" in price_raw:
adapter["currency"] = "GBP"
return item
class DuplicateFilterPipeline:
"""Drop already-seen items by fingerprint (in-memory set)."""
def open_spider(self, spider):
self.seen_ids: set[str] = set()
def process_item(self, item: dict, spider) -> dict:
adapter = ItemAdapter(item)
item_id = adapter.get("item_id")
if item_id in self.seen_ids:
raise DropItem(f"Duplicate item: {item_id}")
if item_id:
self.seen_ids.add(item_id)
return item
class ValidationPipeline:
"""Drop items missing required fields."""
REQUIRED = ["title", "url"]
def process_item(self, item: dict, spider) -> dict:
adapter = ItemAdapter(item)
for field in self.REQUIRED:
if not adapter.get(field):
raise DropItem(f"Missing required field '{field}': {item.get('url','')}")
return item
class JsonLinesExportPipeline:
"""Append items to a JSONL file (alternative to Scrapy FEEDS)."""
def __init__(self, output_path: str = "output.jsonl"):
self.output_path = output_path
@classmethod
def from_crawler(cls, crawler):
return cls(output_path=crawler.settings.get("JSONL_OUTPUT", "output.jsonl"))
def open_spider(self, spider):
import json
self.file = open(self.output_path, "w", encoding="utf-8")
self._json = json
spider.logger.info(f"Writing items to {self.output_path}")
def close_spider(self, spider):
self.file.close()
def process_item(self, item: dict, spider) -> dict:
line = self._json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
# ── 5. Middleware ─────────────────────────────────────────────────────────────
class RotatingUserAgentMiddleware:
"""
Rotate User-Agent strings on every request.
Enable in settings: DOWNLOADER_MIDDLEWARES = {"...RotatingUserAgent...": 400}
"""
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0",
]
_idx = 0
def process_request(self, request: Request, spider) -> None:
request.headers["User-Agent"] = self.USER_AGENTS[self._idx % len(self.USER_AGENTS)]
self.__class__._idx += 1
class ProxyMiddleware:
"""
Route requests through a rotating proxy list.
Enable in settings: DOWNLOADER_MIDDLEWARES = {"...ProxyMiddleware...": 410}
"""
def __init__(self, proxies: list[str]):
self.proxies = proxies
self._idx = 0
@classmethod
def from_crawler(cls, crawler):
proxies = crawler.settings.getlist("PROXY_LIST", [])
return cls(proxies)
def process_request(self, request: Request, spider) -> None:
if self.proxies:
proxy = self.proxies[self._idx % len(self.proxies)]
request.meta["proxy"] = proxy
self._idx += 1
class RetryOnStatusMiddleware:
"""Retry requests that return specific status codes."""
RETRY_CODES = {429, 503, 520, 521, 522, 524}
def process_response(self, request: Request, response: Response, spider) -> Response | Request:
if response.status in self.RETRY_CODES:
retries = request.meta.get("retry_count", 0)
if retries < 3:
new_request = request.copy()
new_request.meta["retry_count"] = retries + 1
spider.logger.warning(f"Retry {retries+1}/3 for {request.url} (HTTP {response.status})")
return new_request
return response
# ── 6. Settings template ──────────────────────────────────────────────────────
RECOMMENDED_SETTINGS = """
# scrapy_pipeline/settings.py
BOT_NAME = "mybot"
SPIDER_MODULES = ["mybot.spiders"]
# Politeness
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1.0
RANDOMIZE_DOWNLOAD_DELAY = True
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 3.0
# Output
FEEDS = {
"%(name)s_%(time)s.jsonl": {"format": "jsonlines"},
}
# Pipelines (order 100–900, lower = runs first)
ITEM_PIPELINES = {
"mybot.pipelines.PriceCleanPipeline": 100,
"mybot.pipelines.ValidationPipeline": 200,
"mybot.pipelines.DuplicateFilterPipeline": 300,
"mybot.pipelines.JsonLinesExportPipeline": 800,
}
# Middlewares
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"mybot.middlewares.RotatingUserAgentMiddleware": 400,
"mybot.middlewares.ProxyMiddleware": 410,
"mybot.middlewares.RetryOnStatusMiddleware": 550,
}
# HTTP cache (dev/debugging only)
HTTPCACHE_ENABLED = False
HTTPCACHE_EXPIRATION_SECS = 3600
# Logging
LOG_LEVEL = "INFO"
"""
# ── Demo (run-from-script pattern) ────────────────────────────────────────────
if __name__ == "__main__":
print("Scrapy Project Structure Demo")
print("=" * 50)
print("\n1. Create project: scrapy startproject myshop")
print("2. Generate spider: scrapy genspider products example-shop.com")
print("3. Run spider: scrapy crawl products -o products.jsonl")
print("4. Limit items: scrapy crawl products -s CLOSESPIDER_ITEMCOUNT=100")
print("\nPipeline order:")
print(" Request → Spider → Item → Pipelines → Export")
print("\nKey settings:")
for line in RECOMMENDED_SETTINGS.split("\n"):
if line.strip() and not line.startswith("#"):
print(f" {line.strip()}")
print("\nScrapy shell for testing selectors:")
print(" scrapy shell 'https://example-shop.com/product/123'")
print(" >>> response.css('h1::text').get()")
print(" >>> response.xpath('//span[@class=\"price\"]/text()').get()")
For the requests + BeautifulSoup alternative for quick one-off scraping — requests/BS4 is simpler for single pages while Scrapy’s built-in AutoThrottle adapts download rate to server response time, the CrawlSpider + LinkExtractor combo crawls entire sites without writing a link-following loop, and the pipeline/middleware architecture enforces clean separation between extraction, validation, and storage that prevents the common pattern of one 500-line scraping script with mixed concerns. For the Playwright/Selenium alternative for all JavaScript sites — Playwright is better for browser automation while scrapy-playwright integrates Playwright as a Scrapy download handler so JavaScript pages are handled transparently without losing Scrapy’s scheduler, rate-limiter, item pipeline, or stats collection, making it the production choice for large JS-heavy crawls. The Claude Skills 360 bundle includes Scrapy skill sets covering Spider and CrawlSpider, CSS and XPath selectors, ItemLoader with processors, pricing cleaning pipelines, dedup and validation pipelines, user-agent and proxy rotation middlewares, retry-on-error middleware, recommended settings with AutoThrottle, and JSONL export. Start with the free tier to try web scraping code generation.