Web scraping ranges from simple HTML parsing to browser automation that handles JavaScript-rendered content, CAPTCHAs, and anti-bot systems. Claude Code generates scraping code with appropriate techniques for each scenario and the data pipeline patterns to process extracted content at scale.

This guide covers web scraping with Claude Code: Playwright automation, structured extraction, anti-bot considerations, and data pipelines.

CLAUDE.md for Scraping Projects

## Scraping Stack
- Playwright for JS-rendered content
- cheerio for static HTML parsing (faster, no browser overhead)
- Rate limiting: 1-2 req/sec default, respect robots.txt
- Retry: exponential backoff on 429/503
- Storage: PostgreSQL for structured data, S3 for raw HTML

## Ethics
- Always check robots.txt before scraping
- Scrape during off-peak hours for target's servers
- Store User-Agent and identify your bot
- Never scrape PII you don't have rights to use

Playwright Browser Automation

Scrape a job listings site that loads results via JavaScript.
Extract: job title, company, location, salary range, posted date, URL.

// src/scrapers/jobs-scraper.ts
import { chromium, Browser, Page } from 'playwright';

interface JobListing {
  title: string;
  company: string;
  location: string;
  salary?: string;
  postedAt: string;
  url: string;
  scrapedAt: Date;
}

export class JobsScraper {
  private browser: Browser | null = null;

  async initialize() {
    this.browser = await chromium.launch({
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-blink-features=AutomationControlled', // Reduce fingerprint
      ],
    });
  }

  async scrapeListings(searchUrl: string, maxPages = 5): Promise<JobListing[]> {
    if (!this.browser) throw new Error('Call initialize() first');

    const context = await this.browser.newContext({
      userAgent: 'Mozilla/5.0 (compatible; MyJobBot/1.0; [email protected])',
      viewport: { width: 1280, height: 720 },
      locale: 'en-US',
    });

    const page = await context.newPage();
    const results: JobListing[] = [];

    try {
      for (let pageNum = 1; pageNum <= maxPages; pageNum++) {
        const url = pageNum === 1 ? searchUrl : `${searchUrl}&page=${pageNum}`;
        
        await page.goto(url, { waitUntil: 'networkidle' });

        // Wait for job listings to load
        await page.waitForSelector('[data-testid="job-card"]', { timeout: 10000 });

        // Extract all listings on this page
        const listings = await page.evaluate(() => {
          const cards = document.querySelectorAll('[data-testid="job-card"]');
          return Array.from(cards).map(card => ({
            title: card.querySelector('.job-title')?.textContent?.trim() ?? '',
            company: card.querySelector('.company-name')?.textContent?.trim() ?? '',
            location: card.querySelector('.location')?.textContent?.trim() ?? '',
            salary: card.querySelector('.salary-range')?.textContent?.trim(),
            postedAt: card.querySelector('[data-posted]')?.getAttribute('data-posted') ?? '',
            url: (card.querySelector('a.job-link') as HTMLAnchorElement)?.href ?? '',
          }));
        });

        results.push(...listings.map(l => ({ ...l, scrapedAt: new Date() })));

        // Check if there's a next page
        const nextButton = page.locator('[aria-label="Next page"]');
        const isDisabled = await nextButton.getAttribute('disabled');
        if (isDisabled !== null) break;

        // Respectful rate limiting
        await page.waitForTimeout(1500 + Math.random() * 1000);
      }
    } finally {
      await context.close();
    }

    return results;
  }

  async close() {
    await this.browser?.close();
  }
}

Anti-Bot Handling

The site is blocking our scraper with a 403 after 10 requests.
What anti-bot measures might be in place and how do we handle them?

Common defenses and approaches:

// src/scrapers/resilient-page.ts
export async function createStealthPage(browser: Browser) {
  const context = await browser.newContext({
    userAgent: getRotatingUserAgent(),
    
    // Consistent fingerprint
    viewport: { width: 1366, height: 768 },
    locale: 'en-US',
    timezoneId: 'America/New_York',
    
    // Proxy rotation (use residential proxies for high-security sites)
    proxy: process.env.PROXY_URL ? {
      server: process.env.PROXY_URL,
      username: process.env.PROXY_USER,
      password: process.env.PROXY_PASS,
    } : undefined,
  });

  const page = await context.newPage();

  // Mask Playwright automation detection
  await page.addInitScript(() => {
    // Remove webdriver property
    Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
    
    // Fake plugins list
    Object.defineProperty(navigator, 'plugins', {
      get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }],
    });
  });

  // Random mouse movements to simulate human behavior
  await page.mouse.move(
    Math.floor(Math.random() * 800 + 200),
    Math.floor(Math.random() * 400 + 200),
  );

  return { page, context };
}

// Rate limiter with jitter
export class RateLimiter {
  private lastRequest = 0;

  constructor(
    private minDelayMs: number = 1000,
    private maxDelayMs: number = 3000,
  ) {}

  async wait() {
    const elapsed = Date.now() - this.lastRequest;
    const delay = this.minDelayMs + Math.random() * (this.maxDelayMs - this.minDelayMs);

    if (elapsed < delay) {
      await new Promise(resolve => setTimeout(resolve, delay - elapsed));
    }

    this.lastRequest = Date.now();
  }
}

Cheerio for Static HTML

The site is static HTML — use the lightweight option.
Extract product prices from a pricing page.

// src/scrapers/static-scraper.ts
import * as cheerio from 'cheerio';
import { got } from 'got'; // HTTP client with retry

export async function scrapePricingPage(url: string): Promise<{ plan: string; price: string; features: string[] }[]> {
  const html = await got(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; PriceBot/1.0)',
      'Accept-Language': 'en-US,en;q=0.9',
    },
    retry: {
      limit: 3,
      statusCodes: [429, 503],
      calculateDelay: ({ retryCount }) => retryCount * 2000, // Exponential backoff
    },
  }).text();

  const $ = cheerio.load(html);
  const plans: { plan: string; price: string; features: string[] }[] = [];

  $('.pricing-card, [data-plan]').each((i, el) => {
    const card = $(el);
    plans.push({
      plan: card.find('.plan-name, h2').first().text().trim(),
      price: card.find('.price, [data-price]').first().text().trim(),
      features: card.find('.feature, li').map((_, li) => $(li).text().trim()).get().filter(Boolean),
    });
  });

  return plans;
}

Data Pipeline for Scraped Content

// src/pipeline/scrape-pipeline.ts
import { Queue, Worker } from 'bullmq';

// Scrape queue — decouples scraping from processing
const scrapeQueue = new Queue('scrape', { connection: redis });

// Process scraped data: validate, deduplicate, store
const processor = new Worker('scrape', async (job) => {
  const { url, rawData } = job.data;

  // 1. Parse and validate
  const parsed = parseJobListing(rawData);
  if (!parsed) {
    job.log(`Skipping: could not parse listing from ${url}`);
    return;
  }

  // 2. Deduplicate by URL fingerprint
  const existing = await db.jobListings.findOne({ sourceUrl: parsed.url });
  if (existing) {
    // Update if data changed
    if (hasChanged(existing, parsed)) {
      await db.jobListings.update(existing.id, { ...parsed, updatedAt: new Date() });
    }
    return;
  }

  // 3. Store new listing
  await db.jobListings.create(parsed);
  
  // 4. Trigger downstream: notify matching users, update search index
  await notificationQueue.add('new-job-match', { jobId: parsed.id });
}, { connection: redis });

For Playwright end-to-end testing (same library, different use case), see the Playwright testing guide. For processing scraped data with background jobs and queues, see the background jobs guide. The Claude Skills 360 bundle includes automation skill sets for browser automation, data extraction, and pipeline processing. Start with the free tier to try web scraping code generation.

Claude Code for Web Scraping: Playwright, Anti-Bot Handling, and Data Pipelines

CLAUDE.md for Scraping Projects

Playwright Browser Automation

Anti-Bot Handling

Cheerio for Static HTML

Data Pipeline for Scraped Content

Keep Reading

Claude Code for Functional Programming: Pure Functions, Composition, and fp-ts

Claude Code for Search Implementation: Full-Text, Vector, and Faceted Search

Claude Code for API Versioning: Strategies, Breaking Changes, and SDK Maintenance

Put these ideas into practice