DuckDB is an in-process OLAP database — no server, no setup, just SQL on files. It scans Parquet, CSV, JSON, and Arrow data directly with column-oriented vectorized execution. The httpfs extension reads from S3 without downloading. Window functions, PIVOT, and full SQL99 CTEs work natively. DuckDB processes gigabytes on a laptop in seconds. Claude Code generates DuckDB SQL queries, Python integration code, data transformation pipelines, and the lakehouse architecture patterns that replace heavyweight analytics infrastructure.

CLAUDE.md for DuckDB Projects

## DuckDB Stack
- Version: duckdb >= 1.0 (stable persistent file format)
- Extensions: httpfs (S3), spatial (GIS), json (JSON functions), excel (XLSX read)
- Formats: Parquet preferred, Delta Lake and Iceberg via extensions
- Python: duckdb.connect() — one connection per process (thread-safe)
- Performance: memory_limit and threads config for large machines
- Persistence: .db file for mutable tables, Parquet for immutable data
- Patterns: COPY TO PARQUET for exports, INSERT INTO SELECT for transforms

Querying Files Directly

-- Scan a single Parquet file
SELECT * FROM read_parquet('data/orders.parquet')
WHERE status = 'delivered'
LIMIT 10;

-- Glob pattern — scan all files matching pattern
SELECT
    DATE_TRUNC('month', created_at) AS month,
    COUNT(*) AS orders,
    SUM(total_cents) / 100.0 AS revenue,
    AVG(total_cents) / 100.0 AS avg_order_value
FROM read_parquet('data/orders/year=*/month=*/*.parquet')
GROUP BY 1
ORDER BY 1;

-- CSV with schema inference
SELECT * FROM read_csv_auto('data/customers.csv', header=true)
LIMIT 5;

-- CSV with explicit schema
SELECT * FROM read_csv(
    'data/events.csv',
    columns = {
        'user_id': 'VARCHAR',
        'event_type': 'VARCHAR',
        'created_at': 'TIMESTAMP',
        'amount_cents': 'INTEGER'
    },
    dateformat='%Y-%m-%d %H:%M:%S'
);

-- JSON files
SELECT
    json_extract_string(data, '$.user.id') AS user_id,
    json_extract(data, '$.items') AS items
FROM read_json('data/events/*.json', format='newline_delimited');

S3 / Cloud Storage with httpfs

-- Install and load httpfs
INSTALL httpfs;
LOAD httpfs;

-- Configure S3 credentials
SET s3_region = 'us-east-1';
SET s3_access_key_id = getenv('AWS_ACCESS_KEY_ID');
SET s3_secret_access_key = getenv('AWS_SECRET_ACCESS_KEY');

-- Or use instance role (no explicit credentials needed)
SET s3_use_ssl = true;
SET s3_url_style = 'vhost';

-- Query S3 directly — no download required
SELECT
    DATE_TRUNC('week', order_date) AS week,
    product_category,
    SUM(revenue_cents) / 100.0 AS weekly_revenue
FROM read_parquet('s3://my-data-lake/orders/year=2026/**/*.parquet')
GROUP BY 1, 2
ORDER BY 1, 2;

-- Write query results back to S3
COPY (
    SELECT
        customer_id,
        COUNT(*) AS order_count,
        SUM(total_cents) AS lifetime_value_cents
    FROM read_parquet('s3://my-data-lake/orders/**/*.parquet')
    GROUP BY customer_id
) TO 's3://my-data-lake/analytics/customer_ltv.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);

Window Functions and Analytics

-- Comprehensive window function showcase
SELECT
    customer_id,
    order_id,
    created_at,
    total_cents,

    -- Ranking
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at) AS order_num,
    RANK() OVER (ORDER BY total_cents DESC) AS spend_rank,
    NTILE(4) OVER (ORDER BY total_cents) AS spend_quartile,  -- Q1-Q4

    -- Running calculations
    SUM(total_cents) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS lifetime_spend,

    -- Moving averages
    AVG(total_cents) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS rolling_3_order_avg,

    -- Lead/Lag for cohort analysis
    created_at - LAG(created_at) OVER (
        PARTITION BY customer_id ORDER BY created_at
    ) AS days_since_last_order,

    LEAD(created_at) OVER (
        PARTITION BY customer_id ORDER BY created_at
    ) AS next_order_date,

    -- Percent of total
    total_cents * 100.0 / SUM(total_cents) OVER (PARTITION BY customer_id)
        AS pct_of_customer_spend

FROM orders
ORDER BY customer_id, created_at;

PIVOT and Unpivot

-- PIVOT: rows to columns (crosstab)
PIVOT (
    SELECT
        customer_id,
        DATE_TRUNC('month', created_at)::DATE AS month,
        SUM(total_cents) AS revenue
    FROM orders
    GROUP BY 1, 2
)
ON month
USING SUM(revenue)
ORDER BY customer_id;

-- Dynamic PIVOT — columns determined by data
PIVOT orders
ON status
USING COUNT(*) AS count, SUM(total_cents) AS revenue
GROUP BY customer_id;

-- UNPIVOT: columns to rows
UNPIVOT (
    SELECT customer_id, jan_revenue, feb_revenue, mar_revenue
    FROM monthly_summary
)
ON jan_revenue, feb_revenue, mar_revenue
INTO NAME month VALUE revenue;

CTEs and Recursive Queries

-- Cohort retention analysis with CTEs
WITH cohorts AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', MIN(created_at))::DATE AS cohort_month
    FROM orders
    GROUP BY customer_id
),
order_months AS (
    SELECT
        o.customer_id,
        c.cohort_month,
        DATE_TRUNC('month', o.created_at)::DATE AS order_month,
        DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.created_at)::DATE) AS months_since_first
    FROM orders o
    JOIN cohorts c ON o.customer_id = c.customer_id
),
retention AS (
    SELECT
        cohort_month,
        months_since_first,
        COUNT(DISTINCT customer_id) AS customers
    FROM order_months
    GROUP BY 1, 2
),
cohort_sizes AS (
    SELECT cohort_month, customers AS cohort_size
    FROM retention
    WHERE months_since_first = 0
)
SELECT
    r.cohort_month,
    r.months_since_first,
    r.customers,
    s.cohort_size,
    ROUND(r.customers * 100.0 / s.cohort_size, 1) AS retention_rate
FROM retention r
JOIN cohort_sizes s ON r.cohort_month = s.cohort_month
ORDER BY cohort_month, months_since_first;

-- Recursive CTE: category hierarchy
WITH RECURSIVE category_tree AS (
    -- Base: root categories
    SELECT id, name, parent_id, 0 AS depth, name AS path
    FROM categories
    WHERE parent_id IS NULL

    UNION ALL

    -- Recursive: children
    SELECT c.id, c.name, c.parent_id,
        ct.depth + 1,
        ct.path || ' > ' || c.name
    FROM categories c
    JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY path;

Python Integration

import duckdb
import pandas as pd
import polars as pl

# In-memory connection (default)
con = duckdb.connect()

# Persistent database
con = duckdb.connect('analytics.db')

# Configure for large workloads
con.execute("""
    SET memory_limit = '8GB';
    SET threads = 8;
    SET enable_progress_bar = true;
""")

# Query returns various formats
df_pandas = con.execute("SELECT * FROM orders LIMIT 100").df()        # pandas
df_polars = con.execute("SELECT * FROM orders LIMIT 100").pl()        # Polars
arrow_table = con.execute("SELECT * FROM orders LIMIT 100").arrow()   # PyArrow
rows = con.execute("SELECT * FROM orders LIMIT 5").fetchall()         # list of tuples

# Register pandas DataFrame as virtual table
orders_pd = pd.read_parquet("orders.parquet")
con.register("orders_view", orders_pd)
result = con.execute("SELECT COUNT(*) FROM orders_view WHERE status = 'delivered'").fetchone()

# Register Polars LazyFrame
orders_pl = pl.scan_parquet("orders.parquet")
con.register("orders_lazy", orders_pl)

# Use Python variables in queries
min_amount = 5000
result = con.execute(
    "SELECT * FROM orders WHERE total_cents > ? AND status = ?",
    [min_amount, "delivered"]
).pl()

# Parameterized for security (never use f-strings with user input)
status_filter = "delivered"  # user input — use parameters, not f-string
safe_result = con.execute(
    "SELECT COUNT(*) FROM read_parquet(?) WHERE status = ?",
    ["data/orders.parquet", status_filter]
).fetchone()

con.close()

Building a Local Data Lakehouse

import duckdb
from pathlib import Path
import datetime

class LocalLakehouse:
    """Simple data lakehouse using DuckDB + Parquet files."""

    def __init__(self, base_path: str):
        self.base_path = Path(base_path)
        self.con = duckdb.connect(str(self.base_path / "catalog.db"))
        self._setup()

    def _setup(self):
        self.con.execute("""
            INSTALL httpfs; LOAD httpfs;
            INSTALL delta; LOAD delta;
            
            CREATE TABLE IF NOT EXISTS ingestion_log (
                table_name VARCHAR,
                partition_path VARCHAR,
                row_count BIGINT,
                ingested_at TIMESTAMP DEFAULT NOW()
            );
        """)

    def write_partition(self, table: str, df, partition_date: datetime.date):
        """Write a dated partition as Parquet."""
        path = self.base_path / table / f"date={partition_date}" / "data.parquet"
        path.parent.mkdir(parents=True, exist_ok=True)

        self.con.execute(f"""
            COPY (SELECT * FROM df)
            TO '{path}'
            (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000)
        """)

        row_count = self.con.execute(f"SELECT COUNT(*) FROM '{path}'").fetchone()[0]
        self.con.execute(
            "INSERT INTO ingestion_log VALUES (?, ?, ?, NOW())",
            [table, str(path), row_count]
        )

        return row_count

    def query(self, table: str, sql_filter: str = "1=1") -> "duckdb.DuckDBPyRelation":
        """Query all partitions of a table."""
        glob_path = str(self.base_path / table / "**/*.parquet")
        return self.con.execute(
            f"SELECT * FROM read_parquet('{glob_path}', hive_partitioning=true) WHERE {sql_filter}"
        )

    def create_view(self, view_name: str, sql: str):
        """Create a persistent view for frequently used queries."""
        self.con.execute(f"CREATE OR REPLACE VIEW {view_name} AS {sql}")


# Usage
lake = LocalLakehouse("/data/lakehouse")

# Ingest daily orders
import polars as pl
orders_today = pl.read_csv("orders_2026-12-18.csv")
count = lake.write_partition("orders", orders_today, datetime.date(2026, 12, 18))
print(f"Ingested {count} rows")

# Query across all partitions
result = lake.query(
    "orders",
    "date >= '2026-10-01' AND status = 'delivered'"
).pl()

Full-Text and JSON Search

-- JSON function usage
SELECT
    json_extract_string(metadata, '$.source') AS acquisition_source,
    json_extract_integer(metadata, '$.items[0].quantity') AS first_item_qty,
    json_array_length(json_extract(metadata, '$.items')) AS item_count,
    json_keys(metadata) AS all_keys
FROM orders
WHERE json_extract_string(metadata, '$.source') = 'instagram';

-- Unnest JSON array into rows
SELECT
    order_id,
    item.value ->> 'product_id' AS product_id,
    (item.value ->> 'quantity')::INTEGER AS quantity,
    (item.value ->> 'price_cents')::INTEGER AS price_cents
FROM orders,
    json_each(metadata -> 'items') AS item;

-- Pattern matching with SIMILAR TO and LIKE
SELECT * FROM orders
WHERE notes SIMILAR TO '.*(urgent|rush|asap).*'
   OR shipping_address ILIKE '%new york%';

For the Polars DataFrames that integrate tightly with DuckDB for Python-native analytics pipelines, see the Polars guide for expression API and lazy evaluation patterns. For the ClickHouse server-based OLAP alternative for high-concurrency production analytics, the ClickHouse guide covers the server-side columnar analytics engine. The Claude Skills 360 bundle includes DuckDB skill sets covering S3 queries, window functions, and local lakehouse patterns. Start with the free tier to try analytical SQL generation.

Claude Code for DuckDB: In-Process Analytics on Files

CLAUDE.md for DuckDB Projects

Querying Files Directly

S3 / Cloud Storage with httpfs

Window Functions and Analytics

PIVOT and Unpivot

CTEs and Recursive Queries

Python Integration

Building a Local Data Lakehouse

Full-Text and JSON Search

Keep Reading

Claude Code for Databricks: Unified Lakehouse with Delta Lake and MLflow

Claude Code for Apache Beam: Unified Batch and Streaming Pipelines

Claude Code for Polars: Fast DataFrames in Python

Put these ideas into practice