DuckDB is an in-process OLAP database — no server, no setup, just SQL on files. It scans Parquet, CSV, JSON, and Arrow data directly with column-oriented vectorized execution. The httpfs extension reads from S3 without downloading. Window functions, PIVOT, and full SQL99 CTEs work natively. DuckDB processes gigabytes on a laptop in seconds. Claude Code generates DuckDB SQL queries, Python integration code, data transformation pipelines, and the lakehouse architecture patterns that replace heavyweight analytics infrastructure.
CLAUDE.md for DuckDB Projects
## DuckDB Stack
- Version: duckdb >= 1.0 (stable persistent file format)
- Extensions: httpfs (S3), spatial (GIS), json (JSON functions), excel (XLSX read)
- Formats: Parquet preferred, Delta Lake and Iceberg via extensions
- Python: duckdb.connect() — one connection per process (thread-safe)
- Performance: memory_limit and threads config for large machines
- Persistence: .db file for mutable tables, Parquet for immutable data
- Patterns: COPY TO PARQUET for exports, INSERT INTO SELECT for transforms
Querying Files Directly
-- Scan a single Parquet file
SELECT * FROM read_parquet('data/orders.parquet')
WHERE status = 'delivered'
LIMIT 10;
-- Glob pattern — scan all files matching pattern
SELECT
DATE_TRUNC('month', created_at) AS month,
COUNT(*) AS orders,
SUM(total_cents) / 100.0 AS revenue,
AVG(total_cents) / 100.0 AS avg_order_value
FROM read_parquet('data/orders/year=*/month=*/*.parquet')
GROUP BY 1
ORDER BY 1;
-- CSV with schema inference
SELECT * FROM read_csv_auto('data/customers.csv', header=true)
LIMIT 5;
-- CSV with explicit schema
SELECT * FROM read_csv(
'data/events.csv',
columns = {
'user_id': 'VARCHAR',
'event_type': 'VARCHAR',
'created_at': 'TIMESTAMP',
'amount_cents': 'INTEGER'
},
dateformat='%Y-%m-%d %H:%M:%S'
);
-- JSON files
SELECT
json_extract_string(data, '$.user.id') AS user_id,
json_extract(data, '$.items') AS items
FROM read_json('data/events/*.json', format='newline_delimited');
S3 / Cloud Storage with httpfs
-- Install and load httpfs
INSTALL httpfs;
LOAD httpfs;
-- Configure S3 credentials
SET s3_region = 'us-east-1';
SET s3_access_key_id = getenv('AWS_ACCESS_KEY_ID');
SET s3_secret_access_key = getenv('AWS_SECRET_ACCESS_KEY');
-- Or use instance role (no explicit credentials needed)
SET s3_use_ssl = true;
SET s3_url_style = 'vhost';
-- Query S3 directly — no download required
SELECT
DATE_TRUNC('week', order_date) AS week,
product_category,
SUM(revenue_cents) / 100.0 AS weekly_revenue
FROM read_parquet('s3://my-data-lake/orders/year=2026/**/*.parquet')
GROUP BY 1, 2
ORDER BY 1, 2;
-- Write query results back to S3
COPY (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_cents) AS lifetime_value_cents
FROM read_parquet('s3://my-data-lake/orders/**/*.parquet')
GROUP BY customer_id
) TO 's3://my-data-lake/analytics/customer_ltv.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);
Window Functions and Analytics
-- Comprehensive window function showcase
SELECT
customer_id,
order_id,
created_at,
total_cents,
-- Ranking
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at) AS order_num,
RANK() OVER (ORDER BY total_cents DESC) AS spend_rank,
NTILE(4) OVER (ORDER BY total_cents) AS spend_quartile, -- Q1-Q4
-- Running calculations
SUM(total_cents) OVER (
PARTITION BY customer_id
ORDER BY created_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS lifetime_spend,
-- Moving averages
AVG(total_cents) OVER (
PARTITION BY customer_id
ORDER BY created_at
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS rolling_3_order_avg,
-- Lead/Lag for cohort analysis
created_at - LAG(created_at) OVER (
PARTITION BY customer_id ORDER BY created_at
) AS days_since_last_order,
LEAD(created_at) OVER (
PARTITION BY customer_id ORDER BY created_at
) AS next_order_date,
-- Percent of total
total_cents * 100.0 / SUM(total_cents) OVER (PARTITION BY customer_id)
AS pct_of_customer_spend
FROM orders
ORDER BY customer_id, created_at;
PIVOT and Unpivot
-- PIVOT: rows to columns (crosstab)
PIVOT (
SELECT
customer_id,
DATE_TRUNC('month', created_at)::DATE AS month,
SUM(total_cents) AS revenue
FROM orders
GROUP BY 1, 2
)
ON month
USING SUM(revenue)
ORDER BY customer_id;
-- Dynamic PIVOT — columns determined by data
PIVOT orders
ON status
USING COUNT(*) AS count, SUM(total_cents) AS revenue
GROUP BY customer_id;
-- UNPIVOT: columns to rows
UNPIVOT (
SELECT customer_id, jan_revenue, feb_revenue, mar_revenue
FROM monthly_summary
)
ON jan_revenue, feb_revenue, mar_revenue
INTO NAME month VALUE revenue;
CTEs and Recursive Queries
-- Cohort retention analysis with CTEs
WITH cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(created_at))::DATE AS cohort_month
FROM orders
GROUP BY customer_id
),
order_months AS (
SELECT
o.customer_id,
c.cohort_month,
DATE_TRUNC('month', o.created_at)::DATE AS order_month,
DATEDIFF('month', c.cohort_month, DATE_TRUNC('month', o.created_at)::DATE) AS months_since_first
FROM orders o
JOIN cohorts c ON o.customer_id = c.customer_id
),
retention AS (
SELECT
cohort_month,
months_since_first,
COUNT(DISTINCT customer_id) AS customers
FROM order_months
GROUP BY 1, 2
),
cohort_sizes AS (
SELECT cohort_month, customers AS cohort_size
FROM retention
WHERE months_since_first = 0
)
SELECT
r.cohort_month,
r.months_since_first,
r.customers,
s.cohort_size,
ROUND(r.customers * 100.0 / s.cohort_size, 1) AS retention_rate
FROM retention r
JOIN cohort_sizes s ON r.cohort_month = s.cohort_month
ORDER BY cohort_month, months_since_first;
-- Recursive CTE: category hierarchy
WITH RECURSIVE category_tree AS (
-- Base: root categories
SELECT id, name, parent_id, 0 AS depth, name AS path
FROM categories
WHERE parent_id IS NULL
UNION ALL
-- Recursive: children
SELECT c.id, c.name, c.parent_id,
ct.depth + 1,
ct.path || ' > ' || c.name
FROM categories c
JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY path;
Python Integration
import duckdb
import pandas as pd
import polars as pl
# In-memory connection (default)
con = duckdb.connect()
# Persistent database
con = duckdb.connect('analytics.db')
# Configure for large workloads
con.execute("""
SET memory_limit = '8GB';
SET threads = 8;
SET enable_progress_bar = true;
""")
# Query returns various formats
df_pandas = con.execute("SELECT * FROM orders LIMIT 100").df() # pandas
df_polars = con.execute("SELECT * FROM orders LIMIT 100").pl() # Polars
arrow_table = con.execute("SELECT * FROM orders LIMIT 100").arrow() # PyArrow
rows = con.execute("SELECT * FROM orders LIMIT 5").fetchall() # list of tuples
# Register pandas DataFrame as virtual table
orders_pd = pd.read_parquet("orders.parquet")
con.register("orders_view", orders_pd)
result = con.execute("SELECT COUNT(*) FROM orders_view WHERE status = 'delivered'").fetchone()
# Register Polars LazyFrame
orders_pl = pl.scan_parquet("orders.parquet")
con.register("orders_lazy", orders_pl)
# Use Python variables in queries
min_amount = 5000
result = con.execute(
"SELECT * FROM orders WHERE total_cents > ? AND status = ?",
[min_amount, "delivered"]
).pl()
# Parameterized for security (never use f-strings with user input)
status_filter = "delivered" # user input — use parameters, not f-string
safe_result = con.execute(
"SELECT COUNT(*) FROM read_parquet(?) WHERE status = ?",
["data/orders.parquet", status_filter]
).fetchone()
con.close()
Building a Local Data Lakehouse
import duckdb
from pathlib import Path
import datetime
class LocalLakehouse:
"""Simple data lakehouse using DuckDB + Parquet files."""
def __init__(self, base_path: str):
self.base_path = Path(base_path)
self.con = duckdb.connect(str(self.base_path / "catalog.db"))
self._setup()
def _setup(self):
self.con.execute("""
INSTALL httpfs; LOAD httpfs;
INSTALL delta; LOAD delta;
CREATE TABLE IF NOT EXISTS ingestion_log (
table_name VARCHAR,
partition_path VARCHAR,
row_count BIGINT,
ingested_at TIMESTAMP DEFAULT NOW()
);
""")
def write_partition(self, table: str, df, partition_date: datetime.date):
"""Write a dated partition as Parquet."""
path = self.base_path / table / f"date={partition_date}" / "data.parquet"
path.parent.mkdir(parents=True, exist_ok=True)
self.con.execute(f"""
COPY (SELECT * FROM df)
TO '{path}'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000)
""")
row_count = self.con.execute(f"SELECT COUNT(*) FROM '{path}'").fetchone()[0]
self.con.execute(
"INSERT INTO ingestion_log VALUES (?, ?, ?, NOW())",
[table, str(path), row_count]
)
return row_count
def query(self, table: str, sql_filter: str = "1=1") -> "duckdb.DuckDBPyRelation":
"""Query all partitions of a table."""
glob_path = str(self.base_path / table / "**/*.parquet")
return self.con.execute(
f"SELECT * FROM read_parquet('{glob_path}', hive_partitioning=true) WHERE {sql_filter}"
)
def create_view(self, view_name: str, sql: str):
"""Create a persistent view for frequently used queries."""
self.con.execute(f"CREATE OR REPLACE VIEW {view_name} AS {sql}")
# Usage
lake = LocalLakehouse("/data/lakehouse")
# Ingest daily orders
import polars as pl
orders_today = pl.read_csv("orders_2026-12-18.csv")
count = lake.write_partition("orders", orders_today, datetime.date(2026, 12, 18))
print(f"Ingested {count} rows")
# Query across all partitions
result = lake.query(
"orders",
"date >= '2026-10-01' AND status = 'delivered'"
).pl()
Full-Text and JSON Search
-- JSON function usage
SELECT
json_extract_string(metadata, '$.source') AS acquisition_source,
json_extract_integer(metadata, '$.items[0].quantity') AS first_item_qty,
json_array_length(json_extract(metadata, '$.items')) AS item_count,
json_keys(metadata) AS all_keys
FROM orders
WHERE json_extract_string(metadata, '$.source') = 'instagram';
-- Unnest JSON array into rows
SELECT
order_id,
item.value ->> 'product_id' AS product_id,
(item.value ->> 'quantity')::INTEGER AS quantity,
(item.value ->> 'price_cents')::INTEGER AS price_cents
FROM orders,
json_each(metadata -> 'items') AS item;
-- Pattern matching with SIMILAR TO and LIKE
SELECT * FROM orders
WHERE notes SIMILAR TO '.*(urgent|rush|asap).*'
OR shipping_address ILIKE '%new york%';
For the Polars DataFrames that integrate tightly with DuckDB for Python-native analytics pipelines, see the Polars guide for expression API and lazy evaluation patterns. For the ClickHouse server-based OLAP alternative for high-concurrency production analytics, the ClickHouse guide covers the server-side columnar analytics engine. The Claude Skills 360 bundle includes DuckDB skill sets covering S3 queries, window functions, and local lakehouse patterns. Start with the free tier to try analytical SQL generation.