Data science work involves pattern-matching against known solutions: handling missing values, encoding categorical features, diagnosing model performance issues, vectorizing slow loops. Claude Code accelerates this significantly because these are solved problems — the challenge is knowing which solution fits your specific data shape and domain.

This guide covers data science workflows with Claude Code: exploratory data analysis, data cleaning, scikit-learn pipelines, model evaluation, and moving from notebook to production.

EDA and Data Quality

Load this CSV, give me a data quality report, and identify
the columns that need cleaning before modeling.

Claude Code runs a systematic EDA:

import pandas as pd
import numpy as np

df = pd.read_csv('customers.csv')

def data_quality_report(df: pd.DataFrame) -> pd.DataFrame:
    report = pd.DataFrame({
        'dtype': df.dtypes,
        'null_pct': (df.isna().sum() / len(df) * 100).round(2),
        'unique_values': df.nunique(),
        'sample_values': df.apply(lambda col: col.dropna().head(3).tolist()),
    })
    report['action_needed'] = ''
    report.loc[report['null_pct'] > 20, 'action_needed'] += 'HIGH_NULLS '
    low_card = (report['dtype'].isin(['int64', 'float64'])) & (report['unique_values'] < 10)
    report.loc[low_card, 'action_needed'] += 'MAYBE_CATEGORICAL '
    return report

print(data_quality_report(df).to_string())

Claude Code reads the output, identifies problem columns, and proposes a cleaning plan specific to your data domain.

CLAUDE.md for Data Science Projects

## Data Science Environment
- Python 3.11, pandas 2.x, numpy 1.26, scikit-learn 1.4
- Data: raw in data/raw/, processed in data/processed/
- Never modify raw data — always write to processed/
- Use pandas 2.x copy-on-write mode (pd.options.mode.copy_on_write = True)
- Prefer vectorized operations over loops — profile with %timeit if unsure
- All experiments tracked with MLflow (mlflow.set_experiment('project-name'))

Data Cleaning Automation

This dataframe has messy address data. Standardize city names
(many variations like "NYC", "New York City", "new york"),
fix ZIP codes, and extract state from address where state is null.

import re
import pandas as pd

CITY_NORMALIZATIONS = {
    'nyc': 'New York', 'new york city': 'New York', 'ny': 'New York',
    'sf': 'San Francisco', 'la': 'Los Angeles', 'chi': 'Chicago',
}

def clean_city(city: str | None) -> str | None:
    if pd.isna(city):
        return None
    normalized = city.strip().lower()
    return CITY_NORMALIZATIONS.get(normalized, city.strip().title())

def clean_zip(zip_code: str | None) -> str | None:
    if pd.isna(zip_code):
        return None
    match = re.match(r'^(\d{5})(?:-\d{4})?$', str(zip_code).strip())
    return match.group(1) if match else None

STATE_PATTERN = re.compile(
    r'\b(AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|'
    r'MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|'
    r'UT|VT|VA|WA|WV|WI|WY)\b'
)

def extract_state(row: pd.Series) -> str | None:
    if pd.notna(row['state']):
        return row['state']
    if pd.isna(row['address']):
        return None
    match = STATE_PATTERN.search(str(row['address']).upper())
    return match.group(1) if match else None

df['city_clean'] = df['city'].apply(clean_city)
df['zip_clean'] = df['zip_code'].apply(clean_zip)
df['state_clean'] = df.apply(extract_state, axis=1)

Vectorizing Slow Operations

This data processing loop takes 45 minutes on 2M rows. Speed it up.

# Original — iterrows is ~100x slower than vectorized operations
def calculate_discount(row):
    if row['customer_tier'] == 'gold':
        return row['price'] * 0.15
    elif row['customer_tier'] == 'silver':
        return row['price'] * 0.10
    elif row['purchase_count'] > 10:
        return row['price'] * 0.05
    return 0.0

df['discount'] = df.apply(calculate_discount, axis=1)  # 45 minutes

# Vectorized — 50-200x faster using numpy.select
import numpy as np

conditions = [
    df['customer_tier'] == 'gold',
    df['customer_tier'] == 'silver',
    df['purchase_count'] > 10,
]
choices = [
    df['price'] * 0.15,
    df['price'] * 0.10,
    df['price'] * 0.05,
]
df['discount'] = np.select(conditions, choices, default=0.0)  # ~15 seconds

Claude Code identifies the bottleneck, explains why iterrows is slow, and rewrites using numpy.select or map depending on the logic pattern.

scikit-learn Pipelines

Build a churn prediction pipeline. Features: customer_age, monthly_spend,
tenure_months, support_tickets, last_login_days_ago, plan_type (categorical).
Handle missing values and scale properly.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate

NUMERIC_FEATURES = ['customer_age', 'monthly_spend', 'tenure_months',
                    'support_tickets', 'last_login_days_ago']
CATEGORICAL_FEATURES = ['plan_type']

preprocessor = ColumnTransformer(transformers=[
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
    ]), NUMERIC_FEATURES),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ]), CATEGORICAL_FEATURES),
])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.05, max_depth=4,
        subsample=0.8, random_state=42,
    )),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
    pipeline, X_train, y_train, cv=cv,
    scoring=['accuracy', 'roc_auc', 'f1'],
)
print(f"CV AUC: {cv_results['test_roc_auc'].mean():.3f} ± {cv_results['test_roc_auc'].std():.3f}")

Feature Importance

Which features are driving the churn predictions?

ohe_feature_names = (pipeline.named_steps['preprocessor']
    .named_transformers_['cat']
    .named_steps['encoder']
    .get_feature_names_out(CATEGORICAL_FEATURES))
feature_names = NUMERIC_FEATURES + list(ohe_feature_names)

importances = pipeline.named_steps['classifier'].feature_importances_

importance_df = (pd.DataFrame({'feature': feature_names, 'importance': importances})
    .sort_values('importance', ascending=False)
    .head(15))

Moving from Notebook to Production

This notebook works for experiments but I need to deploy it
as an API that scores customers daily. Convert it.

# src/models/churn_predictor.py
import joblib
import pandas as pd
from pathlib import Path

MODEL_PATH = Path('models/churn_pipeline_v2.joblib')

class ChurnPredictor:
    def __init__(self):
        self.pipeline = joblib.load(MODEL_PATH)

    def predict(self, features: pd.DataFrame) -> list[dict]:
        proba = self.pipeline.predict_proba(features)[:, 1]
        return [
            {
                'customer_id': str(cid),
                'churn_probability': round(float(p), 4),
                'risk_level': 'high' if p >= 0.6 else 'medium' if p >= 0.3 else 'low',
            }
            for cid, p in zip(features.index, proba)
        ]

# FastAPI serving layer
from fastapi import FastAPI
app = FastAPI()
predictor = ChurnPredictor()

@app.post('/score')
async def score_customers(customer_ids: list[str]) -> list[dict]:
    features = await fetch_customer_features(customer_ids)
    return predictor.predict(features)

Debugging Common ML Issues

My model has 99% training accuracy but 58% validation accuracy. What's wrong?

Claude Code diagnoses overfitting and generates a diagnostic checklist:

# Check if you have data leakage (most common cause of suspiciously high train accuracy)
# 1. Verify target variable isn't in features
# 2. Verify temporal features don't use future information
# 3. Verify train/val split was done before preprocessing

# Visualize the gap — learning curves
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    pipeline, X_train, y_train,
    cv=5, scoring='roc_auc',
    train_sizes=np.linspace(0.1, 1.0, 10),
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Train AUC')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation AUC')
plt.xlabel('Training examples')
plt.ylabel('ROC AUC')
plt.legend()
plt.savefig('learning_curves.png')

If train and validation curves diverge significantly with more data, it’s true overfitting — reduce model complexity, add regularization, or collect more diverse training data.

For deploying the scoring API, see the serverless guide or Docker guide. For Python backend APIs with FastAPI, see the Python FastAPI guide. The Claude Skills 360 bundle includes data science skill sets for EDA automation, feature engineering pipelines, and model evaluation patterns. Start with the free tier to try data science code generation on your datasets.

Claude Code for Python Data Science: Pandas, NumPy, and ML Workflows

EDA and Data Quality

CLAUDE.md for Data Science Projects

Data Cleaning Automation

Vectorizing Slow Operations

scikit-learn Pipelines

Feature Importance

Moving from Notebook to Production

Debugging Common ML Issues

Keep Reading

Claude Code for Functional Programming: Pure Functions, Composition, and fp-ts

Claude Code for Web Scraping: Playwright, Anti-Bot Handling, and Data Pipelines

Claude Code for Search Implementation: Full-Text, Vector, and Faceted Search

Put these ideas into practice