Data science work involves pattern-matching against known solutions: handling missing values, encoding categorical features, diagnosing model performance issues, vectorizing slow loops. Claude Code accelerates this significantly because these are solved problems — the challenge is knowing which solution fits your specific data shape and domain.
This guide covers data science workflows with Claude Code: exploratory data analysis, data cleaning, scikit-learn pipelines, model evaluation, and moving from notebook to production.
EDA and Data Quality
Load this CSV, give me a data quality report, and identify
the columns that need cleaning before modeling.
Claude Code runs a systematic EDA:
import pandas as pd
import numpy as np
df = pd.read_csv('customers.csv')
def data_quality_report(df: pd.DataFrame) -> pd.DataFrame:
report = pd.DataFrame({
'dtype': df.dtypes,
'null_pct': (df.isna().sum() / len(df) * 100).round(2),
'unique_values': df.nunique(),
'sample_values': df.apply(lambda col: col.dropna().head(3).tolist()),
})
report['action_needed'] = ''
report.loc[report['null_pct'] > 20, 'action_needed'] += 'HIGH_NULLS '
low_card = (report['dtype'].isin(['int64', 'float64'])) & (report['unique_values'] < 10)
report.loc[low_card, 'action_needed'] += 'MAYBE_CATEGORICAL '
return report
print(data_quality_report(df).to_string())
Claude Code reads the output, identifies problem columns, and proposes a cleaning plan specific to your data domain.
CLAUDE.md for Data Science Projects
## Data Science Environment
- Python 3.11, pandas 2.x, numpy 1.26, scikit-learn 1.4
- Data: raw in data/raw/, processed in data/processed/
- Never modify raw data — always write to processed/
- Use pandas 2.x copy-on-write mode (pd.options.mode.copy_on_write = True)
- Prefer vectorized operations over loops — profile with %timeit if unsure
- All experiments tracked with MLflow (mlflow.set_experiment('project-name'))
Data Cleaning Automation
This dataframe has messy address data. Standardize city names
(many variations like "NYC", "New York City", "new york"),
fix ZIP codes, and extract state from address where state is null.
import re
import pandas as pd
CITY_NORMALIZATIONS = {
'nyc': 'New York', 'new york city': 'New York', 'ny': 'New York',
'sf': 'San Francisco', 'la': 'Los Angeles', 'chi': 'Chicago',
}
def clean_city(city: str | None) -> str | None:
if pd.isna(city):
return None
normalized = city.strip().lower()
return CITY_NORMALIZATIONS.get(normalized, city.strip().title())
def clean_zip(zip_code: str | None) -> str | None:
if pd.isna(zip_code):
return None
match = re.match(r'^(\d{5})(?:-\d{4})?$', str(zip_code).strip())
return match.group(1) if match else None
STATE_PATTERN = re.compile(
r'\b(AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|MA|'
r'MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|SD|TN|TX|'
r'UT|VT|VA|WA|WV|WI|WY)\b'
)
def extract_state(row: pd.Series) -> str | None:
if pd.notna(row['state']):
return row['state']
if pd.isna(row['address']):
return None
match = STATE_PATTERN.search(str(row['address']).upper())
return match.group(1) if match else None
df['city_clean'] = df['city'].apply(clean_city)
df['zip_clean'] = df['zip_code'].apply(clean_zip)
df['state_clean'] = df.apply(extract_state, axis=1)
Vectorizing Slow Operations
This data processing loop takes 45 minutes on 2M rows. Speed it up.
# Original — iterrows is ~100x slower than vectorized operations
def calculate_discount(row):
if row['customer_tier'] == 'gold':
return row['price'] * 0.15
elif row['customer_tier'] == 'silver':
return row['price'] * 0.10
elif row['purchase_count'] > 10:
return row['price'] * 0.05
return 0.0
df['discount'] = df.apply(calculate_discount, axis=1) # 45 minutes
# Vectorized — 50-200x faster using numpy.select
import numpy as np
conditions = [
df['customer_tier'] == 'gold',
df['customer_tier'] == 'silver',
df['purchase_count'] > 10,
]
choices = [
df['price'] * 0.15,
df['price'] * 0.10,
df['price'] * 0.05,
]
df['discount'] = np.select(conditions, choices, default=0.0) # ~15 seconds
Claude Code identifies the bottleneck, explains why iterrows is slow, and rewrites using numpy.select or map depending on the logic pattern.
scikit-learn Pipelines
Build a churn prediction pipeline. Features: customer_age, monthly_spend,
tenure_months, support_tickets, last_login_days_ago, plan_type (categorical).
Handle missing values and scale properly.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
NUMERIC_FEATURES = ['customer_age', 'monthly_spend', 'tenure_months',
'support_tickets', 'last_login_days_ago']
CATEGORICAL_FEATURES = ['plan_type']
preprocessor = ColumnTransformer(transformers=[
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
]), NUMERIC_FEATURES),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
]), CATEGORICAL_FEATURES),
])
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier(
n_estimators=200, learning_rate=0.05, max_depth=4,
subsample=0.8, random_state=42,
)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
pipeline, X_train, y_train, cv=cv,
scoring=['accuracy', 'roc_auc', 'f1'],
)
print(f"CV AUC: {cv_results['test_roc_auc'].mean():.3f} ± {cv_results['test_roc_auc'].std():.3f}")
Feature Importance
Which features are driving the churn predictions?
ohe_feature_names = (pipeline.named_steps['preprocessor']
.named_transformers_['cat']
.named_steps['encoder']
.get_feature_names_out(CATEGORICAL_FEATURES))
feature_names = NUMERIC_FEATURES + list(ohe_feature_names)
importances = pipeline.named_steps['classifier'].feature_importances_
importance_df = (pd.DataFrame({'feature': feature_names, 'importance': importances})
.sort_values('importance', ascending=False)
.head(15))
Moving from Notebook to Production
This notebook works for experiments but I need to deploy it
as an API that scores customers daily. Convert it.
# src/models/churn_predictor.py
import joblib
import pandas as pd
from pathlib import Path
MODEL_PATH = Path('models/churn_pipeline_v2.joblib')
class ChurnPredictor:
def __init__(self):
self.pipeline = joblib.load(MODEL_PATH)
def predict(self, features: pd.DataFrame) -> list[dict]:
proba = self.pipeline.predict_proba(features)[:, 1]
return [
{
'customer_id': str(cid),
'churn_probability': round(float(p), 4),
'risk_level': 'high' if p >= 0.6 else 'medium' if p >= 0.3 else 'low',
}
for cid, p in zip(features.index, proba)
]
# FastAPI serving layer
from fastapi import FastAPI
app = FastAPI()
predictor = ChurnPredictor()
@app.post('/score')
async def score_customers(customer_ids: list[str]) -> list[dict]:
features = await fetch_customer_features(customer_ids)
return predictor.predict(features)
Debugging Common ML Issues
My model has 99% training accuracy but 58% validation accuracy. What's wrong?
Claude Code diagnoses overfitting and generates a diagnostic checklist:
# Check if you have data leakage (most common cause of suspiciously high train accuracy)
# 1. Verify target variable isn't in features
# 2. Verify temporal features don't use future information
# 3. Verify train/val split was done before preprocessing
# Visualize the gap — learning curves
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
pipeline, X_train, y_train,
cv=5, scoring='roc_auc',
train_sizes=np.linspace(0.1, 1.0, 10),
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Train AUC')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation AUC')
plt.xlabel('Training examples')
plt.ylabel('ROC AUC')
plt.legend()
plt.savefig('learning_curves.png')
If train and validation curves diverge significantly with more data, it’s true overfitting — reduce model complexity, add regularization, or collect more diverse training data.
For deploying the scoring API, see the serverless guide or Docker guide. For Python backend APIs with FastAPI, see the Python FastAPI guide. The Claude Skills 360 bundle includes data science skill sets for EDA automation, feature engineering pipelines, and model evaluation patterns. Start with the free tier to try data science code generation on your datasets.