scikit-learn provides a complete, consistent API for classical machine learning. pip install scikit-learn. from sklearn.pipeline import Pipeline. Build: pipe = Pipeline([("scaler", StandardScaler()), ("clf", RandomForestClassifier())]). Fit: pipe.fit(X_train, y_train), predict: pipe.predict(X_test). ColumnTransformer: from sklearn.compose import ColumnTransformer, ct = ColumnTransformer([("num", StandardScaler(), num_cols), ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)]). Impute: SimpleImputer(strategy="median") — “mean”, “most_frequent”, “constant”. Evaluate: cross_val_score(pipe, X, y, cv=5, scoring="roc_auc"). Tune: GridSearchCV(pipe, param_grid={"clf__n_estimators":[100,200]}, cv=5, n_jobs=-1), RandomizedSearchCV(pipe, param_distributions, n_iter=50, cv=5). Metrics: classification_report(y_test, y_pred), confusion_matrix, roc_auc_score(y_test, probs). Feature selection: SelectFromModel(RandomForestClassifier(), threshold="mean"), RFECV(estimator, cv=5, scoring="roc_auc"). Persist: joblib.dump(pipe, "model.pkl"), joblib.load("model.pkl"). Key estimators: LogisticRegression(C=1, max_iter=1000), SVC(C=1, kernel="rbf", probability=True), RandomForestClassifier(n_estimators=200), GradientBoostingClassifier(n_estimators=200, max_depth=3), KNeighborsClassifier(n_neighbors=5). Calibration: CalibratedClassifierCV(base_clf, cv=5) for reliable probabilities. VotingClassifier and StackingClassifier for ensembles. Claude Code generates sklearn pipelines, preprocessing workflows, hyperparameter search scripts, and model evaluation frameworks.
CLAUDE.md for scikit-learn
## scikit-learn Stack
- Version: scikit-learn >= 1.4
- Pipeline: Pipeline([("name", transform), ..., ("model", estimator)])
- Preprocessing: ColumnTransformer([("name", transformer, columns)])
- Numeric: StandardScaler | MinMaxScaler | RobustScaler | PowerTransformer
- Categorical: OneHotEncoder(handle_unknown="ignore") | OrdinalEncoder
- Impute: SimpleImputer(strategy="median") | KNNImputer(n_neighbors=5)
- Evaluate: cross_val_score(pipe, X, y, cv=5, scoring="roc_auc", n_jobs=-1)
- Tune: GridSearchCV | RandomizedSearchCV | HalvingRandomSearchCV
- Persist: joblib.dump(pipe, path) | joblib.load(path)
scikit-learn ML Pipeline
# ml/sklearn_pipeline.py — complete ML pipeline with scikit-learn
from __future__ import annotations
import warnings
import numpy as np
import pandas as pd
from pathlib import Path
import joblib
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
OneHotEncoder, OrdinalEncoder, LabelEncoder,
PowerTransformer, PolynomialFeatures,
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import SelectFromModel, RFECV, VarianceThreshold
from sklearn.model_selection import (
cross_val_score, cross_validate, GridSearchCV,
RandomizedSearchCV, StratifiedKFold, train_test_split,
)
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, average_precision_score,
mean_squared_error, mean_absolute_error, r2_score,
)
from sklearn.ensemble import (
RandomForestClassifier, RandomForestRegressor,
GradientBoostingClassifier, GradientBoostingRegressor,
VotingClassifier, StackingClassifier,
)
from sklearn.linear_model import (
LogisticRegression, Ridge, Lasso, ElasticNet,
)
from sklearn.svm import SVC, SVR
from sklearn.calibration import CalibratedClassifierCV
# ── 1. Preprocessing builder ──────────────────────────────────────────────────
def build_preprocessor(
num_cols: list[str],
cat_cols: list[str],
strategy_numeric: str = "median",
scaler: str = "standard", # "standard" | "robust" | "minmax" | "power"
cat_strategy: str = "onehot", # "onehot" | "ordinal"
add_poly: bool = False,
poly_degree: int = 2,
) -> ColumnTransformer:
"""
Build a ColumnTransformer that handles numeric and categorical columns separately.
scaler choices:
- standard: assumes Gaussian, zero-mean unit-variance
- robust: better for outlier-heavy features (uses IQR)
- minmax: preserves 0 range (e.g. for SVM, KNN)
- power: Box-Cox/Yeo-Johnson for skewed distributions
"""
scaler_map = {
"standard": StandardScaler(),
"robust": RobustScaler(),
"minmax": MinMaxScaler(),
"power": PowerTransformer(method="yeo-johnson"),
}
chosen_scaler = scaler_map[scaler]
numeric_steps = [
("imputer", SimpleImputer(strategy=strategy_numeric)),
("scaler", chosen_scaler),
]
if add_poly:
numeric_steps.append(("poly", PolynomialFeatures(degree=poly_degree, include_bias=False)))
numeric_pipe = Pipeline(numeric_steps)
if cat_strategy == "onehot":
cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
else:
cat_encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
categorical_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", cat_encoder),
])
transformers = []
if num_cols:
transformers.append(("numeric", numeric_pipe, num_cols))
if cat_cols:
transformers.append(("categorical", categorical_pipe, cat_cols))
return ColumnTransformer(transformers, remainder="drop")
# ── 2. Pipeline builder ───────────────────────────────────────────────────────
def build_pipeline(
preprocessor,
model_type: str = "rf", # "lr" | "rf" | "gbm" | "svm" | "knn"
**model_kwargs,
) -> Pipeline:
"""
Assemble a full preprocessing + model Pipeline.
model_type:
- lr: LogisticRegression (baseline, fast)
- rf: RandomForestClassifier (strong default)
- gbm: GradientBoostingClassifier (accurate, slow training)
- svm: SVC with scaling (good for small datasets)
"""
model_map = {
"lr": LogisticRegression(max_iter=1000, random_state=42, **model_kwargs),
"rf": RandomForestClassifier(n_estimators=200, random_state=42, **model_kwargs),
"gbm": GradientBoostingClassifier(n_estimators=200, max_depth=4,
learning_rate=0.05, random_state=42, **model_kwargs),
"svm": CalibratedClassifierCV(SVC(kernel="rbf", probability=False), cv=3),
}
model = model_map.get(model_type, model_map["rf"])
return Pipeline([
("preprocessor", preprocessor),
("model", model),
])
# ── 3. Model evaluation ───────────────────────────────────────────────────────
def evaluate(
pipeline,
X_test: pd.DataFrame | np.ndarray,
y_test: np.ndarray,
task: str = "classification",
) -> dict:
"""Evaluate pipeline on held-out test data."""
if task == "classification":
y_pred = pipeline.predict(X_test)
try:
y_proba = pipeline.predict_proba(X_test)[:, 1]
except AttributeError:
y_proba = pipeline.decision_function(X_test)
report = classification_report(y_test, y_pred, output_dict=True)
return {
"accuracy": round(report["accuracy"], 4),
"f1_macro": round(report["macro avg"]["f1-score"], 4),
"roc_auc": round(roc_auc_score(y_test, y_proba), 4),
"avg_prec": round(average_precision_score(y_test, y_proba), 4),
}
else:
y_pred = pipeline.predict(X_test)
return {
"rmse": round(np.sqrt(mean_squared_error(y_test, y_pred)), 4),
"mae": round(mean_absolute_error(y_test, y_pred), 4),
"r2": round(r2_score(y_test, y_pred), 4),
}
def cross_validate_pipeline(
pipeline,
X: pd.DataFrame | np.ndarray,
y: np.ndarray,
cv: int = 5,
scoring: str = "roc_auc",
n_jobs: int = -1,
) -> dict:
"""Run stratified k-fold cross-validation."""
scores = cross_val_score(pipeline, X, y, cv=cv, scoring=scoring, n_jobs=n_jobs)
print(f"CV {scoring}: {scores.mean():.4f} ± {scores.std():.4f}")
return {"mean": round(float(scores.mean()), 4), "std": round(float(scores.std()), 4)}
# ── 4. Hyperparameter search ──────────────────────────────────────────────────
def grid_search(
pipeline,
param_grid: dict,
X_train: np.ndarray,
y_train: np.ndarray,
cv: int = 5,
scoring: str = "roc_auc",
n_jobs: int = -1,
verbose: int = 1,
) -> GridSearchCV:
"""
Exhaustive grid search over param_grid.
param_grid keys use double underscore for nested params:
{"model__n_estimators": [100, 200], "preprocessor__numeric__scaler__with_std": [True, False]}
"""
gs = GridSearchCV(
pipeline, param_grid, cv=cv, scoring=scoring,
n_jobs=n_jobs, verbose=verbose, refit=True,
)
gs.fit(X_train, y_train)
print(f"Best {scoring}: {gs.best_score_:.4f}")
print(f"Best params: {gs.best_params_}")
return gs
def random_search(
pipeline,
param_distributions: dict,
X_train: np.ndarray,
y_train: np.ndarray,
n_iter: int = 50,
cv: int = 5,
scoring: str = "roc_auc",
n_jobs: int = -1,
) -> RandomizedSearchCV:
"""Randomized hyperparameter search — faster than GridSearch for large spaces."""
rs = RandomizedSearchCV(
pipeline, param_distributions,
n_iter=n_iter, cv=cv, scoring=scoring,
n_jobs=n_jobs, random_state=42, verbose=1, refit=True,
)
rs.fit(X_train, y_train)
print(f"Best {scoring}: {rs.best_score_:.4f}")
return rs
# ── 5. Feature selection ──────────────────────────────────────────────────────
def select_features_rfecv(
X_train: np.ndarray,
y_train: np.ndarray,
estimator = None,
cv: int = 5,
scoring: str = "roc_auc",
min_features: int = 1,
) -> tuple[np.ndarray, list[int]]:
"""
Recursive Feature Elimination with CV to find optimal feature count.
Returns X_train with only selected features, and selected indices.
"""
if estimator is None:
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector = RFECV(estimator, cv=cv, scoring=scoring, min_features_to_select=min_features, n_jobs=-1)
X_selected = selector.fit_transform(X_train, y_train)
selected = selector.get_support(indices=True).tolist()
print(f"Features selected: {len(selected)} / {X_train.shape[1]}")
return X_selected, selected
# ── 6. Ensemble methods ───────────────────────────────────────────────────────
def build_voting_ensemble(
models: list[tuple[str, object]],
voting: str = "soft", # "hard" | "soft" (soft requires predict_proba)
) -> VotingClassifier:
"""Build a VotingClassifier from a list of (name, estimator) tuples."""
return VotingClassifier(estimators=models, voting=voting, n_jobs=-1)
def build_stacking_ensemble(
base_models: list[tuple[str, object]],
final_model = None,
cv: int = 5,
stack_method: str = "predict_proba",
) -> StackingClassifier:
"""
Stacking: base models generate out-of-fold predictions as features
for the final estimator (meta-learner).
"""
if final_model is None:
final_model = LogisticRegression(max_iter=1000)
return StackingClassifier(
estimators=base_models,
final_estimator=final_model,
cv=cv,
stack_method=stack_method,
n_jobs=-1,
)
# ── 7. Persistence ────────────────────────────────────────────────────────────
def save_pipeline(pipeline, path: str) -> None:
joblib.dump(pipeline, path, compress=3)
size = Path(path).stat().st_size / 1024
print(f"Saved: {path} ({size:.0f} KB)")
def load_pipeline(path: str):
return joblib.load(path)
# ── Demo ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
from sklearn.datasets import make_classification
from scipy.stats import randint, uniform
print("scikit-learn Pipeline Demo")
print("="*50)
# Create synthetic tabular data
n = 5_000
rng = np.random.default_rng(42)
df = pd.DataFrame({
"age": rng.integers(18, 80, n).astype(float),
"income": rng.normal(50_000, 20_000, n),
"score": rng.uniform(0, 1, n),
"city": rng.choice(["NYC", "LA", "Chicago", "Miami"], n),
"education": rng.choice(["high_school", "bachelor", "master", "phd"], n),
})
# Add some missing values
for col in ["age", "income"]:
mask = rng.random(n) < 0.05
df.loc[mask, col] = np.nan
y = (df["income"].fillna(50000) > 55000).astype(int).values
X = df
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
num_cols = ["age", "income", "score"]
cat_cols = ["city", "education"]
# Build and train
preprocessor = build_preprocessor(num_cols, cat_cols, scaler="robust")
pipeline = build_pipeline(preprocessor, model_type="rf")
pipeline.fit(X_tr, y_tr)
# Evaluate
metrics = evaluate(pipeline, X_te, y_te)
print(f"\nTest metrics: {metrics}")
# Cross-validate
cross_validate_pipeline(pipeline, X, y, cv=5, scoring="roc_auc")
# Random search
param_dist = {
"model__n_estimators": randint(100, 400),
"model__max_depth": randint(3, 15),
"model__min_samples_split": randint(2, 20),
}
rs = random_search(pipeline, param_dist, X_tr, y_tr, n_iter=10, cv=3)
best_metrics = evaluate(rs.best_estimator_, X_te, y_te)
print(f"\nTuned metrics: {best_metrics}")
save_pipeline(rs.best_estimator_, "/tmp/sklearn_model.pkl")
loaded = load_pipeline("/tmp/sklearn_model.pkl")
print(f"Loaded model predicts: {loaded.predict(X_te[:3])}")
For the XGBoost/LightGBM alternative when optimizing for raw predictive performance on large tabular datasets — gradient boosting outperforms Random Forest on most tabular benchmarks while scikit-learn’s Pipeline + ColumnTransformer provides the gold-standard API for reproducible preprocessing that prevents data leakage (all transforms fitted on train, applied to test), and sklearn’s GridSearchCV with n_jobs=-1 and cross_validate with multiple scoring metrics give the complete model selection infrastructure that XGBoost’s native API lacks. For the statsmodels alternative when needing statistical inference with p-values, confidence intervals, and diagnostic plots for linear models — statsmodels excels at inference while sklearn’s consistent fit/predict/score API across 50+ estimators enables model comparison in a single for estimator in models loop, and the StackingClassifier meta-learner for combining heterogeneous models (RF + LR + SVM) consistently outperforms any single model on structured prediction tasks in production. The Claude Skills 360 bundle includes scikit-learn skill sets covering Pipeline and ColumnTransformer preprocessing, numeric and categorical transformers, imputation strategies, cross-validation, GridSearchCV and RandomizedSearchCV, feature selection with RFECV, VotingClassifier and StackingClassifier ensembles, and joblib persistence. Start with the free tier to try ML pipeline code generation.