#25

Data Preprocessing Pipelines

Medium🧠 Deep LearningW6 D1

Data Preprocessing Pipelines

Make preprocessing *correct and repeatable* using `Pipeline` + `ColumnTransformer`, while avoiding the most common leakage bugs.

Students can: - build end-to-end sklearn pipelines for mixed-type data, - handle missing values, scaling, and one-hot encoding correctly, - use CV without leakage, - debug common pipeline mistakes (wrong dtypes, unseen categories, fit on full data).

Progress — 0/6 tasks

1Tasks

2Robust target parsing

3Build ColumnTransformer

4Stratified holdout split

5CV on training split only

6Final holdout evaluation

FAANG Gotchas

• If you don’t set `handle_unknown='ignore'`, production can crash on unseen categories.

Asked At

GoogleGitHub

Python 3 — Notebook

0/6 solvedSubstack Notes

Dataset & Setup

Data Preprocessing Pipelines — Student Lab (Customer Churn)

Focus: build leakage-safe preprocessing with ColumnTransformer + Pipeline.

Section 0 — Load churn dataset (with fallback)

Expected path: data/churn/churn.csv

Loading editor...

Solution

Robust target parsing

Create y in {0,1}

Section 1 — Robust target parsing

Task 1.1: Create y in {0,1}

Loading editor...

Solution

Build ColumnTransformer

Identify numeric vs categorical columns

Section 2 — Build ColumnTransformer

Task 2.1: Identify numeric vs categorical columns

Task 2.2: Create preprocess with two branches

Numeric: impute median + scale Categorical: impute most_frequent + one-hot (handle_unknown='ignore')

Loading editor...

Solution

Stratified holdout split

Create stratified train/holdout split

Section 3 — Stratified holdout split

Task 3.1

Create an explicit holdout split:

●train: 80%
●holdout: 20% with stratification on y.

Keep holdout untouched until the final check.

Loading editor...

Solution

CV on training split only

Run StratifiedKFold CV on X_train only

Section 4 — CV on training split only

Task 4.1

Run 5-fold StratifiedKFold CV on X_train, y_train using a Pipeline(preprocess -> model).

Task 4.2

(Leakage demo) show the wrong way by fitting preprocess on all X_train before CV.

Loading editor...

Solution

Final holdout evaluation

Fit on X_train and evaluate X_test

Section 5 — Final holdout evaluation

Task 5.1

Fit the pipeline on the full training split (X_train, y_train) and evaluate ROC-AUC on untouched holdout (X_test, y_test).

This is your final unbiased estimate after model development on training data.

Loading editor...

Solution