S30 Logo
S30 AI Labwww.thes30.com
Back
#25

Data Preprocessing Pipelines

Medium🧠 Deep LearningW6 D1

Data Preprocessing Pipelines

Make preprocessing *correct and repeatable* using `Pipeline` + `ColumnTransformer`, while avoiding the most common leakage bugs.

Students can: - build end-to-end sklearn pipelines for mixed-type data, - handle missing values, scaling, and one-hot encoding correctly, - use CV without leakage, - debug common pipeline mistakes (wrong dtypes, unseen categories, fit on full data).

Progress β€” 0/6 tasks

1Tasks
2Robust target parsing
3Build ColumnTransformer
4Stratified holdout split
5CV on training split only
6Final holdout evaluation

FAANG Gotchas

  • β€’ If you don’t set `handle_unknown='ignore'`, production can crash on unseen categories.

Asked At

GoogleGitHub
Python 3 β€” Notebook
0/6 solvedSubstack Notes
1
Dataset & Setup

Data Preprocessing Pipelines β€” Student Lab (Customer Churn)

Focus: build leakage-safe preprocessing with ColumnTransformer + Pipeline.

Section 0 β€” Load churn dataset (with fallback)

Expected path: data/churn/churn.csv

Loading editor...
Solution
1

Robust target parsing

2
Create y in {0,1}
2

Section 1 β€” Robust target parsing

Task 1.1: Create y in {0,1}

Loading editor...
Solution
2

Build ColumnTransformer

3
Identify numeric vs categorical columns

Section 2 β€” Build ColumnTransformer

Task 2.1: Identify numeric vs categorical columns

Task 2.2: Create preprocess with two branches

Numeric: impute median + scale Categorical: impute most_frequent + one-hot (handle_unknown='ignore')

Loading editor...
Solution
3

Stratified holdout split

4
Create stratified train/holdout split
2

Section 3 β€” Stratified holdout split

Task 3.1

Create an explicit holdout split:

  • ●train: 80%
  • ●holdout: 20% with stratification on y.

Keep holdout untouched until the final check.

Loading editor...
Solution
4

CV on training split only

5
Run StratifiedKFold CV on X_train only
2

Section 4 β€” CV on training split only

Task 4.1

Run 5-fold StratifiedKFold CV on X_train, y_train using a Pipeline(preprocess -> model).

Task 4.2

(Leakage demo) show the wrong way by fitting preprocess on all X_train before CV.

Loading editor...
Solution
5

Final holdout evaluation

6
Fit on X_train and evaluate X_test
2

Section 5 β€” Final holdout evaluation

Task 5.1

Fit the pipeline on the full training split (X_train, y_train) and evaluate ROC-AUC on untouched holdout (X_test, y_test).

This is your final unbiased estimate after model development on training data.

Loading editor...
Solution

Need help? Share feedback