Data Preprocessing Pipelines
Make preprocessing *correct and repeatable* using `Pipeline` + `ColumnTransformer`, while avoiding the most common leakage bugs.
Students can: - build end-to-end sklearn pipelines for mixed-type data, - handle missing values, scaling, and one-hot encoding correctly, - use CV without leakage, - debug common pipeline mistakes (wrong dtypes, unseen categories, fit on full data).
Progress β 0/6 tasks
FAANG Gotchas
- β’ If you donβt set `handle_unknown='ignore'`, production can crash on unseen categories.
Asked At
Data Preprocessing Pipelines β Student Lab (Customer Churn)
Focus: build leakage-safe preprocessing with ColumnTransformer + Pipeline.
Section 0 β Load churn dataset (with fallback)
Expected path: data/churn/churn.csv
Robust target parsing
Section 1 β Robust target parsing
Task 1.1: Create y in {0,1}
Build ColumnTransformer
Section 2 β Build ColumnTransformer
Task 2.1: Identify numeric vs categorical columns
Task 2.2: Create preprocess with two branches
Numeric: impute median + scale Categorical: impute most_frequent + one-hot (handle_unknown='ignore')
Stratified holdout split
Section 3 β Stratified holdout split
Task 3.1
Create an explicit holdout split:
- βtrain: 80%
- βholdout: 20%
with stratification on
y.
Keep holdout untouched until the final check.
CV on training split only
Section 4 β CV on training split only
Task 4.1
Run 5-fold StratifiedKFold CV on X_train, y_train using a Pipeline(preprocess -> model).
Task 4.2
(Leakage demo) show the wrong way by fitting preprocess on all X_train before CV.
Final holdout evaluation
Section 5 β Final holdout evaluation
Task 5.1
Fit the pipeline on the full training split (X_train, y_train) and evaluate ROC-AUC on untouched holdout (X_test, y_test).
This is your final unbiased estimate after model development on training data.